# Machine Learning: NLP Tasks

- Let's take a look at a few more classification tasks in NLP.

- In more complex NLP tasks, feature engineering (text vectorization) can be more complicated. We often need to come up with heuristics to extract features from the texts.

- In this lecture, we demonstrate a few NLP tasks and focus on a heuristics-based approach to feature engineering.

## NLP Tasks and Base Units for Classification

- Document Sentiment/Topic Classification
    - Unit: Document
    - Label: Document's sentiment
- POS Classification
    - Unit: Word
    - Label: Word's POS
- Sentence Segmentation
    - Unit: Word
    - Label: Whether the word is sentence boundary or not
- Dialogue Act Classification
    - Unit: Utterance
    - Label: The dialogue act of the utterance

---

```{tip}
For NLP classification tasks, it is very important to determine the base units on which the classification is being made. 

This should always be made explicit when we come up with the research questions.

```

In [1]:
import nltk, random

## Document Sentiment Classification

In [2]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

- Find the top 2000 words in the entire corpus
- Use these words as the document features

In [3]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [4]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 


{'contains(,)': True, 'contains(the)': True, 'contains(.)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': True, 'contains(to)': True, "contains(')": True, 'contains(is)': True, 'contains(in)': True, 'contains(s)': True, 'contains(")': True, 'contains(it)': True, 'contains(that)': True, 'contains(-)': True, 'contains())': True, 'contains(()': True, 'contains(as)': True, 'contains(with)': True, 'contains(for)': True, 'contains(his)': True, 'contains(this)': True, 'contains(film)': False, 'contains(i)': False, 'contains(he)': True, 'contains(but)': True, 'contains(on)': True, 'contains(are)': True, 'contains(t)': False, 'contains(by)': True, 'contains(be)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(an)': True, 'contains(who)': True, 'contains(not)': True, 'contains(you)': True, 'contains(from)': True, 'contains(at)': False, 'contains(was)': False, 'contains(have)': True, 'contains(they)': True, 'contains(has)': True, 'contains(her)': False, 'conta

In [5]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [6]:
print(nltk.classify.accuracy(classifier, test_set))

0.81


In [7]:
classifier.show_most_informative_features(5)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     13.2 : 1.0
         contains(mulan) = True              pos : neg    =      9.1 : 1.0
         contains(damon) = True              pos : neg    =      8.2 : 1.0
        contains(seagal) = True              neg : pos    =      8.1 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.9 : 1.0


## Parts-of-Speech Tagging

In [8]:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()

for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [9]:
common_suffixes = [suffix for (suffix, count) in 
                   suffix_fdist.most_common(100)]

In [10]:
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [11]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
        return features

In [12]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [13]:
size = int(len(featuresets) * 0.1)

In [14]:
train_set, test_set = featuresets[size:], featuresets[:size]


In [15]:
classifier = nltk.DecisionTreeClassifier.train(train_set)

In [16]:
nltk.classify.accuracy(classifier, test_set)

0.17135753356539035

In [17]:
classifier.classify(pos_features('cats'))

'IN'

## Sentence Boundary

In [18]:
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent) # append tokens of each sent to `tokens`
    offset += len(sent) # update the index of each word token
    boundaries.add(offset-1) # record the index of sent boundary token


In [19]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

In [20]:
# create featuresets
# by selecting only sentence boundary tokens
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1) 
               if tokens[i] in '.?!']

In [21]:
size = int(len(featuresets) * 0.1)

In [22]:
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.936026936026936

In [23]:
classifier.classify(punct_features(tokens, 2))
tokens[0:2]

['.', 'START']

In [24]:
def segment_sentences(words):
    start = 0
    sents = []
    #for i, word in enumerate(words): ## modified
    for i in range(1, len(words)-1):
        word = words[i]
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

In [25]:
text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
nltk.word_tokenize(text)[:20]

['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'high-level',
 'and',
 'general-purpose',
 'programming',
 'language',
 '.',
 'Python',
 "'s",
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'with',
 'its']

In [26]:

segment_sentences(nltk.word_tokenize(text))

[['Python',
  'is',
  'an',
  'interpreted',
  ',',
  'high-level',
  'and',
  'general-purpose',
  'programming',
  'language',
  '.'],
 ['Python',
  "'s",
  'design',
  'philosophy',
  'emphasizes',
  'code',
  'readability',
  'with',
  'its',
  'notable',
  'use',
  'of',
  'significant',
  'indentation',
  '.'],
 ['Its',
  'language',
  'constructs',
  'and',
  'object-oriented',
  'approach',
  'aim',
  'to',
  'help',
  'programmers',
  'write',
  'clear',
  ',',
  'logical',
  'code',
  'for',
  'small',
  'and',
  'large-scale',
  'projects',
  '.']]

## Dialogue Act Classification

- NPS Chat Corpus consists of over 10,000 posts from instant messaging sessions.
- Thse poasts have been labeled with one of  15 dialogue act types.

In [27]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

In [28]:
[p.text for p in posts[:10]]

['now im left with this gay name',
 ':P',
 'PART',
 'hey everyone  ',
 'ah well',
 'NICK :10-19-20sUser7',
 '10-19-20sUser7 is a gay name.',
 '.ACTION gives 10-19-20sUser121 a golf clap.',
 ':)',
 'JOIN']

In [29]:
# bag-of-words
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
        return features

In [30]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]

In [31]:
size = int(len(featuresets) * 0.1)

train_set, test_set = featuresets[size:], featuresets[:size]

In [32]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [33]:
print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set))) 

Accuracy: 0.75


In [34]:
test_featureset = [f for (f, l) in test_set]
test_label = [l for (f, l) in test_set] 

In [35]:

test_label_predicted = [classifier.classify(f) for f in test_featureset]

cm=nltk.ConfusionMatrix(test_label, test_label_predicted)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

           |                                  w      y                      |
           |      S                           h      n                    C |
           |      t                           Q      Q             E      o |
           |      a             E             u      u             m      n |
           |      t      S      m             e      e      A      p      t |
           |      e      y      o      G      s      s      c      h      i |
           |      m      s      t      r      t      t      c      a      n |
           |      e      t      i      e      i      i      e      s      u |
           |      n      e      o      e      o      o      p      i      e |
           |      t      m      n      t      n      n      t      s      r |
-----------+----------------------------------------------------------------+
 Statement | <27.7%>  0.1%   0.6%   0.2%   0.2%   0.9%   0.3%   0.2%   0.8% |
    System |   0.1% <20.8%>     .      .      .      .      .   

## References

- [NLTK Book Chapter 6: Learning to Classify Text](https://www.nltk.org/book/ch06.html)