## 6. Learning to Classify Text

### 1.1 Gender Identification

#### In Chapter 4, we saw that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

#### The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [None]:
import nltk

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [None]:
gender_features('John')

#### The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature, as in the example 'last_letter'.

#### Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels. 

In [None]:
from nltk.corpus import names

In [None]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
   [(name, 'female') for name in names.words('female.txt')])

In [None]:
labeled_names

In [None]:
labeled_names[-10:]

In [None]:
len(labeled_names)

In [None]:
import random

In [None]:
random.seed(1)

In [None]:
random.shuffle(labeled_names)

#### Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier. (See Slides)


In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [None]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### We will learn more about the naive Bayes classifier later in the chapter. For now, let's just test it out on some names that did not appear in its training data:

In [None]:
classifier.classify(gender_features('Neo'))

In [None]:
classifier.classify(gender_features('Trinity'))

#### Observe that these character names from The Matrix are correctly classified. Although this science fiction movie is set in 2199, it still conforms with our expectations about names and genders. We can systematically evaluate the classifier on a much larger quantity of unseen data:

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

#### Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [None]:
classifier.show_most_informative_features(5)

#### This listing shows that the names in the training set that end in "a" are female 36 times more often than they are male, but names that end in "k" are male 32 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

### Exercise 1. Use this classifier to test your own names or any names of your own choosing.

### Exercise 2. Modify the gender_features() function to provide the classifier with features encoding the length of the name, or its first letter. Retrain the classifier with these new features, and test its accuracy. (3 minutes)

### 1.2   Choosing The Right Features

#### Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on a thorough understanding of the task at hand

#### Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem

#### It's common to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually are helpful.

In [None]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [None]:
gender_features2('John')

#### However, there are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets. (See Slides)

In [None]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]

In [None]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

#### Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set. (See Slides)

In [None]:
train_names = labeled_names[1500:]

In [None]:
devtest_names = labeled_names[500:1500]

In [None]:
test_names = labeled_names[:500]

#### Having divided the corpus into appropriate datasets, we train a model using the training set [1], and then run it on the dev-test set[2].

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]

In [None]:
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

In [None]:
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set) #1

In [None]:
print(nltk.classify.accuracy(classifier, devtest_set)) #2

#### Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [None]:
errors = []

In [None]:
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

#### We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly. The names classifier that we have built generates about 100 errors on the dev-test corpus:

In [None]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8} name={:<30}'.format(tag, guess, name))

#### Looking through this list of errors makes it clear that some suffixes that are more than one letter can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. We therefore adjust our feature extractor to include features for two-letter suffixes:

In [None]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

#### Rebuilding the classifier with the new feature extractor, we see that the performance on the dev-test dataset.

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]

In [None]:
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, devtest_set))

### 1.3   Document Classification

#### In Chapter 1, we saw several examples of corpora where documents have been labeled with categories. Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. First, we construct a list of documents, labeled with the appropriate categories. For this example, we've chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

In [None]:
from nltk.corpus import movie_reviews

In [None]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [None]:
movie_reviews.categories()

In [None]:
movie_reviews.fileids('neg')

In [None]:
movie_reviews.words('neg/cv000_29416.txt')

In [None]:
documents[2]

In [None]:
random.seed(2)

In [None]:
random.shuffle(documents)

#### Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to . For document topic identification, we can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus [1]. We can then define a feature extractor [2] that simply checks whether each of these words is present in a given document.

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [None]:
word_features = list(all_words)[:2000] #1

In [None]:
word_features

In [None]:
def document_features(document): #2
    document_words = set(document) #3
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

#### Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews . To check how reliable the resulting classifier is, we compute its accuracy on the test set [1]. And once again, we can use show_most_informative_features() to find out which features the classifier found to be most informative

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set)) #1

In [None]:
classifier.show_most_informative_features(5)

### Exercise 3. Use this classifer to classify a new text (e.g. customer reviews) of your own choosing.

### 1.4   Part-of-Speech Tagging

#### In Chapter 5. we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the internal make-up of the word. However, this regular expression tagger had to be hand-crafted. Instead, we can train a classifier to work out which suffixes are most informative. Let's begin by finding out what the most common suffixes are:

In [None]:
from nltk.corpus import brown

In [None]:
suffix_fdist = nltk.FreqDist()

In [None]:
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [None]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

In [None]:
print(common_suffixes)

#### Next, we'll define a feature extractor function which checks a given word for these suffixes:

In [None]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In [None]:
print (pos_features("visited"))

#### Now that we've defined our feature extractor, we can use it to train a classifier. For decision tree classifer, please use nltk.DecisionTreeClassifier.

In [None]:
tagged_words = brown.tagged_words(categories='news')

In [None]:
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier=nltk.NaiveBayesClassifier.train(train_set)

In [None]:
classifier.classify(pos_features('cats'))

### Exercise 4: Use this classifier to classify a new word of your own choosing

### 1.5   Exploiting Context

#### By augmenting the feature extraction function, we could modify this part-of-speech tagger to leverage a variety of other word-internal features, such as the length of the word, the number of syllables it contains, or its prefix. However, as long as the feature extractor just looks at the target word, we have no way to add features that depend on the context that the word appears in. But contextual features often provide powerful clues about the correct tag — for example, when tagging the word "fly," knowing that the previous word is "a" will allow us to determine that it is functioning as a noun, not a verb.

#### In order to accommodate features that depend on a word's context, we must revise the pattern that we used to define our feature extractor. Instead of just passing in the word to be tagged, we will pass in a complete (untagged) sentence, along with the index of the target word. .

In [None]:
def pos_features(sentence, i): #1
    features = {"suffix(1)": sentence[i][-1],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [None]:
brown.sents()[0]

In [None]:
pos_features(brown.sents()[0], 8)

In [None]:
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
featuresets = []

In [None]:
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append(( pos_features(untagged_sent, i), tag))


#### Given a tagged sentence, return an untagged version of that sentence. I.e., return a list containing the first element of each tuple in tagged_sentence.

In [None]:
featuresets_ex=[]

In [None]:
tagged_sent_ex=[('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')]

In [None]:
untagged_sent_ex=nltk.tag.untag(tagged_sent_ex)

In [None]:
untagged_sent_ex

In [None]:
for i, (word, tag) in enumerate(tagged_sent_ex):
        featuresets_ex.append((pos_features(untagged_sent_ex, i), tag))

In [None]:
featuresets_ex

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

### 1.6   Sequence Classification

#### One sequence classification strategy, known as consecutive classification or greedy sequence classification, is to find the most likely class label for the first input, then to use that answer to help find the best label for the next input. The process can then be repeated until all of the inputs have been labeled. This is the approach that was taken by the bigram tagger from 5, which began by choosing a part-of-speech tag for the first word in the sentence, and then chose the tag for each subsequent word based on the word itself and the predicted tag for the previous word.

####  First, we must augment our feature extractor function to take a history argument, which provides a list of the tags that we've predicted for the sentence so far [1]. Each tag in history corresponds with a word in sentence. But note that history will only contain tags for words we've already classified, that is, words to the left of the target word. 

#### Having defined a feature extractor, we can proceed to build our sequence classifier [2]. During training, we use the annotated tags to provide the appropriate history to the feature extractor, but when tagging new sentences, we generate the history list based on the output of the tagger itself.

In [None]:
def pos_features(sentence, i, history): #1
     features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
     if i == 0:
         features["prev-word"] = "<START>"
         features["prev-tag"] = "<START>"
     else:
         features["prev-word"] = sentence[i-1]
         features["prev-tag"] = history[i-1]
     return features

class ConsecutivePosTagger(nltk.TaggerI): #2

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [None]:
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
size = int(len(tagged_sents) * 0.1)

In [None]:
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [None]:
tagger = ConsecutivePosTagger(train_sents)

In [None]:
print(tagger.evaluate(test_sents))

### Exercise 5. Use this classifier to tag a new sentence of your own choosing.

## 2   Further Examples of Supervised Classification

### 2.1   Sentence Segmentation

#### Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

#### The first step is to obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features:

In [None]:
import nltk

In [None]:
sents = nltk.corpus.treebank_raw.sents()

In [None]:
sents[0]

In [None]:
sents[1]

In [None]:
tokens = []

In [None]:
boundaries = set()

In [None]:
offset = 0

#### When append() method adds its argument as a single element to the end of a list, the length of the list itself will increase by one. Whereas extend() method iterates over its argument adding each element to the list, extending the list.

In [None]:
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [None]:
tokens

In [None]:
tokens1=[]

In [None]:
for sent in sents:
    tokens1.append(sent)

In [None]:
tokens1

#### Here, tokens is a merged list of tokens from the individual sentences, and boundaries is a set containing the indexes of all sentence-boundary tokens. Next, we need to specify the features of the data that will be used in order to decide whether punctuation indicates a sentence-boundary:

In [None]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

#### Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not:

In [None]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

#### Using these featuresets, we can train and evaluate a punctuation classifier:

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

#### Understand the code by using an example

In [None]:
tokens_ex=[]

In [None]:
offset_ex=0

In [None]:
boundaries_ex=set()

In [None]:
sents_ex=[["The","first","word","is","NLP","."],["What","is","the","next","word","?"]]

In [None]:
for sent in sents_ex:
    tokens_ex.extend(sent)
    offset_ex += len(sent)
    boundaries_ex.add(offset_ex-1)

In [None]:
boundaries_ex

In [None]:
tokens_ex

In [None]:
len(tokens_ex)

In [None]:
list(range(1,11))

In [None]:
featuresets_ex = [(punct_features(tokens_ex, i), (i in boundaries_ex))
               for i in range(1, len(tokens_ex)-1)
               if tokens_ex[i] in '.?!']

In [None]:
featuresets_ex

#### To use this classifier to perform sentence segmentation, we simply check each punctuation mark to see whether it's labeled as a boundary; and divide the list of words at the boundary marks. The listing in 2.1 shows how this can be done.

In [None]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

In [None]:
segment_sentences(['I','am','going','to','attend','a','zoom','meeting','.','This','meeting','is','about','NLP'])

### 2.2   Identifying Dialogue Act Types

#### When processing dialogue, it can be useful to think of utterances as a type of action performed by the speaker. This interpretation is most straightforward for performative statements such as "I forgive you" or "I bet you can't climb that hill." But greetings, questions, answers, assertions, and clarifications can all be thought of as types of speech-based actions. Recognizing the dialogue acts underlying the utterances in a dialogue can be an important first step in understanding the conversation.

#### The NPS Chat Corpus, which was demonstrated in 1, consists of over 10,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. The first step is to extract the basic messaging data. We will call xml_posts() to get a data structure representing the XML annotation for each post:

In [None]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

#### Next, we'll define a simple feature extractor that checks what words the post contains:

In [None]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [None]:
post_ex=nltk.corpus.nps_chat.xml_posts()[:10000][6001]

In [None]:
post_ex.text

In [None]:
dialogue_act_features(post_ex.text)

In [None]:
post_ex.get("class")

#### Finally, we construct the training and testing data by applying the feature extractor to each post (using post.get('class') to get a post's dialogue act type), and create a new classifier:

In [None]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))