## Chapter 6

### 1. Using Naive Bayes classifier described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

In [1]:
import nltk

In [2]:
from nltk.corpus import names

In [239]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
   [(name, 'female') for name in names.words('female.txt')])

In [240]:
len(labeled_names)

7944

In [241]:
import random

In [242]:
random.seed(1)

In [243]:
random.shuffle(labeled_names)

In [244]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [245]:
train_names = labeled_names[1000:]

In [246]:
devtest_names = labeled_names[500:1000]

In [247]:
test_names = labeled_names[:500]

In [248]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]

In [249]:
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

In [250]:
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

In [251]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [252]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.782


In [257]:
errors = []

In [258]:
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [259]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Allyson                       
correct=female   guess=male     name=Alyss                         
correct=female   guess=male     name=Arabel                        
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Bird                          
correct=female   guess=male     name=Bliss                         
correct=female   guess=male     name=Britaney                      
correct=female   guess=male     name=Cam                           
correct=female   guess=male     name=Carey                         
correct=female   guess=male     name=Caroljean                     
correct=female   guess=male     name=Chriss                        
correct=female   guess=male     name=Christan                      
correct=female   guess=male     name=Consuelo                      
correct=female   guess=male     name=Courtenay                     
correct=female   guess=male     name=Cynthy     

In [260]:
len(errors)

104

In [253]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [254]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [255]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [256]:
print(nltk.classify.accuracy(classifier, test_set))

0.798


In [300]:
def gender_features2(name):
    features2 = {}
    features2["first_letter"] = name[0].lower()
    features2["suffix1"] = name[-1:].lower()
    features2["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features2["count({})".format(letter)] = name.lower().count(letter)
    return features2

In [301]:
train_set = [(gender_features2(n), gender) for (n, gender) in train_names]

In [302]:
devtest_set = [(gender_features2(n), gender) for (n, gender) in devtest_names]

In [303]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [304]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.79


In [305]:
errors2 = []

In [306]:
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [307]:
for (tag, guess, name) in sorted(errors2):
    print('correct={:<8} guess={:<8} name={:<30}'.format(tag, guess, name))

In [308]:
len(errors2)

0

In [309]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]

In [310]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [311]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [312]:
print(nltk.classify.accuracy(classifier, test_set))

0.812


### 2. Using the movie review document classifier discussed in Chapter 6- Section 1.3 ( constructing a list of the 2500 most frequent words as features and use the first 150 documents as the test dataset) , generate a list of the 10 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

In [313]:
from nltk.corpus import movie_reviews

In [314]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [316]:
random.seed(2)

In [317]:
random.shuffle(documents)

In [318]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [319]:
word_features = list(all_words)[:2500]

In [320]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [321]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [322]:
train_set, test_set = featuresets[150:], featuresets[:150]

In [323]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [324]:
print(nltk.classify.accuracy(classifier, test_set))

0.8666666666666667


In [325]:
classifier.show_most_informative_features(10)

Most Informative Features
        contains(turkey) = True              neg : pos    =     12.0 : 1.0
        contains(annual) = True              pos : neg    =      9.8 : 1.0
       contains(frances) = True              pos : neg    =      9.1 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.6 : 1.0
        contains(regard) = True              pos : neg    =      7.1 : 1.0
        contains(suvari) = True              neg : pos    =      6.9 : 1.0
          contains(mold) = True              neg : pos    =      6.9 : 1.0
          contains(mena) = True              neg : pos    =      6.9 : 1.0
    contains(schumacher) = True              neg : pos    =      6.5 : 1.0
       contains(singers) = True              pos : neg    =      6.4 : 1.0


In [326]:
# The above code line generates the 10 most informative features. Those particular features are informative because most of the words are not neutral but with positive or negative meanings so we may use them to evaluate if the comments are positive or not.The "turkey" and "schumacher" surprised me because they indicate negative context considering they are just noun. So I would assume they have different meanings in spoken english or slang.

### 3. Select one of the classification tasks described in this chapter, such as name gender detection, document classification, part-of-speech tagging, or dialog act classification. Using the same training and test data, and the same feature extractor, build three classifiers for the task: a decision tree, a naive Bayes classifier, and a  Maximum Entropy classifier. Compare the performance of the three classifiers on your selected task.

In [327]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [328]:
word_features = list(all_words)[:2500]

In [329]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [330]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [331]:
train_set, test_set = featuresets[150:], featuresets[:150]

In [332]:
classifier1 = nltk.NaiveBayesClassifier.train(train_set)

In [333]:
classifier2 = nltk.DecisionTreeClassifier.train(train_set)

In [334]:
classifier3 = nltk.MaxentClassifier.train(train_set)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.497


  exp_nf_delta = 2 ** nf_delta
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
  deltas -= (ffreq_empirical - sum1) / -sum2


         Final               nan        0.503


In [335]:
print(nltk.classify.accuracy(classifier1, test_set))

0.8666666666666667


In [336]:
print(nltk.classify.accuracy(classifier2, test_set))

0.64


In [337]:
print(nltk.classify.accuracy(classifier3, test_set))

0.46


### 4. Identify the NPS Chat Corpus, which was demonstrated in Chapter 2, consists of over 15,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. Build a simple feature extractor that checks what words the post contains. Construct the training and testing data by applying the feature extractor to each post and create a Naïve Bayes classifier. Please print the accuracy of this classifier. We use the first 15,000 messages from these instant messages as our dataset and use 8% data as our test data.

In [339]:
posts = nltk.corpus.nps_chat.xml_posts()[:15000]

In [340]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [341]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]

In [342]:
size = int(len(featuresets) * 0.08)

In [343]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [344]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [345]:
print(nltk.classify.accuracy(classifier, test_set))

0.676923076923077


In [None]:
### 5.Given the following confusion matrix, please calculate: a) Accuracy Rate; b) Precision; c) Recall; d) F-Measure.

	No	Yes
No	104	33
Yes	13	50


In [351]:
# a) Accuracy Rate
observations = 104 + 33 + 13 + 50
print("Number of observations", observations)

Number of observations 200


In [352]:
accuracy = (104 + 50) / 200
print("The accuracy rate", accuracy)

The accuracy rate 0.77


In [353]:
# b) Precision
precision = 50 / (50 + 13)
print("Precision", precision)

Precision 0.7936507936507936


In [354]:
# c) Recall
recall = 50 / (50 + 33)
print("Recall", recall)

Recall 0.6024096385542169


In [355]:
# d) F-Measure
F = (2 * precision * recall) / (precision + recall)
print("F-Measure", F)

F-Measure 0.6849315068493151


## Chapter 7

### 6. Write a tag pattern to match noun phrases containing plural head nouns in the following sentence: "Many researchers discussed this project for two weeks." Try to do this by generalizing the tag pattern that handled singular noun phrases too. Please 1) pos-tag this sentence 2) write a tag pattern (i.e. grammar); 3) use RegexpParser to parse the sentence and 4) print out the result containing NP (noun phrases).

In [381]:
sentence = "Many researchers discussed this project for two weeks."

In [382]:
# 1)
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    print(sentences)

In [383]:
ie_preprocess(sentence)

[[('Many', 'JJ'), ('researchers', 'NNS'), ('discussed', 'VBD'), ('this', 'DT'), ('project', 'NN'), ('for', 'IN'), ('two', 'CD'), ('weeks', 'NNS'), ('.', '.')]]


In [384]:
sentence = [('Many', 'JJ'), ('researchers', 'NNS'), ('discussed', 'VBD'), ('this', 'DT'), ('project', 'NN'), ('for', 'IN'), ('two', 'CD'), ('weeks', 'NNS'), ('.', '.')]

In [389]:
# 2)
grammar = "NP: {<DT>?<JJ>*<NNS|NN>}"

In [390]:
# 3)
cp = nltk.RegexpParser(grammar)

In [391]:
# 4)
result = cp.parse(sentence)

In [392]:
print(result)

(S
  (NP Many/JJ researchers/NNS)
  discussed/VBD
  (NP this/DT project/NN)
  for/IN
  two/CD
  (NP weeks/NNS)
  ./.)


### 7. Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.

In [440]:
grammar = r"""
  NP: {<DT>?<VBG>*<NN>+}
      {<DT>?<JJ>*<NNS|NN>} """

In [441]:
sentence = "The testing end engineer monitoring the process."

In [442]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    print(sentences)

In [443]:
ie_preprocess(sentence)

[[('The', 'DT'), ('testing', 'VBG'), ('end', 'NN'), ('engineer', 'NN'), ('monitoring', 'VBG'), ('the', 'DT'), ('process', 'NN'), ('.', '.')]]


In [444]:
sentence = [('The', 'DT'), ('testing', 'VBG'), ('end', 'NN'), ('engineer', 'NN'), ('monitoring', 'VBG'), ('the', 'DT'), ('process', 'NN'), ('.', '.')]

In [445]:
cp = nltk.RegexpParser(grammar)

In [446]:
result = cp.parse(sentence)

In [447]:
print(result)

(S
  (NP The/DT testing/VBG end/NN engineer/NN)
  monitoring/VBG
  (NP the/DT process/NN)
  ./.)


### 8. Use the Brown Corpus and the cascaded chunkers that has patterns for noun phrases, prepositional phrases, verb phrases, and clauses to print out all the verb phrases in the Brown corpus.

In [457]:
def find_chunks(grammer):
    brown = nltk.corpus.brown
    cp=nltk.RegexpParser(grammer)
    for sent in brown.tagged_sents():
        tree = cp.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'VP': print(subtree)

In [458]:
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          
  PP: {<IN><NP>}            
  VP: {<VB.*><NP|PP|CLAUSE>+$} 
  CLAUSE: {<NP><VP>}           
  """

In [459]:
find_chunks(grammar)

(VP Ask/VB-HL (NP jail/NN-HL deputies/NNS-HL))
(VP revolving/VBG-HL (NP fund/NN-HL))
(VP Issue/VB-HL (NP jury/NN-HL subpoenas/NNS-HL))
(VP Nursing/VBG-HL (NP home/NN-HL care/NN-HL))
(VP pay/VB-HL (NP doctors/NNS-HL))
(VP nursing/VBG-HL (NP homes/NNS))
(VP Asks/VBZ-HL (NP research/NN-HL funds/NNS-HL))
(VP Regrets/VBZ-HL (NP attack/NN-HL))
(VP Decries/VBZ-HL (NP joblessness/NN-HL))
(VP Underlying/VBG-HL (NP concern/NN-HL))
(VP bar/VB-HL (NP vehicles/NNS-HL))
(VP loses/VBZ-HL (NP pace/NN-HL))
(VP hits/VBZ-HL (NP homer/NN-HL))
(VP attend/VB-HL (NP races/NNS-HL))
(VP follows/VBZ-HL (NP ceremonies/NNS-HL))
(VP Noted/VBN-HL (NP artist/NN-HL))
(VP Cites/VBZ-HL (NP discrepancies/NNS-HL))
(VP calls/VBZ-HL (NP police/NNS-HL))
(VP held/VBN-HL (NP key/NN-HL))
(VP grant/VB-HL (NP bail/NN-HL))
(VP Held/VBD-HL (NP candle/NN-HL))
(VP Expresses/VBZ-HL (NP thanks/NNS-HL))
(VP Gets/VBZ-HL (NP car/NN-HL number/NN-HL))
(VP Attacks/VBZ-HL (NP officer/NN-HL))
(VP oks/VBZ-HL (NP pact/NN-HL))
(VP report/VB-HL (

### 9. The bigram chunker scores about 90% accuracy. Study its errors and try to work out why it doesn't get 100% accuracy. Experiment with trigram chunking. Are you able to improve the performance any more?

In [464]:
from nltk.corpus import conll2000

In [468]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])

In [469]:
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

In [470]:
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)
    def parse(self, sentence): 
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [471]:
bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%


In [472]:
class TrigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
                      for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
    def parse(self, sentence): 
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [473]:
trigram_chunker = TrigramChunker(train_sents)
print(trigram_chunker.evaluate(test_sents))

ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.5%%
    Recall:        86.8%%
    F-Measure:     84.6%%


In [474]:
# The bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word, it is unable to assign a tag. It cannot tag the following word, even if it was seen during training, simply because it never saw it during training with none tag on the previous word. Consequently, the tagger fails to tag the rest of the sentence. That is the reason why it doesn't get 100% accuracy. Seen from the trigram chunking, the accuracy performance could not be improved. They probably need more data to train in order to give better results.

### 10. Explore the Brown Corpus to print out all the FACILITIES (one of the commonly used types of name entities).

In [528]:
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    chunk = nltk.ne_chunk(sent)
    for subtree in chunk.subtrees():
        if subtree.label() == 'FACILITY': print(subtree)

(FACILITY Raymondville/NP)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY Kremlin/NP)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY Franklin/NP-TL)
(FACILITY Kremlin/NP)
(FACILITY Franklin/NP-TL Square/NN-TL)
(FACILITY Pennsylvania/NP-TL Avenue/NN-TL)
(FACILITY Jenks/NP-TL Street/NN-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY White/JJ-TL)
(FACILITY Pensacola/NP)
(FACILITY White/JJ-TL Sox/NPS-TL)
(FACILITY Caltech/NP)
(FACILITY White/JJ-TL House/NN-TL)
(FACILITY Whi