# Natural Language Processing

## Exercise Sheet 6

In [1]:
#imports for all exercises
import nltk
import random

### Exercise 1

Write a name gender classifier using the Names Corpus, the `apply_features` function, shuffling, and a test set of 500 instances. Use the following features:

a) first letter;  
b) last letter;  
c) last two letters;  
d) length;  
e) for each letter one feature, which is true if the name contains the letter.

Use the `NaiveBayesClassifier`, calculate the accuracy, and display the 10 most informative features.


In [2]:
from nltk.corpus import names
from nltk.classify import apply_features

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
                 [(name, 'female') for name in names.words('female.txt')])

# feature extractor
def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower() 
    features["last_letter"] = name[-1].lower()  
    features["last_two_letters"] = name[-2].lower()
    features["length"] = len(name)
    # for each letter, construct a feature name. (e.g. for the letter 'a', it creates a feature named 'contains_a')
    # it checks whether the current letter exists in the name
    # if the letter is present in the name, the value of the feature is set to True; otherwise, False.
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features[f'contains_{letter}'] = letter in name.lower()
    return features

random.shuffle(labeled_names)

# split data and apply features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)

# calculate the accuracy
print(nltk.classify.accuracy(classifier, test_set))

# display the 10 most informative features
classifier.show_most_informative_features(10)

0.782
Most Informative Features
             last_letter = 'k'              male : female =     43.5 : 1.0
             last_letter = 'a'            female : male   =     34.6 : 1.0
             last_letter = 'f'              male : female =     14.5 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     11.1 : 1.0
             last_letter = 'd'              male : female =     10.4 : 1.0
             last_letter = 'o'              male : female =      8.5 : 1.0
             last_letter = 'm'              male : female =      8.2 : 1.0
        last_two_letters = 'o'              male : female =      7.5 : 1.0
        last_two_letters = 'u'              male : female =      7.2 : 1.0


This listing shows that the names in the training set that end in "a" are female 33 times more often than they are male, but names that end in "k" are male 31 times more often than they are female. 
These ratios are known as likelihood ratios.

### Exercise 2

The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. Using this dataset, build a `NaiveBayesClassifier` that predicts the correct sense tag for a given instance for the word "hard":

In [3]:
from nltk.classify import apply_features

# feature extractor
# extract the preceding and following words of the word "hard" in an instance
def features(inst):
    p = inst.position   # the inst.position attribute to find the position of "hard"
    # retrieve the context words
    features = {
        'prev_word': inst.context[p-1],
        'next_word': inst.context[p+1]
    }
    return features

In [4]:
from nltk.corpus import senseval

instances = senseval.instances('hard.pos')
labeled_instances = [(inst, inst.senses) for inst in instances]

num_iterations = 10
accuracies = []
 
for _ in range(num_iterations): 
    size = int(len(labeled_instances) * 0.1)
    random.shuffle(labeled_instances)
    train_set = apply_features(features, labeled_instances[size:])
    test_set = apply_features(features, labeled_instances[:size])
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    accuracy = nltk.classify.accuracy(classifier, test_set)
    accuracies.append(accuracy)
    print("Accuracy:", accuracy)

Accuracy: 0.8937644341801386
Accuracy: 0.9030023094688222
Accuracy: 0.9237875288683602
Accuracy: 0.8568129330254042
Accuracy: 0.8891454965357968
Accuracy: 0.9076212471131639
Accuracy: 0.9145496535796767
Accuracy: 0.9191685912240185
Accuracy: 0.8891454965357968
Accuracy: 0.9076212471131639


Use the preceding and following word as features. They can be calculated by retrieving the position of the word "hard" as `p=inst.position` and then accessing `inst.context[p-1]` and `inst.context[p+1]`.

Run 10 iterations by reshuffling the instances and printing the individual accuracies. Finally, print the average accuracy.

In [5]:
average_accuracy = sum(accuracies) / num_iterations
print("Average Accuracy:", average_accuracy)

Average Accuracy: 0.9004618937644342


### Exercise 3

The synonyms "strong" and "powerful" pattern differently. Use the tagged Brown corpus with the universal tagset to first list the nouns which follow "strong" vs. "powerful". Write for this a function `next_noun(word, tagged_text)` which returns the list of nouns that follow `word` in the `tagged_text`. Build then a `NaiveBayesClassifier` that predicts when each word should be used by using the function `apply_features` and the following noun as single feature.

Run 10 iterations by reshuffling the instances and printing the individual accuracies. Finally, print the average accuracy.


In [6]:
from nltk.corpus import brown

tagged_brown = brown.tagged_sents(tagset='universal')

# extract next nouns following a word
def next_noun(word, tagged_text):
    nouns = []
    for sentence in tagged_text:
        for i in range(len(sentence) - 1):
            if sentence[i][0].lower() == word:
                if sentence[i+1][1] == 'NOUN':
                    nouns.append(sentence[i+1][0].lower())
    return nouns

# labeled instances for 'strong' and 'powerful' based on next nouns
labeled_instances = []
for word in ['strong', 'powerful']:
    next_nouns = next_noun(word, tagged_brown)
    for noun in next_nouns:
        labeled_instances.append(({'next_noun': noun}, word))

In [7]:
num_iterations = 10
accuracies = []

for _ in range(num_iterations):
    random.shuffle(labeled_instances)
    size = int(len(labeled_instances) * 0.1)

    train_set = apply_features(lambda x: x, labeled_instances[size:])
    test_set = apply_features(lambda x: x, labeled_instances[:size])

    classifier = nltk.NaiveBayesClassifier.train(train_set)

    accuracy = nltk.classify.accuracy(classifier, test_set)
    accuracies.append(accuracy)
    print("Accuracy:", accuracy)

average_accuracy = sum(accuracies) / num_iterations
print("Average Accuracy:", average_accuracy)

Accuracy: 0.7857142857142857
Accuracy: 0.7857142857142857
Accuracy: 0.7857142857142857
Accuracy: 0.7142857142857143
Accuracy: 0.6428571428571429
Accuracy: 0.7857142857142857
Accuracy: 0.6428571428571429
Accuracy: 0.7142857142857143
Accuracy: 0.7857142857142857
Accuracy: 0.9285714285714286
Average Accuracy: 0.7571428571428572


### Exercise 4

Based on the Movie Reviews document classifier discussed in this chapter, build a new `NaiveBayesClassifier`. Tag first the Movie Reviews Corpus using the combined tagger from the previous chapter stored in `t2.pkl`. Filter the tagged words to contain only words for the tags `['JJ', 'JJR', 'JJS', 'RB', 'NN', 'NNS', 'VB', 'VBN', 'VBG', 'VBZ', 'VBD', 'QL']` as well as only alphabetic tokens with at least three characters. Convert the words to lowercase. Use the most common 5000 words as `word_features` in the function `document_features`. 

Run 10 iterations by reshuffling the instances and printing the accuracy and 5 most informative features for each iteration. Finally, print the average accuracy.
    

In [8]:
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

In [9]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

In [10]:
from pickle import dump, load
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()

In [16]:
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
import pickle

with open('t2.pkl', 'rb') as f:
    combined_tagger = pickle.load(f)

# the list of allowed POS tags and the minimum token length
allowed_tags = ['JJ', 'JJR', 'JJS', 'RB', 'NN', 'NNS', 'VB', 'VBN', 'VBG', 'VBZ', 'VBD', 'QL']
min_token_length = 3

# extract features from a document
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[word] = (word in document_words)
    return features

# prepare the movie reviews corpus and filter it
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

filtered_documents = []
for (words, category) in documents:
    tagged_words = combined_tagger.tag(words)
    filtered_words = [word.lower() for (word, tag) in tagged_words if tag in allowed_tags and len(word) >= min_token_length]
    filtered_documents.append((filtered_words, category))

# a list of most common 5000 words as word_features
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:5000]

num_iterations = 10
total_accuracy = 0

for i in range(num_iterations):
    random.shuffle(filtered_documents)
    featuresets = [(document_features(d, word_features), c) for (d, c) in filtered_documents]
    train_set, test_set = featuresets[100:], featuresets[:100]  # 100 test instances

    classifier = NaiveBayesClassifier.train(train_set)
    acc = accuracy(classifier, test_set)
    total_accuracy += acc

    print(f"Iteration {i + 1} - Accuracy: {acc:.2%}")
    print("Top 5 most informative features:")
    classifier.show_most_informative_features(5)
    print()

# Calculate and print the average accuracy
average_accuracy = total_accuracy / num_iterations
print(f"Average Accuracy over {num_iterations} iterations: {average_accuracy:.2%}")


Iteration 1 - Accuracy: 80.00%
Top 5 most informative features:
Most Informative Features
             outstanding = True              pos : neg    =     11.0 : 1.0
               ludicrous = True              neg : pos    =     10.4 : 1.0
               insulting = True              neg : pos    =     10.2 : 1.0
                   sucks = True              neg : pos    =      9.8 : 1.0
                chilling = True              pos : neg    =      9.0 : 1.0

Iteration 2 - Accuracy: 77.00%
Top 5 most informative features:
Most Informative Features
             outstanding = True              pos : neg    =     11.4 : 1.0
               insulting = True              neg : pos    =     10.6 : 1.0
                   sucks = True              neg : pos    =     10.6 : 1.0
                  hudson = True              neg : pos    =     10.3 : 1.0
               ludicrous = True              neg : pos    =      9.9 : 1.0

Iteration 3 - Accuracy: 83.00%
Top 5 most informative features:
Most

### Exercise 5

The PP Attachment Corpus is a corpus describing prepositional phrase attachment decisions. Each instance in the training corpus is encoded as a `PPAttachment` object:

    from nltk.corpus import ppattach
    ppattach.attachments('training')
    
        [PPAttachment(sent='0', verb='join', noun1='board',
            prep='as', noun2='director', attachment='V'),
        PPAttachment(sent='1', verb='is', noun1='chairman',
            prep='of', noun2='N.V.', attachment='N'),
        ...]

    inst = ppattach.attachments('training')[1]
    (inst.noun1, inst.prep, inst.noun2)
    
        ('chairman', 'of', 'N.V.')

In the same way, `ppattach.attachments('test')` accesses the test instances. Select only the instances where `inst.attachment` is `'N'`:

In [14]:
from nltk.corpus import ppattach
nattach = [inst for inst in ppattach.attachments('training')
               if inst.attachment == 'N']

Using this sub-corpus, build a `NaiveBayesClassifier` that attempts to predict which preposition is used to connect a given pair of nouns. For example, given the pair of nouns "team" and "researchers", the classifier should predict the preposition "of". 

Write for this purpose a function `prepare_featuresets(subcorpus)`, where `subcorpus` is either the string "training" or "test" to return the training set or the test set. 

Print the achieved accuracy as well as the result of `classifier.classify({ 'noun1': 'team', 'noun2': 'researchers' })`.

In [15]:
def prepare_featuresets(subcorpus):
    # select the subcorpus (either 'training' or 'test')
    instances = ppattach.attachments(subcorpus)

    # create a list of feature sets and labels
    featuresets = []
    for inst in instances:
        if inst.attachment == 'N':
            features = {'noun1': inst.noun1, 'noun2': inst.noun2}
            label = inst.prep
            featuresets.append((features, label))

    return featuresets

training_set = prepare_featuresets('training')

classifier = nltk.NaiveBayesClassifier.train(training_set)

test_set = prepare_featuresets('test')
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy:", accuracy)

noun1 = 'team'
noun2 = 'researchers'
prediction = classifier.classify({'noun1': noun1, 'noun2': noun2})
print("Prediction for ('{}', '{}'):".format(noun1, noun2), prediction)

Accuracy: 0.5690032858707558
Prediction for ('team', 'researchers'): of
