# Lab 7 - Evaluation and Metrics

In a previous lab, we looked at how to train and test a POS-tagger. We used the evaluation concept to track how our POS-tagger was improving as we added more models to it. In this lab we will be looking at how we evaluate our models and what metrics we can use to see how our model is doing.

We will use [chapter 6](http://www.nltk.org/book/ch06.html) of the nltk book as our reference for this lab and we will work through one of the examples while focusing on the evaluation of our model and the various metrics we use for this.

In this lab, we will train a model for detecting the gender of names. First we need to identify what features we want to use to train our model. For our model, we will start of with using one feature: the last letter of the name. Let's write a feature extractor in the following code block:

In [1]:
def featureExtractor(name):
    # name[-1] will select the last letter of the name
    return {'suffix1': name[-1:]}

Now that we have our feature extractor we define and sort our data. We will be using the nltk 'names' corpus which consists of two files: males.txt, females.txt (no, gender is not binary; it's just the way it's encoded here).

We will extract the names and store them in a dictionary format with the name as key and the gender as value.

In [2]:
# import the corpus and the random module
import nltk
from nltk.corpus import names
import random

# construct the dataset
labeledNames = ([(name, 'male') for name in names.words('male.txt')] +
                [(name, 'female') for name in names.words('female.txt')])

# shuffle the data set
random.shuffle(labeledNames)

Now let's look at how we will split up our dataset for training and evaluation. We will split up the dataset into a development set and a test set. As the names imply, the development set will be used for training and developing our model. Once we have a working model we can test it on our test set and get our evaluation metrics. 

The development set itself is further divided into a **training set** and a **development test set**. The training set is used to train our initial model and the development test set will be used to test our initial model and tweak it before testing it on our final test set. The reason for a separate development test set is that once we test our model on this test set and modify our model, this test set that we used can no longer give us accurate metrics for accuracy, since we used it to tweak the model. (Note that some people refer to this training + development test set as the **training set**. We are using the term **develoment** for the overall combination.)

So, in the next code block we generate a feature set from our data and split this up into the relevant sub sets. We first find out the length of the total data set and then we split it into 70% training, 20% development, and 10% testing.

In [3]:
len(labeledNames)

7944

In [4]:
# divide 70, 20, 10. You can do other splits if you want
length = len(labeledNames)
len_training = int(length * 0.7)
len_dev = int(length * 0.2)
len_test = int(length * 0.1)

# print to double-check
print(len_training)
print(len_dev)
print(len_test)

5560
1588
794


In [5]:
trainingNames = labeledNames[:len_training]
devtestNames = labeledNames[len_training:(len_training + len_dev)]
testNames = labeledNames[(len_training + len_dev):]

# print to double-check
print(len(trainingNames))
print(len(devtestNames))
print(len(testNames))

5560
1588
796


In [6]:
# next we extract the features from each set
trainingSet = [(featureExtractor(n), gender) for (n, gender) in trainingNames]
devtestSet = [(featureExtractor(n), gender) for (n, gender) in devtestNames]
testSet = [(featureExtractor(n), gender) for (n, gender) in testNames]

Now that we have our various datasets, we can start training our model using the training set. We will be using a "naive Bayes" classifier which you can read more about in part 5 of the chapter reading linked above. We will then test it on our devtestSet to see where we can improve our model.

In [7]:
classifier = nltk.NaiveBayesClassifier.train(trainingSet)
print(nltk.classify.accuracy(classifier, devtestSet))

# We can check what the most informative features are in our model
classifier.show_most_informative_features(26)

0.7329974811083123
Most Informative Features
                 suffix1 = 'a'            female : male   =     40.3 : 1.0
                 suffix1 = 'f'              male : female =     20.7 : 1.0
                 suffix1 = 'o'              male : female =     10.2 : 1.0
                 suffix1 = 'd'              male : female =      9.1 : 1.0
                 suffix1 = 'm'              male : female =      8.9 : 1.0
                 suffix1 = 'p'              male : female =      8.4 : 1.0
                 suffix1 = 'w'              male : female =      8.4 : 1.0
                 suffix1 = 'r'              male : female =      6.9 : 1.0
                 suffix1 = 'g'              male : female =      5.3 : 1.0
                 suffix1 = 't'              male : female =      4.9 : 1.0
                 suffix1 = 'b'              male : female =      4.7 : 1.0
                 suffix1 = 's'              male : female =      4.2 : 1.0
                 suffix1 = 'u'              male : fema

As you can see above, names ending in 'a' are predominantly female according to our classifier and names ending in 'k' are mostly male. To improve our model, we will generate a list of names that our classifier gets wrong using the devtestSet.

In [17]:
errors = []

for (name, tag) in devtestNames:
    guess = classifier.classify(featureExtractor(name))
    if guess != tag:
        print("correct=%s guess=%s name=%s" % (tag, guess, name))

correct=male guess=female name=Luce
correct=female guess=male name=Umeko
correct=male guess=female name=Moise
correct=male guess=female name=Shelby
correct=male guess=female name=Tobie
correct=male guess=female name=Gustave
correct=female guess=male name=Gwendolyn
correct=female guess=male name=Keriann
correct=male guess=female name=Eddy
correct=male guess=female name=Andrea
correct=male guess=female name=Shurlocke
correct=male guess=female name=Mattie
correct=male guess=female name=Wally
correct=female guess=male name=Carol
correct=female guess=male name=Carleen
correct=female guess=male name=Jerrilyn
correct=male guess=female name=Doyle
correct=male guess=female name=Smitty
correct=male guess=female name=Reggie
correct=female guess=male name=Garnet
correct=female guess=male name=Erin
correct=female guess=male name=Katharyn
correct=male guess=female name=Esme
correct=male guess=female name=Zacharia
correct=male guess=female name=Claire
correct=male guess=female name=Curtice
correct=fe

Remember that our classifier only looks at the the last letter of each name. From this list, however, we see that sometimes the last two letters are a better indicator of gender. This is because names ending in 'yn' or 'en' are mostly female, even though most names ending in 'n' are male. This tells us that we should add another feature to our model to improve it. This second feature will be the second to last letter of the name.

In [28]:
def featureExtractor2(name):
    # suffix1 returns the last letter of the name and suffix2 returns the last two letters
    return {'suffix1': name[-1:], 'suffix2': name[-2:]}

Now that we created a second feature extractor, let's re-train our model before we can finally test it on our test set.

In [9]:
trainingSet2 = [(featureExtractor2(n), gender) for (n, gender) in trainingNames]
devtestSet2 = [(featureExtractor2(n), gender) for (n, gender) in devtestNames]
testSet2 = [(featureExtractor2(n), gender) for (n, gender) in testNames]

classifier2 = nltk.NaiveBayesClassifier.train(trainingSet2)
print(nltk.classify.accuracy(classifier2, devtestSet2))

0.7550377833753149


In [23]:
classifier2.show_most_informative_features(10)

Most Informative Features
                 suffix2 = 'na'           female : male   =     78.6 : 1.0
                 suffix2 = 'us'             male : female =     50.9 : 1.0
                 suffix2 = 'ra'           female : male   =     50.2 : 1.0
                 suffix1 = 'a'            female : male   =     40.3 : 1.0
                 suffix2 = 'ta'           female : male   =     35.2 : 1.0
                 suffix2 = 'ia'           female : male   =     30.6 : 1.0
                 suffix2 = 'rt'             male : female =     26.6 : 1.0
                 suffix2 = 'ld'             male : female =     25.7 : 1.0
                 suffix2 = 'rd'             male : female =     21.8 : 1.0
                 suffix1 = 'f'              male : female =     20.7 : 1.0


As you can see, our model improved by about 2% by adding one extra feature. But we need to remember that we are only testing on our devtestSet here. This is the set that we investigated and optimized for. 

In order to really know how our model is doing, we need to test it on data that it has never seen before. This is why we have a testSet which we do not use until our model is finalized. Let's run this final model on our testSet and see the actual expected accuracy of our model.

In [10]:
print(nltk.classify.accuracy(classifier2, testSet2))

0.7814070351758794


As you can see, our model got around 76%-81% of tags correct depending on how the data was shuffled. This is fairly good since our model has never seen this data. 

Since this lab is about evaluation metrics let's look at some other ways to see how our model is doing. Accuracy is a very basic evaluation metric and doesn't always give us very good information about our model.

### Precision and Recall

In binary classification tasks, ones where the answer is either correct or false, precision and recall can tell us a lot about our model. In order to calculate these two metrics we first need to split our data into four categories:

- True Positives(TP): Items that were correctly labeled true
- True Negatives(TN): Items that were correctly labeled false
- False Positives(FP): Items that were incorrectly labeled true
- False Negatives(FN): Items that were incorrectly labeled false

Once we have the numbers for the above categories, we can calculate the precision and recall of our model. Precision will tell us how many of the retieved instances are correct or, in other words, how much we can trust the output. Recall, on the other hand, will tell us how many of the relevant instances are found. For a good read on precision and recall look at this [5 minute read](https://medium.com/@yashwant140393/the-3-pillars-of-binary-classification-accuracy-precision-recall-d2da3d09f664#:~:text=PRECISION%3A,total%20number%20of%20positive%20calls.)

Now, to calculate the precision of a model we divide the true positives by true positives plus false positives: TP/TP+FP

To calculate the recall of a model we divide the true positives by true positives plus false negatives: TP/TP+FN

Let's now find the precision and recall for tagging male and female names. Since precision and recall are for binary classification taks, we will need to formulate our question as follows: 

1. What is the precision and recall for male names?
2. What is the precision and recall for female names?

I will show you how to do 1 and then you will follow up by doing 2. 

Let's calculate the following:
- TP: male names labeled as male
- TN: female names labeled as female
- FP: female names labeled as male
- FN: male names labeled as female

In [29]:
# initialize our variables to 0
TP = TN = FN = FP = 0

# loop over all the names in our test set
for (name, tag) in devtestNames:
    # guess the label according to our classifier
    guess = classifier.classify(featureExtractor(name))
    
    # add one to the count depending on the category
    # if the correct tag is male and the model guesses correctly it's a TP
    if tag == "male" and guess == tag:
        TP += 1
    # if the correct tag is female and the model guesses correctly it's a TN
    if tag == "female" and guess == tag:
        TN += 1
    # if the correct tag is male and the model does not guess correct it's a FN
    if tag == "male" and guess != tag:
        FN += 1
    # if the correct tag is female and the model does not guess correct it's a FP
    if tag == "female" and guess != tag:
        FP += 1
        
print("TP = ", TP)
print("TN = ", TN)
print("FN = ", FN)
print("FP = ", FP)

# calculate precision and recall
print("Precision = ", TP/(TP+FP))
print("Recall = ", TP/(TP+FN))

TP =  329
TN =  835
FN =  258
FP =  166
Precision =  0.6646464646464646
Recall =  0.5604770017035775


As you can see, our model is fairly good at precision. When the model tells us a name is male, we can be ~70% sure that this label is correct. The recall, however, is a little worse. This means that our model isn't very good at finding male names and about 40% of all male names are missed.

Another metric that you will often find in research is the F-score or F-measure. This metric combines precision and recall into one single score. To find the F-score of a model you use the following formula: (2*Precison*Recall)/(Precision+Recall)

Calculate the precision, recall and F-score with the second classifier (`featureExtractor2`):

In [41]:
# initialize our variables to 0
TP = TN = FN = FP = 0

# loop over all the names in our test set
for (name, tag) in devtestNames:
    # guess the label according to our classifier
    guess = classifier2.classify(featureExtractor2(name))
    
    # add the correct if statements here to count each occurence
    

    # print the counts for each category a
    if tag == "female" and guess == tag:
        TP += 1
 # if the correct tag is female and the model guesses correctly it's a TN
    if tag == "male" and guess == tag:
        TN += 1
    # if the correct tag is male and the model does not guess correct it's a FN
    if tag == "female" and guess != tag:
        FN += 1
    # if the correct tag is female and the model does not guess correct it's a FP
    if tag == "male" and guess != tag:
        FP += 1
        
# calculate precison and recall and print it
print("TP = ", TP)
print("TN = ", TN)
print("FN = ", FN)
print("FP = ", FP)

# calculate precision and recall
print("Precision = ", TP/(TP+FP))
print("Recall = ", TP/(TP+FN))

precision = (TP/(TP+FP))
Recall =  (TP/(TP+FN))
# calculate F-score and print it


Fscore = ("F-score: ",(2*precision*Recall)/(precision+Recall))
print(Fscore)

TP =  815
TN =  384
FN =  186
FP =  203
Precision =  0.8005893909626719
Recall =  0.8141858141858141
('F-score: ', 0.8073303615651313)
