# Machine Learning: A Simple Example

Let's assume that we have collected a list of personal names and we have their corresponding gender labels, i.e., whether the name is a male or female one.

The goal of this example is to create a classifier that would automatically classify a given name into either male or female.

## A Quick Example: Name Gender

In [1]:
import nltk

In [2]:
from nltk.corpus import names
import random

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

In [3]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Shrek')

{'last_letter': 'k'}

In [4]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

In [5]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [6]:
classifier.classify(gender_features('Neo'))
classifier.classify(gender_features('Trinity'))

'female'

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.768


In [8]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     36.8 : 1.0
             last_letter = 'k'              male : female =     32.3 : 1.0
             last_letter = 'v'              male : female =     18.7 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0


In [9]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])


## Features and Training

In [10]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [11]:
gender_features2('John') 

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [12]:
train_set = apply_features(gender_features2, labeled_names[500:])
test_set = apply_features(gender_features2, labeled_names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.79


In [20]:
classifier.show_most_informative_features(n = 20)

Most Informative Features
             last_letter = 'a'            female : male   =     37.4 : 1.0
             last_letter = 'k'              male : female =     27.7 : 1.0
             last_letter = 'v'              male : female =     16.6 : 1.0
             last_letter = 'f'              male : female =     12.7 : 1.0
             last_letter = 'p'              male : female =     11.3 : 1.0
             last_letter = 'd'              male : female =     10.6 : 1.0
             last_letter = 'm'              male : female =      8.9 : 1.0
             last_letter = 'o'              male : female =      8.4 : 1.0
                count(v) = 2              female : male   =      7.6 : 1.0
             last_letter = 'r'              male : female =      7.2 : 1.0
             last_letter = 'w'              male : female =      5.5 : 1.0
                count(i) = 3                male : female =      5.2 : 1.0
             last_letter = 'z'              male : female =      5.1 : 1.0

## Train-Development-Test Data Splits for Error Analysis

In [13]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

train_set = [(gender_features2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))


0.778


In [14]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

In [15]:
import csv

with open('error-analysis.csv', 'w') as f: 
      
    # using csv.writer method from CSV package 
    write = csv.writer(f) 
    write.writerow(['tag','guess','name']) 
    write.writerows(errors) 

## Evaluation

![](../images/confusion-matrix.png)

- Confusion Matrix:
    - **True positives** are relevant items that we correctly identified as relevant.
    - **True negatives** are irrelevant items that we correctly identified as irrelevant.
    - **False positives** (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
    - **False negatives** (or Type II errors) are relevant items that we incorrectly identified as irrelevant.
    Given these four numbers, we can define the following metrics:

- Evaluation Metrics:
    - **Precision**: how many of the items that we identified were relevant, is TP/(TP+FP).
    - **Recall**: how many of the relevant items that we identified, is TP/(TP+FN).
    - **F-Measure (or F-Score)**: the harmonic mean of the precision and recall,i.e.:
    

$$ 
\frac{(2 × Precision × Recall)}{(Precision + Recall)} 
$$


In [16]:
print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set))) 

Accuracy: 0.77


In [29]:
t_f = [feature for (feature, label) in test_set] # features of test set
t_l = [label for (feature, label) in test_set] # labels of test set
t_l_pr = [classifier.classify(f) for f in t_f] # predicted labels of test set
cm = nltk.ConfusionMatrix(t_l, t_l_pr)

In [30]:
cm=nltk.ConfusionMatrix(t_l, t_l_pr)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <54.4%>  8.2% |
  male |  14.8% <22.6%>|
-------+---------------+
(row = reference; col = test)



In [31]:
def createCM(classifier, test_set):
    t_f = [feature for (feature, label) in test_set]
    t_l = [label for (feature, label) in test_set]
    t_l_pr = [classifier.classify(f) for f in t_f]
    cm = nltk.ConfusionMatrix(t_l, t_l_pr)
    print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

In [32]:
createCM(classifier, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <54.4%>  8.2% |
  male |  14.8% <22.6%>|
-------+---------------+
(row = reference; col = test)



## Try Maxent Classifier

- Maxent is memory hungry, slower, and it requires `numpy`.


In [23]:
%%time
from nltk.classify import MaxentClassifier
classifier_maxent = MaxentClassifier.train(train_set, algorithm = 'gis', trace = 0, max_iter=10, min_lldelta=0.5)

CPU times: user 6.04 s, sys: 11.7 ms, total: 6.05 s
Wall time: 6.1 s


```{note}
The default algorithm for training is `iis` (Improved Iterative Scaling). Another alternative is `gis` (General Iterative Scaling), which is faster.
```

In [34]:
nltk.classify.accuracy(classifier_maxent, test_set)

0.768

In [35]:
classifier_maxent.show_most_informative_features(n=20)

  -0.175 last_letter=='a' and label is 'male'
  -0.125 last_letter=='k' and label is 'female'
  -0.111 count(v)==2 and label is 'male'
  -0.111 last_letter=='v' and label is 'female'
  -0.090 last_letter=='f' and label is 'female'
  -0.084 last_letter=='p' and label is 'female'
  -0.073 last_letter=='d' and label is 'female'
  -0.068 count(a)==3 and label is 'male'
  -0.066 last_letter=='m' and label is 'female'
  -0.063 last_letter=='o' and label is 'female'
  -0.057 last_letter=='i' and label is 'male'
  -0.056 last_letter=='r' and label is 'female'
  -0.053 count(l)==3 and label is 'male'
  -0.053 count(k)==2 and label is 'male'
  -0.052 count(a)==2 and label is 'male'
  -0.049 count(p)==3 and label is 'male'
  -0.048 count(i)==3 and label is 'female'
  -0.048 last_letter=='w' and label is 'female'
  -0.046 count(e)==3 and label is 'male'
  -0.046 last_letter=='z' and label is 'female'


In [36]:
createCM(classifier_maxent, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <51.2%> 11.4% |
  male |  11.8% <25.6%>|
-------+---------------+
(row = reference; col = test)



## Try Decision Tree

- Parameters:
    - `binary`: whether the features are binary
    - `entropy_cutoff`: a value used during tree refinement process (entropy=1 -> high-level uncertainty; entropy = 0 -> perfect model prediction)
    - `depth_cutoff`: to control the depth of the tree
    - `support_cutoff`: the mimimum number of instances that are required to make a decision about a feature.

In [44]:
%%time
from nltk.classify import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier.train(train_set, binary=True, 
                                             entropy_cutoff=0.8, depth_cutoff=5, support_cutoff =5)

CPU times: user 15.2 s, sys: 17 ms, total: 15.2 s
Wall time: 15.2 s


In [45]:
nltk.classify.accuracy(classifier_dt, test_set)

0.72

In [46]:
createCM(classifier_dt, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <58.8%>  3.8% |
  male |  24.2% <13.2%>|
-------+---------------+
(row = reference; col = test)



## Reference

- NLTK Book, Chapter 6.