# Machine Learning: A Simple Example

Let's assume that we have collected a list of personal names and we have their corresponding gender labels, i.e., whether the name is a male or female one.

The goal of this example is to create a classifier that would automatically classify a given name into either male or female.

## A Quick Example: Name Gender Prediction

### Prepare Data

- We use the data provided in NLTK. Please download the corpus data if necessary.
- We load the corpus, `nltk.corpus.names` and randomize it before we proceed.

In [1]:
import nltk
import numpy as np

In [2]:
from nltk.corpus import names
import random

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

### Feature Engineering

- As now our unit for classification is a name. In **feature engineering**, our goal is to transform the texts (i.e., names) into vectorized reprsentations.
- To start with, let's represent each text (name) by using its last character as the features.

In [3]:
def text_vectorizer(word):
    return {'last_letter': word[-1]}


text_vectorizer('Shrek')

{'last_letter': 'k'}

### Train-Test Split

- We then apply the feature engineering method to every text in the data and split the data into train and test sets.

In [4]:
featuresets = [(text_vectorizer(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

### Train the Model

In [5]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Model Prediction

In [6]:
classifier.classify(text_vectorizer('Neo'))
classifier.classify(text_vectorizer('Trinity'))

'female'

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.764


### Post-hoc Analysis

- One of the most important steps after model training is to examine which features contribute the most to the classifier prediction of the class.

In [8]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     39.8 : 1.0
             last_letter = 'k'              male : female =     31.3 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0


- Please note that in `NLTK`, we can use the `apply_features` to create training and test datasets.
- When you have a very large feature set, this can be more effective in terms of memory management.

In [9]:
from nltk.classify import apply_features
train_set = apply_features(text_vectorizer, labeled_names[500:])
test_set = apply_features(text_vectorizer, labeled_names[:500])

## How can we improve the model?

In the following, we will talk about methods that we may consider to further improve the model training.

- Feature Engineering
- Error Analysis
- Cross Validation
- Try Different Machine-Learning Algorithms
- (Ensemble Methods)

## More Sophisticated Feature Engineering

- We can extract more features from the names.
- Use the following features for vectorized representations of names:
    - The first/last letter
    - Frequencies of all 26 alphabets in the names

In [10]:
def text_vectorizer2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [11]:
text_vectorizer2('John')

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [12]:
train_set = apply_features(text_vectorizer2, labeled_names[500:])
test_set = apply_features(text_vectorizer2, labeled_names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.746


In [13]:
classifier.show_most_informative_features(n=20)

Most Informative Features
             last_letter = 'a'            female : male   =     39.8 : 1.0
             last_letter = 'k'              male : female =     31.3 : 1.0
             last_letter = 'f'              male : female =     16.0 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =     10.2 : 1.0
             last_letter = 'm'              male : female =     10.0 : 1.0
                count(v) = 2              female : male   =      8.8 : 1.0
             last_letter = 'o'              male : female =      8.5 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
                count(a) = 3              female : male   =      5.2 : 1.0
                count(i) = 3                male : female =      5.1 : 1.0
             last_letter = 'w'              male : female =      5.1 : 1.0

## Train-Development-Test Data Splits for Error Analysis

- Normally we have **train**-**test** splits of data
- Sometimes we use **development (dev)** set for error analysis and feature engineering.

- Now let's train the model on the **training set** and first check the classifier's performance on the **dev** set.
- We then identify the errors the classifier made in the **dev** set.
- We perform error analysis for further improvement.

In [14]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

train_set = [(text_vectorizer2(n), gender) for (n, gender) in train_names]
devtest_set = [(text_vectorizer2(n), gender) for (n, gender) in devtest_names]
test_set = [(text_vectorizer2(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.79


In [15]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(text_vectorizer2(name))
    if guess != tag:
        errors.append((tag, guess, name))

In [16]:
import csv

with open('error-analysis.csv', 'w') as f:

    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerow(['tag', 'guess', 'name'])
    write.writerows(errors)

## Evaluation

![](../images/confusion-matrix.png)

- Confusion Matrix:
    - **True positives** are relevant items that we correctly identified as relevant.
    - **True negatives** are irrelevant items that we correctly identified as irrelevant.
    - **False positives** (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
    - **False negatives** (or Type II errors) are relevant items that we incorrectly identified as irrelevant.
    

Given these four numbers, we can define the following model evaluation metrics:
- **Accuracy**: How many items were correctly classified?
- **Precision**: How many of the items identified by the classifier as relevant are indeed relevant, is TP/(TP+FP).
- **Recall**: How many of the true relevant items were successfully identified by the classifier, is TP/(TP+FN).
- **F-Measure (or F-Score)**: the harmonic mean of the precision and recall,i.e.:
    

$$ 
F= \frac{(2 × Precision × Recall)}{(Precision + Recall)} 
$$

In [17]:
print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set)))

Accuracy: 0.74


In [18]:
## Compute the Confusion Matrix
t_f = [feature for (feature, label) in test_set]  # features of test set
t_l = [label for (feature, label) in test_set]  # labels of test set
t_l_pr = [classifier.classify(f) for f in t_f]  # predicted labels of test set
cm = nltk.ConfusionMatrix(t_l, t_l_pr)

In [19]:
cm = nltk.ConfusionMatrix(t_l, t_l_pr)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <52.4%> 10.0% |
  male |  15.6% <22.0%>|
-------+---------------+
(row = reference; col = test)



In [20]:
def createCM(classifier, test_set):
    t_f = [feature for (feature, label) in test_set]
    t_l = [label for (feature, label) in test_set]
    t_l_pr = [classifier.classify(f) for f in t_f]
    cm = nltk.ConfusionMatrix(t_l, t_l_pr)
    print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

In [21]:
createCM(classifier, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <52.4%> 10.0% |
  male |  15.6% <22.0%>|
-------+---------------+
(row = reference; col = test)



## Cross Validation

- We can also check our average model performance using the cross-validation method.

---
![](../images/ml-kfold.png)
(Source: https://scikit-learn.org/stable/modules/cross_validation.html)
---

In [22]:
import sklearn.model_selection
kf = sklearn.model_selection.KFold(n_splits=10)
acc_kf = [] ## accuracy holder

## Cross-validation
for train_index, test_index in kf.split(train_set):
    #print("TRAIN:", train_index, "TEST:", test_index)
    classifier = nltk.NaiveBayesClassifier.train(
        train_set[train_index[0]:train_index[len(train_index) - 1]])
    cur_fold_acc = nltk.classify.util.accuracy(
        classifier, train_set[test_index[0]:test_index[len(test_index) - 1]])
    acc_kf.append(cur_fold_acc)
    print('accuracy:', np.round(cur_fold_acc, 2))

accuracy: 0.76
accuracy: 0.76
accuracy: 0.78
accuracy: 0.77
accuracy: 0.78
accuracy: 0.78
accuracy: 0.79
accuracy: 0.79
accuracy: 0.8
accuracy: 0.77


In [23]:
np.mean(acc_kf)

0.7783712315137699

## Try Different Machine Learning Algorithms

### Try Maxent Classifier

- Maxent is memory hungry, slower, and it requires `numpy`.


In [24]:
%%time
from nltk.classify import MaxentClassifier
classifier_maxent = MaxentClassifier.train(train_set,
                                           algorithm='gis',
                                           trace=0,
                                           max_iter=100,
                                           min_lldelta=0.5)

CPU times: user 5.9 s, sys: 11.2 ms, total: 5.91 s
Wall time: 5.95 s


```{note}
The default algorithm for training is `iis` (Improved Iterative Scaling). Another alternative is `gis` (General Iterative Scaling), which is faster.
```

In [25]:
nltk.classify.accuracy(classifier_maxent, test_set)

0.624

In [26]:
classifier_maxent.show_most_informative_features(n=20)

  -0.193 last_letter=='a' and label is 'male'
  -0.129 last_letter=='k' and label is 'female'
  -0.126 last_letter=='f' and label is 'female'
  -0.119 count(v)==2 and label is 'male'
  -0.078 last_letter=='p' and label is 'female'
  -0.076 last_letter=='m' and label is 'female'
  -0.075 count(a)==3 and label is 'male'
  -0.075 last_letter=='v' and label is 'female'
  -0.070 last_letter=='d' and label is 'female'
  -0.065 last_letter=='i' and label is 'male'
  -0.059 last_letter=='w' and label is 'female'
  -0.059 count(l)==3 and label is 'male'
  -0.058 last_letter=='o' and label is 'female'
  -0.055 count(e)==3 and label is 'male'
  -0.052 last_letter=='r' and label is 'female'
  -0.051 count(a)==2 and label is 'male'
  -0.050 last_letter=='z' and label is 'female'
  -0.050 count(p)==3 and label is 'male'
  -0.044 count(n)==3 and label is 'male'
   0.044 last_letter=='c' and label is 'male'


In [27]:
createCM(classifier_maxent, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <62.4%>     . |
  male |  37.6%     <.>|
-------+---------------+
(row = reference; col = test)



In [28]:
%%time
for train_index, test_index in kf.split(train_set):
    #print("TRAIN:", train_index, "TEST:", test_index)
    classifier = MaxentClassifier.train(
        train_set[train_index[0]:train_index[len(train_index) - 1]],
        algorithm='gis',
        trace=0,
        max_iter=10,
        min_lldelta=0.5)
    print(
        'accuracy:',
        nltk.classify.util.accuracy(
            classifier,
            train_set[test_index[0]:test_index[len(test_index) - 1]]))

accuracy: 0.6180124223602484
accuracy: 0.6304347826086957
accuracy: 0.6475155279503105
accuracy: 0.59472049689441
accuracy: 0.6360808709175739
accuracy: 0.6531881804043546
accuracy: 0.6485225505443235
accuracy: 0.5847589424572317
accuracy: 0.640746500777605
accuracy: 0.645412130637636
CPU times: user 59.7 s, sys: 132 ms, total: 59.8 s
Wall time: 1min 1s


### Try Decision Tree

- Parameters:
    - `binary`: whether the features are binary
    - `entropy_cutoff`: a value used during tree refinement process (entropy=1 -> high-level uncertainty; entropy = 0 -> perfect model prediction)
    - `depth_cutoff`: to control the depth of the tree
    - `support_cutoff`: the mimimum number of instances that are required to make a decision about a feature.

In [29]:
%%time
from nltk.classify import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier.train(train_set,
                                             binary=True,
                                             entropy_cutoff=0.7,
                                             depth_cutoff=5,
                                             support_cutoff=5)

CPU times: user 16.4 s, sys: 28.8 ms, total: 16.4 s
Wall time: 16.8 s


In [30]:
nltk.classify.accuracy(classifier_dt, test_set)

0.72

In [31]:
createCM(classifier_dt, test_set)

       |      f        |
       |      e        |
       |      m      m |
       |      a      a |
       |      l      l |
       |      e      e |
-------+---------------+
female | <59.2%>  3.2% |
  male |  24.8% <12.8%>|
-------+---------------+
(row = reference; col = test)



In [32]:
%%time

for train_index, test_index in kf.split(train_set):
    #print("TRAIN:", train_index, "TEST:", test_index)
    classifier = DecisionTreeClassifier.train(
        train_set[train_index[0]:train_index[len(train_index) - 1]],
        binary=True,
        entropy_cutoff=0.7,
        depth_cutoff=5,
        support_cutoff=5)
    print(
        'accuracy:',
        nltk.classify.util.accuracy(
            classifier,
            train_set[test_index[0]:test_index[len(test_index) - 1]]))

accuracy: 0.7065217391304348
accuracy: 0.7158385093167702
accuracy: 0.7515527950310559
accuracy: 0.6909937888198758
accuracy: 0.7325038880248833
accuracy: 0.7418351477449455
accuracy: 0.702954898911353
accuracy: 0.6765163297045101
accuracy: 0.7356143079315708
accuracy: 0.713841368584759
CPU times: user 2min 39s, sys: 301 ms, total: 2min 40s
Wall time: 2min 42s


### Try `sklearn` Classifiers

- `sklearn` is a very useful module for machine learning. We will talk more about this module in our later lectures.
- This package provides a lot more ML algorithms for classification tasks.

#### Naive Bayes in `sklearn`

In [33]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

sk_classifier = SklearnClassifier(MultinomialNB())
sk_classifier.train(train_set)

<SklearnClassifier(MultinomialNB())>

In [34]:
nltk.classify.accuracy(sk_classifier, test_set)

0.742

#### Logistic Regression in `sklearn`

In [35]:
from sklearn.linear_model import LogisticRegression
sk_classifier = SklearnClassifier(LogisticRegression(max_iter=500))
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.774

#### Support Vector Machine in `sklearn`

- `sklearn` provides several implementations for Support Vector Machines.
- Please see its documentation for more detail: [Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html)

In [36]:
from sklearn.svm import SVC
sk_classifier = SklearnClassifier(SVC())
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.794

In [37]:
from sklearn.svm import LinearSVC
sk_classifier = SklearnClassifier(LinearSVC(max_iter=2000))
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.778

In [38]:
from sklearn.svm import NuSVC
sk_classifier = SklearnClassifier(NuSVC())
sk_classifier.train(train_set)
nltk.classify.accuracy(sk_classifier, test_set)

0.796

## References

- NLTK Book, [Chapter 6 Learning to Classify Texts](https://www.nltk.org/book/ch06.html)
- Géron (2019), Chapter 3 Classification