In [1]:
import pandas as pd
import nltk

# Classification

<img src="http://www.nltk.org/images/supervised-classification.png" alt="drawing" style="width:400px;"/>

## The Titanic Dataset

Dataset from Kaggle [here](https://www.kaggle.com/c/titanic/overview)

> This is the legendary Titanic ML competition  the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

We are starting with a non-text dataset since classifying text adds some complexity.


<table>
<tbody>
<tr><th><b>Variable</b></th><th><b>Definition</b></th><th><b>Key</b></th></tr>
<tr>
<td>survival</td>
<td>Survival</td>
<td>0 = No, 1 = Yes</td>
</tr>
<tr>
<td>pclass</td>
<td>Ticket class</td>
<td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
</tr>
<tr>
<td>sex</td>
<td>Sex</td>
<td></td>
</tr>
<tr>
<td>Age</td>
<td>Age in years</td>
<td></td>
</tr>
<tr>
<td>sibsp</td>
<td># of siblings / spouses aboard the Titanic</td>
<td></td>
</tr>
<tr>
<td>parch</td>
<td># of parents / children aboard the Titanic</td>
<td></td>
</tr>
<tr>
<td>ticket</td>
<td>Ticket number</td>
<td></td>
</tr>
<tr>
<td>fare</td>
<td>Passenger fare</td>
<td></td>
</tr>
<tr>
<td>cabin</td>
<td>Cabin number</td>
<td></td>
</tr>
<tr>
<td>embarked</td>
<td>Port of Embarkation</td>
<td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
</tr>
</tbody>
</table>

Read in the corpus. The result here will be a list of dictionaries.

In [2]:
# file_id = "1lXO7JEO99fLidhHJDdZUpxinXTiU8Vf-"
# url = f'https://drive.google.com/uc?id={file_id}'
url = 'corpora/titanic.csv'
df = pd.read_csv(url)
dlist = df.to_dict('records')

Shuffle the list of dictionaries. Then split it into two parts, one for training and one for testing.

In [3]:
import random
random.shuffle(dlist)
train_size = int(.9 * len(dlist))
train_list = dlist[:train_size]
test_list = dlist[train_size:]

### Our First Classifier
We just guess, for every person, that they died. So, in this case, there's no training to be done.

We'll create two lists, a `gold_list` and a `guess_list`. The `gold_list` has the right answers. The `guess_list` has the guess made by our classifier.

In [None]:
gold_list = [r["Survived"] for r in test_list]
guess_list = [0 for r in test_list]

nltk has a tool for drawing a confusion matrix, given the gold_list and guess_list.

We'll also look at how many were correct.

In [None]:
def pprint(txt):
    print("<pre>{}</pre>".format(txt))

In [None]:
cm = nltk.ConfusionMatrix(gold_list, guess_list)
pprint(cm)
accuracy = (cm[0, 0] + cm[1, 1]) / len (test_list)
accuracy

#### Cohen's Kappa
We should use another measure here, Cohen's Kappa, that adjusts for agreement by chance.

```math
\large \kappa=\frac{p_o-p_e}{1-p_e}
```

where `$p_0$` is the relative observed agreement among raters, and `$p_e$` is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then `$\kappa=1$`

In [None]:
import sklearn
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(gold_list, guess_list)

## Our second classifier: All with Sex=Female survive

Now let's try guessing that everyone female survived and every male dies.

In [None]:
def simple_classifier(r):
    if r["Sex"] == "female":
        return 1
    else:
        return 0

In [None]:
random.shuffle(dlist)
train_list = dlist[:train_size]
test_list = dlist[train_size:]
gold_list = [r["Survived"] for r in test_list]
guess_list = [simple_classifier(r) for r in test_list]

In [None]:
cm = nltk.ConfusionMatrix(gold_list, guess_list)
pprint(cm)
accuracy = (cm[0, 0] + cm[1, 1]) / len (test_list)
print("accuracy = " + str(accuracy))
"kappa = " + str(cohen_kappa_score(gold_list, guess_list))

This works pretty well. In a way, that's a problem, since it will be hard to do better.

## Set up a more formal structure

We want a more formal structure, both to organize our work, and because nltk expect data structures with a particular form.

We don't really need ot start over. But we will so we have everything in one place.

In [None]:
df = pd.read_csv('corpora/titanic.csv')
dlist = df.to_dict('records')

We convert the data to **labeled feature sets**. Each "feature set" is a dictionary, with the features, plus the plus the "label" which is the right answer for that feature set.

We'll start with an extremely simple feature set, which just includes the "Sex" variable.

In [None]:
def passenger_features(r):
    return {"sex": r["Sex"]}

labeled_feature_sets = [(passenger_features(r), r["Survived"]) for r in dlist]

We split the labeled feature sets into training and test sets

In [None]:
train_size = int(.9 * len(dlist))

train_set = labeled_feature_sets[:train_size]
test_set = labeled_feature_sets[train_size:]

**Create and train a classifier**

There are many machine learning algorithms that we can use.
From a practical point of view, switching from one algorithm to another can be as easy as changing part of one line of code. But you might want to change other parts of the pipeline.

We're going to start by making use of an algorithm called **Naive Bayes**, in part because it's relatively easy to explain how it works. (I'm going to explain it in a bit.)

In [None]:
import nltk
titanic_classifier = nltk.NaiveBayesClassifier.train(train_set)

We can now use this classifier to classify individual *feature sets*. For example:

In [None]:
print(labeled_feature_sets[0])
titanic_classifier.classify(labeled_feature_sets[0][0])

In [None]:
print(labeled_feature_sets[1])
titanic_classifier.classify(labeled_feature_sets[0][0])

**Evaluate**: Look at how this classifier does overall on various metrics

In [None]:
gold_list = [t[1] for t in test_set]
guess_list = [titanic_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, guess_list)
pprint(cm)
accuracy = nltk.classify.accuracy(titanic_classifier, test_set)
print("accuracy = " + str(accuracy))
print("kappa = " + str(cohen_kappa_score(gold_list, guess_list)))

**Altogether**: Let's put it all together now, but with a more extensive feature set.

This cell runs our whole process.

In [None]:
df = pd.read_csv('corpora/titanic.csv')
dlist = df.to_dict('records')
random.shuffle(dlist)

def passenger_features(r):
    return {"sex": r["Sex"], "pclass": r["Pclass"], "embarked": r["Embarked"]}

labeled_feature_sets = [(passenger_features(r), r["Survived"]) for r in dlist]
train_set = labeled_feature_sets[:train_size]
test_set = labeled_feature_sets[train_size:]
titanic_classifier = nltk.NaiveBayesClassifier.train(train_set)
gold_list = [t[1] for t in test_set]
guess_list = [titanic_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, guess_list)
pprint(cm)
accuracy = nltk.classify.accuracy(titanic_classifier, test_set)
print("accuracy = " + str(accuracy))
print("kappa = " + str(cohen_kappa_score(gold_list, guess_list)))

It's not clear whether this is better or worse than using the feature set that only had "sex."

The naive bayes classifier in nltk will also show us the "most informative features."

In [None]:
def show_most_informative_features(classif, number=25):
    print("<pre>")
    classif.show_most_informative_features(25)
    print("</pre>")

In [None]:
show_most_informative_features(titanic_classifier, 25)

## How Naive Bayes works

### The problem


We want to know the probability of a given label given a set of features.

```math
\large P(L|f_1 + f_2 + f_3+...)
```

For example, in the Titanic case, we'd like to know the probability that a passenger died, given that they were female, were in class 2, and embarked at S. If we knew the probability that they died, and the probability that they lived, we could pick the one that was larger.

We can use the training data to estimate probabilities like this.

Using the training set, we can easily gather data about how often a feature occurs
with a given label

```math
\large P(f_1|L), P(f_2|L), P(f_3|L)
```

So, the question is, can we get to the former from the latter?

The answer is yes, with Bayes Theorem and a bit of fudging.

```math
\large P(L|f_1 + f_2 + f_3+...) = \frac{P(L)P(f_1|L)P(f_2|L)P(f_3|L)}{P(f_1)P(f_2)P(f_3)}
```

Bayes theorem will let us flip the L and fs.
Being "naive" will let us split up the fs.