# Learning
In this part, we try to learn the way people die, based on 4 features generated before:
* Education (8 bins)
* Sex
* Race (White, Black, Other)
* Martial status (Single, Married, ...)

We will try to use some learning models to "guess" how people die based one these 4 informations.  
Two informaions will be tried to be learned: 
* The **manner** of death. 7 categories:
  * Accident
  * Suicide
  * Homicide
  * Pending investigation
  * Could not determine
  * Self-Inflicted
  * Natural
* The **cause** of death (much more precise: 39 categories)

## Learning the Manner of death (7 bins)

First, we load the features generated by previous part.

In [1]:
import numpy as np
from pickle import load
from time import time

features = load(open("features4.pickle", "rb"))
y_all = load(open("manner.pickle", "rb"))

After that, we generate the train and test sets, the train set being a random subset of 10000 people from the entire dataset.

In [2]:
from sklearn.cross_validation import train_test_split

selected = np.random.randint(y_all.shape[0],size=10000)
X = features[selected,:]
y = y_all[selected]

X, X_test, y, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

On these sets, we perform a transformation to perform distances between **categorical** features.

In [3]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X = enc.fit_transform(X)
X_test = enc.transform(X_test)

### Model comparison

#### SVC

In [4]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn import cross_validation

C = 10
clf = OneVsRestClassifier(SVC(kernel='linear', C=C, probability=True), n_jobs=4)

print("TRAINING....")

t0 = time()
clf.fit(X, y)
t1 = time()
print("TIME:\n", t1-t0)

print("SCORE:")
print(clf.score(X_test, y_test))

TRAINING....
TIME:
 5.259270191192627
SCORE:
0.78


#### K neighbors

In [5]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=300)
print("TRAINING....")

t0 = time()
clf.fit(X, y)
t1 = time()
print("TIME:\n", t1-t0)

print("SCORE:")
print(clf.score(X_test, y_test))

TRAINING....
TIME:
 0.0017457008361816406
SCORE:
0.78


#### Naive Bayes

In [6]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
print("TRAINING....")

t0 = time()
clf.fit(X.toarray(), y)
t1 = time()
print("TIME:\n", t1-t0)


print("SCORE:")
print(clf.score(X_test.toarray(), y_test))

TRAINING....
TIME:
 0.0060193538665771484
SCORE:
0.779090909091


Here the three models give greate results (very high score), but we can see that the linear SVC is much slower than the other two.  
For the examples below, the Naive Bayes model is used.

### Examples

Black man with no education

In [7]:
ex = enc.transform([[1, 0, 3, 0]])
clf.predict_proba(ex)

array([[ 0.0936895 ,  0.1109537 ,  0.00976497,  0.2584388 ,  0.00478453,
         0.00826412,  0.51410439]])

White woman married, with high education

In [8]:
ex = enc.transform([[6, 1, 1, 1]])
clf.predict_proba(ex)

array([[  1.63399336e-01,   2.22790087e-02,   4.66005052e-03,
          2.64268740e-04,   1.02434968e-03,   5.27518901e-04,
          8.07845468e-01]])

## Learning the cause of death (39 bins)

In [9]:
features = load(open("features4.pickle", "rb"))
y_all = load(open("cause.pickle", "rb"))

from sklearn.cross_validation import train_test_split

selected = np.random.randint(y_all.shape[0],size=10000)
X = features[selected,:]
y = y_all[selected]

X, X_test, y, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X = enc.fit_transform(X)
X_test = enc.transform(X_test)

### K Neighbors

In [10]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=300)
print("TRAINING....")

t0 = time()
clf.fit(X, y)
t1 = time()
print("TIME:\n", t1-t0)

print("SCORE:")
print(clf.score(X_test, y_test))

TRAINING....
TIME:
 0.0033082962036132812
SCORE:
0.175454545455


As you can see, for learning these 39 bins, we don't have much information. The error is to small to expect reliable results from the model. We are here at the limit of this dataset. 

To be able to learn this precise cause of deatch, we must have access to more information about these people like medical details, geographic position, ...