In [32]:
import pandas as pd
import nltk
import sklearn
from sklearn.metrics import cohen_kappa_score

## The Adult Data set

In this workbook, you'll work with data extracted from the 1994 census. I got it from [here](http://archive.ics.uci.edu/ml/datasets/Adult). The title is not very helpful.

Essentially, this data set has some information about some adults, plus whether their income was more or less thatn 50k. The classification task is to use the other variables to predict whether someone made more or less than 50k.

The data is in a weird format - an `arff` file. I'll deal with reading it in for you.

In [None]:
from scipy.io import arff
data, meta = arff.loadarff("corpora/adult/adult.arff")
df = pd.DataFrame(data)
dlist = df.to_dict('records')
for row in dlist:
    for k, v in row.items():
        if isinstance(v, bytes):
            row[k] = str(v, "utf-8")

Now the data is in the same form we used in the preceding notebooks; namely it's a list of dictionaries.

In [22]:
dlist[0]

{'age': 39.0,
 'workclass': 'State-gov',
 'fnlwgt': 77516.0,
 'education': 'Bachelors',
 'education-num': 13.0,
 'marital-status': 'Never-married',
 'occupation': 'Adm-clerical',
 'relationship': 'Not-in-family',
 'race': 'White',
 'sex': 'Male',
 'capital-gain': 2174.0,
 'capital-loss': 0.0,
 'hours-per-week': 40.0,
 'native-country': 'United-States',
 'income': '<=50K'}

In [33]:
import random
random.shuffle(dlist)
train_size = int(.9 * len(dlist))

def adult_features(r):
    return {"education": r["education"]}

labeled_feature_sets = [(adult_features(r), r["income"]) for r in dlist]
train_set = labeled_feature_sets[:train_size]
test_set = labeled_feature_sets[train_size:]
adult_classifier = nltk.NaiveBayesClassifier.train(train_set)
gold_list = [t[1] for t in test_set]
guess_list = [adult_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, guess_list)
print(cm)
accuracy = nltk.classify.accuracy(adult_classifier, test_set)
print("accuracy =", accuracy)
print("kappa = ", str(cohen_kappa_score(gold_list, guess_list)))

      |    <      |
      |    =    > |
      |    5    5 |
      |    0    0 |
      |    K    K |
------+-----------+
<=50K |<2361> 103 |
 >50K |  640 <153>|
------+-----------+
(row = reference; col = test)

accuracy = 0.7718759594719067
kappa =  0.19618365515979153


In [35]:
adult_classifier.show_most_informative_features()

Most Informative Features
               education = 'Prof-school'    >50K : <=50K  =      9.0 : 1.0
               education = 'Doctorate'      >50K : <=50K  =      8.4 : 1.0
               education = '1st-4th'       <=50K : >50K   =      7.3 : 1.0
               education = '5th-6th'       <=50K : >50K   =      6.3 : 1.0
               education = '9th'           <=50K : >50K   =      5.8 : 1.0
               education = '11th'          <=50K : >50K   =      5.7 : 1.0
               education = '7th-8th'       <=50K : >50K   =      5.3 : 1.0
               education = '10th'          <=50K : >50K   =      4.3 : 1.0
               education = 'Masters'        >50K : <=50K  =      4.1 : 1.0
               education = '12th'          <=50K : >50K   =      3.7 : 1.0
