# Classification Exercise

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

### THE DATA

** Read in the census_data.csv data with pandas**

In [1]:
import pandas as pd

In [43]:
census = pd.read_csv('./census_data.csv')

In [24]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
census.describe()

Unnamed: 0,age,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,10.080679,1077.648844,87.30383,40.437456
std,13.640433,2.57272,7385.292085,402.960219,12.347429
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


In [67]:
census.shape

(32561, 13)

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.**

** Convert the Label column to 0s and 1s instead of strings.**

In [18]:
for c in census.columns:
    print(c, census[c].unique()[:5])

age [39 50 38 53 28]
workclass [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov']
education [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th']
education_num [13  9  7 14  5]
marital_status [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated']
occupation [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service']
relationship [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried']
race [' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']
gender [' Male' ' Female']
capital_gain [ 2174     0 14084  5178  5013]
capital_loss [   0 2042 1408 1902 1573]
hours_per_week [40 13 16 45 50]
native_country [' United-States' ' Cuba' ' Jamaica' ' India' ' ?']
income_bracket [' <=50K' ' >50K']


In [44]:
census['target_income'] = pd.Categorical(census.income_bracket).codes

In [45]:
census.income_bracket.unique()

array([' <=50K', ' >50K'], dtype=object)

array([' <=50K', ' >50K'], dtype=object)

### Perform a Train Test Split on the Data

In [39]:
from sklearn.model_selection import train_test_split

In [46]:
label = census.target_income
census = census.drop(['target_income', 'income_bracket'], axis=1)

In [48]:
X_train, X_test, y_train, y_test = train_test_split(
    census, 
    label,
    test_size = 0.33, 
    random_state = 101
)

In [49]:
X_train.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country
941,35,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,0,0,40,United-States
25762,60,Self-emp-not-inc,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States
3987,37,Private,Bachelors,13,Married-civ-spouse,Other-service,Husband,Asian-Pac-Islander,Male,0,0,30,China
17851,31,Private,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States
12116,43,Private,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,4064,0,38,United-States


In [50]:
y_train.head()

941      0
25762    1
3987     0
17851    0
12116    0
Name: target_income, dtype: int8

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [51]:
census.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

** Import Tensorflow **

In [52]:
import tensorflow as tf

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [57]:
for c in census.columns:
    print(c, census[c].unique().size)

age 73
workclass 9
education 16
education_num 16
marital_status 7
occupation 15
relationship 6
race 5
gender 2
capital_gain 119
capital_loss 92
hours_per_week 94
native_country 42


In [60]:
categorical_features = [
    tf.feature_column.categorical_column_with_hash_bucket('workclass', 10),
    tf.feature_column.categorical_column_with_hash_bucket('education', 20),
    tf.feature_column.categorical_column_with_hash_bucket('marital_status', 10),
    tf.feature_column.categorical_column_with_hash_bucket('occupation', 20),
    tf.feature_column.categorical_column_with_vocabulary_list('relationship', ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']),
    tf.feature_column.categorical_column_with_vocabulary_list('race', ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']),
    tf.feature_column.categorical_column_with_vocabulary_list('gender', ['Female', 'Male']),
    tf.feature_column.categorical_column_with_hash_bucket('native_country', 50),
]

** Create the continuous feature_columns for the continuous values using numeric_column **

In [61]:
continuous_features = [
    tf.feature_column.numeric_column('age'),
    tf.feature_column.numeric_column('education_num'),
    tf.feature_column.numeric_column('capital_gain'),
    tf.feature_column.numeric_column('capital_loss'),
    tf.feature_column.numeric_column('hours_per_week'),
]

** Put all these variables into a single list with the variable name feat_cols **

In [62]:
feat_cols = categorical_features + continuous_features

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [91]:
train_input_fn = tf.estimator.inputs.pandas_input_fn(
    x = X_train, 
    y = y_train, 
    batch_size = 100, 
    num_epochs = None, 
    shuffle = True
)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [92]:
linearModel = tf.estimator.LinearClassifier(
    feature_columns = feat_cols,
    n_classes = 2
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpnsocjcdt', '_save_summary_steps': 100, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5, '_session_config': None, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None}


In [121]:
categorical_features

[_HashedCategoricalColumn(key='workclass', hash_bucket_size=10, dtype=tf.string),
 _HashedCategoricalColumn(key='education', hash_bucket_size=20, dtype=tf.string),
 _HashedCategoricalColumn(key='marital_status', hash_bucket_size=10, dtype=tf.string),
 _HashedCategoricalColumn(key='occupation', hash_bucket_size=20, dtype=tf.string),
 _VocabularyListCategoricalColumn(key='relationship', vocabulary_list=('Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 _VocabularyListCategoricalColumn(key='race', vocabulary_list=('White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 _VocabularyListCategoricalColumn(key='gender', vocabulary_list=('Female', 'Male'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 _HashedCategoricalColumn(key='native_country', hash_bucket_size=50, dtype=tf.string)]

In [125]:
dnn_categorical_features = []
for cf in categorical_features:
    if hasattr(cf, 'hash_bucket_size'):
        emb = tf.feature_column.embedding_column(
            cf, 
            cf.hash_bucket_size
        )
    else:
        emb = tf.feature_column.embedding_column(
            cf,
            len(cf.vocabulary_list)
        )
    dnn_categorical_features.append(emb)

In [150]:
dnn_feat_cols = dnn_categorical_features + continuous_features
dnnModel = tf.estimator.DNNClassifier(
    hidden_units = [ 20 ] * 3, 
    feature_columns = dnn_feat_cols
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpzv8hwxjm', '_save_summary_steps': 100, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5, '_session_config': None, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None}


** Train your model on the data, for at least 5000 steps. **

In [93]:
linearModel.train(train_input_fn, steps=5000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpnsocjcdt/model.ckpt.
INFO:tensorflow:step = 1, loss = 69.3147
INFO:tensorflow:global_step/sec: 104.473
INFO:tensorflow:step = 101, loss = 278.105 (0.959 sec)
INFO:tensorflow:global_step/sec: 113.484
INFO:tensorflow:step = 201, loss = 61.8264 (0.881 sec)
INFO:tensorflow:global_step/sec: 113.271
INFO:tensorflow:step = 301, loss = 154.323 (0.883 sec)
INFO:tensorflow:global_step/sec: 113.71
INFO:tensorflow:step = 401, loss = 1013.53 (0.879 sec)
INFO:tensorflow:global_step/sec: 115.365
INFO:tensorflow:step = 501, loss = 106.537 (0.867 sec)
INFO:tensorflow:global_step/sec: 112.636
INFO:tensorflow:step = 601, loss = 572.219 (0.888 sec)
INFO:tensorflow:global_step/sec: 112.729
INFO:tensorflow:step = 701, loss = 99.7765 (0.891 sec)
INFO:tensorflow:global_step/sec: 114.222
INFO:tensorflow:step = 801, loss = 123.912 (0.872 sec)
INFO:tensorflow:global_step/s

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x12037e5c0>

In [151]:
dnnModel.train(train_input_fn, steps=10000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpzv8hwxjm/model.ckpt.
INFO:tensorflow:step = 1, loss = 2001.57
INFO:tensorflow:global_step/sec: 101.225
INFO:tensorflow:step = 101, loss = 41.706 (0.990 sec)
INFO:tensorflow:global_step/sec: 100.732
INFO:tensorflow:step = 201, loss = 32.6319 (0.993 sec)
INFO:tensorflow:global_step/sec: 102.046
INFO:tensorflow:step = 301, loss = 38.0829 (0.979 sec)
INFO:tensorflow:global_step/sec: 100.48
INFO:tensorflow:step = 401, loss = 24.8358 (0.996 sec)
INFO:tensorflow:global_step/sec: 108.883
INFO:tensorflow:step = 501, loss = 35.2721 (0.918 sec)
INFO:tensorflow:global_step/sec: 101.624
INFO:tensorflow:step = 601, loss = 28.9604 (0.984 sec)
INFO:tensorflow:global_step/sec: 114.32
INFO:tensorflow:step = 701, loss = 45.6234 (0.876 sec)
INFO:tensorflow:global_step/sec: 100.427
INFO:tensorflow:step = 801, loss = 38.545 (0.996 sec)
INFO:tensorflow:global_step/sec:

INFO:tensorflow:global_step/sec: 116.861
INFO:tensorflow:step = 8401, loss = 28.1763 (0.856 sec)
INFO:tensorflow:global_step/sec: 115.958
INFO:tensorflow:step = 8501, loss = 34.6208 (0.862 sec)
INFO:tensorflow:global_step/sec: 118.155
INFO:tensorflow:step = 8601, loss = 32.3669 (0.847 sec)
INFO:tensorflow:global_step/sec: 120.172
INFO:tensorflow:step = 8701, loss = 38.0892 (0.833 sec)
INFO:tensorflow:global_step/sec: 119.434
INFO:tensorflow:step = 8801, loss = 32.1122 (0.837 sec)
INFO:tensorflow:global_step/sec: 120.379
INFO:tensorflow:step = 8901, loss = 27.0036 (0.831 sec)
INFO:tensorflow:global_step/sec: 117.011
INFO:tensorflow:step = 9001, loss = 29.6537 (0.854 sec)
INFO:tensorflow:global_step/sec: 115.026
INFO:tensorflow:step = 9101, loss = 36.0277 (0.870 sec)
INFO:tensorflow:global_step/sec: 115.875
INFO:tensorflow:step = 9201, loss = 51.164 (0.863 sec)
INFO:tensorflow:global_step/sec: 107.98
INFO:tensorflow:step = 9301, loss = 34.4751 (0.931 sec)
INFO:tensorflow:global_step/sec:

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1206b44a8>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [94]:
pred_input_fn = tf.estimator.inputs.pandas_input_fn(
    X_test, 
    batch_size = len(X_test), 
    shuffle = False
)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [95]:
linearPredictions = linearModel.predict(pred_input_fn)
linearPredictions = list(linearPredictions)

INFO:tensorflow:Restoring parameters from /var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpnsocjcdt/model.ckpt-5000


In [152]:
dnnPredictions = dnnModel.predict(pred_input_fn)
dnnPredictions = list(dnnPredictions)

INFO:tensorflow:Restoring parameters from /var/folders/7f/kl6zxp_50ddcw8xr42yvrpfm0000gn/T/tmpzv8hwxjm/model.ckpt-10000


** Each item in your list will look like this: **

In [96]:
linearPredictions[0]

{'class_ids': array([0]),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.13777165], dtype=float32),
 'logits': array([-1.83392262], dtype=float32),
 'probabilities': array([ 0.86222839,  0.13777164], dtype=float32)}

In [153]:
dnnPredictions[0]

{'class_ids': array([0]),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.2315159], dtype=float32),
 'logits': array([-1.1997714], dtype=float32),
 'probabilities': array([ 0.76848412,  0.23151588], dtype=float32)}

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [130]:
linearLabels = [ p['class_ids'][0] for p in linearPredictions ]
linearLabels[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [154]:
dnnLabels = [ p['class_ids'][0] for p in dnnPredictions ]
dnnLabels[:10]

[0, 0, 1, 0, 1, 0, 0, 0, 0, 0]

** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [None]:
from sklearn.metrics import classification_report

In [102]:
print(classification_report(y_test, linearLabels))

             precision    recall  f1-score   support

          0       0.87      0.94      0.90      8161
          1       0.75      0.55      0.64      2585

avg / total       0.84      0.85      0.84     10746



In [155]:
print(classification_report(y_test, dnnLabels))

             precision    recall  f1-score   support

          0       0.89      0.93      0.91      8161
          1       0.73      0.62      0.67      2585

avg / total       0.85      0.85      0.85     10746



             precision    recall  f1-score   support

          0       0.88      0.92      0.90      7436
          1       0.70      0.59      0.64      2333

avg / total       0.84      0.84      0.84      9769



# Great Job!