# Classification Exercise

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

### THE DATA

** Read in the census_data.csv data with pandas**

In [1]:
import pandas as pd

In [2]:
census = pd.read_csv('census_data.csv')

In [3]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.**

** Convert the Label column to 0s and 1s instead of strings.**

In [4]:
census['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [5]:
# Method for converting labels to binary values 
def bin_label(label):
    if label ==' <=50K':
        return 0
    else:
        return 1

In [6]:
census['income_bracket'] = census['income_bracket'].apply(bin_label)

### Perform a Train Test Split on the Data

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
labels = census['income_bracket']

In [9]:
x_data = census.drop(['income_bracket'],axis=1)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(x_data, labels, test_size=0.3, random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [11]:
print(x_data.columns)

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')


In [12]:
feat_cont = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

In [13]:
import tensorflow as tf

In [14]:
featco = []
for val in feat_cont:
    featco.append(tf.feature_column.numeric_column(val))

In [15]:
feat_cat = ['workclass', 'education', 'marital_status','occupation','relationship', 'gender', 'native_country']

In [16]:
featca = []
for val in feat_cat:
    featca.append(tf.feature_column.categorical_column_with_vocabulary_list(val,x_data[val].unique()))

In [17]:
for it in featca:
    print(it)
    print("\n")

_VocabularyListCategoricalColumn(key='workclass', vocabulary_list=(' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov', ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'), dtype=tf.string, default_value=-1, num_oov_buckets=0)


_VocabularyListCategoricalColumn(key='education', vocabulary_list=(' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th', ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th', ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th', ' Preschool', ' 12th'), dtype=tf.string, default_value=-1, num_oov_buckets=0)


_VocabularyListCategoricalColumn(key='marital_status', vocabulary_list=(' Never-married', ' Married-civ-spouse', ' Divorced', ' Married-spouse-absent', ' Separated', ' Married-AF-spouse', ' Widowed'), dtype=tf.string, default_value=-1, num_oov_buckets=0)


_VocabularyListCategoricalColumn(key='occupation', vocabulary_list=(' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners', ' Prof-specialty', ' Other-service',

** Import Tensorflow **

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

** Create the continuous feature_columns for the continuous values using numeric_column **

** Put all these variables into a single list with the variable name feat_cols **

In [18]:
feat_cols = featco + featca

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [20]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=100,num_epochs=None,shuffle=True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [21]:
lin_model = tf.estimator.LinearClassifier(feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\FRKSTE~1\\AppData\\Local\\Temp\\tmppfx84drb', '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_tf_random_seed': 1, '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_log_step_count_steps': 100, '_keep_checkpoint_max': 5}


** Train your model on the data, for at least 5000 steps. **

In [22]:
lin_model.train(input_fn=input_func, steps=20000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\FRKSTE~1\AppData\Local\Temp\tmppfx84drb\model.ckpt.
INFO:tensorflow:loss = 69.3147, step = 1
INFO:tensorflow:global_step/sec: 209.525
INFO:tensorflow:loss = 763.57, step = 101 (0.481 sec)
INFO:tensorflow:global_step/sec: 248.713
INFO:tensorflow:loss = 183.881, step = 201 (0.402 sec)
INFO:tensorflow:global_step/sec: 243.847
INFO:tensorflow:loss = 191.058, step = 301 (0.410 sec)
INFO:tensorflow:global_step/sec: 240.324
INFO:tensorflow:loss = 136.186, step = 401 (0.416 sec)
INFO:tensorflow:global_step/sec: 243.849
INFO:tensorflow:loss = 241.906, step = 501 (0.410 sec)
INFO:tensorflow:global_step/sec: 234.668
INFO:tensorflow:loss = 56.3952, step = 601 (0.426 sec)
INFO:tensorflow:global_step/sec: 244.445
INFO:tensorflow:loss = 130.582, step = 701 (0.410 sec)
INFO:tensorflow:global_step/sec: 236.337
INFO:tensorflow:loss = 61.3952, step = 801 (0.422 sec)
INFO:tensorflow:global_step/sec: 247.479


INFO:tensorflow:global_step/sec: 243.254
INFO:tensorflow:loss = 27.355, step = 8401 (0.411 sec)
INFO:tensorflow:global_step/sec: 243.849
INFO:tensorflow:loss = 109.076, step = 8501 (0.410 sec)
INFO:tensorflow:global_step/sec: 248.091
INFO:tensorflow:loss = 124.492, step = 8601 (0.404 sec)
INFO:tensorflow:global_step/sec: 249.338
INFO:tensorflow:loss = 70.8065, step = 8701 (0.401 sec)
INFO:tensorflow:global_step/sec: 248.093
INFO:tensorflow:loss = 24.326, step = 8801 (0.402 sec)
INFO:tensorflow:global_step/sec: 250.589
INFO:tensorflow:loss = 40.8559, step = 8901 (0.400 sec)
INFO:tensorflow:global_step/sec: 239.745
INFO:tensorflow:loss = 53.5762, step = 9001 (0.416 sec)
INFO:tensorflow:global_step/sec: 237.455
INFO:tensorflow:loss = 76.0751, step = 9101 (0.421 sec)
INFO:tensorflow:global_step/sec: 248.102
INFO:tensorflow:loss = 42.3494, step = 9201 (0.403 sec)
INFO:tensorflow:global_step/sec: 248.713
INFO:tensorflow:loss = 44.8572, step = 9301 (0.402 sec)
INFO:tensorflow:global_step/sec:

INFO:tensorflow:global_step/sec: 242.662
INFO:tensorflow:loss = 44.5281, step = 16801 (0.412 sec)
INFO:tensorflow:global_step/sec: 246.257
INFO:tensorflow:loss = 34.9702, step = 16901 (0.407 sec)
INFO:tensorflow:global_step/sec: 247.479
INFO:tensorflow:loss = 25.3596, step = 17001 (0.403 sec)
INFO:tensorflow:global_step/sec: 250.589
INFO:tensorflow:loss = 74.3587, step = 17101 (0.400 sec)
INFO:tensorflow:global_step/sec: 248.713
INFO:tensorflow:loss = 33.7972, step = 17201 (0.401 sec)
INFO:tensorflow:global_step/sec: 251.854
INFO:tensorflow:loss = 35.3692, step = 17301 (0.397 sec)
INFO:tensorflow:global_step/sec: 251.853
INFO:tensorflow:loss = 47.4463, step = 17401 (0.398 sec)
INFO:tensorflow:global_step/sec: 245.651
INFO:tensorflow:loss = 52.47, step = 17501 (0.407 sec)
INFO:tensorflow:global_step/sec: 249.334
INFO:tensorflow:loss = 37.8384, step = 17601 (0.400 sec)
INFO:tensorflow:global_step/sec: 249.961
INFO:tensorflow:loss = 41.4469, step = 17701 (0.400 sec)
INFO:tensorflow:global

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x244e88e7358>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [33]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [34]:
predictions = list(lin_model.predict(pred_input_func))

INFO:tensorflow:Restoring parameters from C:\Users\FRKSTE~1\AppData\Local\Temp\tmppfx84drb\model.ckpt-20000


** Each item in your list will look like this: **

In [25]:
print(predictions[0]['class_ids'][0])

0


** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [35]:
id_list = []
for pred in predictions:
    id_list.append(pred['class_ids'][0])

In [27]:
#id_list

** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [28]:
from sklearn.metrics import classification_report

In [37]:
print(classification_report(y_test, id_list))

             precision    recall  f1-score   support

          0       0.84      0.93      0.88      7436
          1       0.66      0.43      0.52      2333

avg / total       0.80      0.81      0.80      9769



# Great Job!