# Classification Exercise Using Tensor Flow

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belong in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

### THE DATA

** Read in the census_data.csv data with pandas**

In [107]:
import pandas as pd

In [108]:
data = pd.read_csv('census_data.csv')

In [109]:
data.tail()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


** Convert the Income Bracket column to 0s and 1s instead of strings. This is because Tensorflow wont be able to understand strings as labels.**

In [110]:
data['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [111]:
def convert(income):
    if income==' <=50K':
        return 0
    else:
        return 1

In [112]:
data['income_bracket'] = data['income_bracket'].apply(convert)

In [113]:
income = data['income_bracket']

In [114]:
data.tail()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
32556,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0
32560,52,Self-emp-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,1


### Perform a Train Test Split on the Data

In [115]:
x = data.drop('income_bracket',axis=1)
y = data['income_bracket']

In [116]:
from sklearn.model_selection import train_test_split

In [117]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state=101)

### Create the Feature Columns for tf.esitmator


In [118]:
data.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

** Import Tensorflow **

In [119]:
import tensorflow as tf

In [120]:
data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [121]:
len(data['native_country'].unique())

42

** Create the tf.feature_columns for the categorical values. **

In [122]:
workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass',hash_bucket_size=10)
education = tf.feature_column.categorical_column_with_hash_bucket('education',hash_bucket_size=17)
marital_status = tf.feature_column.categorical_column_with_hash_bucket('marital_status',hash_bucket_size=8)
occupation = tf.feature_column.categorical_column_with_hash_bucket('occupation',hash_bucket_size=17)
relation = tf.feature_column.categorical_column_with_hash_bucket('relationship',hash_bucket_size=7)
race = tf.feature_column.categorical_column_with_hash_bucket('race',hash_bucket_size=6)
gender = tf.feature_column.categorical_column_with_hash_bucket('gender',hash_bucket_size=2)
native_country = tf.feature_column.categorical_column_with_hash_bucket('native_country',hash_bucket_size=45)


** Create the continuous feature_columns for the continuous values **

In [123]:
age = tf.feature_column.numeric_column('age')
education_num =  tf.feature_column.numeric_column('education_num')
capital_gain =  tf.feature_column.numeric_column('capital_gain')
capital_loss =  tf.feature_column.numeric_column('capital_loss')
hours_per_week =  tf.feature_column.numeric_column('hours_per_week')

** Put all these variables into a single list **

In [145]:
feat_cols = [workclass,education,marital_status,occupation,relation,race,gender,native_country,age,education_num,capital_gain,capital_loss,hours_per_week]

**Create embeddings for categorical columns**

In [125]:
workclass = tf.feature_column.embedding_column(workclass,dimension = 9)
education = tf.feature_column.embedding_column(education,dimension = 16)
marital_status = tf.feature_column.embedding_column(marital_status,dimension = 7)
occupation = tf.feature_column.embedding_column(occupation,dimension = 15)
relation = tf.feature_column.embedding_column(relation,dimension = 6)
race = tf.feature_column.embedding_column(race,dimension = 5)
gender = tf.feature_column.embedding_column(gender,dimension = 2)
native_country = tf.feature_column.embedding_column(native_country,dimension = 42)

** Creating Input function and Train your model on the data. **

In [130]:
feat_cols = [workclass,education,marital_status,occupation,relation,race,gender,native_country,age,education_num,capital_gain,capital_loss,hours_per_week]

In [131]:
input_func = tf.estimator.inputs.pandas_input_fn(x= x_train,y=y_train,batch_size=10,num_epochs=1000,shuffle=True)

In [132]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[20,20,20],feature_columns=feat_cols,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_session_config': None, '_model_dir': 'C:\\Users\\apurv\\AppData\\Local\\Temp\\tmpwvbuv5rg', '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_keep_checkpoint_max': 5, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100}


In [133]:
dnn_model.train(input_fn=input_func,steps=5000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\apurv\AppData\Local\Temp\tmpwvbuv5rg\model.ckpt.
INFO:tensorflow:loss = 23.0586, step = 1
INFO:tensorflow:global_step/sec: 378.792
INFO:tensorflow:loss = 5.12629, step = 101 (0.264 sec)
INFO:tensorflow:global_step/sec: 458.168
INFO:tensorflow:loss = 1.64382, step = 201 (0.218 sec)
INFO:tensorflow:global_step/sec: 408.226
INFO:tensorflow:loss = 3.69606, step = 301 (0.245 sec)
INFO:tensorflow:global_step/sec: 418.902
INFO:tensorflow:loss = 6.87487, step = 401 (0.254 sec)
INFO:tensorflow:global_step/sec: 350.424
INFO:tensorflow:loss = 2.4005, step = 501 (0.270 sec)
INFO:tensorflow:global_step/sec: 463.085
INFO:tensorflow:loss = 63.9322, step = 601 (0.232 sec)
INFO:tensorflow:global_step/sec: 437.174
INFO:tensorflow:loss = 5.07426, step = 701 (0.217 sec)
INFO:tensorflow:global_step/sec: 379.381
INFO:tensorflow:loss = 6.13865, step = 801 (0.260 sec)
INFO:tensorflow:global_step/sec: 426.652
INF

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1b56b3f6cf8>

### Evaluation

** Create a prediction input function. **

In [134]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(x = x_test,batch_size=10,num_epochs=1,shuffle=False)

In [137]:
predictions = dnn_model.predict(pred_input_func)

In [138]:
my_pred = list(predictions)

INFO:tensorflow:Restoring parameters from C:\Users\apurv\AppData\Local\Temp\tmpwvbuv5rg\model.ckpt-5000


** Each item in your list will look like this: **

In [139]:
my_pred[0]

{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.38203651], dtype=float32),
 'logits': array([-0.48091328], dtype=float32),
 'probabilities': array([ 0.61796349,  0.38203648], dtype=float32)}

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [140]:
final_preds = []
for pred in my_pred:
    final_preds.append(pred['class_ids'][0])

In [141]:
final_preds[:10]

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

** Full report of model's performance on the test data. **

In [142]:
from sklearn.metrics import classification_report

In [143]:
print(classification_report(y_test,final_preds))

             precision    recall  f1-score   support

          0       0.89      0.92      0.91      7436
          1       0.72      0.63      0.67      2333

avg / total       0.85      0.85      0.85      9769

