# Tensorflow Estimator Linear Classifier

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

### THE DATA

** Read in the census_data.csv data with pandas**

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [2]:
census = pd.read_csv('census_data.csv')

In [3]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.**

** Convert the Label column to 0s and 1s instead of strings.**

In [4]:
census['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [5]:
census['income_bracket'] = census['income_bracket'].apply(lambda x: 0 if x == ' <=50K' else 1)

In [6]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [7]:
# Features
X = census.drop('income_bracket', axis=1)

In [8]:
# Labels
y = census['income_bracket']

### Perform a Train Test Split on the Data

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [11]:
print(X_train)

       age          workclass      education  education_num  \
1257    57            Private        HS-grad              9   
3278    59        Federal-gov        HS-grad              9   
20030   42            Private        HS-grad              9   
11894   44            Private   Some-college             10   
28133   26          State-gov   Some-college             10   
2241    54            Private        7th-8th              4   
5831    46            Private      Bachelors             13   
1580    26            Private        HS-grad              9   
18944   44        Federal-gov   Some-college             10   
301     21            Private   Some-college             10   
21163   42            Private     Assoc-acdm             12   
1019    19                  ?        HS-grad              9   
3731    37        Federal-gov   Some-college             10   
8959    27   Self-emp-not-inc        HS-grad              9   
21159   59            Private   Some-college           

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [12]:
X.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

In [13]:
# Make Features
age = tf.feature_column.numeric_column('age')
edu_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

** Import Tensorflow **

In [14]:
print(tf.__version__)

1.3.0


** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [15]:
# To use DNNClassifier, conver the "Categorical Columns" to "Embedded Categorical Columns"

workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass', hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket('education', hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket('marital_status', hash_bucket_size=1000)
occupation = tf.feature_column.categorical_column_with_hash_bucket('occupation', hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket('relationship', hash_bucket_size=1000)
race = tf.feature_column.categorical_column_with_hash_bucket('race', hash_bucket_size=1000)
gender = tf.feature_column.categorical_column_with_hash_bucket('gender', hash_bucket_size=2)
native_country = tf.feature_column.categorical_column_with_hash_bucket('native_country', hash_bucket_size=1000)

** Create the continuous feature_columns for the continuous values using numeric_column **
** Put all these variables into a single list with the variable name feat_cols **

In [16]:
feat_cols = [age, workclass, education, edu_num, marital_status, occupation, relationship, race, gender, capital_gain, capital_loss, hours_per_week, native_country]

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [17]:
input_func = tf.estimator.inputs.pandas_input_fn(X_train, y_train, batch_size=100, num_epochs=None, shuffle=True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [18]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_tf_random_seed': 1, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_model_dir': '/tmp/tmpqlz_cfpr', '_session_config': None, '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}


** Train your model on the data, for at least 5000 steps. **

In [19]:
model.train(input_fn=input_func, steps=5000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpqlz_cfpr/model.ckpt.
INFO:tensorflow:step = 1, loss = 69.31474
INFO:tensorflow:global_step/sec: 297.349
INFO:tensorflow:step = 101, loss = 1332.1401 (0.340 sec)
INFO:tensorflow:global_step/sec: 224.029
INFO:tensorflow:step = 201, loss = 274.10214 (0.445 sec)
INFO:tensorflow:global_step/sec: 221.262
INFO:tensorflow:step = 301, loss = 195.12384 (0.452 sec)
INFO:tensorflow:global_step/sec: 220.176
INFO:tensorflow:step = 401, loss = 862.121 (0.454 sec)
INFO:tensorflow:global_step/sec: 214.457
INFO:tensorflow:step = 501, loss = 382.1477 (0.466 sec)
INFO:tensorflow:global_step/sec: 208.925
INFO:tensorflow:step = 601, loss = 35.083088 (0.479 sec)
INFO:tensorflow:global_step/sec: 226.003
INFO:tensorflow:step = 701, loss = 84.96335 (0.442 sec)
INFO:tensorflow:global_step/sec: 235.911
INFO:tensorflow:step = 801, loss = 108.55966 (0.426 sec)
INFO:tensorflow:global_step/sec: 271.649
INFO:tensorflow:st

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x7f63899889e8>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [20]:
pred_input = tf.estimator.inputs.pandas_input_fn(X_test, batch_size=len(X_test), shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [21]:
pred = model.predict(input_fn=pred_input)

** Each item in your list will look like this: **

In [22]:
predictions = list(pred)

INFO:tensorflow:Restoring parameters from /tmp/tmpqlz_cfpr/model.ckpt-5000


** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [23]:
predictions

[{'class_ids': array([0]),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.1082208], dtype=float32),
  'logits': array([-2.109045], dtype=float32),
  'probabilities': array([0.89177924, 0.1082208 ], dtype=float32)},
 {'class_ids': array([0]),
  'classes': array([b'0'], dtype=object),
  'logistic': array([7.635702e-06], dtype=float32),
  'logits': array([-11.782668], dtype=float32),
  'probabilities': array([9.999924e-01, 7.635702e-06], dtype=float32)},
 {'class_ids': array([0]),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.00010685], dtype=float32),
  'logits': array([-9.144009], dtype=float32),
  'probabilities': array([9.99893188e-01, 1.06846695e-04], dtype=float32)},
 {'class_ids': array([0]),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.00314912], dtype=float32),
  'logits': array([-5.7574797], dtype=float32),
  'probabilities': array([0.9968509 , 0.00314912], dtype=float32)},
 {'class_ids': array([0]),
  'classes': array([b'

In [24]:
final_preds = []

for ci in predictions:
    final_preds.append(ci['class_ids'][0])

print(len(final_preds))

6513


In [25]:
final_preds

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [26]:
from sklearn.metrics import classification_report

In [27]:
print(classification_report(y_test, final_preds))

             precision    recall  f1-score   support

          0       0.86      0.92      0.89      4942
          1       0.69      0.54      0.61      1571

avg / total       0.82      0.83      0.82      6513

