# Random Forest — more or less than 50K/year
Now with optimized random forests and neural networks!

## Step 0: Importing
First, we import the libraries needed to read the data, to use a `RandomForestClassifier`, a `LabelEncoder`, `GridSearchCV`, and our neural networks.

In [1]:
from __future__ import print_function

import pandas as pd
import numpy as np
from scipy import stats as stats
import keras
from keras.datasets import mnist, reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
def accuracy(predictions, results):
    corrects = predictions == results
    num_correct = sum(corrects)
    return num_correct/len(results)

## Step 0.5: Data parsing
Then, we parse the data from `adult_data.csv` (from https://archive.ics.uci.edu/ml/datasets/Adult) and we clarify which columns should be used as training data — we exclude fnlwgt and the result, as those both relate to income, which is what our output is (so we don't want them as an input).

In [3]:
df = pd.read_csv("../data/adult_data.csv")
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'result']

In [4]:
test_features = ['age', 'workclass', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

In [5]:
all_features = df.columns

In [6]:
df[test_features].head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
1,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
2,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
3,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
4,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States


In [7]:
df.shape

(32560, 15)

We then set aside some data for training with and some for testing with.

In [8]:
num_training = int(0.8*df.shape[0])

In [9]:
df_training = df.iloc[:num_training, ]

In [10]:
df_training.shape

(26048, 15)

In [11]:
df_testing = df.iloc[num_training:, ]

In [12]:
le = preprocessing.LabelEncoder()

In [13]:
df_training_transformed = pd.DataFrame()
for feature in all_features:
    df_training_transformed[feature] = le.fit_transform(df_training[feature])

In [14]:
df_testing_transformed = pd.DataFrame()
for feature in all_features:
    df_testing_transformed[feature] = le.fit_transform(df_testing[feature])

We then define the X input data and the Y output data.

In [15]:
x_train = df_training_transformed[test_features]

In [16]:
y_train = df_training_transformed['result']

## Step 1: Training our original, random forest model
We then create a `RandomForestClassifier()` which we train with out training data.

In [17]:
clf = RandomForestClassifier()

In [18]:
clf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

We can then use `clf.predict` and compare the predict results of our test data with the actual results to find the accuracy of our data.

In [19]:
clf.predict(df_testing_transformed[test_features])

array([0, 0, 0, ..., 0, 0, 1])

In [20]:
clf_perc = accuracy(clf.predict(df_testing_transformed[test_features]), df_testing_transformed['result'])

## Step 2: Training our optimized random forest model using GridSearchCV()
Using `GridSearchCV()`, we can optimize the parameters of our model. First, we create a new `RandomForestClassifier()`.

In [21]:
optimized_forest = RandomForestClassifier()

We then define the parameters that we're adjusting for and what values they can take on.

In [22]:
parameters = [
    {"n_estimators": [110, 120], "min_samples_split": [25, 30, 35]}
]

We then create a GridSearchCV optimized model and fit it with our training data.

In [23]:
clf_optimized = GridSearchCV(optimized_forest, parameters)

In [24]:
clf_optimized.fit(x_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_estimators': [110, 120], 'min_samples_split': [25, 30, 35]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

We can then look at the `best_estimator` of our data (AKA the values that were chosen by `GridSearchCV`.

In [25]:
clf_optimized.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=25,
            min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

As with our original model, we can test our model with new data to find its accuracy.

In [26]:
clf_optimized.predict(df_testing_transformed[test_features])

array([0, 0, 0, ..., 0, 0, 1])

In [27]:
clf_optimized_perc = accuracy(clf_optimized.predict(df_testing_transformed[test_features]), df_testing_transformed['result'])

In [28]:
print("Original: " + str(clf_perc))
print("Optimized: " + str(clf_optimized_perc))

Original: 0.808814496314
Optimized: 0.820792383292


Voila! We see that our `RandomForestClassifier`, when optimized with `GridSearchCV`, is more accurate!

## Step 2: A Neural Network for classification
Based off the work of Carl Shan and the Keras examples.

In [29]:
print(len(x_train), 'train sequences')
print(len(df_testing_transformed['result']), 'test sequences')

26048 train sequences
6512 test sequences


In [30]:
num_classes = np.max(y_train) + 1 # How many classes we have.
print(num_classes, 'classes')

2 classes


In [31]:
np.unique(y_train)

array([0, 1])

In [32]:
x_test = df_testing_transformed[test_features] # Only get the features for testing in our test set

In [33]:
# Make x_test processable by Keras.
x_test = x_test.values
x_test = x_test.astype('float32')

In [34]:
# Converts data to a binary matrix, which allows for categorization
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(df_testing_transformed['result'], num_classes)

In [35]:
batch_size = 32 # The batch size, set to save memory
epochs = 20 # The number of epochs for which we want to run

In [36]:
x_train = x_train.values # Make the data processable

Below, we train the model. I chose the layer hyperparameters based on the recommendations here: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw, specifically the answer of `hobs`.

I chose the optimizer `adadelta` based on seeing someone else use it here: https://datascience.stackexchange.com/questions/10048/what-is-the-best-keras-model-for-multi-class-classification & based on seeing this image in class: https://cdn-images-1.medium.com/max/1600/1*SjtKOauOXFVjWRR7iCtHiA.gif. When it ended up working, I was more than happy to keep using it

In [37]:
print('Building model...')
model = Sequential() # Define our model as a sequential neural net
model.add(Dense(896, activation='relu', input_shape=(13,))) # Add a 13 node input layer
model.add(Dropout(0.2)) # Set dropout to .2 to reduce the total number of connections to prevent overfitting.
model.add(Dense(896, activation='relu')) # The a 896 node hidden layer.
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax')) # set a softmax final layer for our output.

model.summary()
# opt = SGD(lr=0.01)
model.compile(optimizer='adadelta', # Using adadelta based on https://cdn-images-1.medium.com/max/1600/1*SjtKOauOXFVjWRR7iCtHiA.gif
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Fit our model
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

Building model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 896)               12544     
_________________________________________________________________
dropout_1 (Dropout)          (None, 896)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 896)               803712    
_________________________________________________________________
dropout_2 (Dropout)          (None, 896)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 1794      
Total params: 818,050
Trainable params: 818,050
Non-trainable params: 0
_________________________________________________________________
Train on 26048 samples, validate on 6512 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epo

In [38]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
neural_net_perc = score[1]

Test loss: 0.344766228659
Test accuracy: 0.840294840295


In [39]:
print("Original Random Forest: " + str(clf_perc))
print("Optimized Random Forest: " + str(clf_optimized_perc))
print(str(epochs) + "-Epoch Neural Network: " + str(neural_net_perc))

Original Random Forest: 0.808814496314
Optimized Random Forest: 0.820792383292
20-Epoch Neural Network: 0.840294840295


We see that our neural network is the most accurate of the 3 models. I read the following article: http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics, and found that the tradeoffs between random forests & neural networks are very data-dependent. For our case neural networks are more effective mainly because they can find unclear, complicated relationships within classification that may not be apparent, they delve into a level of complexity not seem by random forests, and they don't overfit. That's a big one. Although I increased the `min_samples_split` within the grid-searched classifier, overfitting is still possible because of the noise and variance that comes with determining income. With neural networks, we can set a `Dropout` so that not all connections are made (while keeping the model accurate).