# Income Predictor - Random Forest Machine Learning Model
This notebook demonstrates my ability to create a machine learning model using a random forest classifier. This model will predict whether a person makes at least $50,000 a year based on several features such as education level, marital status, race, sex, and more.

The process involves training the model based on several subsets of existing data and testing the accuracy of the model with another subset of the data.

The data was retrieved from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census%20income).

## Load and Inspect the Data

In [95]:
# function ignores the warning module
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [80]:
# import libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

The column names were not included in the csv file downloaded from the repository. The column names need to be added manually.

In [81]:
# read in the data
data = pd.read_csv('adult.data', header=None, names=['age', 'workclass', 'fnlwgt', 'education', 'education_years', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'], delimiter=', ')

# preview data
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [82]:
data.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
education_years     int64
marital_status     object
occupation         object
relationship       object
race               object
sex                object
capital_gain        int64
capital_loss        int64
hours_per_week      int64
native_country     object
income             object
dtype: object

Random forests cannot split features that contain strings. The columns with the object datatype will either need to be recoded into a continuous datatype, or not be used as a feature to train the model.

The `income` column is currently an object datatype. It needs to be recoded to be used as a label for the model. This is done in the following section

## Prepare Data for the Model
To train the model, labels and features must be defined. The data also needs to be split into training and testing subsets.

First, the `income` columns needs to be recoded into a compatible datatype. Because the model is predicting whether a person makes at least $50,000 a year, we will code this condition as 0 or 1. 0 for no, 1 for yes.

In [83]:
# create dictionary for mapping new coded values
income_condition = {'>50K': 1, '<=50K': 0}

# map the values to the income column
data['income'] = data['income'].map(income_condition)

In [84]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [87]:
# set labels and features for the model
labels = data[['income']]
features = data[['age', 'capital_gain', 'capital_loss', 'hours_per_week']]

# split the data
train_data, test_data, train_labels, test_labels = train_test_split(features, labels, random_state = 1)

## Build the Model
Now that the labels and features have been defined, and the data has been split into training and testing subsets, the random forest can be built.

In [88]:
# create the classifier object
forest = RandomForestClassifier(random_state=1)

# fit the model to train data
forest.fit(train_data, train_labels)

  forest.fit(train_data, train_labels)


RandomForestClassifier(random_state=1)

## Test the Model
Now that the model has been trained we can test it's accuracy using the testing subset of the data.

In [89]:
# check model score
forest.score(test_data, test_labels)

0.8222577078982926

82% is pretty good, but it can be better. The `sex` column can be added to see if the score improves. It will also need to be recoded into 0 for male, and 1 for female.

In [91]:
# create remapping dictionary
sex_condition = {'Male': 0, 'Female': 1}

# recode the sex column
data['sex'] = data['sex'].map(sex_condition)

In [103]:
# reassign the model features
features = data[['age', 'capital_gain', 'capital_loss', 'hours_per_week', 'sex']]

# split the data
train_data, test_data, train_labels, test_labels = train_test_split(features, labels, random_state = 1)

# fit the model to train data
forest.fit(train_data, train_labels)

  forest.fit(train_data, train_labels)


RandomForestClassifier(random_state=1)

In [104]:
# recheck the model score
forest.score(test_data, test_labels)

0.8272939442328953

The addition of the sex column only improved the model accuracy by about half of a percent.

Two more features that may improve the accuracy of the model are the `education_years` column and the `race` column. The `education_years` column is already compatible, but the `race` column will need to be recoded.

In [107]:
# create mapping dict for race
race_condition = {'White': 0, 'Black': 1, 'Asian-Pac-Islander': 2, 'Amer-Indian-Eskimo': 3, 'Other': 4}

# recode race column
data['race'] = data['race'].map(race_condition)

In [111]:
# reassign the model features
features = data[['age', 'capital_gain', 'capital_loss', 'hours_per_week', 'sex', 'education_years']]

# split the data
train_data, test_data, train_labels, test_labels = train_test_split(features, labels, random_state = 1)

# fit the model to train data
forest.fit(train_data, train_labels)

  forest.fit(train_data, train_labels)


RandomForestClassifier(random_state=1)

In [112]:
# recheck the model score
forest.score(test_data, test_labels)

0.8368750767718953

The addition of the `education_years` added an additional 1% to the accuracy of the model. The addition of the `race` column subtracted about 0.2% from the score, so it will not be used as a training feature for the model.

## Make Predictions With the Model
Now we can give new data from an individual to the model and it will predict with roughly 84% accuracy whether or not that person makes more than $50,000 per year.

In [142]:
# create hypothetical personal data
my_age = 24
my_capital_gain = 750
my_capital_loss = 0
my_hours_per_week = 30
my_sex = 0
my_education_years = 18

# create new instance of personal data
my_data = {'age': my_age, 'capital_gain': my_capital_gain, 'capital_loss': my_capital_loss, 'hours_per_week': my_hours_per_week, 'sex': my_sex, 'education_years': my_education_years}

# create df of new instance
new_data = pd.DataFrame([my_data])

In [143]:
# make prediction
forest.predict(new_data)

array([0])

Based on this hypothetical person's data, the model predicts that the individual does not make over $50,000 per year.