# Project: Census Income

#### Project Goal: 
Use Random Forests to classify income as <=50k or >50k.

#### Data: 
Data obtained from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/census%20income.

#### Analysis and Evaluation: 
A Random Forests model will be implemented and model performance will be determined by the accuracy obatained and recall.

In [1]:
# import python modules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load data into pandas DataFrame
income_data = pd.read_csv('adult.csv', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])

# inspect data and verify correct datatypes
print(income_data.head())
print(income_data.dtypes)

   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States   <=50

The "sex" and "native-country" columns will be used to predict income, but random forests can't use columns that contain strings. A new column will made that identifies "Male" as 0 and "Female" as 1. Another new column will identify the native country of "United-States" as 0 and all other countries as 1.

In [3]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == ' Male' else 1)

income_data["country-int"] = income_data["native-country"].apply(lambda row: 0 if row == ' United-States' else 1)

# space added before the strings to match the data file

In [4]:
# specify the labels as the "income" column
labels = income_data.income

# select data to be used to predict income
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-int', 'country-int', 'education-num']]

# split data and labels into a training set and test set
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size = 0.2, stratify=labels)

# create classifier and fit the data
forest = RandomForestClassifier(n_estimators = 200)
forest.fit(train_data, train_labels)

# determine and print classification report
predictions = forest.predict(test_data)
true_classes = test_labels
report = classification_report(true_classes, predictions)
print(report)

              precision    recall  f1-score   support

       <=50K       0.86      0.92      0.89      4945
        >50K       0.68      0.53      0.60      1568

    accuracy                           0.83      6513
   macro avg       0.77      0.73      0.74      6513
weighted avg       0.82      0.83      0.82      6513



The accuracy of over 0.8 suggests that the model is satisfactory. Upon closer inspection of the classification report it can be seen that the recall of the "<=50K" and ">50K" labels are 0.92 and 0.53, respectively, which is important to check because the data is imbalanced (per the support).