## Random Forest


In [1]:
# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


## Predicting Income with Random Forests

In this project, we will be using a dataset containing census information from [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income).
By using this census data with a random forest, we will try to predict whether or not a person income using the following variables: age, sex, capital-gain, capital-loss, hours-per-week.
Let's get started!

We want to get all of that data into a Pandas DataFrame. Use the `pd.read_csv()` function using `"income.csv"` as a parameter and store the result in a variable named `income_data`. There's a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row's `native-country` is `" United-States"`.

In [2]:
# read the income.csv file 

income_df = pd.read_csv("C:/Users/erwin/Desktop/MIS536/Module 7/DecTrees/income.csv", engine = 'python', delimiter = ", ")

#the engine=python, delimeter=", " is a way to clean data before it's put into a dataframe. 
#There's a space before each entry in the dataframe that we don't want, so we removed the commas and the spaces.
#we included commas, because we have to include all delimeters we want to ignore when putting them into a dataframe. It's a comma separated value (CSV) file, so we had to mention it.


In [3]:
# Explore the dataset
# read the first row of the dataset 
print(income_df.head())
print(income_df.columns)
print(income_df.shape)
print(income_df.describe)

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

In [4]:
# clean the datast: sex is not numeric.

income_df.sex = income_df.sex.replace("Male","0", regex = True)
income_df.sex = income_df.sex.replace("Female","1", regex = True).astype(int)

print(income_df.sex.dtypes)

int32


In [5]:
# construct datasets for analysis

X=income_df[["age","sex","capital-gain","capital-loss","hours-per-week"]]
y=income_df["income"]

print(X.head(),"\n",y.head())

   age  sex  capital-gain  capital-loss  hours-per-week
0   39    0          2174             0              40
1   50    0             0             0              13
2   38    0             0             0              40
3   53    0             0             0              40
4   28    1             0             0              40 
 0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: object


In [6]:
# create the training set and the test set 
training_X, valid_X, training_y,valid_y = train_test_split(X,y,test_size =0.2, random_state = 1)

print(training_X.shape)

(26048, 5)


In [11]:
# create the random forest
# random_state: Controls both the randomness of the bootstrapping of the samples used when building trees
# n_estimator: the number of trees in the forest. default is 100
# max_feature: the number of features to consider when looking for the best split: default is sqrt(n) 
###n being the number of predictors, n = 5 in this case, so the max_feature will be the sqr root of 5 

forest = RandomForestClassifier(random_state=5, n_estimators = 101, max_features = "sqrt")


In [12]:
# train the forest 
forest.fit(training_X, training_y)
print(forest.score(valid_X, valid_y))

0.8248119146322739


In [15]:
# Which features tend to be more relevant?
importances = forest.feature_importances_

for i in range(len(importances)):
    print(valid_X.columns[i], ": ", importances[i])
    


age :  0.2970407841838787
sex :  0.065691390765117
capital-gain :  0.31013913967352197
capital-loss :  0.12435212759803446
hours-per-week :  0.2027765577794479


In [None]:
###based on the results, the most relevant features are "age", and "capital gain"