## Income Level Predictor

ML analysis by Jason Yeoh

This Machine Learning model predicts income level based on __crimes__ and __socioeconomic factors__.

__Income levels__ are subdivided into five classes namely:
1. Top 20%
2. Upper 20%
3. Mid 20%
4. Lower 20%
5. Bottom 20%

In [165]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import geopandas as gpd
import visuals as vs
import time
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### SOCIOECONOMIC INDICATORS
With this dataset, our team removed outliers

In [9]:
socioecon = pd.read_csv("ChicagoSocioecon.csv")
socioecon['Community Area Number'] = pd.to_numeric( socioecon['Community Area Number'], downcast='signed')
socioecon['Community Area Number'] = socioecon['Community Area Number'].fillna(0.0).apply(np.int64)
socioecon['HARDSHIP INDEX'] = socioecon['HARDSHIP INDEX'].fillna(0.0).apply(np.int64)
socioecon = socioecon.set_index('Community Area Number')
socioecon = socioecon.dropna()

### CRIMES

In [10]:
crimes = pd.read_csv("dataFiltered.csv")
crimes['Year'] = crimes['Year'].fillna(0.0).apply(np.int64)
crime = crimes.groupby('Community Area').count()['ID'].reset_index(name="Crime Count")

### DATA PREPROCESSING

The team joined crimes and socioeconomic tables on the key __Community Area Number__. In addition, we used abbreviations on various socioeconomic classes to make the table easily referenced.

Below is the list of abbreviations:
1. __NAME__ - Community Area Name
2. __PHC__ - Percent of Housing Crowded
3. __PHBP__ - Percent Households Below Poverty
4. __PAUN__ - Percent Aged 16+ Unemployed
5. __PAWHS__ - Percent Aged 25+ without High School Diploma
6. __PAU18__ - Percent Aged Under 18 or Over 64
7. __PCI__ - Per Capita Income
8. __HI__ - Hardship Index
9. __CC__ - Crime Count

In [240]:
crime_socioecon = socioecon.join(crime, on='Community Area Number')
crime_socioecon = crime_socioecon.drop(['Community Area'], axis=1)
crime_socioecon.columns = ['NAME', 'PHC', 'PHBP', 'PAUN', 'PAWHS', 'PAU18', 'PCI', 'HI', 'CC']
crime_socioecon.head()

Unnamed: 0_level_0,NAME,PHC,PHBP,PAUN,PAWHS,PAU18,PCI,HI,CC
Community Area Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39,7748
2,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46,6722
3,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20,7271
4,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17,3537
5,North Center,0.3,7.5,5.2,4.5,26.2,57123,6,2945


### ENCODING, LABELING AND DATA SPLITTING

In this section, our group encoded the values on __PCI__ (per capita income) metric into bins of income levels. Each bin corresponds to an income bracket. The rest of the columns were selected to be the features.

Moreover, we used these data to split it into training and testing datasets.

In [149]:
# Encode income levels
def encodeLabels(y):
    if y >= 33982.6:   return 0  #TOP 20%
    elif y >= 24018.4: return 1  #UPPER MID 20%
    elif y >= 18527.4: return 2  #MID 20%
    elif y >= 14846.6: return 3  #LOWER MID 20%
    else: return 4 #BOTTOM 20%

feature_cols = ['PHC', 'PHBP', 'PAUN', 'PAWHS', 'PAU18', 'CC']
numerical = ['PHC', 'PHBP', 'PAUN', 'PAWHS', 'PAU18', 'PCI', 'CC']

X_raw = crime_socioecon[feature_cols]
y = crime_socioecon['PCI']
y_raw = y.apply(encodeLabels)

X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2)

print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 62 samples.
Testing set has 16 samples.


### DATA TRAINING AND PREDICTING
In this segment of ML analysis, we used a classifier to train and predict testing and training datasets, and report performance based on a number of metrics. To get an accurate performance scores, the team aggregated the results on training and getting prediction on a classifier five times.

In [239]:
# Train and predict a classifier.
def train_and_predict(clf, X_train, X_test, y_train, y_test):
    start = time.time()
    clf = clf.fit(X_train, y_train)
    end = time.time()
    clf_time = end - start

    start = time.time()
    predictions_train = clf.predict(X_train)
    predictions_test = clf.predict(X_test)
    end = time.time() 
    pred_time = end - start
    
    accu_train = accuracy_score(y_train, predictions_train)
    accu_test = accuracy_score(y_test, predictions_test)
    
    f_train = fbeta_score(y_train, predictions_train, beta=0.5, average='micro')
    f_test = fbeta_score(y_test, predictions_test, beta=0.5, average='micro')
    
    return (clf_time, pred_time, accu_train, accu_test, f_train, f_test)
    
# Train and predict a classifier five times.
def train_and_predict_5_times(clf, X_train, X_test, y_train, y_test):
    clf_time = []
    pred_time = []
    accu_train = [] 
    accu_test = []
    f_train = []
    f_test = []
    
    for tries in range(0, 5):
        (a,b,c,d,e,f) = train_and_predict(clf, X_train, X_test, y_train, y_test)
        clf_time.append(a)
        pred_time.append(b)
        accu_train.append(c)
        accu_test.append(d)
        f_train.append(e) 
        f_test.append(f)
        
    # Display results
    print("[ {} ]".format(clf.__class__.__name__))
    print(" TRAINING TIME:         {}".format(np.mean(clf_time)))
    print(" PREDICTION TIME:       {}".format(np.mean(pred_time)))
    print(" ACCURACY ON TRAIN SET: {}".format(np.mean(accu_train)))
    print(" ACCURACY ON TEST SET:  {}".format(np.mean(accu_test)))
    print(" F1-SCORE ON TRAIN SET: {}".format(np.mean(f_train)))
    print(" F1-SCORE ON TEST SET:  {}".format(np.mean(f_test)))
    print("\n")

### EVALUATING PERFORMANCE 

Based on these metrics:
- Training time
- Prediction time
- Accuracy score for test & train
- F1 score for test & train

Random Forest Classifier looks viable as it scored the highest in accuracy and F1 score metrics, in which it gives the most accurate and precise prediction among the other three classifiers such as KNeighbors, Decision Tree, and Gaussian Naive-Bayes. 

In [231]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

clf_A = KNeighborsClassifier()
clf_B = DecisionTreeClassifier()
clf_C = GaussianNB()
clf_D = RandomForestClassifier()

scores = []

for clf in [clf_A, clf_B, clf_C, clf_D]:
    train_and_predict_5_times(clf, X_train, X_test, y_train, y_test)

[ KNeighborsClassifier ]
 CLF TIME:              0.001154327392578125
 PREDICTION TIME:       0.004290914535522461
 ACCURACY ON TRAIN SET: 0.4516129032258064
 ACCURACY ON TEST SET:  0.125
 F1-SCORE ON TRAIN SET: 0.4516129032258064
 F1-SCORE ON TEST SET:  0.125


[ DecisionTreeClassifier ]
 CLF TIME:              0.0015454292297363281
 PREDICTION TIME:       0.0018198490142822266
 ACCURACY ON TRAIN SET: 1.0
 ACCURACY ON TEST SET:  0.625
 F1-SCORE ON TRAIN SET: 1.0
 F1-SCORE ON TEST SET:  0.625


[ GaussianNB ]
 CLF TIME:              0.00230560302734375
 PREDICTION TIME:       0.0020112991333007812
 ACCURACY ON TRAIN SET: 0.7903225806451613
 ACCURACY ON TEST SET:  0.6875
 F1-SCORE ON TRAIN SET: 0.7903225806451613
 F1-SCORE ON TEST SET:  0.6875


[ RandomForestClassifier ]
 CLF TIME:              0.008571100234985352
 PREDICTION TIME:       0.003151607513427734
 ACCURACY ON TRAIN SET: 0.967741935483871
 ACCURACY ON TEST SET:  0.7375
 F1-SCORE ON TRAIN SET: 0.967741935483871
 F1-SCORE ON 

### FEATURE IMPORTANCE

As you can see from the weights listed below, a socioeconomic indicator __PAWHS__ (Percent Aged 25+ without High School Diploma) seem to be the most influential factor contributing to a neighborhood's income level, followed by __PHBP__ (Percent Households Below Poverty), and __PAUN__ (Percent Aged 16+ Unemployed).

In [238]:
model = RandomForestClassifier().fit(X_train, y_train)
feat_imp = pd.DataFrame({'features': X_train.columns, 'weight': model.feature_importances_})
feat_imp.sort_values(by='weight', ascending=False)

Unnamed: 0,features,weight
3,PAWHS,0.281223
1,PHBP,0.237257
2,PAUN,0.19367
4,PAU18,0.140166
5,CC,0.075423
0,PHC,0.07226


### HYPERPARAMETER TUNING
After finding the best classifier for our datasets, the team tried to optimize the performance of the Random Forest Classifier by finding the right configuration. Using GridSearchCV, we were able to find the best classifier model parameters.

In [235]:
def reportResults(pred, prediction):
    accu_score = accuracy_score(y_test, prediction)
    f_score = fbeta_score(y_test, prediction, beta=0.5, average='micro')
    
    print ("{}: \n ACCURACY SCORE: {} \n F1-SCORE: {}".format(pred, accu_score, f_score))

clf = RandomForestClassifier()

score = make_scorer(fbeta_score, beta=0.5, average="micro")
grid = GridSearchCV(clf, {'n_estimators': [5, 10, 15], 'max_features': [4, 5, 6, None]}, scoring=score)
grid = grid.fit(X_train, y_train)
new_clf = grid.best_estimator_

pred = (clf.fit(X_train, y_train)).predict(X_test)
new_pred = new_clf.predict(X_test)

reportResults("[OLD MODEL]", pred)
reportResults("[NEW MODEL]", new_pred)

[OLD MODEL]: 
 ACCURACY SCORE: 0.6875 
 F1-SCORE: 0.6875
[NEW MODEL]: 
 ACCURACY SCORE: 0.8125 
 F1-SCORE: 0.8125


### USING THE PREDICTING MODEL

Give inputs in the following order:
1. __PHC__ - Percent of Housing Crowded
2. __PHBP__ - Percent Households Below Poverty
3. __PAUN__ - Percent Aged 16+ Unemployed
4. __PAWHS__ - Percent Aged 25+ without High School Diploma
5. __PAU18__ - Percent Aged Under 18 or Over 64
6. __CC__ - Crime Count

In [336]:
labels = ['TOP 20%', 'UPPER 20%', 'MID 20%', 'LOWER 20%', 'BOTTOM 20%']

data = [ [30, 20, 30, 30, 30, 6000],   # Community 1
         [20, 20, 25, 25, 35, 10000],  # Community 2
         [15, 12, 9, 5, 44, 100]       # Community 3
       ]

for community, income_level in enumerate(new_clf.predict(data)):
   print( "Community {} is predicted to be in the income level of {}.".format(community, labels[community]) )

Community 0 is predicted to be in the income level of TOP 20%.
Community 1 is predicted to be in the income level of UPPER 20%.
Community 2 is predicted to be in the income level of MID 20%.
