# Introduction

In this exercise we aim to build a predictive model that receives an input array of city statistics and outputs a predicted metric for “nonViolPerPop” which represents the total number of non-violent crimes per 100K in the city’s population. We approached this problem by first reading in and cleaning the data and then building three predictive models, a Random Forest Regressor, a Lasso regression analysis, and a Decision Tree Regressor.

In [57]:
import warnings
import shutup; shutup.please()
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.io import arff
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn import tree


In [31]:
# Read the data

df = pd.read_csv('data/crime/CommViolPredUnnormalizedData.txt', encoding='latin-1',header=None)
df.columns
#the column names are the second word in each row of data/crime_headings.txt
with open('data/crime/crime_headings.txt') as f:
    headings = f.readlines()
col_names = []
types = []
for heading in headings:
    if len(heading.split()) <= 1:
        continue
    col_names.append(heading.split()[1])
    if heading.split()[2] == 'numeric':
        types.append(float)
    else:
        types.append(str)

df.columns = col_names

#drop drop rows with "?" values
df = df.replace('?', np.nan)

df = df.astype(dict(zip(col_names, types)))

#communityname, countyCode, communityCode, fold are not predictive so drop them
df = df.drop(['communityname', 'countyCode', 'communityCode', 'fold'], axis=1)
df.head()







Unnamed: 0,State,pop,perHoush,pctBlack,pctWhite,pctAsian,pctHisp,pct12-21,pct12-29,pct16-24,...,burglaries,burglPerPop,larcenies,larcPerPop,autoTheft,autoTheftPerPop,arsons,arsonsPerPop,violentPerPop,nonViolPerPop
0,NJ,11980.0,3.1,1.37,91.78,6.5,1.88,12.47,21.44,10.93,...,14.0,114.85,138.0,1132.08,16.0,131.26,2.0,16.41,41.02,1394.59
1,PA,23123.0,2.82,0.8,95.57,3.44,0.85,11.01,21.3,10.48,...,57.0,242.37,376.0,1598.78,26.0,110.55,1.0,4.25,127.56,1955.95
2,OR,29344.0,2.43,0.74,94.33,3.43,2.35,11.36,25.88,11.01,...,274.0,758.14,1797.0,4972.19,136.0,376.3,22.0,60.87,218.59,6167.51
3,NY,16656.0,2.4,1.7,97.35,0.5,0.7,12.55,25.2,12.19,...,225.0,1301.78,716.0,4142.56,47.0,271.93,,,306.64,
4,MN,11245.0,2.76,0.53,89.16,1.17,0.52,24.46,40.53,28.69,...,91.0,728.93,1060.0,8490.87,91.0,728.93,5.0,40.05,,9988.79


# Dataset Cleaning

We read in the data using pandas read_csv in Python. The original data set did not include column names, so we read in the column names from a separate file and inserted those into our data frame. We did some data preprocessing and formatting of columns and dropped columns that are not predictive.

In [32]:
#useful functions
def drop_rows_missing_target(df):
    return df.dropna(subset=['nonViolPerPop'])


def fill_with_mean(df):
    for column in df.columns:
        if df[column].dtype == float:
            df[column] = df[column].fillna(df[column].mean())
    return df

def fill_with_median(df):
    for column in df.columns:
        if df[column].dtype == float:
            df[column] = df[column].fillna(df[column].median())
    return df

def convert_categorical_to_numeric(df):
    for column in df.columns:
        if df[column].dtype != float:
            df[column] = df[column].astype('category')
            df[column] = df[column].cat.codes
    return df

def normalize(df):
    for column in df.columns:
        if df[column].dtype == float:
            df[column] = (df[column] - df[column].mean()) / df[column].std()
    return df

def remove_random_features(df, n):
    dropped_cols = np.random.choice(df.columns[:-1], n, replace=False)
    return df.drop(dropped_cols, axis=1), dropped_cols


Our goal is to remove the missing values of the dataset as effectively as possible. We first dropped the columns with more that 80% missing values. Then after dropping the columns with more than 80% 
missing values, we then dropped all rows that also have a missing value in our target nonViolPerPop. After that we dropped all the remaining columns with missing values.

In [60]:
# function to drop the columnns with more that 90% missing values
print(df.shape)
def drop_missing_values(df):
    for column in df.columns:
        if df[column].isnull().sum() > 0.9*len(df):
            df = df.drop(column, axis=1)
    return df
# drop the columns with more than 80% missing values
df = drop_missing_values(df)
print(df.shape)
# drop the rows with missing values from the nonViolPerPop column
df = drop_rows_missing_target(df)
print(df.shape)
#drop the remaining columns with missing values
df = df.dropna(axis=1)
print(df.shape)


(2118, 121)
(2118, 121)
(2118, 121)
(2118, 121)


# Random Forest Regressor

A Random Forest Regressor (RFR) is a predictive model that, according to sklearn’s website, “fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.” This means that RFRs take an ensemble approach, using many different random decision trees and aggregating them to create a tree that is greater than the sum of their parts. 

Random Forest Regressors have an optional fine-tuning approach, in which the model is trained with the entire dataset and the model gains an attribute “feature_importances_”. Using these feature importances we were able to highlight features that were most important in predicting output and drop other features from our model. 

With our RFR we saw impressive results with both fine tuning the features and without. The results can be seen below. 


In [27]:

def feature_selection(df, features_as_targets):
    #drop the last 18 columns
    if features_as_targets:
        X = df.drop(df.columns[-1:], axis=1)
    else:
        X = df.drop(df.columns[-18:], axis=1)
        
    y = df['nonViolPerPop']
    rf = RandomForestRegressor(n_estimators=100)
    rf.fit(X, y)
    importance = rf.feature_importances_
    # print(np.sort(importance))
    indices = np.argsort(importance)[::-1]
    kept_cols = []
    for f in range(X.shape[1]):
        if importance[indices[f]] > 0.01:
            kept_cols.append(X.columns[indices[f]])
    #put nonViolPerPop back in
    kept_cols.append('nonViolPerPop')
    return df[kept_cols], kept_cols


def fine_tune_features_approach(df, features_as_targets = False):
    black_box_df = fill_with_mean(normalize(drop_rows_missing_target(convert_categorical_to_numeric(df))))

    black_box_df, kept_cols = feature_selection(black_box_df, features_as_targets)

    train, test = train_test_split(black_box_df, test_size=0.2)

    X_train = train.drop(train.columns[-1], axis=1)
    y_train = train['nonViolPerPop']
    X_test = test.drop(test.columns[-1], axis=1)
    y_test = test['nonViolPerPop']

    # Create the model with 100 trees
    model = RandomForestRegressor(n_estimators=100,
                                    bootstrap = True,
                                    max_features = 'sqrt')
    # Fit on training data

    model.fit(X_train, y_train)

    # Actual class predictions
    rf_predictions = model.predict(X_test)

    #calculate mae
    mae = np.mean(abs(rf_predictions - y_test))
    print('Mean Absolute Error:', mae)
    print('Kept columns:', kept_cols)
    print()
print('Using targets as features:')
fine_tune_features_approach(df, True)
print('\n\nNot using targets as features:')
fine_tune_features_approach(df, False)

Using targets as features:
Mean Absolute Error: 0.06659212833266745
Kept columns: ['larcPerPop', 'burglPerPop', 'autoTheftPerPop', 'nonViolPerPop']



Not using targets as features:
Mean Absolute Error: 0.49681147064804626
Kept columns: ['pctKids2Par', 'pct2Par', 'pctAllDivorc', 'pct12-17w2Par', 'rentLowQ', 'pctPopDenseHous', 'pctMaleDivorc', 'persHomeless', 'pctHousOccup', 'houseVacant', 'pctFemDivorc', 'persPerRenterOccup', 'persPoverty', 'kidsBornNevrMarr', 'nonViolPerPop']



In [28]:
# https://www.geeksforgeeks.org/random-forest-regression-in-python/

def random_forest_approach(df, features_as_targets = False, do_normalize = False):
    for i in range(1):
        black_box_df = df.copy()

        if not features_as_targets:
            black_box_df = black_box_df.drop(black_box_df.columns[-17:-1], axis=1)
        if do_normalize:
            black_box_df = fill_with_mean(normalize(drop_rows_missing_target(convert_categorical_to_numeric(black_box_df))))
        else:
            black_box_df = fill_with_mean(drop_rows_missing_target(convert_categorical_to_numeric(black_box_df)))


        train, test = train_test_split(black_box_df, test_size=0.2)
        X_train = train.drop(train.columns[-1], axis=1)
        y_train = train['nonViolPerPop']
        X_test = test.drop(test.columns[-1], axis=1)
        y_test = test['nonViolPerPop']

        # Create the model with 100 trees
        model = RandomForestRegressor(n_estimators=100,
                                        bootstrap = True,
                                        max_features = 'sqrt')
        # Fit on training data
        model.fit(X_train, y_train)

        # Actual class predictions
        rf_predictions = model.predict(X_test)

        #calculate mae
        mae = np.mean(abs(rf_predictions - y_test))
        print('Mean Absolute Error:', mae)

        
        
print('Using targets as features (normalize):')
random_forest_approach(df, True, True)
print('\n\nNot using targets as features (normalize):')
random_forest_approach(df, False, True)
print('\n\nUsing targets as features (no normalize):')
random_forest_approach(df, True, False)
print('\n\nNot using targets as features (no normalize):')
random_forest_approach(df, False, False)




Using targets as features (normalize):
Mean Absolute Error: 0.23966561863969835


Not using targets as features (normalize):
Mean Absolute Error: 0.48363125581579414


Using targets as features (no normalize):
Mean Absolute Error: 514.6430143867925


Not using targets as features (no normalize):
Mean Absolute Error: 1267.8243113207548


# Lasso

A Lasso regression model is a regression that penalizes factors in order to “select” the best ones for the model. Using the parameter lambda to identify the penalizing term, it forces factors that have little to no effect on the response variable to have a coefficient term of zero. Factors that have non-zero coefficients are “selected” and have an effect on the model.
We fit four different Lasso models using different strategies. Two models were fitted when we included other response variables in the data set as factors, and two other models were fitted excluding response variables as factors. One of the models from each of those two pairs used normalized data, while the other model from each was not normalized.
We used the LassoCV package in Python to find the best value of the parameter for the penalization term and then fit our models. Each model had differing results. The models that included other response variables as factors performed the best, with low mean absolute error and high r-squared values. When we normalized our data, the performance of our models improved substantially. Those models have leakage, however, which skews our results. Not using targets as features, our results were more realistic but not as good. Our r-squared values were much lower with these models and our mean absolute error was much higher. All significant factors included in the models are shown below, but some of our selected variables are whitePerCap, medFamIncome, and persPerEccupHous. It was interesting to see the different variables the models selected and the different coefficients they had.


In [29]:
#doing lasso 
from sklearn.linear_model import LassoCV

def lasso(df, features_as_targets, do_normalize):
    if do_normalize:
        lasso_df = fill_with_mean(normalize(drop_rows_missing_target(convert_categorical_to_numeric(df))))
    else:
        lasso_df = fill_with_mean(drop_rows_missing_target(convert_categorical_to_numeric(df)))



    if features_as_targets:
        X = lasso_df.drop(lasso_df.columns[-1:], axis=1)
    else:
        X = lasso_df.drop(lasso_df.columns[-18:], axis=1)
    y = lasso_df['nonViolPerPop']

    alpha_predict = LassoCV(cv=5, random_state=0, max_iter=10000)
    alpha_predict.fit(X, y)

    model = Lasso(alpha=alpha_predict.alpha_)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = np.mean(abs(y_pred - y_test))
    print('Mean Absolute Error:', mae)
    coef_dict = { k:v for (k,v) in zip(model.coef_, X)}
    #sort coef_dict by abs value
    coef_dict = {k: v for k, v in sorted(coef_dict.items(), key=lambda item: abs(item[0]), reverse=True)}

    print('kept Columns:', [f'{v}: {k:.5f}' for k,v in coef_dict.items() if k > 0.00001 or k < -0.00001])
    #print R squared
    print('R squared:', model.score(X_test, y_test))




print('Using targets as features (normalized):')
lasso(df, True, True)
print('\n\nNot using targets as features (normalized):')
lasso(df, False, True)
print('\n\nUsing targets as features (not normalized):')
lasso(df, True, False)
print('\n\nNot using targets as features (not normalized):')
lasso(df, False, False)






Using targets as features (normalized):
Mean Absolute Error: 0.0010973321789923415
kept Columns: ['larcPerPop: 0.69865', 'burglPerPop: 0.27935', 'autoTheftPerPop: 0.18385', 'arsonsPerPop: 0.01365', 'State: -0.00001']
R squared: 0.9999975761617778


Not using targets as features (normalized):
Mean Absolute Error: 0.46124457718071876
kept Columns: ['persPerOccupHous: 0.32557', 'pctWsocsec: 0.32228', 'pctForeignBorn: 0.29128', 'pctMaleDivorc: 0.25351', 'pctEmploy: 0.25221', 'pctImmig-10: -0.24941', 'pctMaleNevMar: 0.23886', 'whitePerCap: 0.22923', 'pctPoverty: 0.21450', 'pctKids2Par: -0.15475', 'persPerOwnOccup: -0.13792', 'rentLowQ: -0.13377', 'perHoush: -0.13326', 'pctLowEdu: -0.12557', 'medYrHousBuilt: 0.12169', 'pctOccupMgmt: 0.11420', 'pctKidsBornNevrMarr: 0.10331', 'pctFgnImmig-5: -0.09419', 'pctFgnImmig-8: 0.09031', 'ownHousUperQ: -0.08896', 'pctRetire: -0.08769', 'pctSmallHousUnits: -0.08751', 'pctCollGrad: -0.08698', 'medOwnCostpct: -0.08124', 'kidsBornNevrMarr: -0.08069', 'perCa

# Decision Tree

A Decision Tree is a non-parametric supervised learning method. Sklearn says a decision tree “predicts the value of a target variable by learning simple decision rules inferred from the data features”. We used the default tree which uses a gini impurity function to determine the best way to split. When training the gini function looks at the relationship between all of the features and determines the best split. This continues until a full decision tree is made. Then when we are running our predictions our data is run through the data and classified by the tree. 

With our Decision Tree we saw some impressive results when we used the targets as features but we saw not as good of results when we didn’t. The results can be seen below. 


In [58]:
def dec_tree(df, features_as_targets, do_normalize):
    if do_normalize:
        dec_df = fill_with_mean(normalize(drop_rows_missing_target(convert_categorical_to_numeric(df))))
    else:
        dec_df = fill_with_mean(drop_rows_missing_target(convert_categorical_to_numeric(df)))

    if features_as_targets:
        X = dec_df.drop(dec_df.columns[-1:], axis=1)
    else:
        X = dec_df.drop(dec_df.columns[-18:], axis=1)
    y = dec_df['nonViolPerPop']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    dtree = tree.DecisionTreeRegressor()
    dtree = dtree.fit(X_train, y_train)
    preds = dtree.predict(X_test)

    mae = np.mean(abs(preds - y_test))
    print('Mean Absolute Error:', mae)

print('Using targets as features (normalized):')
dec_tree(df, True, True)
print('\n\nNot using targets as features (normalized):')
dec_tree(df, False, True)
print('\n\nUsing targets as features (not normalized):')
dec_tree(df, True, False)
print('\n\nNot using targets as features (not normalized):')
dec_tree(df, False, False)


Using targets as features (normalized):
Mean Absolute Error: 0.1094720955465372


Not using targets as features (normalized):
Mean Absolute Error: 0.6719314177846643


Using targets as features (not normalized):
Mean Absolute Error: 324.0136320754718


Not using targets as features (not normalized):
Mean Absolute Error: 1918.6865566037736


# Conclusion

This exercise was very helpful in understanding how to create explainable models, do feature engineering, and make predictions on a large dataset. In our experiments we found that our best performing model (by far) was the Random Forest Regressor (when not using other targets as features) with a mean absolute error of .484. 

Given more time, we think it would be interesting to implement a more black box model and see how its prediction performance compares to those of our models. This would help us further understand how much performance improvement can be gained at the expense of explainability. 

In addition, we would be interested in trying out different hyperparameters on our models, perhaps with a grid search, so we could truly optimize for both performance AND explainability.  
