<h2>Crime Rate Prediction - Open Problem</h2>

A US Census has been performed in the year 1990, and 1995 recording crime data across several communities and counties with the United states including more than 100 features of data characteristics ranging from the community name, racial profiles, household data, police deployment data and more, recording the amount of crimes per 100k population. Given this dataset, you are tasked to design an accurate predictive model that can estimate the crime behaviour of american communities as per the given datasets. <b>The following attributes are non-predictive and should not be considered: community_name, state, countyCode, communityCode and fold</b>. A full description of the dataset can be found in the link below: https://archive.ics.uci.edu/dataset/183/communities+and+crime


You are tasked with predicting the <b>amount of violent crime per population</b> after careful estimation of the relevant features.

In [5]:
import pandas as pd
df = pd.read_csv('db/crime_data.csv')
df.head()

Unnamed: 0,Êcommunityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,burglaries,burglPerPop,larcenies,larcPerPop,autoTheft,autoTheftPerPop,arsons,arsonsPerPop,ViolentCrimesPerPop,nonViolPerPop\n
0,BerkeleyHeightstownship,NJ,39,5320,1,11980,3.1,1.37,91.78,6.5,...,14,114.85,138,1132.08,16,131.26,2,16.41,41.02,1394.59\n
1,Marpletownship,PA,45,47616,1,23123,2.82,0.8,95.57,3.44,...,57,242.37,376,1598.78,26,110.55,1,4.25,127.56,1955.95\n
2,Tigardcity,OR,?,?,1,29344,2.43,0.74,94.33,3.43,...,274,758.14,1797,4972.19,136,376.3,22,60.87,218.59,6167.51\n
3,Gloversvillecity,NY,35,29443,1,16656,2.4,1.7,97.35,0.5,...,225,1301.78,716,4142.56,47,271.93,?,?,306.64,?\n
4,Bemidjicity,MN,7,5068,1,11245,2.76,0.53,89.16,1.17,...,91,728.93,1060,8490.87,91,728.93,5,40.05,?,9988.79\n


In [None]:
df.info()

<b>Step 1: Load your dataset and remove any impurities with the data</b>

In [None]:
Nrec = len(df)#number of records
Natts = len(df.columns)#number of attributes

for i in range(Natts):
    #df.iloc[:,[i]] = df.iloc[:,[i]].apply(lambda x: x.replace('?',''))
    #df.iloc[:,[i]] = df.iloc[:,[i]].apply(lambda x:x.replace('\n',''))
    if i > 1:
        df.iloc[:,i] = pd.to_numeric(df.iloc[:,i], errors='coerce')
df.head()

In [None]:
df.info()

In [None]:
print("Number of missing values:", df.isnull().any().sum())
df.dropna(inplace=True)

print("Number of missing values: -- after preprocessing", df.isnull().any().sum())

<b>Step 2: Identify the target variable and the potential predicive features</b>

In [None]:
df.rename(columns={'nonViolPerPop\n':'num_violent_crimes'},inplace=True)
X = df.iloc[:,5:145]
y = df.iloc[:,[146]]

In [2]:
X.head()

NameError: name 'X' is not defined

In [6]:
df.to_csv('crime_data_curated.csv')

In [7]:
y.head()

NameError: name 'y' is not defined

<b>Step 3: Make use of the feature importance score of your choice to remove the least potent features</b>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression

N_perc = 1#get 60% of the features 
featScaler = SelectKBest(mutual_info_regression, k=int(N_perc*Natts))
X_filtered = featScaler.fit_transform(X, y)
mask = featScaler.get_support()#
filtered_feats = X.columns[mask]

In [None]:
print('Number of selected features:',len(filtered_feats))
filtered_feats

<b>Step 4: Normalise the feature vector</b>

In [None]:
from sklearn.preprocessing import StandardScaler

Sc = StandardScaler()
X_sc = Sc.fit_transform(X)#features are normalised
X_sc = pd.DataFrame(X_sc, columns = X.columns)

<b>Step 5: Split the datasets into Training and Test using 80/20 proportion</b>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(X_sc,y, test_size=0.2, random_state=1234)

<b>Step 6: List your models of interest for the training phase</b>

In [None]:
#-- (1) Linear Regression
#---(2) Ridge Regression
#-- (3) LASSO Regression
#-- (4) Decision Trees
#-- (5) Random Forest

<b>Step 7: Set aside 20% of your training set data to build a validation set for hyperparemeter tuning or make use of k-fold crossvalidation (k=5)</b>

<img src="https://miro.medium.com/v2/resize:fit:1400/1*0DKYwS627160j5YMu6aGXw.png" width="500px"/>

Go through this documentation: <br/>
<b>Crossvalidation:</b>https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

<b>GridSearch Hyperparameter tuning:</b>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#examples-using-sklearn-model-selection-gridsearchcv


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

dt_parameters = {'max_depth':[None, 10], 'min_samples_leaf':[1,2,3,4,5]}

mlr = LinearRegression()
dt = DecisionTreeRegressor()
dt_regression = GridSearchCV(dt, dt_parameters,cv=5)
dt_regression.fit(X_train.loc[:,filtered_feats],y_train)

In [None]:
dt_regression.best_params_

In [None]:
dt_regression.best_score_

<b>Step 8: Retrain the models with the training dataset using the best hyperparemeter configuration</b>

In [None]:
dt_regressor = DecisionTreeRegressor(max_depth=dt_regression.best_params_['max_depth'],min_samples_leaf=dt_regression.best_params_['min_samples_leaf'])
dt_regressor.fit(X_train,y_train)

<b>Step 9: Assess the performance of the models on the test set. Use the R2-score and RMSE as your metrics and provide goodness of fit scatter plots (i.e on the test set result only)</b>

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

y_pred_test = dt_regressor.predict(X_test)
r2_score_test_dt = metrics.r2_score(y_test,y_pred_test)

plt.figure()
plt.scatter(y_test,y_pred_test,color="blue")
plt.title('R2 score- test - DT: %.2f'%(r2_score_test_dt))



<b>Step 10: Attempt performing feature selection using a wrapper-based method and assess the performance of the models obtained. </b>

Read the documentation below: 
https://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
#-->  All_features_model
#--> Filtered_features

#---> run your model from hyperparatmer tuning
#---> retrain the model using the best cofiguratio

#---> test perform for all_features_model
#---> filtered_features_model


#(-optinona)
#wrapper_filtering on rhe filtered_features or all the features


In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

dt = DecisionTreeRegressor()
sfs = SequentialFeatureSelector(dt, n_features_to_select=int(len(X_train.columns)*0.2))
sfs.fit(X_train, y_train)

mask = sfs.get_support()
mask

In [None]:
feats_feats = X_train.columns[mask]
feats_feats

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

dt_regressor.fit(X_train.loc[:,feats_feats],y_train)
y_pred_test = dt_regressor.predict(X_test.loc[:,feats_feats])
r2_score_test_dt = metrics.r2_score(y_test,y_pred_test)

plt.figure()
plt.scatter(y_test,y_pred_test,color="blue")
plt.title('R2 score- test - DT: %.2f'%(r2_score_test_dt))