**Preprocessing 

Data from: https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime

1. Data Cleaning
- Dealing with data types e.g. making categorical data numeric i.e. dummy features
- Handling missing data (Imputing Na values instead of removing)

2. Data Exploration
- Detecting outliers (Tukey IQH or Kernal Density Estimation)
- Plotting Distributions. Log transformation of data that is skewed (very long tails) can impove accuracy.
- Balance dataset-to prevent the tree from being biased toward the classes that are dominant. Create an equal number of samples from each class by normalising the sum of the sample weights for each class to the same value.

3. Feature Enigineering
- Interactions between features
- Increasing dimensionality vs decreasing dimensionality
- Smote? (generates values for the under sampled classes)

4. Feature Selection. Discard the least important variables to reduce noise. Good variables are often constructed using ratios, differences, averages of variables etc.


In [None]:
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib.image as pltimg


# Read data and assign NA to missing values 
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt',
                   sep='\s*,\s*',encoding='latin-1',engine='python', na_values=["?"])


data.columns = ['communityname','state','countyCode','communityCode','fold','population','householdsize','racepctblack',
           'racePctWhite','racePctAsian','racePctHisp','agePct12t21','agePct12t29','agePct16t24','agePct65up',
           'numbUrban','pctUrban','medIncome','pctWWage','pctWFarmSelf','pctWInvInc','pctWSocSec','pctWPubAsst',
           'pctWRetire','medFamInc','perCapInc','whitePerCap','blackPerCap','indianPerCap','AsianPerCap','OtherPerCap',
           'HispPerCap','NumUnderPov','PctPopUnderPov','PctLess9thGrade','PctNotHSGrad','PctBSorMore','PctUnemployed',
           'PctEmploy','PctEmplManu','PctEmplProfServ','PctOccupManu','PctOccupMgmtProf','MalePctDivorce',
           'MalePctNevMarr','FemalePctDiv','TotalPctDiv','PersPerFam','PctFam2Par','PctKids2Par','PctYoungKids2Par',
           'PctTeen2Par','PctWorkMomYoungKids','PctWorkMom','NumKidsBornNeverMar','PctKidsBornNeverMar','NumImmig',
           'PctImmigRecent','PctImmigRec5','PctImmigRec8','PctImmigRec10','PctRecentImmig','PctRecImmig5',
           'PctRecImmig8','PctRecImmig10','PctSpeakEnglOnly','PctNotSpeakEnglWell','PctLargHouseFam',
           'PctLargHouseOccup','PersPerOccupHous','PersPerOwnOccHous','PersPerRentOccHous','PctPersOwnOccup',
           'PctPersDenseHous','PctHousLess3BR','MedNumBR','HousVacant','PctHousOccup','PctHousOwnOcc','PctVacantBoarded',
           'PctVacMore6Mos','MedYrHousBuilt','PctHousNoPhone','PctWOFullPlumb','OwnOccLowQuart','OwnOccMedVal',
           'OwnOccHiQuart','OwnOccQrange','RentLowQ','RentMedian','RentHighQ','RentQrange','MedRent','MedRentPctHousInc',
           'MedOwnCostPctInc','MedOwnCostPctIncNoMtg','NumInShelters','NumStreet','PctForeignBorn','PctBornSameState',
           'PctSameHouse85','PctSameCity85','PctSameState85','LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps',
           'LemasSwFTFieldPerPop','LemasTotalReq','LemasTotReqPerPop','PolicReqPerOffic','PolicPerPop',
           'RacialMatchCommPol','PctPolicWhite','PctPolicBlack','PctPolicHisp','PctPolicAsian','PctPolicMinor',
           'OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked','LandArea','PopDens','PctUsePubTrans',
           'PolicCars','PolicOperBudg','LemasPctPolicOnPatr','LemasGangUnitDeploy','LemasPctOfficDrugUn',
           'PolicBudgPerPop','murders','murdPerPop','rapes','rapesPerPop','robberies','robbbPerPop','assaults',
           'assaultPerPop','burglaries','burglPerPop','larcenies','larcPerPop','autoTheft','autoTheftPerPop','arsons',
           'arsonsPerPop','ViolentCrimesPerPop','nonViolPerPop']

print(data.head(5))

In [None]:
# Select the relevant columns to use in the model 
cols_final = ['population',
 'racepctblack',
 'agePct12t29',
 'numbUrban',
 'medIncome',
 'pctWWage',
 'pctWInvInc',
 'medFamInc',
 'perCapInc',
 'whitePerCap',
 'PctEmploy',
 'MalePctDivorce',
 'MalePctNevMarr',
 'TotalPctDiv',
 'PctKids2Par',
 'PctWorkMom',
 'NumImmig',
 'PctRecImmig8',
 'PctRecImmig10',
 'PctLargHouseOccup',
 'PersPerOccupHous',
 'PersPerRentOccHous',
 'PctPersOwnOccup',
 'PctPersDenseHous',
 'HousVacant',
 'PctHousOwnOcc',
 'OwnOccLowQuart',
 'OwnOccMedVal',
 'RentLowQ',
 'RentMedian',
 'MedRent',
 'MedOwnCostPctIncNoMtg',
 'NumStreet',
 'ViolentCrimesPerPop']
    
# drop all columns that are not in required list
data.drop(data.columns.difference(cols_final), 1, inplace=True) 

# look at data again
data.describe()
data

In [None]:
# Take a look at the outcome variable i.e. crime
print(data['ViolentCrimesPerPop'].value_counts())


# need to transform this outcome into 0 and 1's, 0 for low crime, 1 for high crime.
# choose a suitable threshold based, < 2500 crimes is low crime, although this is subjective.
data['ViolentCrimesPerPop'] = [0 if x < 795 else 1 for x in data['ViolentCrimesPerPop']]


# Then need to split up the features and outcomes
# So x as a data frame of features and y as a series of the outcome variable
x = data.drop('ViolentCrimesPerPop', 1) 
y = data.ViolentCrimesPerPop


print('variables', x.head(5))
print('crime outcome', y.head(5))

In [None]:
# look at data again to see all variables and then y, the outcome, as 0s and 1s 
print(data.head(5))

In [None]:
# use get_dummies in pandas to change categorical data to numerical

print(pd.get_dummies(x['communityname']).head(5))

**Data Cleaning

Decide which categorical variables to use in the model
Models can only handle numerical features, so dummy features are created to transform a categorical feature into a set of dummy features, each representing a unique category. In the set of dummy features, 1 indicates that the observation belongs in that category e.g. female would be 1, male 0.

Dummy features dont have to be used for ones with low frequencies, instead, buckets can be used to bucket low frequency categories as 'other'.

In [None]:
# Check how many unique categories there are 
for col_name in x.columns:
    if x[col_name].dtypes == 'object':
        unique_cat = len(x[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} unique categories".format(
            col_name=col_name, unique_cat=unique_cat))

In [None]:
# select the features which are not numeric
todummy_list = ['state', 'communityname']

# Dummy all categorical variables used. make them numeric and then missing values can be dealt with.
def dummy_df(data, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(data[x], prefix=x, dummy_na=False)
        data = data.drop(x, 1) # dropping the original feature
        data = pd.concat([data, dummies], axis=1) # adding the one to be used
    return data

x = dummy_df(x, todummy_list)
print(x.head(5))

**Removing items with missing data

Models can't handle missing data, so features with missing data should be removed. Removing data can cause issues if the data is randomly missing because it can cause the loss of a lot of data. However, greater issues arise from removing data if the data is randomly as well as non-randomly missing because this makes it no longer representative of the whole population and can introduce potential biases.

Imputation can be used to replace missing values with another value i.e. the mean, median or highest frequency of a given feature.

In [None]:
# Remove missing values
# First establish how much data is missing
x.isnull().sum().sort_values(ascending=False).head()


# Impute the missing values using SimpleImputer in sklearn.impute
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(x)
x = pd.DataFrame(data=imp.transform(x), columns=x.columns)


# Check if there is still missing data
x.isnull().sum().sort_values(ascending=False).head()

**Outlier Detection

An outlier = an observation that deviates drastically from other values in the dataset. Decision trees are robust to outliers because they isolate them in small regions of the feature space. Since the prediction for each leaf is the average (for regression), being isolated in seperate leaves, outliers won't influence the rest of the predictions/ impact the mean of the other leaves.

**Natural vs error:

Naturally occuring error, although not problematic, can skew the model by affecting the slope
Error is indicative of data quality issues, therefore it it not information that should be used in the model. Imputation can be used to deal with these erroneous values (the same way as dealing with missing data).
Methods of outlier detection include Kernel density estimation or Tukey IQR.

**Tukey IQR** Identifies extreme values in the data and is favourable to using standard deviation from the mean to detect outliers because Tukey doesn't make assumptions about normality and is less sensitive to extreme values.
(Interquartile ranges). To find most most extreme values, use a diff multiplier to 1.5.

Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)

One limitation of Tukey IQR outlier detection is that it does not capture outliers in a bimodal distribution, but rather extreme values, like Kernal Density Estimation can.

In [None]:
# find outliers using Tukey IQR
def find_outliers_tukey(x):
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)
    iqr = q3-q1 
    floor = q1 - 1.5*iqr # The floor = less than the first quartile minus the IQR.
    ceiling = q3 + 1.5*iqr # The ceiling = more than the third quartile plus the IQR.
    outlier_indices = list(x.index[(x < floor) | (x > ceiling)]) # If the value is below the floor, or above the ceiling, it is an outlier
    outlier_values = list(x[outlier_indices]) # indices to access these data points later

    return outlier_indices, outlier_values


# for example, check the outliers for ‘medIncome’
tukey_indices, tukey_values = find_outliers_tukey(x['medIncome'])
print(np.sort(tukey_values))

**Distribution of Features

Plotting frequency histograms to show the distribution of a given feature.

In [1]:
# plot histograms using peplos in marplotlib
%matplotlib inline
import matplotlib.pyplot as plt

def histo_plot(x):
    plt.hist(x,color='red', alpha=0.5)
    plt.title("'{var_name}' Histogram".format(var_name=x.name))
    plt.ylabel("Freq")
    plt.xlabel("Value")
    plt.show()
    
# plot histograms to show the distributions of given features
histo_plot(x['population'])
histo_plot(x['racepctblack'])
histo_plot(x['numbUrban'])
histo_plot(x['medIncome'])
histo_plot(x['pctWWage'])
histo_plot(x['pctWInvInc'])
histo_plot(x['medFamInc'])
histo_plot(x['whitePerCap'])
histo_plot(x['PctEmploy'])
histo_plot(x['MalePctDivorce'])
histo_plot(x['MalePctNevMarr'])
histo_plot(x['TotalPctDiv'])
histo_plot(x['PctKids2Par'])
histo_plot(x['PctWorkMom'])
histo_plot(x['NumImmig'])
histo_plot(x['PctRecImmig8'])
histo_plot(x['PctRecImmig10'])
histo_plot(x['PctLargHouseOccup'])
histo_plot(x['PersPerRentOccHous'])
histo_plot(x['PctPersOwnOccup'])
histo_plot(x['PctPersDenseHous'])
histo_plot(x['HousVacant'])
histo_plot(x['PctHousOwnOcc'])
histo_plot(x['OwnOccLowQuart'])
histo_plot(x['OwnOccMedVal'])
histo_plot(x['RentLowQ'])
histo_plot(x['RentMedian'])
histo_plot(x['MedRent'])
histo_plot(x['MedOwnCostPctIncNoMtg'])
histo_plot(x['NumStreet'])

IndentationError: unexpected indent (<ipython-input-1-826f1a1c3cd3>, line 28)

In [None]:
# histograms showing the distribution of features by the outcome variable (dependent variable)
def histogram_dv(x,y):
    plt.hist(list(x[y==0]), alpha=0.5, label='DV=0')
    plt.hist(list(x[y==1]), alpha=0.5, label='DV=1')
    plt.title("'{var_name}' Histogram by DV Category".format(var_name=x.name))
    plt.ylabel("Freq")
    plt.xlabel("Value")
    plt.legend(loc='upper right')
    plt.show()
    
# these show the distribution of a feature when the outcome is 0, so crime is less than .../none
histo_plot_dv(x['population'], y)
histo_plot_dv(x['racepctblack'], y)
histo_plot_dv(x['numbUrban'], y)
histo_plot_dv(x['medIncome'], y)
histo_plot_dv(x['pctWWage'], y)
histo_plot_dv(x['pctWInvInc'], y)
histo_plot_dv(x['medFamInc'], y)
histo_plot_dv(x['whitePerCap'], y)
histo_plot_dv(x['PctEmploy'], y)
histo_plot_dv(x['MalePctDivorce'], y)
histo_plot_dv(x['MalePctNevMarr'], y)
histo_plot_dv(x['TotalPctDiv'], y)
histo_plot_dv(x['PctKids2Par'], y)
histo_plot_dv(x['PctWorkMom'], y)
histo_plot_dv(x['NumImmig'], y)
histo_plot_dv(x['PctRecImmig8'], y)
histo_plot_dv(x['PctRecImmig10'], y)
histo_plot_dv(x['PctLargHouseOccup'], y)
histo_plot_dv(x['PersPerRentOccHous'], y)
histo_plot_dv(x['PctPersOwnOccup'], y)
histo_plot_dv(x['PctPersDenseHous'], y)
histo_plot_dv(x['HousVacant'], y)
histo_plot_dv(x['PctHousOwnOcc'], y)
histo_plot_dv(x['OwnOccLowQuart'], y)
histo_plot_dv(x['OwnOccMedVal'], y)
histo_plot_dv(x['RentLowQ'], y)
histo_plot_dv(x['RentMedian'], y)
histo_plot_dv(x['MedRent'], y)
histo_plot_dv(x['MedOwnCostPctIncNoMtg'], y)
histo_plot_dv(x['NumStreet'], y)

**Feature Engineering- 
Either 1. increase the dimensionality or 2. decrease the dimentionality

Increasing Dimensionality = creating new features. This is useful if the impact of two or more features on the outcome is non-additive. A good automated way to do this is to look for interactions between features.
e.g. a simple 2-way interaction (where X3 is the interaction between X1 and X2):

X3-X1 * X2

However, with lots of features, this grows the data A LOT. Therefore, it is better t use domain knowledge about certain interactions between features so that there aren't too many interaction terms.

Dimensionality has benefits because information is added, however, it is computationally costly i.e. inefficient and also has the potential for overfitting the model. So it is a trade off between creating new useful information vs the potential for overfitting plus the computational cost.

In [None]:
# polynomialFeatures in sklearn.preprocessing to create two-way interactions for ALL features
# not implemented because it would make the program too computationally slow 
# from itertools import combinations
# from sklearn.preprocessing import PolynomialFeatures

# def add_interactions(df):
    # get feature names
    #combos = list(combinations(list(df.columns),2))
   # colnames = list(df.columns) + ['_'.join(x) for x in combos]

    # establish the interactions in the data
   # poly = PolynomialFeatures(interaction_only=True, include_bias=False)
   # df = poly.fit_transform(df)
   # df = pd.DataFrame(df)
   # df.columns = colnames

    # remove the interactions with 0 values
    #noint_indicies = [i for i, x in enumerate(list((df == 0).all())) if x]
   # df = df.drop(df.columns[noint_indicies], axis=1)

   # return df


#x = add_interactions(x)
#print(x.head(5))

**Decreasing dimensionality**

Using principal component analysis(PCA), a method that transforms a dataset with many features into its principle components that best summarise the underlying variance in the data.

Each ‘principle component’ is established by finding the linear combination of features that maximises variance, whilst also ensuring zero correlation with previously calculated principal components.

PCA/ decreasing dimensionality is useful when you have very high-dimensionality data, in order to reduce dimensions, when the dataset has many highly correlated variables because it will take the variance from these to reduce this correlation and when there is poor observation-to-feature ratio.

However, using dimensionality reduction makes the data harder to interpret and understand because the output gives arbitrary principle components, e.g. for interpreting the outcome, principle component number 1 is not as easy to interpret as medIncome. Therefore, in certain contexts, like explaining the results to a client, this would make it difficult to explain the drivers of the target outcome variable.

In [None]:
# not implementing this either, because it makes the results too difficult to interpret
# Using sklearn.decomposition PCA to find the principal components 
# from sklearn.decomposition import PCA

# transform entire data set into 10 features
# pca = PCA(n_components=10)
# x_pca = pd.DataFrame(pca.fit_transform(x))

# print(x_pca.head(5))

**End of pre-processing