## Lesson 4: Predictive Policing
### Author: Ana Javed

#### Workplace Scenario

You are working for a data science consulting company. Your company is approached by a client requesting that you analyze crime data across the United States. At first glance, you notice that the data has 128 attributes and cannot be examined manually. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. You are tasked to identify which are the most important features or attributes that contribute to crime. 

Generally, such data might be used for predictive policing, where police departments can predict potential criminal activity so they can ensure they are properly staffed and the areas of concern are patrolled accordingly.

##### Instructions

It is recommended you complete the lab exercises for this lesson before beginning the assignment.

Using the Communities and Crime dataset (http://archive.ics.uci.edu/ml/machine-learning-databases/communities/), create a new notebook and perform each of the following tasks and answer the related questions:

    - Read data.
    - Apply three techniques for filter selection: Filter methods, Wrapper methods, Embedded methods.
    - Describe your findings.


In [285]:
## Importing Necessary Libraries & Packages 
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np
import datetime as dt
import csv
import sklearn 


In [287]:
## Reading data file into Dataframe 
col_name_list = ["state","county","community","communityname","fold","population",
"householdsize","racepctblack","racePctWhite","racePctAsian","racePctHisp","agePct12t21",
"agePct12t29","agePct16t24","agePct65up","numbUrban","pctUrban","medIncome","pctWWage",
"pctWFarmSelf","pctWInvInc","pctWSocSec","pctWPubAsst","pctWRetire","medFamInc","perCapInc",
"whitePerCap","blackPerCap","indianPerCap","AsianPerCap","OtherPerCap","HispPerCap",
"NumUnderPov","PctPopUnderPov","PctLess9thGrade","PctNotHSGrad","PctBSorMore",
"PctUnemployed","PctEmploy","PctEmplManu","PctEmplProfServ","PctOccupManu",
"PctOccupMgmtProf","MalePctDivorce","MalePctNevMarr","FemalePctDiv","TotalPctDiv",
"PersPerFam","PctFam2Par","PctKids2Par","PctYoungKids2Par","PctTeen2Par",
"PctWorkMomYoungKids","PctWorkMom","NumIlleg","PctIlleg","NumImmig","PctImmigRecent",
"PctImmigRec5","PctImmigRec8","PctImmigRec10","PctRecentImmig","PctRecImmig5",
"PctRecImmig8","PctRecImmig10","PctSpeakEnglOnly","PctNotSpeakEnglWell","PctLargHouseFam",
"PctLargHouseOccup","PersPerOccupHous","PersPerOwnOccHous","PersPerRentOccHous","PctPersOwnOccup",
"PctPersDenseHous","PctHousLess3BR","MedNumBR","HousVacant","PctHousOccup",
"PctHousOwnOcc","PctVacantBoarded","PctVacMore6Mos","MedYrHousBuilt","PctHousNoPhone",
"PctWOFullPlumb","OwnOccLowQuart","OwnOccMedVal","OwnOccHiQuart","RentLowQ",
"RentMedian","RentHighQ","MedRent","MedRentPctHousInc","MedOwnCostPctInc","MedOwnCostPctIncNoMtg",
"NumInShelters","NumStreet","PctForeignBorn","PctBornSameState","PctSameHouse85",
"PctSameCity85","PctSameState85","LemasSwornFT","LemasSwFTPerPop","LemasSwFTFieldOps",
"LemasSwFTFieldPerPop","LemasTotalReq","LemasTotReqPerPop","PolicReqPerOffic",
"PolicPerPop","RacialMatchCommPol","PctPolicWhite","PctPolicBlack","PctPolicHisp",
"PctPolicAsian","PctPolicMinor","OfficAssgnDrugUnits","NumKindsDrugsSeiz","PolicAveOTWorked",
"LandArea","PopDens","PctUsePubTrans","PolicCars","PolicOperBudg","LemasPctPolicOnPatr",
"LemasGangUnitDeploy","LemasPctOfficDrugUn","PolicBudgPerPop","ViolentCrimesPerPop"]

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data"
df = pd.read_csv(url, sep=",", names = col_name_list)

## First & Last 5 Rows from Dataframe
# print(df.head())
# print(df.tail()) 

In [288]:
## Conducting Exploratory Data Analysis: 
    # Number of Instances: 1994
    # Number of Attributes: 128
    # Missing Values? Yes
    # Data Set Characteristics:  Multivariate
    # Attribute Characteristics: Real

print(df.shape)  # (1993, 128)
print(df.dtypes) 
print(df.describe()) 

(1994, 128)
state                    int64
county                  object
community               object
communityname           object
fold                     int64
                        ...   
LemasPctPolicOnPatr     object
LemasGangUnitDeploy     object
LemasPctOfficDrugUn    float64
PolicBudgPerPop         object
ViolentCrimesPerPop    float64
Length: 128, dtype: object
             state         fold   population  householdsize  racepctblack  \
count  1994.000000  1994.000000  1994.000000    1994.000000   1994.000000   
mean     28.683551     5.493982     0.057593       0.463395      0.179629   
std      16.397553     2.873694     0.126906       0.163717      0.253442   
min       1.000000     1.000000     0.000000       0.000000      0.000000   
25%      12.000000     3.000000     0.010000       0.350000      0.020000   
50%      34.000000     5.000000     0.020000       0.440000      0.060000   
75%      42.000000     8.000000     0.050000       0.540000      0.230000   
max 

In [289]:
## Function to Find which column has the most missing values

def missing_value_count(df):
    missing_dict = {}

    for i, row in enumerate(df.values):
        if row[0] in col_name_list:
            continue

        # print(i, row)
        for num, val in enumerate(row):
            # print(num, val)
            if val == '?':
                if str(num) not in missing_dict.keys():
                    missing_dict[str(num)] = 0
                missing_dict[str(num)] += 1 
            else:
                continue
                
    return missing_dict 


In [290]:
## Missing Value Counts by Columns: 
missing_dict = missing_value_count(df)
print(missing_dict)

"""
It appears columns 101 - 117, 121-124, and 126 have the most missing values (1675 rows each). 
I will drop these columns since more than half of the values are missing. 
For other columns that have fewer missing values, I will impute them 
with the column median values if they are numeric. 
"""

# print(missing_dict.keys())


# Columns to Drop: 101 - 117, 121-124, 126
counter = 0 
for i, name in enumerate(col_name_list):
    if (i in range(101, 118, 1)) or (i in range(121, 125, 1))  or (i == 126):
        # print(name)
        df = df.drop(columns=name)
        del missing_dict[str(i)]
        counter +=1

print("Number of Columns Deleted: ", counter)


## Also Deleting the One String Column: communityname 
df = df.drop(columns=  'communityname')


{'1': 1174, '2': 1177, '101': 1675, '102': 1675, '103': 1675, '104': 1675, '105': 1675, '106': 1675, '107': 1675, '108': 1675, '109': 1675, '110': 1675, '111': 1675, '112': 1675, '113': 1675, '114': 1675, '115': 1675, '116': 1675, '117': 1675, '121': 1675, '122': 1675, '123': 1675, '124': 1675, '126': 1675, '30': 1}
Number of Columns Deleted:  22


In [291]:
### Imputing the Missing Values with the Median value (if applicable) 
print("Columns with Remaining Missing Values: ")
missing_dict = missing_value_count(df)
print(missing_dict)
# Remaining columns are: county, community, and OtherPerCap


df.loc[:, "county"] = pd.to_numeric(df.loc[:, "county"], errors='coerce')
HasNan1 = np.isnan(df.loc[:, "county"] )
# sum(HasNan1)  # 1174
df.loc[HasNan1, "county"] = np.nanmedian(df.loc[:, "county"] )


df.loc[:, "community"] = pd.to_numeric(df.loc[:, "community"], errors='coerce')
HasNan1 = np.isnan(df.loc[:, "community"] )
# sum(HasNan1)  # 1177
df.loc[HasNan1, "community"] = np.nanmedian(df.loc[:, "community"] )


df.loc[:, "OtherPerCap"] = pd.to_numeric(df.loc[:, "OtherPerCap"], errors='coerce')
HasNan1 = np.isnan(df.loc[:, "OtherPerCap"] )
# sum(HasNan1)  # 1
df.loc[HasNan1, "OtherPerCap"] = np.nanmedian(df.loc[:, "OtherPerCap"] )



Columns with Remaining Missing Values: 
{'1': 1174, '2': 1177, '29': 1}


In [292]:
## Checking Once More Regarding Missing values:
print("Columns with Remaining Missing Values: ")
missing_dict = missing_value_count(df)
print(missing_dict) 


print("\nDataframe Dimensions: ")
print(df.shape)

Columns with Remaining Missing Values: 
{}

Dataframe Dimensions: 
(1994, 105)


#### Now Z-Normalizing the Dataset 



In [293]:
from sklearn.preprocessing import StandardScaler

# Z-Normalizing the attributes: 
X = df.loc[:, list(df.columns[:104])]
y = df.loc[:, "ViolentCrimesPerPop"]
# print(X.head())

standardization_scale = StandardScaler().fit(X)
X = standardization_scale.transform(X)
X = pd.DataFrame(X) 

X.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,94,95,96,97,98,99,100,101,102,103
count,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,...,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0
mean,-9.443020000000001e-17,2.1380420000000003e-17,-1.812882e-16,-5.701446e-17,4.187e-17,-4.525523e-16,1.042296e-16,1.692617e-16,4.231542e-17,1.247191e-17,...,-3.5634040000000005e-17,4.187e-17,4.774961e-16,-1.7817020000000003e-17,2.4943830000000002e-17,2.5122e-16,4.6324250000000005e-17,5.256021e-17,7.928574e-17,-1.0690210000000002e-17
std,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,...,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251,1.000251
min,-1.688697,-0.4430138,-2.914039,-1.564227,-0.4539368,-2.831179,-0.708935,-3.089277,-0.7359319,-0.6196275,...,-0.2269325,-0.9328163,-2.980711,-2.951081,-3.124761,-3.287706,-0.5960859,-1.146831,-0.7060564,-0.3914469
25%,-1.017697,-0.177687,0.04806272,-0.8680839,-0.3751184,-0.6928041,-0.6300017,-0.5070787,-0.544384,-0.5766044,...,-0.2269325,-0.6731617,-0.6799173,-0.6345617,-0.5308709,-0.4618707,-0.4133235,-0.6543192,-0.6187192,-0.3914469
50%,0.3243035,-0.177687,0.04806272,-0.171941,-0.2963001,-0.1429362,-0.4721351,0.3946412,-0.4007231,-0.4475351,...,-0.2269325,-0.3702314,0.1033317,0.02730101,0.2173666,0.2445882,-0.2305611,-0.3095611,-0.400376,-0.3914469
75%,0.8123035,-0.177687,0.04806272,0.8722733,-0.05984503,0.4680281,0.1987979,0.7635267,0.07814654,0.06874201,...,-0.2269325,0.2789052,0.8253895,0.6891637,0.7161916,0.6987404,0.04358251,0.2322017,0.1236475,-0.3914469
max,1.666304,9.675587,2.916835,1.568416,7.427898,3.278464,3.23773,1.00945,4.052765,3.682682,...,9.735713,3.39476,1.914595,2.564441,1.863489,1.758429,8.542034,3.778285,3.660806,3.770572


#### Now that I've removed all missing values from the data, standardized values, and trimmed the columns down to 106 - next is to apply feature selection methods


### First Method: Filter methods

In [294]:
# from sklearn.metrics import mutual_info_score
## Conducting Pair-wise correlation for Columns in The Dataframe


def correlation_above_threshold(df, threshold=0.5):
    at_or_above_threshold = {}
    
    for i, val in enumerate(corr_df['ViolentCrimesPerPop']):
        if abs(val) >= threshold:
            if df.columns[i] != 'ViolentCrimesPerPop':
                print (df.columns[i], ".... Column:", i, "....", val)
                at_or_above_threshold[df.columns[i]] = val
                
    return at_or_above_threshold 


## Creating a Pair-Wise Correlation Df 
corr_df = df.corr('pearson')
# corr_df.head(15)

## Finding Correlations Above Specified Threshold (using function above)
output = correlation_above_threshold(corr_df, threshold = 0.6)
print("Total: ", len(output)) 

# when threshold = 0.5, 15 variables; when threshold =0.6, 7 variables
# when threshold = 0.7, 3 variables; when threshold = 0.8, 0 variables 
# output

racepctblack .... Column: 6 .... 0.6312636346597023
racePctWhite .... Column: 7 .... -0.6847695762715443
PctFam2Par .... Column: 47 .... -0.7066674691569855
PctKids2Par .... Column: 48 .... -0.7384238020704434
PctYoungKids2Par .... Column: 49 .... -0.6660588959347982
PctTeen2Par .... Column: 50 .... -0.6615816444304072
PctIlleg .... Column: 54 .... 0.7379565498586647
Total:  7


### Second Method: Wrapper methods

In [295]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE #Recursive Feature Elimination
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
selector = RFE(estimator, 7, step=1) #Step=1 means each step only remove 1 variable from the model
selector = selector.fit(X, y)
print(selector.support_) # The mask of selected features.
print(selector.ranking_) # selected features are ranked 1. The 6th is the one that is removed first,
                         # 2nd is the one that is removed last
        



[False False False False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False  True False  True  True False False
  True False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False  True
  True False False False False False False False False False False False
 False False False False False False False False]
[49 36 85 79  2 91  3 55 84 31 39 25 42 76  1 23 17 21 53 24 41 97 38 20
  5  6 78 77 82 50 71 59 33 51 52 70 90 22 47 87 48 34  1 26  1  1 40 93
  1 63 98 58 29 44 12 35 72 81 74 75 73 10  9 57 92 19 66 28 16 37 15 14
  4 64 65 11 60 13 45 46 80 68 86  1  1 54  8 96 30  7 69 61 27 43 32 18
 67 94 95 88 83 89 56 62]


In [296]:
for i, val in enumerate(selector.ranking_):
    # print(i, val)
    if str(val) == '1':
        if df.columns[i] != 'ViolentCrimesPerPop':
            print (df.columns[i], ".... Column:", i, "....", val)
            

numbUrban .... Column: 14 .... 1
MalePctDivorce .... Column: 42 .... 1
FemalePctDiv .... Column: 44 .... 1
TotalPctDiv .... Column: 45 .... 1
PctKids2Par .... Column: 48 .... 1
OwnOccLowQuart .... Column: 83 .... 1
OwnOccMedVal .... Column: 84 .... 1


### Third Method: Embedded methods

In [297]:
# LASSO Embedded Method 
from sklearn import linear_model

alpha = 0.025 # Increasing alpha can shrink variable coefficients more to 0
clf = linear_model.Lasso(alpha=alpha)
clf.fit(X, y)

print(clf.coef_)
print(clf.intercept_)


[-0.         -0.         -0.         -0.          0.         -0.
  0.         -0.04264457  0.          0.         -0.         -0.
 -0.          0.          0.          0.         -0.         -0.
 -0.         -0.          0.          0.         -0.         -0.
 -0.         -0.         -0.         -0.          0.          0.
 -0.          0.          0.          0.          0.         -0.
  0.         -0.         -0.         -0.          0.         -0.
  0.0082995   0.          0.          0.00024754  0.         -0.
 -0.06535678 -0.         -0.         -0.         -0.          0.
  0.04202239  0.          0.          0.          0.          0.
  0.          0.          0.          0.         -0.          0.
  0.          0.          0.         -0.          0.         -0.
  0.00514212  0.         -0.          0.01755062 -0.         -0.
  0.         -0.          0.          0.          0.          0.
  0.          0.         -0.          0.          0.          0.
  0.          0.         

In [298]:
for i, val in enumerate(clf.coef_):
    # print(i, val)
    if str(abs(val)) != '0.0':
        if df.columns[i] != 'ViolentCrimesPerPop':
            print (df.columns[i], ".... Column:", i, "....", val)

racePctWhite .... Column: 7 .... -0.04264457357393647
MalePctDivorce .... Column: 42 .... 0.008299504974689956
TotalPctDiv .... Column: 45 .... 0.0002475415149962874
PctKids2Par .... Column: 48 .... -0.06535677567041379
PctIlleg .... Column: 54 .... 0.04202239072487789
PctPersDenseHous .... Column: 72 .... 0.005142119558077064
HousVacant .... Column: 75 .... 0.017550618444137436
NumStreet .... Column: 94 .... 0.004408861637625294


### Summary 

The policing dataset explored in this assignment originally contained 128 attributes and had missing data values, which necessitated the use of different feature selection methods after cleaning and standardizing the data. 

The first feature selection method applied was the Filter method. I conducted pairwise correlation between the features and target variable, and only selected features that had a correlation above a specific threshold. I tested a few different threholds to see the number of attributes that returned. When I set the correlation threshold to 0.8 or above, no features returned. When I set the threshold to 0.6, then 7 attributes returned. These 7 attributes included: 
- racepctblack = percentage of population that is african american  (correlation: 0.631)
- racePctWhite = percentage of population that is caucasian (correlation: -0.68)
- PctFam2Par =  percentage of families (with kids) that are headed by two parents (correlation: -0.706)
- PctKids2Par = percentage of kids in family housing with two parents (correlation: -0.738)
- PctYoungKids2Par = percent of kids 4 and under in two parent households (correlation: -0.66)
- PctTeen2Par = percent of kids age 12-17 in two parent households (correlation: -0.66)
- PctIlleg= percentage of kids born to never married (correlation: 0.737)

This list of attributes show that race and family dynamic (e.g. families headed by two parents) were the attributes with the most correlation to the total number of violent crimes per 100K popuation (aka the target variable). The variable that had the highest correlation (negative) was PctKids2Par, or the percentage of kids in housing with two parents. The next highest correlation (positive) was PctIlleg, or percentage of kids born to never married parents. While these two attributes are related to each other, it shows that family household is an important attribute.

The second feature selection method applied was the Wrapper method. I specifically used the Backwards Step-wise feature selection with Recursive Feature Elimination. Since the first method resulted in 7 attributes, I passed 7 to the RFE() function to see how closely the attributes would match the filter method. The 7 attributes that returned included:
- numbUrban = number of people living in areas classified as urban
- MalePctDivorce = percentage of males who are divorced
- FemalePctDiv =  percentage of females who are divorced
- TotalPctDiv = percentage of population who are divorced
- PctKids2Par = percentage of kids in family housing with two parents
- OwnOccLowQuart = owner occupied housing - lower quartile value
- OwnOccMedVal= owner occupied housing - median value

This was interesting since this output only shows one attribute that the filter method showed (PctKids2Par), and we also see new attributes that may be related to crime - such as number owner occupied homes, divorced households, and community types. 

Lastly, the final feature selection method applied was the Embedded Method, specifically the LASSO method. I wanted to have the Lasso method provided around 7 attributes if possible (to easily compare between the different methods), and this ultimately depended on the alpha I passed to the function. A higher alpha value means a higher penalty for coefficients, and the more the coefficients will be closer to 0. With this information, I set the alpha = 0.025, and 8 attributes returned. They were: 
- racePctWhite = percentage of population that is caucasian  (coef = -0.0426)
- MalePctDivorce = percentage of males who are divorced (coef = 0.008)
- TotalPctDiv = percentage of population who are divorced  (coef = 0.0002)
- PctKids2Par = percentage of kids in family housing with two parents (coef =  -0.065)
- PctIlleg =  percentage of kids born to never married  (coef = 0.042)
- PctPersDenseHous = percent of persons in dense housing (coef = 0.005)
- HousVacant = number of vacant households (coef = 0.017)
- NumStreet  = number of homeless people counted in the street (coef = 0.0044) 

Here we see race, divorced status, parent presence, and community statistics shown. Specifically we see new attribute such as the number of homes that are vacant, and number of homeless people in the street. The attribute with the highest coefficient was PctKids2Par (percentage of kids in family with two parents), and the next highest ones (respectively) were racePctWhite and PctIlleg. 

While there are some similarities in attributes across the methods, there were still new features that were introduced by each method. Now that we have different features selected, these features should be passed into a machine learning model to see which features best predict the target variable.