# Data Mining - Mini Lab

#### Introduction
###### Data Description
The Dallas Crime Incident data set that is used in the Mini Project acts as a bridge between the citizens of Dallas and the Dallas PD. It represents the Dallas Police Public Data - RMS Incidents from June 1, 2014 to September 7, 2018. 
For purposes of this Mini Project, the main dataframe that is trimmed based on the analysis performed as part of Lab 1. The details of the data quality clean up and choice of columns have been detailed in the Lab 1 notebook link provided below.
Lab 1 Notebook Link - https://github.com/wtubin/MSDS7331-Data-Mining/MSDS7331_Data_Mining_Lab1_Data-Viz_Pre-Processing.ipynb


#### Objective

The objective of this unit is to perform Logistic Regression and Support Vector Machines categorization on the chosen data set and optimize the parameters in order to improve the accuracy of the model.
GitHub Repository containing the artifacts - https://github.com/wtubin/MSDS7331-Data-Mining
Location of the raw (compressed) data file - https://github.com/wtubin/MSDS7331-Data-Mining/Police_Incidents.7z 

The three models are:

- Logistic Regression, using GridSearchCV, with manual variable reduction
- Logistic Regression, using GridSearchCV, with Recursive Feature Elimination (RFE)
- Support Vector Machine (SVM)



### Create Models

##### Data Preparation

The dataset is loaded and cleaning is performed with some modifications as needed to feed into model. The attributes with zero variance (closer) or those attributes that does not have value in predicting the respose variable are removed. For example attributes like beats, sector, Location1, etc. serves no purpose for our model. 

Then dataset will be splitted into explanatory, reffered as X (Attributes) and response referred as "Y" (response variable: "Arrest_status") for running the models.

        - X : Explanatory variable (attributes)
        - Y : Response variable (Res_time_category)

The attributes will be scaled to have a mean of 0 and variance of 1 in order to imporve accuracy of the classification models. The data will then be splitted in to 80/20 training/test set split. To reduce possibility of "overfitting", 10-fold cross validation will be performed. The GrisdsearchCV method with manual variable reduction will be performed and we will be utilizing correlation scores, variance inflaion scores, variance inflation factors (VIFs) and significance for manual determination of attributes. This will help us reduce the attributes for our model.These remaining attributes will also be utilized in other two models: Logistic Regression using GridSearchCV with Recursive Feature Elimination and Support Vector Machine. The scikit-learn GridSearchCV feature will be utilized to adjust model parameters for adjusting class_weight.

Overall, accuracy, precision, and recall is determined by utilizing the modification of original function created by Dr. Drew in his Education Data Notebook for clasification to check for our proper classification success. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import ShuffleSplit, cross_validate
from sklearn.linear_model import LogisticRegression

import warnings
warnings.simplefilter('ignore', DeprecationWarning)
pd.set_option('mode.chained_assignment', None)
warnings.filterwarnings('ignore')

path = "../Data/" # Generic path
incident = pd.read_csv(path + 'LAB1_completed_Dataset_clean.csv', low_memory= False)
print(incident.shape)
incident['Response_time'] = incident['Response_time'].fillna(incident['Response_time'].mean()).astype(np.int)
incident['Res_time_category'] = pd.cut(incident.Response_time,[0,20,1e6],2,labels=[0,1])
incident = incident[incident['Res_time_category'].isnull()==False]
# incident['Res_time_category'] = pd.Categorical(incident['Res_time_category']).codes

incident = incident[incident['Call_Received_Hour'].isnull()==False]

print(incident.shape)
print(incident.isnull().sum())

(255154, 44)
(223803, 45)
Year_of_Incident                0
Service_Number_ID               0
Watch                           0
Type_of_Incident                0
Type_Location                3828
Reporting_Area                129
Beat                           53
Division                        0
Sector                         53
Council_District                0
Day1_of_the_Week                0
Call_Received_Date_Time         0
Call_Cleared_Date_Time        148
Call_Dispatch_Date_Time        21
Person_Involvement_Type         0
Victim_Type                     0
Victim_Race                     0
Victim_Gender                   0
Victim_Age                      0
Offense_Status                399
Victim_Condition           206053
Hate_Crime                      0
Family_Offense                 29
Weapon_Used                 23889
Gang_Related_Offense            0
Drug_Related                    0
UCR_Offense_Name            12102
RMS_Code                        0
UCR_Code              

In [2]:
print(incident['Res_time_category'].value_counts())
incident['Res_time_category'].value_counts().plot(kind='barh')

0    142816
1     80987
Name: Res_time_category, dtype: int64


<matplotlib.axes._subplots.AxesSubplot at 0x21c224166d8>

In [3]:
# UCR_Offense_Name change this

incident.loc[:,'UCR_Offense_Name'] = incident['UCR_Offense_Name'].fillna("MISSING")

THEFT_FRAUD     = dict.fromkeys(['THEFT/BMV', 'THEFT ORG RETAIL', 'BURGLARY-RESIDENCE', 'OTHER THEFTS',
                                 'ROBBERY-INDIVIDUAL','THEFT/SHOPLIFT', 'BURGLARY-BUSINESS', 'FORGE & COUNTERFEIT', 
                                 'FRAUD', 'EMBEZZLEMENT','ROBBERY-BUSINESS','THEFT ORG RETAIL'],"THEFT_FRAUD" ) 
MVA_TRAFFIC      =dict.fromkeys(['ACCIDENT MV', 'MOTOR VEHICLE ACCIDENT', 'UUMV', 'TRAFFIC VIOLATION',
                                 'TRAFFIC FATALITY'],"MVA_TRAFFIC" )        
WEAPONS_FIREARMS =dict.fromkeys(['WEAPONS', 'ARSON', 'INJURED FIREARM'], "WEAPONS_FIREARMS")         
ASSUALT          = dict.fromkeys(['ASSAULT','VANDALISM & CRIM MISCHIEF', 'AGG ASSAULT - NFV', 'OFFENSE AGAINST CHILD',
                                  'AGG ASSAULT - FV'], "ASSUALT")
OTHERS_THREATS   = dict.fromkeys(['FOUND', 'OTHERS', 'LOST', 'CRIMINAL TRESPASS', 'DISORDERLY CONDUCT', 
                                  'ANIMAL BITE','INJURED HOME','INJURED PUBLIC', 'TERRORISTIC THREAT', 
                                  'EVADING', 'INJURED OCCUPA', 'ORANIZED CRIME', 'KIDNAPPING', 
                                  'RESIST ARREST','FAIL TO ID', 'HUMAN TRAFFICKING', 'MISSING'], "OTHERS_THREATS")
INTOXICATION     = dict.fromkeys(['DRUNK & DISORDERLY', 'DWI', 'NARCOTICS & DRUGS', 'LIQUOR OFFENSE', 
                                  'INTOXICATION MANSLAUGHTER'],"INTOXICATION")
MURDER_DEATH     = dict.fromkeys(['SUDDEN DEATH&FOUND BODIES','MURDER'], "MURDER_DEATH")
                    

incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(THEFT_FRAUD)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(MVA_TRAFFIC)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(WEAPONS_FIREARMS)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(ASSUALT)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(OTHERS_THREATS)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(INTOXICATION)
incident.loc[:,'UCR_Offense_Name']= incident['UCR_Offense_Name'].replace(MURDER_DEATH)

In [4]:
# # FILTERING OUT UNNECESSARY NULL DATA
incident = incident[incident['Watch']!=0]
incident = incident[(incident['Victim_Age']>=0) & (incident['Victim_Age']<=90)]
incident = incident[incident['Victim_Race']!="Unknown"]
incident = incident[incident['Victim_Type']!="Unknown"]
incident= incident[incident.Number_of_offense != "RP"]

incident = incident[incident['Victim_Gender']!="U"]
incident.loc[:,'IsMale'] = incident.Victim_Gender=='M' 
incident.IsMale = incident.IsMale.astype(np.int)

incident.loc[:,'Social_crime_score'] = incident['Hate_Crime']+incident['Gang_Related_Offense']+incident['Drug_Related']

incident.loc[:,'Victim_Age'] = incident['Victim_Age'].astype(np.int)
incident.loc[:,'Victim_Age_Group'] = pd.cut(incident.Victim_Age,[-1,18,30,60,999],4,labels=[0,1,2,3])

# incident['UCR_Offense_Name'] = pd.Categorical(incident['UCR_Offense_Name']).codes
incident['Day1_of_the_Week'] = pd.Categorical(incident['Day1_of_the_Week']).codes
# incident['Division'] = pd.Categorical(incident['Division']).codes
incident['Victim_Type'] = pd.Categorical(incident['Victim_Type']).codes
incident['Victim_Race'] = pd.Categorical(incident['Victim_Race']).codes

incident['Number_of_offense']= incident.Number_of_offense.astype(np.int)

In [5]:
# Fist 2 characters in the RMS_Code is offense degree.

incident['Degree']=incident['RMS_Code'].astype(str).str[:2]
incident['Degree_Fact'] = pd.Categorical(incident['Degree']).codes

In [6]:
tmp_df = pd.get_dummies(incident.Division,prefix='Div')
incident = pd.concat((incident,tmp_df),axis=1)

tmp_df = pd.get_dummies(incident.UCR_Offense_Name,prefix='UCR_')
incident = pd.concat((incident,tmp_df),axis=1)

In [7]:
incident.columns

Index(['Year_of_Incident', 'Service_Number_ID', 'Watch', 'Type_of_Incident',
       'Type_Location', 'Reporting_Area', 'Beat', 'Division', 'Sector',
       'Council_District', 'Day1_of_the_Week', 'Call_Received_Date_Time',
       'Call_Cleared_Date_Time', 'Call_Dispatch_Date_Time',
       'Person_Involvement_Type', 'Victim_Type', 'Victim_Race',
       'Victim_Gender', 'Victim_Age', 'Offense_Status', 'Victim_Condition',
       'Hate_Crime', 'Family_Offense', 'Weapon_Used', 'Gang_Related_Offense',
       'Drug_Related', 'UCR_Offense_Name', 'RMS_Code', 'UCR_Code',
       'X_Coordinate', 'Y_Coordinate', 'Zip_Code', 'City', 'State',
       'Location1', 'Call_Received', 'Call_Cleared', 'Call_Dispatch',
       'Number_of_offense', 'Response_time', 'Latitude', 'Longitude',
       'Arrest_status', 'Call_Received_Hour', 'Res_time_category', 'IsMale',
       'Social_crime_score', 'Victim_Age_Group', 'Degree', 'Degree_Fact',
       'Div_CENTRAL', 'Div_NORTH CENTRAL', 'Div_NORTHEAST', 'Div_NORTHWES

In [8]:
# Y Response variable dataframe
inci_Y = incident['Res_time_category']

# Attributes with no predictive features with respect to resposne variable

# incident = incident.drop(['Year_of_Incident','Service_Number_ID','Type_of_Incident','Type_Location', 'Reporting_Area', 
#                           'Beat', 'Division', 'Sector', 'Council_District', 'Call_Received_Date_Time', 
#                           'Call_Cleared_Date_Time', 'Call_Dispatch_Date_Time','Person_Involvement_Type', 'Offense_Status',
#                           'Victim_Condition','Family_Offense', 'Weapon_Used', 'RMS_Code', 'UCR_Code', 
#                           'Zip_Code', 'City', 'State','Location1', 'Call_Received', 'Call_Cleared', 'X_Coordinate', 
#                           'Y_Coordinate','Call_Dispatch', 'Latitude', 'Longitude','Victim_Gender',
#                           'Res_time_category','Victim_Age_Group','Response_time', 'Degree', 'Number_of_offense', 
#                           'Watch'],axis=1)


incident = incident.drop(['Year_of_Incident','Service_Number_ID','Type_of_Incident','Type_Location', 'Reporting_Area', 
                          'Beat','Sector', 'Council_District', 'Call_Received_Date_Time', 
                          'Call_Cleared_Date_Time', 'Call_Dispatch_Date_Time','Person_Involvement_Type', 'Offense_Status',
                          'Victim_Condition','Family_Offense', 'Weapon_Used', 'RMS_Code', 'UCR_Code', 
                          'Zip_Code', 'City', 'State','Location1', 'Call_Received', 'Call_Cleared', 'X_Coordinate', 
                          'Y_Coordinate','Call_Dispatch', 'Latitude', 'Longitude','Victim_Gender', 
                          'Response_time', 'Degree', 'Watch','Victim_Age_Group', 'Res_time_category',
                         'Victim_Type','Victim_Race','Victim_Age', 'Gang_Related_Offense','Hate_Crime',
                            'Drug_Related', 'Division', 'UCR_Offense_Name'],axis=1)



In [9]:
# BINS MODEL
# ['Day1_of_the_Week',  'Division','Responsetime_cat','Arrest_status','Social_crime_score', 
#  'IsMale','Call_Received_Hour','UCR_Offense_Name','Degree_Fact']

In [10]:
incident.sample(2)

Unnamed: 0,Day1_of_the_Week,Number_of_offense,Arrest_status,Call_Received_Hour,IsMale,Social_crime_score,Degree_Fact,Div_CENTRAL,Div_NORTH CENTRAL,Div_NORTHEAST,...,Div_SOUTHEAST,Div_SOUTHWEST,Div_Unknown,UCR__ASSUALT,UCR__INTOXICATION,UCR__MURDER_DEATH,UCR__MVA_TRAFFIC,UCR__OTHERS_THREATS,UCR__THEFT_FRAUD,UCR__WEAPONS_FIREARMS
82790,4,1,0,19.0,0,0,8,0,0,0,...,0,0,0,1,0,0,0,0,0,0
209090,2,1,0,20.0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [11]:
incident.describe()

Unnamed: 0,Day1_of_the_Week,Number_of_offense,Arrest_status,Call_Received_Hour,IsMale,Social_crime_score,Degree_Fact,Div_CENTRAL,Div_NORTH CENTRAL,Div_NORTHEAST,...,Div_SOUTHEAST,Div_SOUTHWEST,Div_Unknown,UCR__ASSUALT,UCR__INTOXICATION,UCR__MURDER_DEATH,UCR__MVA_TRAFFIC,UCR__OTHERS_THREATS,UCR__THEFT_FRAUD,UCR__WEAPONS_FIREARMS
count,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,...,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0,200780.0
mean,2.958756,1.087623,0.099118,12.676317,0.531911,0.01274,6.503586,0.125082,0.110868,0.179993,...,0.166446,0.153257,0.000254,0.27943,0.00141,0.014997,0.14616,0.13979,0.416585,0.001629
std,1.996215,0.41675,0.298822,6.270955,0.498982,0.113783,2.773755,0.330813,0.313969,0.384183,...,0.372481,0.360236,0.015936,0.44872,0.037517,0.121539,0.353267,0.34677,0.492994,0.040324
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,8.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,1.0,0.0,13.0,1.0,0.0,7.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5.0,1.0,0.0,18.0,1.0,0.0,8.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
max,6.0,22.0,1.0,23.0,1.0,2.0,11.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Cross Validation

    1. The data is divided into 80/20 train -test split.
    2. 10 folds cross validation
    3. Random seed with random state 0 for random test and training splits for each iteration of cross validation

In [12]:
from sklearn.model_selection import ShuffleSplit
#Create Cross Validation Object with 10 folds with 80/20 train - test split
cv = ShuffleSplit(n_splits = 10, test_size=0.20, random_state=0)

#Create X Explanatory Variables DF to support the individual models
inci_X = incident

inci_X_Rfe = incident
inci_X_SVM = incident
print("inci_X", inci_X.info())
print("inci_X_Rfe", inci_X_Rfe.info())
print("inci_X_SVM", inci_X_SVM.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200780 entries, 0 to 255153
Data columns (total 22 columns):
Day1_of_the_Week         200780 non-null int8
Number_of_offense        200780 non-null int32
Arrest_status            200780 non-null int64
Call_Received_Hour       200780 non-null float64
IsMale                   200780 non-null int32
Social_crime_score       200780 non-null int64
Degree_Fact              200780 non-null int8
Div_CENTRAL              200780 non-null uint8
Div_NORTH CENTRAL        200780 non-null uint8
Div_NORTHEAST            200780 non-null uint8
Div_NORTHWEST            200780 non-null uint8
Div_SOUTH CENTRAL        200780 non-null uint8
Div_SOUTHEAST            200780 non-null uint8
Div_SOUTHWEST            200780 non-null uint8
Div_Unknown              200780 non-null uint8
UCR__ASSUALT             200780 non-null uint8
UCR__INTOXICATION        200780 non-null uint8
UCR__MURDER_DEATH        200780 non-null uint8
UCR__MVA_TRAFFIC         200780 non-null ui

##### Colinearity

The dataset had few issues with collinearity, this issue was found during ther LAB 1 iteration of data exploration. 

###### Starting Colinearity
Some of the attributes which were hightly correlated comes from either creation of new columns or they are derived from date related varibales or splitted from original categories.

In [13]:
#Create correlation matrix
CorrMat = incident.corr()

# Highest Correlation Pairs
corrPairs = CorrMat.unstack().sort_values(kind="quicksort", ascending=False)
#- REMOVE DUPLICATES
corrPairs = corrPairs[::2]
corrPairs = corrPairs[corrPairs.index.get_level_values(0) != corrPairs.index.get_level_values(1)]
with pd.option_context('display.max_rows',10):
    print(corrPairs)

UCR__OTHERS_THREATS  Degree_Fact            0.210516
Arrest_status        UCR__ASSUALT           0.206600
Degree_Fact          UCR__ASSUALT           0.159513
UCR__MVA_TRAFFIC     Degree_Fact            0.152628
UCR__MURDER_DEATH    Degree_Fact            0.113504
                                              ...   
UCR__MVA_TRAFFIC     UCR__ASSUALT          -0.257647
UCR__THEFT_FRAUD     UCR__OTHERS_THREATS   -0.340642
                     UCR__MVA_TRAFFIC      -0.349614
Degree_Fact          UCR__THEFT_FRAUD      -0.433983
UCR__THEFT_FRAUD     UCR__ASSUALT          -0.526213
Length: 231, dtype: float64


Ending Colinearity
The highly correlated attributes were manually removed from dataset. 
- Total attributes removed : _____****

In [14]:

#Drop highly correlated, insignificant and high VIF columns.
# inci_X = incident.drop(['Drug_Related', 'Call_Received_Hour', 'Gang_Related_Offense'], axis=1)

#Create correlation matrix
CorrMat = inci_X.corr()

# Highest Correlation Pairs
corrPairs = CorrMat.unstack().sort_values(kind="quicksort", ascending=False)
#- REMOVE DUPLICATES
corrPairs = corrPairs[::2]
corrPairs = corrPairs[corrPairs.index.get_level_values(0) != corrPairs.index.get_level_values(1)]
with pd.option_context('display.max_rows',10):
    print(corrPairs)

UCR__OTHERS_THREATS  Degree_Fact            0.210516
Arrest_status        UCR__ASSUALT           0.206600
Degree_Fact          UCR__ASSUALT           0.159513
UCR__MVA_TRAFFIC     Degree_Fact            0.152628
UCR__MURDER_DEATH    Degree_Fact            0.113504
                                              ...   
UCR__MVA_TRAFFIC     UCR__ASSUALT          -0.257647
UCR__THEFT_FRAUD     UCR__OTHERS_THREATS   -0.340642
                     UCR__MVA_TRAFFIC      -0.349614
Degree_Fact          UCR__THEFT_FRAUD      -0.433983
UCR__THEFT_FRAUD     UCR__ASSUALT          -0.526213
Length: 231, dtype: float64


###### Scale Data
In order to imporve accuracy and performance of our classification model and to prevent emphasis of one attribute over the other, attributes are scaled to have a mean of 0 and variance of 1 for all models in this report.Several features in the data set are decimal measurements that will never exceed 1.


In [15]:
from sklearn.preprocessing import StandardScaler

#Scale data
scaler = StandardScaler()
inci_X_scaled = scaler.fit_transform(inci_X)
inci_X_Rfe_scaled = scaler.fit_transform(inci_X_Rfe)
inci_X_SVM_scaled = scaler.fit_transform(inci_X_SVM)

#Save as data frames
df_inci_X_scaled = pd.DataFrame(inci_X_scaled)
df_inci_X_Rfe_scaled = pd.DataFrame(inci_X_Rfe_scaled)
df_inci_X_SVM_scaled= pd.DataFrame(inci_X_SVM_scaled)

#### Variance Inflation Factors (VIF)

The attributes analysis and scaling is indicated with hight variance inflation factors. Generally acceptable value should be under 10. This will help create better model. 

###### Initially, for manual reduction method for Logistic regression VIF is, 

In [17]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
df2_vif = pd.DataFrame()
df2_vif["VIF Factor"] = [vif(inci_X.values, i) for i in range(inci_X.shape[1])]
df2_vif["features"] = inci_X.columns
df2_vif

Unnamed: 0,VIF Factor,features
0,1.001417,Day1_of_the_Week
1,1.020348,Number_of_offense
2,1.067255,Arrest_status
3,1.006317,Call_Received_Hour
4,1.017149,IsMale
5,1.020684,Social_crime_score
6,1.294221,Degree_Fact
7,inf,Div_CENTRAL
8,inf,Div_NORTH CENTRAL
9,inf,Div_NORTHEAST


##### Applying threshold of 10 VIF is,
After applying a threshold of 10 and using the Logistic Regression-with manual variable reduction, dataset, the VIF factors have been reduced significantly and are in an acceptable range.

In [18]:
#Credit to:
###https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
###https://etav.github.io/python/vif_factor_python.html

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

df2_vif = pd.DataFrame()
df2_vif["VIF Factor"] = [vif(df_inci_X_scaled.values, i) for i in range(df_inci_X_scaled.shape[1])]
df2_vif["features"] = inci_X.columns
df2_vif

Unnamed: 0,VIF Factor,features
0,1.001383,Day1_of_the_Week
1,1.020348,Number_of_offense
2,1.067249,Arrest_status
3,1.006317,Call_Received_Hour
4,1.017149,IsMale
5,1.020684,Social_crime_score
6,1.294149,Degree_Fact
7,inf,Div_CENTRAL
8,843914.7,Div_NORTH CENTRAL
9,inf,Div_NORTHEAST


### FEATURE SIGNIFICANCE

In [19]:
# LOGISTIC REGRESSION: SUMMARY TABLE WITHOUT SCALING- FEATURE SIGNIFICANCE, CROSS VALIDATION OF FULL MODEL


import statsmodels.api as sm
logit_model = sm.Logit(inci_Y, inci_X)
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.632328
         Iterations 6
                            Results: Logit
Model:               Logit              Pseudo R-squared:  0.025      
Dependent Variable:  Res_time_category  AIC:               253959.7509
Date:                2018-10-03 12:26   BIC:               254174.1602
No. Observations:    200780             Log-Likelihood:    -1.2696e+05
Df Model:            20                 LL-Null:           -1.3026e+05
Df Residuals:        200759             LLR p-value:       0.0000     
Converged:           1.0000             Scale:             1.0000     
No. Iterations:      6.0000                                           
----------------------------------------------------------------------
                       Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
----------------------------------------------------------------------
Day1_of_the_Week      -0.0041   0.0024  -1.7305 0.0835 -0.0088  0.0005
Num

## Logistic Regresssion

##### Classifier Evaluation

- Functions and code utilized from Dr. Drew's NC models 
https://github.com/jakemdrew/EducationDataNC/blob/master/2016/Models/2016ComparingSegregatedHighSchoolCampuses.ipynb

In [20]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

results = []

def EvaluateClassifierEstimator(classifierEstimator, X, y, cv, model):
   
    #Perform cross validation 
    scores = cross_validate(classifierEstimator, inci_X, inci_Y, scoring=['accuracy','precision','recall']
                            , cv=cv, return_train_score=True)

    Accavg = scores['test_accuracy'].mean()
    Preavg = scores['test_precision'].mean()
    Recavg = scores['test_recall'].mean()

    print_str = "The average accuracy for all cv folds is: \t\t\t {Accavg:.5}"
    print_str2 = "The average precision for all cv folds is: \t\t\t {Preavg:.5}"
    print_str3 = "The average recall for all cv folds is: \t\t\t {Recavg:.5}"

    print(print_str.format(Accavg=Accavg))
    print(print_str2.format(Preavg=Preavg))
    print(print_str3.format(Recavg=Recavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['Accuracy'] = scores['test_accuracy']
    scoresResults['Precision'] = scores['test_precision']
    scoresResults['Recall'] = scores['test_recall']
    
    results.append({'Model': model, 'Accuracy': Accavg, 'Precision': Preavg, 'Recall': Recavg})

    return scoresResults

def EvaluateClassifierEstimator2(classifierEstimator, X, y, cv):
    
    #Perform cross validation 
    from sklearn.model_selection import cross_val_predict
    predictions = cross_val_predict(classifierEstimator, inci_X, inci_Y, cv=cv)
    
    #model evaluation 
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    
    #pass true test set values and predictions to classification_report
    classReport = classification_report(Y,predictions)
    confMat = confusion_matrix(Y,predictions)
    acc = accuracy_score(Y,predictions)
    
    print (classReport)
    print (confMat)
    print (acc)

##### GridSearchCV Logistic Regression with Manual Feature Reduction


In [21]:
#Logisitic regression 10-fold cross-validation 
from sklearn.linear_model import LogisticRegression
regEstimator = LogisticRegression()

parameters = { 'penalty':['l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced', 'none']
              ,'random_state': [0]
              ,'solver': ['lbfgs']
              ,'max_iter':[100,500]
             }

#Create a grid search object using the  
from sklearn.model_selection import GridSearchCV
regGridSearch = GridSearchCV(estimator=regEstimator
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
#regGridSearch.fit(teamX, teamY)
regGridSearch.fit(df_inci_X_scaled, inci_Y)

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   12.1s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   44.0s
[Parallel(n_jobs=8)]: Done 280 out of 280 | elapsed:  1.1min finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'penalty': ['l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'class_weight': ['balanced', 'none'], 'random_state': [0], 'solver': ['lbfgs'], 'max_iter': [100, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [22]:
#Diplay the top model parameters
regGridSearch.best_estimator_.coef_

array([[-0.00796553, -0.08838005, -0.17504015,  0.17255479, -0.03045134,
        -0.08302298,  0.12905592,  0.02997458, -0.03865177,  0.01099027,
         0.05481226, -0.01266453, -0.04454001,  0.00121415,  0.00491569,
        -0.05200729, -0.0295495 , -0.09433099,  0.05186489, -0.12319448,
         0.12497818, -0.03237182]])

#### Accuacy Precision, Recall, Attribute Weights, Model Parameters
Average accuracy, precision, and recall for the cross-validation folds is listed below.

In [23]:
#Use the best parameters for our Linear Regression object
classifierEst = regGridSearch.best_estimator_

#Evaluate the regression estimator above using our pre-defined cross validation and scoring metrics.
print("\n",round(EvaluateClassifierEstimator(classifierEst, df_inci_X_scaled, inci_Y, cv, "manual"),4))

#Use the best parameters for our Linear Regression object",
ClassiferParams = regGridSearch.best_params_
print("\n---- Logistic Regression - CV, Scaled 'Manual' Attr Elimination ----")
for keys,values in ClassiferParams.items():
    print(keys,": \t ",values)
    
# sort these attributes and spit them out\n",
name = inci_X
zip_vars = zip(regGridSearch.best_estimator_.coef_.T,name) # combine attributes
zip_vars = sorted(zip_vars, reverse=True)

# Print out Attributes and their weights\n",
print("\n---- Attributes and their weights -----\n")
for coef, name in zip_vars:
    print(name, ' has weight of', coef[0])

The average accuracy for all cv folds is: 			 0.64862
The average precision for all cv folds is: 			 0.51234
The average recall for all cv folds is: 			 0.010762
*********************************************************
Cross Validation Fold Mean Error Scores

    Accuracy  Precision  Recall
0    0.6480     0.4927  0.0096
1    0.6485     0.5048  0.0113
2    0.6470     0.4985  0.0119
3    0.6460     0.5524  0.0096
4    0.6468     0.5074  0.0097
5    0.6518     0.5100  0.0109
6    0.6516     0.4862  0.0114
7    0.6481     0.5226  0.0115
8    0.6472     0.5114  0.0111
9    0.6511     0.5374  0.0108

---- Logistic Regression - CV, Scaled 'Manual' Attr Elimination ----
C : 	  0.001
class_weight : 	  none
max_iter : 	  100
penalty : 	  l2
random_state : 	  0
solver : 	  lbfgs

---- Attributes and their weights -----

Call_Received_Hour  has weight of 0.17255478852412748
Degree_Fact  has weight of 0.1290559193980801
UCR__THEFT_FRAUD  has weight of 0.12497818088044535
Div_NORTHWEST  has weight

In [24]:
#Is there a difference between .predict and .best_estimator_.predict?  Nope.
print("Best Estimator GridSearch Prediction")
print(regGridSearch.best_estimator_.predict(df_inci_X_scaled))
print(regGridSearch.best_estimator_.predict_proba(df_inci_X_scaled))

Best Estimator GridSearch Prediction
[0 0 0 ... 0 0 0]
[[0.73277859 0.26722141]
 [0.63542657 0.36457343]
 [0.52470965 0.47529035]
 ...
 [0.57282696 0.42717304]
 [0.62291186 0.37708814]
 [0.58553968 0.41446032]]


##### GridSearchCV Logistic Regression with Recursive Feature Elimination

In [25]:
#Credit to:  Jake Drew NC Education Data Set Analysis

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import ShuffleSplit


print("RFECV Logistic Regression 1st Pass")
rfecvEstimator = LogisticRegression()

parameters = { 'penalty':['l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced', 'none']
              ,'random_state': [0]
              ,'solver': ['lbfgs']
              ,'max_iter':[100,500]
             }

#Create a grid search object using the  
from sklearn.model_selection import GridSearchCV
rfecvGridSearch = GridSearchCV(estimator=rfecvEstimator
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data using RFECV
rfecvGridSearch.fit(df_inci_X_Rfe_scaled, inci_Y)

#Use the best parameters for our RFECV Linear Regression object
rfecvClassifierEst = rfecvGridSearch.best_estimator_

#Recursive Feature Elimination
rfecv = RFECV(estimator=rfecvClassifierEst, step=1, cv=cv, scoring='accuracy', verbose=1)
#X_BestFeatures = rfecv.fit_transform(teamX, teamY)
X_BestFeatures = rfecv.fit_transform(df_inci_X_Rfe_scaled, inci_Y)

#Print RFECV Details
print("Ranking", rfecv.ranking_)
print("Support", rfecv.support_)
print("Number of Features:", rfecv.n_features_)

print("Logistic Regression Second Pass")
#create a pipeline to scale all of the data and perform logistic regression during each grid search step.
pipe = make_pipeline(StandardScaler(), LogisticRegression())

#Define a range of hyper parameters for grid search
parameters = { 'logisticregression__penalty':['l2']
              ,'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'logisticregression__class_weight': ['balanced','none']
              ,'logisticregression__random_state': [0]
              ,'logisticregression__solver': ['lbfgs']
              ,'logisticregression__max_iter':[100,500]
             }

#Perform the grid search using accuracy as a metric during cross validation.
grid = GridSearchCV(pipe, parameters, cv=cv, scoring='accuracy')

#Use the best features from recursive feature elimination during the grid search
grid.fit(df_inci_X_Rfe_scaled, inci_Y)

RFECV Logistic Regression 1st Pass
Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   11.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   44.2s
[Parallel(n_jobs=8)]: Done 280 out of 280 | elapsed:  1.1min finished


Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'logisticregression__penalty': ['l2'], 'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'logisticregression__class_weight': ['balanced', 'none'], 'logisticregression__random_state': [0], 'logisticregression__solver': ['lbfgs'], 'logisticregression__max_iter': [100, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [26]:
#Use the best parameters from RFE for our Linear Regression object

EvaluateClassifierEstimator(rfecvClassifierEst, df_inci_X_Rfe_scaled, inci_Y, cv, 'Rfe')

The average accuracy for all cv folds is: 			 0.64862
The average precision for all cv folds is: 			 0.51234
The average recall for all cv folds is: 			 0.010762
*********************************************************
Cross Validation Fold Mean Error Scores


Unnamed: 0,Accuracy,Precision,Recall
0,0.648048,0.492701,0.009555
1,0.648521,0.504762,0.011263
2,0.647027,0.498525,0.011924
3,0.64603,0.552419,0.009621
4,0.646753,0.507407,0.009655
5,0.651833,0.51,0.010939
6,0.651609,0.486239,0.011373
7,0.648073,0.522581,0.011452
8,0.647176,0.511401,0.011076
9,0.651136,0.537367,0.010763


###### Reiteration of manual feature reduction of Logistic Regression

In [27]:
print(grid.best_estimator_.predict(df_inci_X_Rfe_scaled))
print(grid.best_estimator_.predict_proba(df_inci_X_Rfe_scaled))

#Use the best parameters for our RFE  Regression object
rfecvClassifierEst = rfecvGridSearch.best_estimator_

#Evaluate the regression estimator above using our pre-defined cross validation and scoring metrics.
print("\n",round(EvaluateClassifierEstimator(rfecvClassifierEst, df_inci_X_Rfe_scaled, inci_Y, cv, "manual"),4))

#Use the best parameters for our RFECV Regression object",
rfecvClassiferParams = rfecvGridSearch.best_params_
print("\n---- RFECV Regression - CV, Scaled ----")
for keys,values in rfecvClassiferParams.items():
    print(keys,": \t ",values)
    
# sort these attributes and spit them out\n",
name = inci_X
zip_vars = zip(rfecvGridSearch.best_estimator_.coef_.T,name) # combine attributes
zip_vars = sorted(zip_vars, reverse=True)

# Print out Attributes and their weights\n",
print("\n---- Attributes and their weights -----\n")
for coef, name in zip_vars:
    print(name, ' has weight of', coef[0])

[0 0 0 ... 0 0 0]
[[0.73277859 0.26722141]
 [0.63542657 0.36457343]
 [0.52470965 0.47529035]
 ...
 [0.57282696 0.42717304]
 [0.62291186 0.37708814]
 [0.58553968 0.41446032]]
The average accuracy for all cv folds is: 			 0.64862
The average precision for all cv folds is: 			 0.51234
The average recall for all cv folds is: 			 0.010762
*********************************************************
Cross Validation Fold Mean Error Scores

    Accuracy  Precision  Recall
0    0.6480     0.4927  0.0096
1    0.6485     0.5048  0.0113
2    0.6470     0.4985  0.0119
3    0.6460     0.5524  0.0096
4    0.6468     0.5074  0.0097
5    0.6518     0.5100  0.0109
6    0.6516     0.4862  0.0114
7    0.6481     0.5226  0.0115
8    0.6472     0.5114  0.0111
9    0.6511     0.5374  0.0108

---- RFECV Regression - CV, Scaled ----
C : 	  0.001
class_weight : 	  none
max_iter : 	  100
penalty : 	  l2
random_state : 	  0
solver : 	  lbfgs

---- Attributes and their weights -----

Call_Received_Hour  has weight o

## SUPPORT VECTOR MACHINE (SVM)

In [28]:
#SVM model on main dataframe.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn import metrics as mt

#train the model just as before
svm_clf = SVC(C=0.5, kernel='linear', degree=3, gamma='auto') # get object
svm_clf.fit(df_inci_X_SVM_scaled, inci_Y)  # train object

y_hat = svm_clf.predict(df_inci_X_SVM_scaled)

acc = mt.accuracy_score(inci_Y,y_hat)
conf = mt.confusion_matrix(inci_Y,y_hat)
prec = mt.precision_score(inci_Y, y_hat)
recall = mt.recall_score(inci_Y, y_hat)
print('accuracy:', acc )
print('precision:', prec)
print('recall:', recall)
print(conf)

results.append({'Model': 'SVM', 'Accuracy': acc, 'Precision': prec, 'Recall': recall})

accuracy: 0.6478185078195039
precision: 0.0
recall: 0.0
[[130069      0]
 [ 70711      0]]


  'precision', 'predicted', average, warn_for)


In [29]:
#look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )

(142796, 22)
(142796,)
[72085 70711]


In [30]:
# SVM based Prediction
print(y_hat)

[0 0 0 ... 0 0 0]


## Create Model Summary

- The three models: 
    - Logistic regression with manual selection, 
    - Logistic regression with RFE selection, 
    - Support Vector Machine (SVM) 
    
These models were executed successfully. The models were cross validated with controls. 

- Stochastic Gradient Descent was not utilized for the support vector machine model ...................***

- The "GridSearchCV Logistic Regression with manual variable reduction" model ultimately produced the best accuracy and overall results. The results are summarized in the table below.


In [1]:
df_results = pd.DataFrame(results)
df_results = df_results[['Model', 'Accuracy', 'Precision', 'Recall']]
df_results

NameError: name 'pd' is not defined

### Model Advantages

For classification analysis both Logistic regression and Support Vector Machines are common machine learning algorithms for creating models.

- Logistic regression

    - For maximizing the probability of the data, logistic regression models are generally good. The accuracy of model is acheived at its best in these type of models when data points are distictly separated far away from hyperplane.
    - This is more probablistic model.
    - RFE (Recursive Feature Elimination) model chooses the peformance of feature and repeats process until all attibutes are analyzed. 

- Support Vector Machine

    - By definition, SVM models tired to score on hyperplane which maximizes the distance closest to margin or support vectors.
    - This is deterministic model.
    - The SVM model creates hyperplane and puts source data in these dimensional space which is different from original data and is analyzed accordingly. 


Generally, from the results produced by our models both Logistic Regrerssion and SVM have similar accuracy. However, manual and RFE model of Logistic regression performed well in terms of precision and recall then SVM. In terms of accuracy, manual and RFE model has 89.77%, whereas SVM had 89.74% accuracy, which is with less than 1% of each other. 
In terms of precision, which means correctly classified classes, RFE and manual models had 60.87% precision. SVM did not perform well for out dataset or our variable selection. Overall, the manual and RFE logistic regression model performed best for our dataset. 




## Feature Importance for Logistic Regression
In logistic models, feature weights will provide us with importance of attribute. We can compare RFE and manual models in terms of weight as both were normalized.

###### Manual Variable Selection Model

*** EXPLANATION RELATED TO ATTRIBUTES 

###### Recursive Selection Model

*** EXPLANATION RELATED TO ATTRIBUTES 

In [None]:
from matplotlib import pyplot as plt

def plotCoef(coef, names, t):
    imp = coef
    imp,names = zip(*sorted(zip(imp,names), key=lambda x: abs(x[0])))
    plt.figure(figsize=(9,12))
    barlist = plt.barh(range(len(names)), imp, align='center')
    for x in np.nditer(np.where(np.asarray(list(imp)) < 0)):
        barlist[x].set_color('r')
    plt.yticks(range(len(names)), names)
    plt.title(t)
    plt.show()



In [None]:
plotCoef(regGridSearch.best_estimator_.coef_[0], inci_X.columns.values, "Manual Logistic Features")
list(sorted(zip(regGridSearch.best_estimator_.coef_.ravel(), inci_X.columns.values)))

In [None]:
plotCoef(grid.best_estimator_.named_steps['logisticregression'].coef_.ravel(), inci_X_Rfe.columns.values, "Recursive Logistic Features")
list(sorted(zip(grid.best_estimator_.named_steps['logisticregression'].coef_.ravel(), inci_X_Rfe.columns.values)))

### Interpreting SVM Fields

For SVM models, the interpretation of field importance is not as straight forward. Non-linear SVM models create hyperplanes in infinite dimensional space. To accomplish this the source data used in the analysis must be mapped to a higher dimentional space and as a result is very different from the original data. Because of this it is not possible to determine feature weights like we did with the logisitc regessions above.

However, we can examine individual features to investigate how SVM approaches classification problems.**** EXPLANATION

In [None]:
#Credit To:
####http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
from matplotlib import pyplot as plt

#R and RA
pX = pd.DataFrame()

pX['a'] = inci_X_SVM['UCR_Offense_Name']
pX['b'] = inci_X_SVM['Number_of_offense']

psvc = SVC(kernel='linear', C=0.5, gamma='auto').fit(pX, inci_Y)

pXAmin = pX['a'].min() - 1
pXAmax = pX['a'].max() + 1
pXBmin = pX['b'].min() - 1
pXBmax = pX['b'].max() + 1

pxx, pyy = np.meshgrid(np.arange(pXAmin, pXAmax, 10), np.arange(pXBmin, pXBmax, 10))

plt.subplot(1, 1, 1)

pZ = psvc.predict(np.c_[pxx.ravel(), pyy.ravel()])

pZ = pZ.reshape(pxx.shape)
plt.contourf(pxx, pyy, pZ, cmap=plt.cm.Paired, alpha=0.8)

plt.scatter(pX['a'], pX['b'], c=inci_Y, cmap=plt.cm.Paired)
plt.xlabel('UCR_Offense_Name')
plt.ylabel('Number_of_offense')
plt.xlim(pxx.min(), pxx.max())
plt.title('SVM:  UCR_Offense_Name and Number_of_offense')
plt.show()

### END OF REPORT

In [None]:
# from sklearn.svc import SVM
# from sklearn.model_selection import StratifiedKFold

# X = inci_X_SVM_scaled
# y = inci_Y

# # Create the validation curve visualizer
# cv = StratifiedKFold(12)
# param_range = np.logspace(-6, -1, 12)

# viz = ValidationCurve(
#     SVC(), param_name="gamma", param_range=param_range,
#     logx=True, cv=cv, scoring="f1_weighted", n_jobs=8,
# )

# viz.fit(X, y)
# viz.poof()

In [None]:
from sklearn.model_selection import train_test_split

# Extract the numpy arrays from the data frame
X = inci_X.as_matrix()
y = inci_Y.as_matrix()

# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression

from yellowbrick.classifier import ROCAUC

# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = ROCAUC(logistic)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data