# MSDS 7331 - Mini Lab: Logistic Regression and SVMs

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab1)
- [Tom Elkins](telkins@smu.edu?subject=lab1)
- [Austin Kelly](ajkelly@smu.edu?subject=lab1)
- [Murali Parthasarathy](mparthasarathy@smu.edu?subject=lab1)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Lab Instructions</h3>
    <p>You are to perform predictive analysis (classification) upon a data set: model the dataset using methods we have discussed in class: logistic regression and support vector machines, and making conclusions from the analysis. Follow the CRISP-DM framework in your analysis (you are not performing all of the CRISP-DM outline, only the portions relevant to the grading rubric outlined below). This report is worth 10% of the final grade. You may complete this assignment in teams of as many as three people.
Write a report covering all the steps of the project. The format of the document can be PDF, *.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the rendered iPython notebook. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.</p>
</div>

<a id='data_prep'></a>
## 1 - Data Preparation

In [19]:
#1.0.1 - Import the libraries we will need
import pandas as pd
import numpy as np
import math

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(font_scale=2)
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

# Read in the crime data from the Lab 1 CSV file
dc = pd.read_csv('data/DC_Crime_2015_Lab1.csv')

### *** TO DO:
###  * Incorporate a feature for the weather conditions during START_DATE and END_DATE so we can use rainfall/max temp/min temp in the regression
dc['REPORT_DAT'] = pd.to_datetime(dc['REPORT_DAT'])
dc=dc.rename(columns = {'REPORT_DAT':'REPORT_DATE'})
dc['START_DATE'] = pd.to_datetime(dc['START_DATE'])
dc['END_DATE'] = pd.to_datetime(dc['END_DATE'])
dc['XBLOCK'] = dc['XBLOCK'].astype(np.float64)
dc['YBLOCK'] = dc['YBLOCK'].astype(np.float64)
dc['Crime_Month'] = dc["START_DATE"].map(lambda x: x.month)
dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36489 entries, 0 to 36488
Data columns (total 27 columns):
REPORT_DATE             36489 non-null datetime64[ns]
SHIFT                   36489 non-null object
OFFENSE                 36489 non-null object
METHOD                  36489 non-null object
DISTRICT                36442 non-null float64
PSA                     36441 non-null float64
WARD                    36489 non-null int64
ANC                     36489 non-null int64
NEIGHBORHOOD_CLUSTER    36489 non-null int64
CENSUS_TRACT            36489 non-null int64
VOTING_PRECINCT         36489 non-null int64
CCN                     36489 non-null int64
XBLOCK                  36489 non-null float64
YBLOCK                  36489 non-null float64
START_DATE              36489 non-null datetime64[ns]
END_DATE                36489 non-null datetime64[ns]
PSA_ID                  36489 non-null int64
DistrictID              36489 non-null int64
SHIFT_Code              36489 non-null int6

### 1.1 - Dataset Review
We continue to use our dataset selected for lab 1 - the 2015 Washington, D.C. Metro Crime data.  That dataset contained the type of crime committed (Field name "OFFENSE"; from which we derived an "Offense_Code" field and ascribed a numeric value for each offense type (NOTE: The number used does not imply a level of severity they were simply applied in order of appearance).  :

|Offense|Offense_Code|Crime_Type|
|:------|:----------:|---------:|
|Theft/Other|1|2 (Property)|
|Theft from Auto|2|2 (Property)|
|Burglary|3|2 (Property)|
|Assault with Dangerous Weapon|4|1 (Violent)|
|Robbery|5|1 (Violent)|
|Motor Vehicle Theft|6|2 (Property)|
|Homicide|7|1 (Violent)|
|Sex Abuse|8|1 (Violent)|
|Arson|9|2 (Property)|

The dataset contains a variety of geographic identifiers representing different political, social, and legal boundaries.

DISTRICT -- the Police district within which the crime was committed<br>
Police Service Area (PSA) -- A subordinate area within a District<br>
Ward -- A political area, similar to a "county" in a larger state<br>
Advisory Neighborhood Committed (ANC) -- A social group consisting of neighbors and social leaders in a small geographic area<br>
Voting Precinct -- A political area for the management of voting residents<br>
Local Coordinates (XBLOCK and YBLOCK) -- location within the DC metro area based on the Maryland mapping system<br>
Global Coordinates (Latitude and Longitude) -- location on the planet<br>

There are also time-based identifiers provided in the data
* The Start and End dates/times of when the crime *might* have been committed.
* The date/time the crime was reported (i.e. when the police responded and took the report)
* These can be further decomposed to Seasons, Months, Weeks, Day of the Week, etc.
* Shift - the police duty shift that responded to the crime (broken into 8-hour periods within a day)

From these time-based data we could associate environmental conditions as well, including temperatures, rainfall, phase of the moon, etc.

These features give us a variety of ways to attempt to classify the data.

### 1.2 - Classification Tasks
We decided to take a look at two different classification processes with our data set.

#### 1.2.1 - Crime_Type (Violent/Property)
The second classification task is a binary classification, in which we attempt to build a model to predict whether the crime will be against a person (violent) or against property. Again, the goal is to help the Police manage resources more appropriately.

#### 1.2.2 - Offense/Offense_Code
For the first classification task, we chose to attempt building a model to predict the type of offense given the other features of the data (geographic location, time of day, political area, etc.).  The hope is that if a type of crime could be predicted, then the Police would be better able to allocate offense-specific resources appropriately.

#### 1.2.3 - Model Comparison
Secondarily, we seek to compare the accuracy of the models - i.e. if the Crime_Type prediction indicates a "Violent" crime, does the Offense prediction agree (Homicide, Sex Abuse, Robbery, or Assault).


<a id="model_building"></a>
## 2 - Model Building

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>SVM and Logistic Regression Modeling</h3>
    <ol><li>[<b>50 points</b>] Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use.</li>
    <li>[<b>10 points</b>]  Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.</li>
    <li>[<b>30 points</b>] Use the weights from logistic regression to interpret the importance of different features for each classification task. Explain your interpretation in detail. Why do you think some variables are more important?</li>
    <li>[<b>10 points</b>]  Look at the chosen support vectors for the classification task. Do these provide
any insight into the data? Explain.</li>
</ol>
</div>

### 2.1 - Logistic Regression Model for Crime_Type (Rubric Item 1)

In [42]:
#2.1.1 Dataset creation

#  The field "CRIME_TYPE" exists as 1 = Violent, and 2 = Property.  
#  We subtract from 2 to make it 1 = Violent, and 0 = Property
LRM_Response = 2 - dc["CRIME_TYPE"]
#print LRM_Response

#  What is the mean response
Mean_Response = LRM_Response.mean()
print "Mean response for the entire data set is " + str(Mean_Response)

Guess_Rate = 1.0 - Mean_Response
print "If we simply guessed 'Property' crime all the time, our accuracy would be " + str(Guess_Rate)

#  Set up model using all relevant features
LRM_Features = dc[["PSA_ID","WARD","ANC","NEIGHBORHOOD_CLUSTER","CENSUS_TRACT","VOTING_PRECINCT","SHIFT_Code","Latitude","Longitude","Crime_Month"]]
#print LRM_Features

#  Fit our model
LRM_Model = LogisticRegression()
LRM_Model = LRM_Model.fit(LRM_Features, LRM_Response)

#  How accurate is it?
Model_Acc = LRM_Model.score(LRM_Features, LRM_Response)
print "Accuracy of Logistic Regression model is " + str(Model_Acc)

if Model_Acc > Guess_Rate:
    print "The Logistic Regression model is better than simply guessing"
else:
    print "The Logistic Regression model is worse than simply guessing"


Mean response for the entire data set is 0.169174271698
If we simply guessed 'Property' crime all the time, our accuracy would be 0.830825728302
Accuracy of Logistic Regression model is 0.830250212393
The Logistic Regression model is worse than simply guessing


In [38]:
#  Display the coefficients to see if they tell us anything
pd.DataFrame(zip(LRM_Features.columns, np.transpose(LRM_Model.coef_)))

Unnamed: 0,0,1
0,PSA_ID,[0.00167783221638]
1,WARD,[0.0195928056189]
2,ANC,[0.00886350684632]
3,NEIGHBORHOOD_CLUSTER,[-0.00232184509213]
4,CENSUS_TRACT,[-1.37492426304e-05]
5,VOTING_PRECINCT,[0.0014077633811]
6,SHIFT_Code,[0.79779478674]
7,Latitude,[-0.0193099343437]
8,Longitude,[0.0475113614735]
9,Crime_Month,[0.00955306118479]


* The most significant factor is the SHIFT (i.e. the time of day)
* The second-most significant factor is Longitude, implying that as you move east, your chances of being involved in a violent crime increases
* The third-most significant factor is political WARD, so there are some Wards that are worse than others
* The fourth-most significant factor is Latitude, but negatively, so there is a greater chance of being involved in a violent crime as you move south

In [43]:
# Split the data into a training set and a test set (80/20)

LRM_XTrain, LRM_XTest, LRM_YTrain, LRM_YTest = train_test_split(LRM_Features, LRM_Response, test_size=0.2, random_state=0)

# Fit the same features against the training data
LRM_Model2 = LogisticRegression()
LRM_Model2.fit(LRM_XTrain, LRM_YTrain)

#  How accurate is it?
Model_Acc = LRM_Model2.score(LRM_Features, LRM_Response)
print "Accuracy of Logistic Regression model is " + str(Model_Acc)

if Model_Acc > Guess_Rate:
    print "The Logistic Regression model is better than simply guessing"
else:
    print "The Logistic Regression model is worse than simply guessing"


Accuracy of Logistic Regression model is 0.830250212393
The Logistic Regression model is worse than simply guessing


In [34]:
predicted = LRM_Model2.predict(LRM_XTest)
print predicted

# generate class probabilities
probs = LRM_Model2.predict_proba(LRM_XTest)
print probs

# generate evaluation metrics
print metrics.accuracy_score(LRM_YTest, predicted)
print metrics.roc_auc_score(LRM_YTest, probs[:, 1])

print metrics.confusion_matrix(LRM_YTest, predicted)
print metrics.classification_report(LRM_YTest, predicted)

[0 0 0 ..., 0 0 0]
[[ 0.62327851  0.37672149]
 [ 0.86575354  0.13424646]
 [ 0.91667427  0.08332573]
 ..., 
 [ 0.8726165   0.1273835 ]
 [ 0.84396971  0.15603029]
 [ 0.86138094  0.13861906]]
0.825020553576
0.720589616457
[[5919   82]
 [1195  102]]
             precision    recall  f1-score   support

          0       0.83      0.99      0.90      6001
          1       0.55      0.08      0.14      1297

avg / total       0.78      0.83      0.77      7298



In [35]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), LRM_Features,LRM_Response, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.83068493  0.83068493  0.83068493  0.83091258  0.83091258  0.83063853
  0.83086623  0.83004386  0.81030702  0.73574561]
0.81914812025


### 2.2 - Support Vector Machine Model for Crime_Type (Rubric Item 1)

### 2.3 - Logistic Regression Model for Offense_Code (Exceptional Work)

### 2.4 - Support Vector Machine Model for Offense_Code (Exceptional Work)

### 2.5 - Advantages of Each Model (Rubric Item 2)

### 2.6 - Logistic Regression Weights (Rubric Item 3)

### 2.7 - Support Vectors (Rubric Item 4)