# MSDS 7331 - Mini Lab: Logistic Regression and SVMs

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab1)
- [Tom Elkins](telkins@smu.edu?subject=lab1)
- [Austin Kelly](ajkelly@smu.edu?subject=lab1)
- [Murali Parthasarathy](mparthasarathy@smu.edu?subject=lab1)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Lab Instructions</h3>
    <p>You are to perform predictive analysis (classification) upon a data set: model the dataset using methods we have discussed in class: logistic regression and support vector machines, and making conclusions from the analysis. Follow the CRISP-DM framework in your analysis (you are not performing all of the CRISP-DM outline, only the portions relevant to the grading rubric outlined below). This report is worth 10% of the final grade. You may complete this assignment in teams of as many as three people.
Write a report covering all the steps of the project. The format of the document can be PDF, *.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the rendered iPython notebook. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.</p>
</div>

<a id='data_prep'></a>
## 1 - Data Preparation

In [1]:
import pandas as pd
import numpy as np
import math

%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(font_scale=2)
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# Read in the crime data from the combined CSV file
dc = pd.read_csv('data/DC_Crime_2015_Lab1.csv')
dc.drop("Unnamed:0")

Unnamed: 0.1,Unnamed: 0,REPORT_DAT,SHIFT,OFFENSE,METHOD,DISTRICT,PSA,WARD,ANC,NEIGHBORHOOD_CLUSTER,...,PSA_ID,DistrictID,SHIFT_Code,OFFENSE_Code,METHOD_Code,CRIME_TYPE,AGE,TIME_TO_REPORT,Latitude,Longitude
0,0,2015-03-04 12:05:00,DAY,THEFT/OTHER,OTHERS,3.0,305.0,1,12,3,...,305,3,1,1,1,2,2678580.0,15897720.0,38.918640,-77.031953
1,1,2015-01-22 09:00:00,DAY,THEFT F/AUTO,OTHERS,4.0,408.0,1,14,2,...,408,4,1,2,1,2,176880.0,6299520.0,38.934826,-77.039955
2,2,2015-01-03 21:20:00,EVENING,THEFT/OTHER,OTHERS,3.0,302.0,1,11,2,...,302,3,2,1,1,2,600.0,7200.0,38.929513,-77.032731
3,3,2015-01-05 12:44:00,DAY,THEFT/OTHER,OTHERS,3.0,306.0,1,12,3,...,306,3,1,1,1,2,1449000.0,4440.0,38.922580,-77.019719
4,4,2015-01-20 07:01:00,DAY,THEFT F/AUTO,OTHERS,3.0,302.0,1,11,2,...,302,3,1,2,1,2,71940.0,120.0,38.928635,-77.029708
5,5,2015-01-20 06:38:00,MIDNIGHT,BURGLARY,OTHERS,3.0,305.0,1,12,3,...,305,3,3,3,1,2,75600.0,5880.0,38.918768,-77.023893
6,6,2015-01-20 11:30:00,DAY,ASSAULT W/DW,OTHERS,3.0,304.0,1,12,2,...,304,3,1,4,1,1,1140.0,2400.0,38.922424,-77.028876
7,7,2015-01-20 12:00:00,DAY,THEFT/OTHER,OTHERS,4.0,408.0,1,14,2,...,408,4,1,1,1,2,88200.0,77400.0,38.935881,-77.036459
8,8,2015-01-01 23:48:00,MIDNIGHT,ROBBERY,GUN,3.0,305.0,1,12,3,...,305,3,3,5,2,1,60.0,2880.0,38.919462,-77.025035
9,9,2015-01-04 01:28:00,MIDNIGHT,THEFT F/AUTO,OTHERS,3.0,304.0,1,12,2,...,304,3,3,2,1,2,12600.0,780.0,38.924116,-77.035347


### 1.1 - Dataset Review
We continue to use our dataset selected for lab 1 - the 2015 Washington, D.C. Metro Crime data.  That dataset contained the type of crime committed (Field name "OFFENSE"; from which we derived an "Offense_Code" field and ascribed a numeric value for each offense type (NOTE: The number used does not imply a level of severity they were simply applied in order of appearance).  :

|Offense|Offense_Code|Crime_Type|
|:------|:----------:|---------:|
|Theft/Other|1|2 (Property)|
|Theft from Auto|2|2 (Property)|
|Burglary|3|2 (Property)|
|Assault with Dangerous Weapon|4|1 (Violent)|
|Robbery|5|1 (Violent)|
|Motor Vehicle Theft|6|2 (Property)|
|Homicide|7|1 (Violent)|
|Sex Abuse|8|1 (Violent)|
|Arson|9|2 (Property)|

The dataset contains a variety of geographic identifiers representing different political, social, and legal boundaries.

DISTRICT -- the Police district within which the crime was committed<br>
Police Service Area (PSA) -- A subordinate area within a District<br>
Ward -- A political area, similar to a "county" in a larger state<br>
Advisory Neighborhood Committed (ANC) -- A social group consisting of neighbors and social leaders in a small geographic area<br>
Voting Precinct -- A political area for the management of voting residents<br>

There are also time-based identifiers provided in the data
* The Start and End dates/times of when the crime *might* have been committed.
* The date/time the crime was reported (i.e. when the police responded and took the report)
* These can be further decomposed to Seasons, Months, Weeks, Day of the Week, etc.
* Shift - the police duty shift that responded to the crime (broken into 8-hour periods within a day)
From these time-based data we could associate environmental conditions as well, including temperatures, rainfall, phase of the moon, etc.

These features give us a variety of ways to attempt to classify the data.

### 1.2 - Classification Tasks
We decided to take a look at two different classification processes with our data set.

#### 1.2.1 - Offense/Offense_Code
For the first classification task, we chose to attempt building a model to predict the type of offense given the other features of the data (geographic location, time of day, political area, etc.).  The hope is that if a type of crime could be predicted, then the Police would be better able to allocate offense-specific resources appropriately.

#### 1.2.2 - Crime_Type (Violent/Property)
The second classification task is a binary classification, in which we attempt to build a model to predict whether the crime will be against a person (violent) or against property. Again, the goal is to help the Police manage resources more appropriately.

#### 1.2.3 - Model Comparison
Secondarily, we seek to compare the accuracy of the models - i.e. if the Crime_Type prediction indicates a "Violent" crime, does the Offense prediction agree (Homicide, Sex Abuse, Robbery, or Assault).


<a id="model_building"></a>
## 2 - Model Building

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>SVM and Logistic Regression Modeling</h3>
    <ol><li>[<b>50 points</b>] Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use.</li>
    <li>[<b>10 points</b>]  Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.</li>
    <li>[<b>30 points</b>] Use the weights from logistic regression to interpret the importance of different features for each classification task. Explain your interpretation in detail. Why do you think some variables are more important?</li>
    <li>[<b>10 points</b>]  Look at the chosen support vectors for the classification task. Do these provide
any insight into the data? Explain.</li>
</ol>
</div>

### 2.1 - Logistic Regression Model for Offense_Code (Rubric Item 1)

### 2.2 - Support Vector Machine Model for Offense_Code (Rubric Item 1)

### 2.3 - Logistic Regression Model for Crime_Type (Exceptional Work)

### 2.4 - Support Vector Machine Model for Crime_Type (Exceptional Work)

### 2.5 - Advantages of Each Model (Rubric Item 2)

### 2.6 - Logistic Regression Weights (Rubric Item 3)

### 2.7 - Support Vectors (Rubric Item 4)