# MSDS 7331 - Lab Two: Regress or Classify

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab2)
- [Ben Brock](mailto:bbrock@smu.edu?subject=lab2)
- [Tom Elkins](telkins@smu.edu?subject=lab2)
- [Austin Kelly](ajkelly@smu.edu?subject=lab2)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Lab Instructions</h3>
    <p>You are to build upon the predictive analysis that you already completed in the previous mini-project, adding additional modeling from new classification algorithms as well as more explanations that are inline with the CRISP-DM framework. You should use appropriate cross validation for all of your analysis (explain your chosen method of performance validation <i>in detail</i>). Try to use as much testing data as possible <i>in a realistic manner</i> (you should define what you think is realistic and why).</p>
    <p>This report is worth 20% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a single document. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.</p>
    <p>Report Sections:</p>
    <ol>
        <li>[Data Preparation](#data_preparation) <b>(15 points)</b></li>
        <li>[Modeling and Evaluation](#modeling_and_evaluation) <b>(70 points)</b></li>
        <li>[Deployment](#deployment) <b>(5 points)</b></li>
        <li>[Exceptional Work](#exceptional_work) <b>(10 points)</b></li>
    </ol>
</div>

<a id='data_preparation'></a>
## 1 - Data Preparation
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Data Preparation (<b>15 points total</b>)</h3>
    <ul><li>[<b>10 points total</b>] [1.1 - Define and prepare your class variables](#define_and_prepare_class_variables). Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reductions, scaling, etc. Remove variables that are not needed/useful for the analysis.</li>
    <li>[<b>5 points total</b>] [1.2 - Describe the final dataset](#describe_final_dataset) that is used for classification/regression (include a description of any newly formed variables you created.)</li>
    </ul>
</div>

<a id='define_and_prepare_class_variables'></a>
### 1.1 - Define and Prepare Class Variables
** >> revise below << **

The data set chosen for lab 1 is the 2015 Washington DC Metro Crime inspired from a Kaggle data set found at https://www.kaggle.com/vinchinzu/dc-metro-crime-data. The data set was obtained by following the steps found on the [Using the Crime Map Application](http://mpdc.dc.gov/node/200622) page. This site allowed us to download data by political ward for all eight wards from 01/01/2015 to 12/31/2015 as CSV files. These individual ward files were then merged together into a single file for our use. This data set contains 36,493 entries and 18 attributes that are both continuous and discrete. This satisfies the data set requirement for a minimum of 30,000 entries and 10 attributes which are both continuous and discrete. Further definition of this data set will be discussed in the [Data Understanding](#data_understanding) section.

![Ward Map](images/wards_small.png "Washington DC Wards") 
<p style='text-align: center;'>
Washington DC Metro Ward Map
</p>

The crime data is published by the Washington DC Metro police department daily (see below image) to provide their residents a clear picture of crime trends as they actually happen. The data is shared with its residents such as Advisory Neighborhood Commissions to help the police determine how to keep neighborhoods safe. The data is also analyzed to determine the effectiveness of current investments such as putting more officers on the streets, buying police more tools, and launching community partnerships, see [Washington DC Metro Police Department report](http://mpdc.dc.gov/publication/mpd-annual-report-2015) for more details.

![Ward Map](images/dc_2015_crime.tiff "Washington DC Year End Crime Data") 
<p style='text-align: center;'>
Washington DC Metro 2015 Year End Crime Data
</p>

### 1.1.1 - Load Data

In [None]:
# generic imports
import pandas as pd
import numpy as np
from __future__ import print_function

# plotting setup
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import seaborn as sns
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# scikit imports
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

# Read in the crime data from the CSV file
df = pd.read_csv('data/DC_Crime_2015_Lab2.csv')

<div style='color:red'>
Recommend:
<ul>
<li>Remove REPORT_DAT - this simply indicates when the Police responded to the report of the crime and has no predictive impact</li>
<li>Remove SHIFT - The duty shift is too coarse in resolution (8-hour blocks).  Better to include the hour the crime might have been committed (using the HOUR from the END_DATE field)</li>
<li>Keep OFFENSE as a separate vector for labels so we can properly label the results, but remove from the data set</li>
<li>Remove METHOD - this is not predictive.  We won't know if a gun/knife will be involved until the crime is committed</li>
<li>Remove DISTRICT - this field has missing values and can be replaced with DistrictID</li>
<li>Remove PSA - this field has missing values and can be replaced with PSA_ID</li>
<li>Remove WARD - although there appears to be an effect, the WARD is too coarse.  ANC provides more resolution, and Ward can be derived from ANC.</li>
<li>Keep ANC - This field identifies a geo-political grouping, and is finer resolution than WARD.  We can derive WARD from the leading digit of the ANC</li>
<li>Remove NEIGHBORHOOD_CLUSTER - This field provided no useful information</li>
<li>Remove CENSUS_TRACT - This field provided no useful information</li>
<li>Remove VOTING_PRECINCT - This field provided no useful information</li>
<li>Undecided on CCN - this value acts like an index because it is unique per crime report and may be used to get additional information from public data sources</li>
<li>Remove XBLOCK and YBLOCK - These give specific location, but the magnitude of the values tends to artifically inflate their importance.  These were converted to Latitude and Longitude</li>
<li>Remove START_DATE - For crimes in which there was a witness, the END_DATE is just as relevant.  For unwitnessed crimes, the START_DATE could be too far back in time to be useful</li>
<li>Keep PSA_ID - this gives a second geo-political grouping, but is more interpretable to the police department as it defines jurisdictional boundaries.  We can derive District from the leading digit of the PSA_ID value</li>
<li>Remove DistrictID - We can derive District from the PSA_ID value.</li>
<li>Remove SHIFT_Code - we know that time of day is more influential, and we can derive the Shift from the hour of the day</li>
<li>Keep OFFENSE_Code - this is our response variable</li>
<li>Remove CRIME_TYPE - We can derive Violent/Property crime type from the OFFENSE_Code</li>
<li>Remove AGE - Since the START_DATE for an unwitnessed crime is a guess based on the last observation of the area, we can get arbitrarily large values that will skew the results.</li>
<li>Remove TIME_TO_REPORT - this value represents an action AFTER the crime was committed and has no predictive value</li>
<li>Keep Latitude and Longitude - These provide specific location information on a continuous scale</li>
<li>ADD Crime_Hour - This was derived in the Mini-Lab as the Hour value from the END_DATE variable and showed highly influential coefficients</li>
<li>*POSSIBLY* Day of Week - This can be derived from the END_DATE variable.  There was a noticeable impact from weekend crimes (which are Day-of-Week values 5 and 6)</li>
<li>*POSSIBLY* Month - There was a fairly noticeable impact from the month the crime was witnessed</li>
<li>ADD Weather data - There are a variety of weather factors that may have an impact (high/low temperature, precipitation, winds, icy conditions, thunder/lightning, etc.)</li>
<li>*POSSIBLY* Phase of the moon - Since there was a significant increase in crimes at night, one factor that may influence that is ambient light.  Full moons provide considerable light in areas that have little domestic lighting - is this an advantage or a detriment?</li>
<li>*POSSIBLY* Phase of the moon - Also can't completely discount the psychological impacts of lunacy -- is there an increase in crimes during full moons?</li>
</ul>
</div>

<a id='describe_final_dataset'></a>
### 1.2 - Purpose of Data Set
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created.) **(5 points total)**

<div style='color:red;'>
<p>From data exploration in Lab 1 and the Mini-Lab, we have two potential response variables: Crime_Type (Property crime vs. Violent crime), and Offense_Code (The more specific type of offense: Homicide, Robbery, Theft, Arson, etc.).  The goal is to provide the police with a model that can predict or classify a crime based on the available explanatory variables.</p>
<p>One problem with this data is that the victim profile data is missing (due to privacy concerns, and the fact that property crimes are not necessarily because of the owner's profile).  The explanatory variables for this dataset focus on time and locations.  We believe that the detection/classification of a Violent crime would be based primarily on the victim's characteristics, and not exclusively on the location or time.  The other problem with this data is that (fortunately) there are far fewer violent crimes than there are property crimes (approximately 83% of the 36000+ crime reports are against property rather than persons), so we have very unbalanced classification tasks.</p>
<p>Our exploration of the variables seem to indicate that time (not necessarily the day, but the time during the day) is one of the more significant factors.  We saw this in the SHIFT variable (which gives the Police duty shift that responded to the call).  When we broke the time down into individual hours of the day, we saw a pronounced cyclic effect, where night-time crimes were far more likely than daytime crimes.  Weekend crimes were slightly more likely than crimes during the work week, and monthly trends appeared to be opposite intuition (fall crimes were more likely than winter or summer crimes).</p>
<p>Location also appeared to have some influence, but the <i>way</i> the locations were grouped altered the effect significantly.  Different political areas (Wards and the subordinate Association Neighborhood Committees) showed a different trend than using global locations (Latitude and Longitude).  Police districts (and their subordinate Police Service Areas) showed a different trend than the Ward/ANC grouping.  This tells us that there are some location effects, but it is difficult to separate them out due to the correlation between geo-physical areas and the different (but overlapping) political mappings.</p>
<p>From what we can tell, due to our previous exploration, the existing variables are not able to properly perform the classification tasks alone.  As such, we have looked at including other data to attempt to fill in the picture more completely.  As mentioned before, victim profile would be interesting and probably very helpful, but we would not be able to get access to that due to privacy concerns.  We are examining environmental data (primarily weather data, but we also have access to lunar phases to estimate luminance for night crimes or the psychological impact of lunacy.</p>
<p>For these reasons, we have decided to drop the Crime_Type (Property vs. Violent crime) classification and focus on the Offense_Code (type of offense) classification task.</p>
</div>

<a id="modeling_and_evaluation"></a>
## 2 - Modeling and Evaluation

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Modeling and Evaluation (<b>70 points total</b>)</h3>
    <ul><li>[<b>10 points</b>] [2.1 - Choose and explain your evaluation metrics](#choose_and_explain) that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.</li>
    <li>[<b>10 points</b>] [2.2 - Choose the method you will use for dividing your data](#choose_the_method) into training and testing splits (i.e., are you Stratified 10-fold cross validataion? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.</li>
    <li>[<b>20 points</b>] [2.3 - Create three different classification/regression models](#create_models) for each task (e.g., random forest KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!</li>
    <li>[<b>10 points</b>] [2.4 - Analyze the results using your chosen method of evaluation](#analyze_results). Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.</li>
    <li>[<b>10 points</b>] [2.5 - Discuss the advantages of each model](#discuss_models) for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques - be sure they are appropriate for your chosen method of validation.</li>
    <li>[<b>10 points</b>] [2.6 - Which attributes from your analysis are most important](#important_attributes)? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesis about why certain attributes are more important than others for a given classification task.</li>
   </ul>
</div>



<a id='choose_and_explain'></a>
### 2.1 - Choose and explain your evaluation metrics
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions. **(10 points total)**

<a id='choose_the_method'></a>
### 2.2 - Choose the method you will use for dividing your data
Choose the method you will use for dividing your data into training and testing splits (i.e., are you Stratified 10-fold cross validataion? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time. **(10 points total)**

<a id='create_models'></a>
### 2.3 - Create three different classification/regression models
Create three different classification/regression models for each task (e.g., random forest KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms! **(20 points total)**

<a id='analyze_results'></a>
### 2.4 - Analyze the results using your chosen method of evaluation
Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model. **(10 points total)**

<a id='discuss_models'></a>
### 2.5 - Discuss the advantages of each model
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques - be sure they are appropriate for your chosen method of validation. **(10 points total)**

<a id='important_attributes'></a>
### 2.6 - Which attributes from your analysis are most important
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesis about why certain attributes are more important than others for a given classification task. **(10 points total)**

<a id="deployment"></a>
## 3 - Deployment

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Deployment (<b>5 points total</b>)</h3>
    <ul><li>[3.1 - How useful is your model](#model_usefulness) for interested parties (i.e., the companies or organizations that might want to use it for prediction)?</li>
    <li>[3.2 - How would you measure the model's value](#model_value) if it was used by these parties?</li>
    <li>[3.3 - How would you deploy your model](#model_deploy) for interested parties?</li>
    <li>[3.4 - What other data should be collected](#other_data)?</li>
    <li>[3.5 - How often would the model need to be updated](#model_update), etc.?</li>
   </ul>
</div>

<a id='model_usefulness'></a>
### 3.1 - How useful is your model
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)?

<a id='model_value'></a>
### 3.2 - How would you measure the model's value
How would you measure the model's value if it was used by these parties?

<a id='model_deploy'></a>
### 3.3 - How would you deploy your model
How would you deploy your model for interested parties?

<a id='other_data'></a>
### 3.4 - What other data should be collected
What other data should be collected?

<a id='model_update'></a>
### 3.5 - How often would the model need to be updated
How often would the model need to be updated, etc.?

<a id="exceptional"></a>
## 4 - Exceptional Work

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Exceptional Work (<b>10 points total</b>)</h3>
   <p>Free reign to provide additional analysis. The following are possible ideas:</p>
   <ul>
       <li>Grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?</li>
       <li>Apply Synthetic Minority Over-sampling Technique (SMOTE)</li>
       <li>Utilize pipeline</li>
       <li>Visualize feature importance</li>
       <li>Utilize R implementation of ADA</li>
   </ul>
</div>