# MSDS 7331 - Lab Two: Regress or Classify

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab2)
- [Tom Elkins](telkins@smu.edu?subject=lab2)
- [Austin Kelly](ajkelly@smu.edu?subject=lab2)
- [Murali Parthasarathy](mparthasarathy@smu.edu?subject=lab2)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Lab Instructions</h3>
    <p>You are to build upon the predictive analysis that you already completed in the previous mini-project, adding additional modeling from new classification algorithms as well as more explanations that are inline with the CRISP-DM framework. You should use appropriate cross validation for all of your analysis (explain your chosen method of performance validation <i>in detail</i>). Try to use as much testing data as possible <i>in a realistic manner</i> (you should define what you think is realistic and why).</p>
    <p>This report is worth 20% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a single document. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.</p>
    <p>Report Sections:</p>
    <ol>
        <li>[Data Preparation](#data_preparation) <b>(15 points)</b></li>
        <li>[Modeling and Evaluation](#modeling_and_evaluation) <b>(70 points)</b></li>
        <li>[Deployment](#deployment) <b>(5 points)</b></li>
        <li>[Exceptional Work](#exceptional_work) <b>(10 points)</b></li>
    </ol>
</div>

<a id='data_preparation'></a>
## 1 - Data Preparation
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Data Preparation (<b>15 points total</b>)</h3>
    <ul><li>[<b>10 points total</b>] [1.1 - Define and prepare your class variables](#define_and_prepare_class_variables). Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reductions, scaling, etc. Remove variables that are not needed/useful for the analysis.</li>
    <li>[<b>5 points total</b>] [1.2 - Describe the final dataset](#describe_final_dataset) that is used for classification/regression (include a description of any newly formed variables you created.)</li>
    </ul>
</div>

<a id='define_and_prepare_class_variables'></a>
### 1.1 - Define and Prepare Class Variables
** >> revise below << **

The data set chosen for lab 1 is the 2015 Washington DC Metro Crime inspired from a Kaggle data set found at https://www.kaggle.com/vinchinzu/dc-metro-crime-data. The data set was obtained by following the steps found on the [Using the Crime Map Application](http://mpdc.dc.gov/node/200622) page. This site allowed us to download data by political ward for all eight wards from 01/01/2015 to 12/31/2015 as CSV files. These individual ward files were then merged together into a single file for our use. This data set contains 36,493 entries and 18 attributes that are both continuous and discrete. This satisfies the data set requirement for a minimum of 30,000 entries and 10 attributes which are both continuous and discrete. Further definition of this data set will be discussed in the [Data Understanding](#data_understanding) section.

![Ward Map](images/wards_small.png "Washington DC Wards") 
<p style='text-align: center;'>
Washington DC Metro Ward Map
</p>

The crime data is published by the Washington DC Metro police department daily (see below image) to provide their residents a clear picture of crime trends as they actually happen. The data is shared with its residents such as Advisory Neighborhood Commissions to help the police determine how to keep neighborhoods safe. The data is also analyzed to determine the effectiveness of current investments such as putting more officers on the streets, buying police more tools, and launching community partnerships, see [Washington DC Metro Police Department report](http://mpdc.dc.gov/publication/mpd-annual-report-2015) for more details.

![Ward Map](images/dc_2015_crime.tiff "Washington DC Year End Crime Data") 
<p style='text-align: center;'>
Washington DC Metro 2015 Year End Crime Data
</p>

### 1.1.1 - Load Data

In [None]:
# generic imports
import pandas as pd
import numpy as np
from __future__ import print_function

# plotting setup
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import seaborn as sns
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# scikit imports
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

# Read in the crime data from the CSV file
df = pd.read_csv('data/DC_Crime_2015_Lab2.csv')

<a id='describe_final_dataset'></a>
### 1.2 - Purpose of Data Set
Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created.) **(5 points total)**

<a id="modeling_and_evaluation"></a>
## 2 - Modeling and Evaluation

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Modeling and Evaluation (<b>70 points total</b>)</h3>
    <ul><li>[<b>10 points</b>] [2.1 - Choose and explain your evaluation metrics](#choose_and_explain) that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.</li>
    <li>[<b>10 points</b>] [2.2 - Choose the method you will use for dividing your data](#choose_the_method) into training and testing splits (i.e., are you Stratified 10-fold cross validataion? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.</li>
    <li>[<b>20 points</b>] [2.3 - Create three different classification/regression models](#create_models) for each task (e.g., random forest KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!</li>
    <li>[<b>10 points</b>] [2.4 - Analyze the results using your chosen method of evaluation](#analyze_results). Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.</li>
    <li>[<b>10 points</b>] [2.5 - Discuss the advantages of each model](#discuss_models) for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques - be sure they are appropriate for your chosen method of validation.</li>
    <li>[<b>10 points</b>] [2.6 - Which attributes from your analysis are most important](#important_attributes)? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesis about why certain attributes are more important than others for a given classification task.</li>
   </ul>
</div>



<a id='choose_and_explain'></a>
### 2.1 - Choose and explain your evaluation metrics
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions. **(10 points total)**

<a id='choose_the_method'></a>
### 2.2 - Choose the method you will use for dividing your data
Choose the method you will use for dividing your data into training and testing splits (i.e., are you Stratified 10-fold cross validataion? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time. **(10 points total)**

<a id='create_models'></a>
### 2.3 - Create three different classification/regression models
Create three different classification/regression models for each task (e.g., random forest KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms! **(20 points total)**

<a id='analyze_results'></a>
### 2.4 - Analyze the results using your chosen method of evaluation
Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model. **(10 points total)**

<a id='discuss_models'></a>
### 2.5 - Discuss the advantages of each model
Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques - be sure they are appropriate for your chosen method of validation. **(10 points total)**

<a id='important_attributes'></a>
### 2.6 - Which attributes from your analysis are most important
Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesis about why certain attributes are more important than others for a given classification task. **(10 points total)**

<a id="deployment"></a>
## 3 - Deployment

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Deployment (<b>5 points total</b>)</h3>
    <ul><li>[3.1 - How useful is your model](#model_usefulness) for interested parties (i.e., the companies or organizations that might want to use it for prediction)?</li>
    <li>[3.2 - How would you measure the model's value](#model_value) if it was used by these parties?</li>
    <li>[3.3 - How would you deploy your model](#model_deploy) for interested parties?</li>
    <li>[3.4 - What other data should be collected](#other_data)?</li>
    <li>[3.5 - How often would the model need to be updated](#model_update), etc.?</li>
   </ul>
</div>

<a id='model_usefulness'></a>
### 3.1 - How useful is your model
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)?

<a id='model_value'></a>
### 3.2 - How would you measure the model's value
How would you measure the model's value if it was used by these parties?

<a id='model_deploy'></a>
### 3.3 - How would you deploy your model
How would you deploy your model for interested parties?

<a id='other_data'></a>
### 3.4 - What other data should be collected
What other data should be collected?

<a id='model_update'></a>
### 3.5 - How often would the model need to be updated
How often would the model need to be updated, etc.?

<a id="exceptional"></a>
## 4 - Exceptional Work

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Exceptional Work (<b>10 points total</b>)</h3>
   <p>Free reign to provide additional analysis. The following are possible ideas:</p>
   <ul>
       <li>Grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?</li>
       <li>Apply Synthetic Minority Over-sampling Technique (SMOTE)</li>
       <li>Utilize pipeline</li>
       <li>Visualize feature importance</li>
       <li>Utilize R implementation of ADA</li>
   </ul>
</div>