# MSDS 7331 - Lab Three: Clustering, Association Rules, or Recommenders

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab2)
- [Ben Brock](bbrock@smu.edu?subject=lab2)
- [Tom Elkins](telkins@smu.edu?subject=lab2)
- [Austin Kelly](ajkelly@smu.edu?subject=lab2)

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>CRISP-DM Capstone: Association Rule Mining, Clustering, or Collaborative Filtering</h3>
    <h3>Lab Instructions</h3>
    <p>In the final assignment for this course, you will be using one of three different analysis methods:</p>
    <ul>
    <li>Option A: Use transaction data for mining associations rules</li>
    <li>Option B: Use clustering on an unlabeled dataset to provide insight or features</li>
    <li>Option C: Use collaborative filtering to build a custom recommendation system</li>
    </ul>
    <p>This report is worth 20% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a single document. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.</p>
    <p>Your choice of dataset will largely determine the task that you are trying to achieve. Though the dataset does not need to change from your previous tasks. For example, you might choose to use clustering on your data as a preprocessing step that extracts different features. Then you can use those features to build a classifier and analyze its performance in terms of accuracy (precision, recall) and speed. Alternatively, you might choose a completely different dataset and perform rule mining or build a recommendation system.</p>
    <p>Note that scikit-learn can be used for clustering analysis, but not for Association Rule Mining (you should use R) or collaborative filtering (you should use graphlabcreate from Dato). Both can be run using iPython notebooks as shown in lecture.</p>
     <p>Write a report covering in detail all the steps of the project. The results need to be reproducible using only this report. Describe all assumptions you make and include all code you use in the iPython notebook or as supplemental functions. Follow the CRISP-DM framework in your analysis (you are performing all of the CRISP-DM outline). This report is worth 20% of the final grade.</p>
    <p>Report Sections:</p>
    <ol>
        <li>[Business Understanding](#business_understanding) <b>(10 points)</b></li>
        <li>[Data Understanding](#data_understanding) <b>(20 points)</b></li>
        <li>[Modeling and Evaluation](#modeling_and_evaluation) <b>(50 points)</b></li>
        <li>[Deployment](#deployment) <b>(10 points)</b></li>
        <li>[Exceptional Work](#exceptional_work) <b>(10 points)</b></li>
    </ol>
</div>

<a id='business_understanding'></a>
## 1 - Business Understanding
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Business Understanding (<b>10 points total</b>)</h3>
    <ul>
    <li>Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?).</li>
    <li>How will you measure the effectiveness of a good algorithm?</li>
    <li>Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?</li>
    </ul>
</div>

<a id='data_understanding'></a>
## 2 - Data Understanding
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Data Understanding (<b>20 points total</b>)</h3>
    <ul>
    <li>[<b>10 points total</b>] [2.1 - Describe the meaning and type of data](#define_meaning).</li>
        <ul>
        <li>Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.</li>
        <li>Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?</li>
        </ul>
    <li>[<b>10 points total</b>] [2.2 - Visualize data](#visualize_data).</li>
        <ul>
        <li>Visualize any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.</li>
        </ul>
    </ul>
</div>

<a id='define_meaning'></a>
### 2.1 - Describe the meaning and type of data (10 points)
<ul>
<li>Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.</li>
<li>Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?</li>
</ul>

<a id='define_meaning'></a>
### 2.2 - Visualize dataa (10 points)
<ul>
<li>Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.</li>
<li>Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?</li>
</ul>

In [1]:
# generic imports
import pandas as pd
import numpy as np
from __future__ import print_function

# plotting setup
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import seaborn as sns
sns.set(font_scale=1)
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

# scikit imports
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier

# Read in the crime data from the CSV file
df = pd.read_csv('data/DC_Crime_2015_Lab2_Weather.csv')
#df_foodstamps = pd.read_csv('data/foodstamps.csv')
df_anc_data = pd.read_csv('data/ANC Data Unemployment and Housing Master.csv')

In [2]:
# how is the data represented?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36489 entries, 0 to 36488
Data columns (total 33 columns):
REPORT_DAT              36489 non-null object
SHIFT                   36489 non-null object
OFFENSE                 36489 non-null object
METHOD                  36489 non-null object
DISTRICT                36442 non-null float64
PSA                     36441 non-null float64
WARD                    36489 non-null int64
ANC                     36489 non-null int64
NEIGHBORHOOD_CLUSTER    36489 non-null int64
CENSUS_TRACT            36489 non-null int64
VOTING_PRECINCT         36489 non-null int64
CCN                     36489 non-null int64
XBLOCK                  36489 non-null float64
YBLOCK                  36489 non-null float64
START_DATE              36489 non-null object
END_DATE                36489 non-null object
PSA_ID                  36489 non-null int64
DistrictID              36489 non-null int64
SHIFT_Code              36489 non-null int64
OFFENSE_Code          

In [3]:
df_anc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 6 columns):
YR                85 non-null int64
ANC               85 non-null int64
ANC2              85 non-null object
ANC3              85 non-null object
Housing_Prices    85 non-null int64
Unemployment      85 non-null float64
dtypes: float64(1), int64(3), object(2)
memory usage: 4.1+ KB


<a id='modeling_and_evaluation'></a>
## 3 - Model and Evaluation
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Model and Evaluation (<b>50 points total</b>)</h3>
<p>Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results.</p>
<p><ul>
    <li><b>Option A: Cluster Analysis</b><ul>
        <li>Perform cluster analysis using several clustering methods</li>
        <li>How did you determine a suitable number of clusters for each method?</li>
        <li>Use internal and/or external validation measures to describe and compare the clusterings and the clusters (some visual methods would be good).</li>
    <li>Describe your results. What findings are the most interesting and why?</li></ul>
    <li><b>Option B: Association Rule Mining</b><ul>
        <li>Create frequent itemsets and association rules.</li>
        <li>Use tables/visualization to discuss the found results.</li>
        <li>Use several measure for evaluating how interesting different rules are.</li>
        <li>Describe your results. What findings are the most compelling and why?</li>
    </ul></li>
    </ul>
</div>

<a id='deployment'></a>
## 4 - Deployment
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Deployment (<b>10 points total</b>)</h3>
<p><ul>
<li>Be critical of your performance and tell the reader how your current model might be usable by other parties.</li>
<li>Did you achieve your goals? If not, can you reign in the utility of your modeling?</li>
<li>How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?</li>
<li>How would your deploy your model for interested parties?</li>
<li>What other data should be collected?</li>
<li>How often would the model need to be updated, etc.?</li>
</ul>
</div>

<a id='exceptional_work'></a>
## 5 - Exceptional Work
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Exceptional Work (<b>10 points total</b>)</h3>
<p>You have free reign to provide additional analyses or combine analyses</p>
</div>