# Lab 3: CRISP-DM Capstone
## Association Rule Mining, Clustering, or Collaborative Filtering

### Ryan Bass, Brett Benefield, Cho Kim, Nicole Wittlin

<a id="top"></a>
## Contents
* Business Understanding
* Data Understanding
    * <a href="#data1">Data Understanding 1</a>
    * <a href="#data2">Data Understanding 2</a>
* Modeling and Evaluation
    * <a href="#Model1">Train and Adjust Parameters</a>
    * <a href="#Model2">Evaluate and Compare</a>
    * <a href="#Model3">Visualize Results</a>
    * <a href="#Model4">Summarize the Ramifications</a>
* <a href="#Deployment">Deployment</a>
* <a href="#Exceptional">Exceptional Work</a>


In [2]:
%matplotlib notebook

import warnings
warnings.filterwarnings("ignore")

In [3]:
import pandas as pd
import numpy as np
import yellowbrick as yb
import matplotlib.pyplot as plt
from math import sqrt
from pprint import pprint
from time import time
from datetime import datetime
from sklearn import metrics as mt
from sklearn import neighbors
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.feature_selection import VarianceThreshold, SelectFromModel, SelectPercentile, f_regression, mutual_info_regression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score, mean_absolute_error, make_scorer, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, Binarizer, scale
from sklearn.svm import LinearSVC, NuSVC, SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier, RidgeClassifier, LinearRegression
from sklearn.linear_model import Ridge, Lasso, LassoCV
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold, GridSearchCV, cross_validate, RandomizedSearchCV
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix, ClassPredictionError, ROCAUC
from yellowbrick.features import Rank1D, Rank2D, RFECV
from yellowbrick.features.importances import FeatureImportances
from yellowbrick.model_selection import ValidationCurve, LearningCurve
from yellowbrick.regressor import PredictionError, ResidualsPlot
from yellowbrick.regressor.alphas import AlphaSelection

### Slack Integration

In [9]:
# Some setup is required before you can use this because token must be kept private
# I also need to add your name and unique identifier to the dictionary userID below
import os
from slackclient import SlackClient
from dotenv import load_dotenv

load_dotenv()

userID = {"brett": "UAN6UQEVC", "ryan": "UALUD69AB"}

#slackToken = os.environ["SLACK_BOT_TOKEN"]
#sc = SlackClient(slackToken)

def sendSlackMessage(msg, user):
    result = sc.api_call(
    "chat.postMessage",
    channel=userID[user.lower()],
    text=msg)
    
    if (not result['ok']):
        print("Error: {}".format(result))

### Supporting Functions

In [10]:
# Source: https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d
def plot_coefficients(classifier, feature_names, top_features=20):
    coef = classifier.coef_.ravel()
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    # create plot
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" for c in coef[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
    plt.show()
    
def getTopCoefficients(classifier, feature_names, top_features=20):
    coef = classifier.coef_.ravel()
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    feature_names = np.array(feature_names)
    return feature_names[top_coefficients]

# Source: https://stackoverflow.com/questions/39812885/retain-feature-names-after-scikit-feature-selection
def percentile_threshold_selector(data, percent=10):
    selector = SelectPercentile(f_classif, percentile = percent)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

def scale_data(data):
    temp = scaler.fit(data)
    data = pd.DataFrame(temp, columns = data.columns)
    return data

# https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

# rmse, mape functions take from :https://github.com/jakemdrew/EducationDataNC/blob/master/Other%20Projects/iPython%20Notebooks/Machine%20Learning/High%20School%20Minority%20Percentage%20February%202018.ipynb
#Use mean absolute error (MAE) to score the regression models created 
#(the scale of MAE is identical to the response variable)

#Function for Root mean squared error
#https://stackoverflow.com/questions/17197492/root-mean-square-error-in-python
def rmse(y_actual, y_predicted):
    return np.sqrt(mean_squared_error(y_actual, y_predicted))

#Function for Mean Absolute Percentage Error (MAPE) - Untested
#Adapted from - https://stackoverflow.com/questions/42250958/how-to-optimize-mape-code-in-python
def mape(y_actual, y_predicted): 
    mask = y_actual != 0
    return (np.fabs(y_actual - y_predicted)/y_actual)[mask].mean() * 100

#Create scorers for rmse and mape functions
mae_scorer = make_scorer(score_func=mean_absolute_error, greater_is_better=False)
rmse_scorer = make_scorer(score_func=rmse, greater_is_better=False)
mape_scorer = make_scorer(score_func=mape, greater_is_better=False)

#Make scorer array to pass into cross_validate() function for producing mutiple scores for each cv fold.
errorScoring = {'MAE':  mae_scorer, 
                'RMSE': rmse_scorer,
                'MAPE': mape_scorer
               }

In [42]:
# code used to evaluate regression models used
# code from Dr. Drew github: https://github.com/jakemdrew/EducationDataNC/blob/master/Other%20Projects/iPython%20Notebooks/Machine%20Learning/High%20School%20Minority%20Percentage%20February%202018.ipynb

def EvaluateRegressionEstimator(regEstimator, X, y, cv):
    
    scores = cross_validate(regEstimator, X, y, scoring=errorScoring, cv=cv, return_train_score=True)

    #cross val score sign-flips the outputs of MAE
    # https://github.com/scikit-learn/scikit-learn/issues/2439
    scores['test_MAE'] = scores['test_MAE'] * -1
    scores['test_MAPE'] = scores['test_MAPE'] * -1
    scores['test_RMSE'] = scores['test_RMSE'] * -1

    #print mean MAE for all folds 
    maeAvg = scores['test_MAE'].mean()
    print_str = "The average MAE for all cv folds is: \t\t\t {maeAvg:.5}"
    print(print_str.format(maeAvg=maeAvg))

    #print mean test_MAPE for all folds
    scores['test_MAPE'] = scores['test_MAPE']
    mape_avg = scores['test_MAPE'].mean()
    print_str = "The average MAE percentage (MAPE) for all cv folds is: \t {mape_avg:.5}"
    print(print_str.format(mape_avg=mape_avg))

    #print mean MAE for all folds 
    RMSEavg = scores['test_RMSE'].mean()
    print_str = "The average RMSE for all cv folds is: \t\t\t {RMSEavg:.5}"
    print(print_str.format(RMSEavg=RMSEavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['MAE'] = scores['test_MAE']
    scoresResults['MAPE'] = scores['test_MAPE']
    scoresResults['RMSE'] = scores['test_RMSE']
    return scoresResults

In [1]:
# Brett's directory
# Laptop
%cd "C:\sandbox\SMU\dataMining\choNotebook\EducationDataNC\2017\Machine Learning Datasets"

# Ryan's directory
#%cd "C:\Users\Clovis\Documents\7331DataMining\EducationDataNC\2017\Machine Learning Datasets"

# Cho's directory
#%cd "/Users/chostone/Documents/Data Mining/7331DataMining/EducationDataNC/2017/Machine Learning Datasets"

# NW directory
#%cd "C:\Users\Nicole Wittlin\Documents\Classes\MSDS7331\Project\2017\Machine Learning Datasets"
dfPublicHS = pd.read_csv("PublicHighSchools2017_ML.csv")

[Errno 2] No such file or directory: 'C:\\sandbox\\SMU\\dataMining\\choNotebook\\EducationDataNC\\2017\\Machine Learning Datasets'
/Users/chostone/Documents/Data Mining/7331DataMining


NameError: name 'pd' is not defined

<a href="#top">Back to Top</a> 
## Data Understanding (20 points total)
<a id="data1"></a>
### Data Understanding 1 (10 points)

<span style="color: blue">Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?</span>

<a href="#top">Back to Top</a> 
<a id="data2"></a>
### Data Understanding 2 (10 points)

<span style="color: blue">Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

<span style="color: blue">Note: "Visualize the any" is verbatim from syllabus</span>


<a href="#top">Back to Top</a> 
## Modeling and Evaluation (50 points total)

<span style="color: blue">Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. Each option is broken down by: 

<ul style = "color: blue">
<li>Train and Adjust Parameters</li>
<li>Evaluate and Compare</li>
<li>Visualize Results</li>
<li>Summarize the Ramifications</li>


<ul style="color: blue">
<li><strong>Option A: Cluster Analysis</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Perform cluster analysis using several clustering methods (adjust parameters).</li>
    <li><strong>Eval</strong>: Use internal and/or external validation measures to describe and compare the clusterings and the clusters — how did you determine a suitable number of clusters for each method?</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results. Explain each visualization in detail.</li>
    <li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>
    </ul></ul>

<ul style="color: blue">
<li><strong>Option B: Association Rule Mining</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Create frequent itemsets and association rules (adjust parameters).</li>
    <li><strong>Eval</strong>: Use several measure for evaluating how interesting different rules are.</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results.</li>
<li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>
    </ul></ul>

<ul style = "color: blue">
<li><strong>Option C: Collaborative Filtering</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Create user-item matrices or item-item matrices using collaborative filtering (adjust parameters).</li>
    <li><strong>Eval</strong>: Determine performance of the recommendations using different performance measures (explain the ramifications of each measure).</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results. Explain each visualization in detail.</li>
    <li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>

<a href="#top">Back to Top</a> 
<a id="Model1"></a>
### Train and Adjust Parameters (10 points)

<a href="#top">Back to Top</a> 
<a id="Model2"></a>
### Evaluate and Compare (10 points)

<a href="#top">Back to Top</a> 
<a id="Model3"></a>
### Visualize Results (10 points)

<a href="#top">Back to Top</a> 
<a id="Model4"></a>
### Summarize the Ramifications (20 points)

<a href="#top">Back to Top</a> 
<a id="Deployment"></a>
## Deployment

<span style="color: blue">Be critical of your performance and tell the reader how *your* current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

<ul style="color: blue">
<li>How useful is your model for interested parties (i.e., the companies of organizations that might want to use it)?</li>
<li>How would *you* deploy your model for interested parties?</li>
<li>What other data should be collected</li>
<li>How often would the model need to be updated, etc. ?</li>

<a href="#top">Back to Top</a> 
<a id="Exceptional"></a>
## Exceptional Work (10 points total)

<span style="color: blue">You have free reign to provide additional analyses or combine analyses.