# Lab 3: CRISP-DM Capstone
## Association Rule Mining, Clustering, or Collaborative Filtering

### Ryan Bass, Brett Benefield, Cho Kim, Nicole Wittlin

<a id="top"></a>
## Contents
* Business Understanding
* Data Understanding
    * <a href="#data1">Data Understanding 1</a>
    * <a href="#data2">Data Understanding 2</a>
* Modeling and Evaluation
    * <a href="#Model1">Train and Adjust Parameters</a>
    * <a href="#Model2">Evaluate and Compare</a>
    * <a href="#Model3">Visualize Results</a>
    * <a href="#Model4">Summarize the Ramifications</a>
* <a href="#Deployment">Deployment</a>
* <a href="#Exceptional">Exceptional Work</a>


In [1]:
# Display plots below cells
%matplotlib notebook

# Turn off annoying warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
import pandas as pd
import numpy as np
import yellowbrick as yb
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from pprint import pprint
from time import time
from datetime import datetime
from sklearn import metrics as mt
from sklearn import neighbors
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.feature_selection import VarianceThreshold, SelectFromModel, SelectPercentile, f_regression, mutual_info_regression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score, mean_absolute_error, make_scorer, mean_squared_error, silhouette_score
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, Binarizer, scale
from sklearn.svm import LinearSVC, NuSVC, SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier, RidgeClassifier, LinearRegression
from sklearn.linear_model import Ridge, Lasso, LassoCV
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, ShuffleSplit, StratifiedShuffleSplit, StratifiedKFold, GridSearchCV, cross_validate, RandomizedSearchCV, cross_val_score
from sklearn.cluster import KMeans, MiniBatchKMeans
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix, ClassPredictionError, ROCAUC
from yellowbrick.features import Rank1D, Rank2D, RFECV
from yellowbrick.features.importances import FeatureImportances
from yellowbrick.model_selection import ValidationCurve, LearningCurve
from yellowbrick.regressor import PredictionError, ResidualsPlot
from yellowbrick.regressor.alphas import AlphaSelection
from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer

In [4]:
# Show all columns/rows in output
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)

### Slack Integration

In [None]:
# Some setup is required before you can use this because token must be kept private
# I also need to add your name and unique identifier to the dictionary userID below
import os
from slackclient import SlackClient
from dotenv import load_dotenv

load_dotenv()

userID = {"brett": "UAN6UQEVC", "ryan": "UALUD69AB"}

#slackToken = os.environ["SLACK_BOT_TOKEN"]
#sc = SlackClient(slackToken)

def sendSlackMessage(msg, user):
    result = sc.api_call(
    "chat.postMessage",
    channel=userID[user.lower()],
    text=msg)
    
    if (not result['ok']):
        print("Error: {}".format(result))

### Supporting Functions

In [5]:
# Source: https://medium.com/@aneesha/visualising-top-features-in-linear-svm-with-scikit-learn-and-matplotlib-3454ab18a14d
def plot_coefficients(classifier, feature_names, top_features=20):
    coef = classifier.coef_.ravel()
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    # create plot
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" for c in coef[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha="right")
    plt.show()
    
def getTopCoefficients(classifier, feature_names, top_features=20):
    coef = classifier.coef_.ravel()
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    feature_names = np.array(feature_names)
    return feature_names[top_coefficients]

# Source: https://stackoverflow.com/questions/39812885/retain-feature-names-after-scikit-feature-selection
def percentile_threshold_selector(data, percent=10):
    selector = SelectPercentile(f_classif, percentile = percent)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

def scale_data(data):
    temp = scaler.fit(data)
    data = pd.DataFrame(temp, columns = data.columns)
    return data

# https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

# rmse, mape functions take from :https://github.com/jakemdrew/EducationDataNC/blob/master/Other%20Projects/iPython%20Notebooks/Machine%20Learning/High%20School%20Minority%20Percentage%20February%202018.ipynb
#Use mean absolute error (MAE) to score the regression models created 
#(the scale of MAE is identical to the response variable)

#Function for Root mean squared error
#https://stackoverflow.com/questions/17197492/root-mean-square-error-in-python
def rmse(y_actual, y_predicted):
    return np.sqrt(mean_squared_error(y_actual, y_predicted))

#Function for Mean Absolute Percentage Error (MAPE) - Untested
#Adapted from - https://stackoverflow.com/questions/42250958/how-to-optimize-mape-code-in-python
def mape(y_actual, y_predicted): 
    mask = y_actual != 0
    return (np.fabs(y_actual - y_predicted)/y_actual)[mask].mean() * 100

#Create scorers for rmse and mape functions
mae_scorer = make_scorer(score_func=mean_absolute_error, greater_is_better=False)
rmse_scorer = make_scorer(score_func=rmse, greater_is_better=False)
mape_scorer = make_scorer(score_func=mape, greater_is_better=False)

#Make scorer array to pass into cross_validate() function for producing mutiple scores for each cv fold.
errorScoring = {'MAE':  mae_scorer, 
                'RMSE': rmse_scorer,
                'MAPE': mape_scorer
               }

In [6]:
# code used to evaluate regression models used
# code from Dr. Drew github: https://github.com/jakemdrew/EducationDataNC/blob/master/Other%20Projects/iPython%20Notebooks/Machine%20Learning/High%20School%20Minority%20Percentage%20February%202018.ipynb

def EvaluateRegressionEstimator(regEstimator, X, y, cv):
    
    scores = cross_validate(regEstimator, X, y, scoring=errorScoring, cv=cv, return_train_score=True)

    #cross val score sign-flips the outputs of MAE
    # https://github.com/scikit-learn/scikit-learn/issues/2439
    scores['test_MAE'] = scores['test_MAE'] * -1
    scores['test_MAPE'] = scores['test_MAPE'] * -1
    scores['test_RMSE'] = scores['test_RMSE'] * -1

    #print mean MAE for all folds 
    maeAvg = scores['test_MAE'].mean()
    print_str = "The average MAE for all cv folds is: \t\t\t {maeAvg:.5}"
    print(print_str.format(maeAvg=maeAvg))

    #print mean test_MAPE for all folds
    scores['test_MAPE'] = scores['test_MAPE']
    mape_avg = scores['test_MAPE'].mean()
    print_str = "The average MAE percentage (MAPE) for all cv folds is: \t {mape_avg:.5}"
    print(print_str.format(mape_avg=mape_avg))

    #print mean MAE for all folds 
    RMSEavg = scores['test_RMSE'].mean()
    print_str = "The average RMSE for all cv folds is: \t\t\t {RMSEavg:.5}"
    print(print_str.format(RMSEavg=RMSEavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['MAE'] = scores['test_MAE']
    scoresResults['MAPE'] = scores['test_MAPE']
    scoresResults['RMSE'] = scores['test_RMSE']
    return scoresResults

In [7]:
# Brett's directory
# Laptop
%cd "C:\sandbox\SMU\dataMining\choNotebook\EducationDataNC\2017\Machine Learning Datasets"

# Ryan's directory
#%cd "C:\Users\Clovis\Documents\7331DataMining\EducationDataNC\2017\Machine Learning Datasets"

# Cho's directory
#%cd "/Users/chostone/Documents/Data Mining/7331DataMining/EducationDataNC/2017/Machine Learning Datasets"

# NW directory
#%cd "C:\Users\Nicole Wittlin\Documents\Classes\MSDS7331\Project\2017\Machine Learning Datasets"

dfPublicHS = pd.read_csv("PublicHighSchools2017_ML.csv")

C:\sandbox\SMU\dataMining\choNotebook\EducationDataNC\2017\Machine Learning Datasets


In [12]:
# These names seem to cause problems so let's give them friendlier names
# renameCols = {'_1yr_tchr_trnovr_pct': 'One_yr_tchr_trnovr_pct',
#               '0-3 Years_LEA_Exp_Pct_Prin': 'less_3_years_LEA_Exp_Pct_Prin',
#               '10+ Years_LEA_Exp_Pct_Prin': 'ten_plus_years_LEA_Exp_Pct_Prin',
#               '4-10 Years_LEA_Exp_Pct_Prin': 'four_plus_years_LEA_Exp_Pct_Prin',
#               '4-Year_Cohort_Graduation_Rate_Score': 'four_Year_Cohort_Graduation_Rate_Score',
#               '_1_to_1_access_Yes': 'one_to_one_access_yes',
#               '4_10_Years_LEA_Exp_Pct_Prin': 'Four_to_Ten_Years_Exp_Pct_prin',
#               '10+_Years_LEA_Exp_Pct_Prin': 'Ten_Plus_Years_LEA_Exp_Pct_prin',
#               '0_3_Years_LEA_Exp_Pct_Prin': 'One_to_three_lea_exp_pct_prin'}

# # Rename columns
# dfPublicHS.rename(columns=renameCols, inplace = True)

In [8]:
# Replace all non alphanumeric characters with underscores
dfPublicHS.columns = dfPublicHS.columns.str.replace(r'\W', "_")

In [9]:
# Change all columns to floats since some libraries only work with floats
dfPublicHS = dfPublicHS.astype(float)

# Treat unit_code as a string
dfPublicHS["unit_code"] = dfPublicHS.astype({"unit_code": str})

In [10]:
#want to delete any remaining variables related to the ACT score (such as ACT benchmarks) to not bias our model
dfDropped = dfPublicHS

temp = dfDropped['ACT_Score']

dropCols = dfDropped.filter(regex = r'ACT')

dfDropped.drop(dropCols, axis = 1, inplace = True)

dfDropped['ACT_Score'] = temp

In [11]:
#list of all the columns that were deleted (note that ACT Score was put back into dataframe that is being used)
#dropCols.info()

In [12]:
dropColsPrin = dfDropped.filter(regex = r'prin')

dfDropped.drop(dropColsPrin, axis = 1, inplace = True)

#dropColsPrin.info()

#### Determine the 25% and 75% quartile. 

In [13]:
fig = plt.figure()
ax = fig.add_subplot()

dfDropped.boxplot(column=['ACT_Score'])
qSplit = dfDropped['ACT_Score'].quantile([.25, .50, .75, 1])

<IPython.core.display.Javascript object>

#### We want to explore the extremes

In [14]:
dfDropped["Q25"] = np.where(dfDropped['ACT_Score'] <= qSplit[.25], 1.0, 0.0)
dfDropped["Q50"] = np.where(dfDropped['ACT_Score'] <= qSplit[.50], 1.0, 0.0)
dfDropped["Q75"] = np.where(dfDropped['ACT_Score'] <= qSplit[.75], 1.0, 0.0)
dfDropped["Q100"] = np.where(dfDropped['ACT_Score'] <= qSplit[1], 1.0, 0.0)

<a href="#top">Back to Top</a> 
## Business Understanding (10 points total)

<span style="color: blue">Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?</span>

<a href="#top">Back to Top</a> 
## Data Understanding (20 points total)
<a id="data1"></a>
### Data Understanding 1 (10 points)

<span style="color: blue">Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?</span>

<a href="#top">Back to Top</a> 
<a id="data2"></a>
### Data Understanding 2 (10 points)

<span style="color: blue">Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

<span style="color: blue">Note: "Visualize the any" is verbatim from syllabus</span>


In [14]:
zeroScore = dfDropped[dfDropped['ACT_Score'] == 0]
zeroScore[['student_num', 'ACT_Score']]

Unnamed: 0,student_num,ACT_Score
8,56.0,0.0
16,64.0,0.0
51,62.0,0.0
78,8.0,0.0
187,443.0,0.0
311,68.0,0.0
332,46.0,0.0
340,149.0,0.0
353,502.0,0.0
463,59.0,0.0


In [18]:
featCols = ['English_II_Size', 'SPG_Score', 'EVAAS_Growth_Score', 'NC_Math_1_Score', 
            'Passing_NC_Math_3', 'EOCSubjects_CACR_All', 'EOCEnglish2_CACR_Male',
            'EOCEnglish2_CACR_White', 'EOCMathI_CACR_White', 'EOCSubjects_CACR_White',
            'EOCBiology_CACR_SWD', 'st_short_susp_per_c_num', 'nbpts_num', 'pct_eds']

for feat in featCols:
    fig = plt.figure()
    fig.suptitle(feat)

    ax = fig.add_subplot(131)
    plt.boxplot(dfDropped[feat], showmeans=True)
    ax.set_xlabel('All Schools')

    ax = fig.add_subplot(132, sharey=ax)
    btmQ = dfDropped[dfDropped['Q25'] == 1]
    plt.boxplot(btmQ[feat], showmeans=True)
    ax.set_xlabel('Bottom Quartile Schools')

    ax = fig.add_subplot(133, sharey=ax)
    topQ = dfDropped[dfDropped['Q100'] == 1]
    plt.boxplot(topQ[feat], showmeans=True)
    ax.set_xlabel('Top Quartile Schools')
    
# del dfDropped['Q100']

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [19]:
breaks = np.asarray(np.percentile(dfDropped['ACT_Score'], [25,50,75,100]))
dfDropped['ACT_Score_Quartiles'] = (dfDropped['ACT_Score'].values > breaks[..., np.newaxis]).sum(0)

In [21]:
quartile = {0: "First Q", 1: "Second Q", 2: "Third Q", 3: "Fourth Q"}

# scatter plot code from: https://stackoverflow.com/questions/21654635/scatter-plots-in-pandas-pyplot-how-to-plot-by-category
groups = dfDropped.groupby('ACT_Score_Quartiles')

for feat in featCols:
    fig, ax = plt.subplots()
    ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
    for name, group in groups:
        ax.plot(getattr(group, feat), group.ACT_Score, marker='o', 
                    linestyle='', ms=12, label=quartile[name])
    ax.legend()
    ax.set(title='ACT Score vs ' + feat, 
               xlabel=feat,
               ylabel='ACT Score')

    plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<a href="#top">Back to Top</a> 
## Modeling and Evaluation (50 points total)

<span style="color: blue">Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. Each option is broken down by: 

<ul style = "color: blue">
<li>Train and Adjust Parameters</li>
<li>Evaluate and Compare</li>
<li>Visualize Results</li>
<li>Summarize the Ramifications</li>


<ul style="color: blue">
<li><strong>Option A: Cluster Analysis</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Perform cluster analysis using several clustering methods (adjust parameters).</li>
    <li><strong>Eval</strong>: Use internal and/or external validation measures to describe and compare the clusterings and the clusters — how did you determine a suitable number of clusters for each method?</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results. Explain each visualization in detail.</li>
    <li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>
    </ul></ul>

<ul style="color: blue">
<li><strong>Option B: Association Rule Mining</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Create frequent itemsets and association rules (adjust parameters).</li>
    <li><strong>Eval</strong>: Use several measure for evaluating how interesting different rules are.</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results.</li>
<li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>
    </ul></ul>

<ul style = "color: blue">
<li><strong>Option C: Collaborative Filtering</strong></li>
    <ul style="color: blue">
    <li><strong>Train</strong>: Create user-item matrices or item-item matrices using collaborative filtering (adjust parameters).</li>
    <li><strong>Eval</strong>: Determine performance of the recommendations using different performance measures (explain the ramifications of each measure).</li>
    <li><strong>Visualize</strong>: Use tables/visualization to discuss the found results. Explain each visualization in detail.</li>
    <li><strong>Summarize</strong>: Describe your results. What findings are the most interesting and why?</li>

| Method name                  | Parameters                                 | Scalability                                                 | Best Use Case                                                                   | Geometry/Measurement Metric                       |
|------------------------------|--------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------|----------------------------------------------|
| K-Means                      | number of clusters                         | Very large n_samples, medium n_clusters with MiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters  | Distances between points                     |
| DBSCAN                       | neighborhood size                          | Very large n_samples, medium n_clusters                     | Non-flat geometry, uneven cluster sizes                                   | Distances between nearest points             |
| Agglomerative clustering     | number of clusters, linkage type, distance | Large n_samples and n_clusters                              | Many clusters, possibly connectivity constraints, non Euclidean distances | Any pairwise distance                        |
| Ward hierarchical clustering | number of clusters                         | Large n_samples and n_clusters                              | Many clusters, possibly connectivity constraints                          | Distances between points                     |
| Spectral clustering          | number of clusters                         | Medium n_samples, small n_clusters                          | Few clusters, even cluster size, non-flat geometry                        | Graph distance (e.g. nearest-neighbor graph) |
| Gaussian mixtures            | many                                       | Not scalable                                                | Flat geometry, good for density estimation                                | Mahalanobis distances to centers             |

In [55]:
from sklearn import cluster, mixture
from sklearn.neighbors import kneighbors_graph

for kSize in np.arange(2,10):
    default_base = {'quantile': .3,
                    'eps': .3,
                    'damping': .9,
                    'preference': -200,
                    'n_neighbors': 10,
                    'n_clusters': int(kSize)}

    params = default_base.copy()

    # normalize dataset for easier parameter selection
    dfDropped = StandardScaler().fit_transform(dfDropped)

    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(dfDropped, quantile=params['quantile'])

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(
        dfDropped, n_neighbors=params['n_neighbors'], include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
    two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
    ward = cluster.AgglomerativeClustering(
        n_clusters=params['n_clusters'], linkage='ward',
        connectivity=connectivity)
    spectral = cluster.SpectralClustering(
        n_clusters=params['n_clusters'], eigen_solver='arpack',
        affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=params['eps'])
    affinity_propagation = cluster.AffinityPropagation(
        damping=params['damping'], preference=params['preference'])
    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock",
        n_clusters=params['n_clusters'], connectivity=connectivity)
    birch = cluster.Birch(n_clusters=params['n_clusters'])
    gmm = mixture.GaussianMixture(
        n_components=params['n_clusters'], covariance_type='full')

    clustering_algorithms = (
        ('MiniBatchKMeans', two_means),
        ('AffinityPropagation', affinity_propagation),
        ('MeanShift', ms),
        ('SpectralClustering', spectral),
        ('Ward', ward),
        ('AgglomerativeClustering', average_linkage),
        #('DBSCAN', dbscan),
        ('Birch', birch),
        #('GaussianMixture', gmm)
    )
    
    print("Cluster Size: {}".format(kSize))
    
    for name, algorithm in clustering_algorithms:
        cLabels = algorithm.fit_predict(dfDropped)
        silAvg = silhouette_score(dfDropped, cLabels)
        print("Algorithm: {}".format(name))
        print("Silhouette Avg Score: {}".format(silAvg))

Cluster Size: 2
Algorithm: MiniBatchKMeans
Silhouette Avg Score: 0.14355119901951233
Algorithm: AffinityPropagation
Silhouette Avg Score: 0.045905953968654274
Algorithm: MeanShift
Silhouette Avg Score: 0.14804173718670605
Algorithm: SpectralClustering
Silhouette Avg Score: 0.10705411381971865
Algorithm: Ward
Silhouette Avg Score: 0.13637376190872447
Algorithm: AgglomerativeClustering
Silhouette Avg Score: 0.47184114209764866
Algorithm: Birch
Silhouette Avg Score: 0.1768909993165435
Cluster Size: 3
Algorithm: MiniBatchKMeans
Silhouette Avg Score: 0.16453689371319058
Algorithm: AffinityPropagation
Silhouette Avg Score: 0.045905953968654274
Algorithm: MeanShift
Silhouette Avg Score: 0.14804173718670605
Algorithm: SpectralClustering
Silhouette Avg Score: 0.043592097286572935
Algorithm: Ward
Silhouette Avg Score: 0.05116956747867224
Algorithm: AgglomerativeClustering
Silhouette Avg Score: 0.46019790062906124
Algorithm: Birch
Silhouette Avg Score: 0.022090125204953116
Cluster Size: 4
Algorit

In [22]:
for kSize in np.arange(2,10):
    fig, ax = plt.subplots()
    model = KMeans(kSize)
    cLabels = model.fit_predict(dfDropped)
    #model.
    silAvg = silhouette_score(dfDropped, cLabels)
    vis = SilhouetteVisualizer(model)

    vis.fit(dfDropped)
    vis.poof()
    
    print("For n_clusters =", kSize,
          "The average silhouette_score is :", silAvg)

<IPython.core.display.Javascript object>

For n_clusters = 2 The average silhouette_score is : 0.7139229234302018


<IPython.core.display.Javascript object>

For n_clusters = 3 The average silhouette_score is : 0.3287387744555162


<IPython.core.display.Javascript object>

For n_clusters = 4 The average silhouette_score is : 0.36616755863372996


<IPython.core.display.Javascript object>

For n_clusters = 5 The average silhouette_score is : 0.2764776411964059


<IPython.core.display.Javascript object>

For n_clusters = 6 The average silhouette_score is : 0.3214906504917908


<IPython.core.display.Javascript object>

For n_clusters = 7 The average silhouette_score is : 0.3188380247484674


<IPython.core.display.Javascript object>

For n_clusters = 8 The average silhouette_score is : 0.3355085886572387


<IPython.core.display.Javascript object>

For n_clusters = 9 The average silhouette_score is : 0.3341762881063567


In [19]:
for kSize in np.arange(2,10):
    fig, ax = plt.subplots()
    model = MiniBatchKMeans(kSize)
    cLabels = model.fit_predict(dfDropped[featCols])
    silAvg = silhouette_score(dfDropped[featCols], cLabels)
    vis = SilhouetteVisualizer(model)

    vis.fit(dfDropped[featCols])
    vis.poof()
    
    print("For n_clusters =", kSize,
          "The average silhouette_score is :", silAvg)

<IPython.core.display.Javascript object>

For n_clusters = 2 The average silhouette_score is : 0.3515934654110688


<IPython.core.display.Javascript object>

For n_clusters = 3 The average silhouette_score is : 0.2643402672478679


<IPython.core.display.Javascript object>

For n_clusters = 4 The average silhouette_score is : 0.296135259902697


<IPython.core.display.Javascript object>

For n_clusters = 5 The average silhouette_score is : 0.26982864580647903


<IPython.core.display.Javascript object>

For n_clusters = 6 The average silhouette_score is : 0.19760057384066357


<IPython.core.display.Javascript object>

For n_clusters = 7 The average silhouette_score is : 0.2113723758801071


<IPython.core.display.Javascript object>

For n_clusters = 8 The average silhouette_score is : 0.21897668130880238


<IPython.core.display.Javascript object>

For n_clusters = 9 The average silhouette_score is : 0.17165779522946464


In [49]:
dfExtremes = dfDropped[(dfDropped['ACT_Score_Quartiles'] == 0) | (dfDropped['ACT_Score_Quartiles'] == 3)]

for kSize in np.arange(2,10):
    fig, ax = plt.subplots()
    model = MiniBatchKMeans(kSize)
    cLabels = model.fit_predict(dfExtremes[featCols])
    silAvg = silhouette_score(dfExtremes[featCols], cLabels)
    vis = SilhouetteVisualizer(model)

    vis.fit(dfExtremes[featCols])
    vis.poof()
    
    print("For n_clusters =", kSize,
          "The average silhouette_score is :", silAvg)

<IPython.core.display.Javascript object>

For n_clusters = 2 The average silhouette_score is : 0.4733754315377237


<IPython.core.display.Javascript object>

For n_clusters = 3 The average silhouette_score is : 0.46028295478547504


<IPython.core.display.Javascript object>

For n_clusters = 4 The average silhouette_score is : 0.30196616349244637


<IPython.core.display.Javascript object>

For n_clusters = 5 The average silhouette_score is : 0.3192540677751459


<IPython.core.display.Javascript object>

For n_clusters = 6 The average silhouette_score is : 0.25107909520575666


<IPython.core.display.Javascript object>

For n_clusters = 7 The average silhouette_score is : 0.21897038480048311


<IPython.core.display.Javascript object>

For n_clusters = 8 The average silhouette_score is : 0.329997174979444


<IPython.core.display.Javascript object>

For n_clusters = 9 The average silhouette_score is : 0.23310185397849417


In [51]:
# clsRF = RandomForestRegressor(n_estimators=10, random_state=43)
# cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)
# dfAll = dfDropped.drop(['ACT_Score'], axis = 1)
# y = dfDropped['ACT_Score']

model = MiniBatchKMeans(2)
cLabels = model.fit_predict(dfExtremes[featCols])
dfCluster = np.column_stack((dfExtremes[featCols], pd.get_dummies(cLabels)))

# EvaluateRegressionEstimator(clsRF, dfCluster, dfExtremes['ACT_Score'], cv = cv)
# EvaluateRegressionEstimator(clsRF, dfExtremes[featCols], dfExtremes['ACT_Score'], cv = cv)
# EvaluateRegressionEstimator(clsRF, dfDropped[featCols], y, cv = cv )
# EvaluateRegressionEstimator(clsRF, dfAll, y, cv = cv)

In [28]:
fig, ax = plt.subplots()
model = MiniBatchKMeans()
vis = KElbowVisualizer(model, k=(2,10), metric='silhouette', timings=False)

vis.fit(dfExtremes[featCols])
vis.poof()

<IPython.core.display.Javascript object>

In [29]:
fig, ax = plt.subplots()
model = MiniBatchKMeans()
vis = KElbowVisualizer(model, k=(2,10), metric='silhouette', timings=False)

vis.fit(dfExtremes)
vis.poof()

<IPython.core.display.Javascript object>

In [30]:
#this was taken from above:
# cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)
# dfAll = dfDropped.drop(['ACT_Score'], axis = 1)
# y = dfDropped['ACT_Score']

#want to delete columns that give away what quartile it is
# del dfDropped['Q25']
# del dfDropped['Q50']
# del dfDropped['Q75']
# del dfDropped['Q100']
# del dfCluster["Q25", "Q50", "Q75", "Q100"]
# del dfExtremes["Q25", "Q50", "Q75", "Q100"]
# del dfAll["Q25", "Q50", "Q75", "Q100"]


#different X, y combo's for the models
# X_Cluster = dfCluster[(dfCluster['ACT_Score_Quartiles'] != 1 | dfCluster['ACT_Score_Quartiles'] != 2)]
# X_Extremes = dfExtremes[featCols]
X_Dropped = dfDropped[featCols]
# X_All = dfAll

# y_Cluster = dfExtremes['ACT_Score_Quartiles']
# y_Extremes = dfExtremes['ACT_Score_Quartiles']
y_Dropped = dfDropped['ACT_Score_Quartiles']
# y_All = dfDropped['ACT_Score_Quartiles']


#make train and test splits for them
#####################
# i took out the stratify=y on this part because i got an error
#####################
# X_train_Cluster, X_test_Cluster, y_train_Cluster, y_test_Cluster = train_test_split(X_Cluster, y_Cluster, random_state=42, stratify=y, test_size=.2)
# X_train_Extremes, X_test_Extremes, y_train_Extremes, y_test_Extremes = train_test_split(X_Extremes, y_Extremes, random_state=42, stratify=y, test_size=.2)
X_train_Dropped, X_test_Dropped, y_train_Dropped, y_test_Dropped = train_test_split(X_Dropped, y_Dropped, random_state=42, test_size=.2)
# X_train_All, X_test_All, y_train_All, y_test_All = train_test_split(X_All, y_All, random_state=42, stratify=y, test_size=.2)

#create cross validation variable
cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)

In [31]:
######################
# chose accuracy, not sure if that's correct though
######################

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'n_estimators',
    param_range = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    cv = cv, scoring = 'accuracy', n_jobs = -1
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

<IPython.core.display.Javascript object>

In [32]:
######################
# chose accuracy, not sure if that's correct though
######################

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'max_depth',
    param_range = [int(x) for x in np.linspace(40, 140, num = 11)], 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

<IPython.core.display.Javascript object>

In [33]:
######################
# chose accuracy, not sure if that's correct though
######################

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_split',
    param_range = np.arange(2, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

<IPython.core.display.Javascript object>

In [34]:
######################
# chose accuracy, not sure if that's correct though
######################

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Create the validation curve visualizer
viz = ValidationCurve(
    RandomForestClassifier(random_state = 1), param_name = 'min_samples_leaf',
    param_range = np.arange(1, 10), 
    cv = cv, scoring = 'accuracy', n_jobs = -1,
)

viz.fit(X_train_Dropped, y_train_Dropped)
viz.poof()

<IPython.core.display.Javascript object>

In [37]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'log2', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(70, 140, num = 8)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 4, 6]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'log2', 'sqrt'], 'max_depth': [70, 80, 90, 100, 110, 120, 130, 140, None], 'min_samples_split': [2, 4, 6], 'min_samples_leaf': [1, 2, 3], 'bootstrap': [True, False]}


In [38]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV.fit(X_train_Dropped, y_train_Dropped)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   35.7s
[Parallel(n_jobs=-1)]: Done 584 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.8min finished


NameError: name 'rf_randomACT' is not defined

In [40]:
# examine the best model
print(rfc_randomCV.best_score_)
print(rfc_randomCV.best_params_)
print(rfc_randomCV.best_estimator_)

0.7223684210526315
{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 80, 'bootstrap': True}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=80, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)


In [None]:
# get best parameters from the gridsearch for the all quartile model
bestValues = rfc_randomCV.best_params_

print("Best parameters set found on development set: {}".format(bestValues))

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Set model to best values found
cls = RandomForestClassifier(bootstrap = bestValues['bootstrap'], 
                             max_depth = bestValues['max_depth'], 
                             max_features = bestValues['max_features'], 
                             min_samples_leaf = bestValues['min_samples_leaf'], 
                             min_samples_split = bestValues['min_samples_split'], 
                             n_estimators = bestValues['n_estimators'])

classFit = cls.fit(X_train_Dropped, y_train_Dropped)

y_hat = cls.predict(X_test_Dropped)

# Train
cm = ConfusionMatrix(classFit)

# Predict test values
cm.predict(X_test_Dropped)

cm.score(X_test_Dropped, y_test_Dropped)

cm.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Classification Report
vis = ClassificationReport(cls)
vis.fit(X_train_Dropped, y_train_Dropped)
vis.score(X_test_Dropped, y_test_Dropped)
vis.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Prediction Error Report
vis = ClassPredictionError(cls)

# Fit the training data to the visualizer
vis.fit(X_train_Dropped, y_train_Dropped)

# Evaluate the model on the test data
vis.score(X_test_Dropped, y_test_Dropped)

# Draw visualization
vis.poof()

In [None]:
#feature importance of all quartile classification model

clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=80, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)

clf.fit(X_Dropped, y_Dropped)

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(dfDropped.columns, clf.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=45)

In [46]:
dfTemp = dfDropped[(dfDropped['ACT_Score_Quartiles'] == 0) | (dfDropped['ACT_Score_Quartiles'] == 3)]

X_Temp = dfTemp[featCols]
y_Temp = dfTemp['ACT_Score_Quartiles']

X_train_Temp, X_test_Temp, y_train_Temp, y_test_Temp = train_test_split(X_Temp, y_Temp, random_state=42, test_size=.2)

cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV2 = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV2.fit(X_train_Temp, y_train_Temp)

# examine the best model
print(rfc_randomCV2.best_score_)
print(rfc_randomCV2.best_params_)
print(rfc_randomCV2.best_estimator_)

0.9692307692307692
{'n_estimators': 1600, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 80, 'bootstrap': False}
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=80, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1600, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)


In [53]:
# This includes dfDropped[featCols] + clustering labels
X_Cluster = dfCluster

# Just look at the Q1 and Q4 data
y_Cluster = dfExtremes['ACT_Score_Quartiles']

X_train_Cluster, X_test_Cluster, y_train_Cluster, y_test_Cluster = train_test_split(X_Cluster, y_Cluster, random_state=42, test_size=.2)

cv = ShuffleSplit(n_splits=10, test_size=.2, random_state=42)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rfc = RandomForestClassifier(random_state = 24)

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rfc_randomCV3 = RandomizedSearchCV(estimator = rfc, 
                                  param_distributions = random_grid, 
                                  n_iter = 100, cv = cv, verbose = 2, 
                                  random_state = 18, n_jobs = -1, 
                                  scoring = 'accuracy')

# Fit the random search model
rfc_randomCV3.fit(X_train_Cluster, y_train_Cluster)

# examine the best model
print(rfc_randomCV3.best_score_)
print(rfc_randomCV3.best_params_)
print(rfc_randomCV3.best_estimator_)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    8.1s
[Parallel(n_jobs=-1)]: Done 301 tasks      | elapsed:   25.8s
[Parallel(n_jobs=-1)]: Done 584 tasks      | elapsed:   51.3s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.4min finished


0.9769230769230769
{'n_estimators': 600, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 110, 'bootstrap': False}
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=110, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)


In [None]:
# Create the parameter grid based on the results of random search
# based on the cluster extreme model
param_grid = {
    'bootstrap': [False],
    'max_depth': [135, 140, 145],
    'max_features': ['auto'],
    'min_samples_leaf': [1],
    'min_samples_split': [2, 3],
    'n_estimators': [725, 750, 775, 800, 825, 850, 875]
}

# Create a based model
rfc = RandomForestClassifier(random_state = 24)

# Instantiate the grid search model
rfc_gridCV3 = GridSearchCV(estimator = rfc, param_grid = param_grid, 
                                cv = cv, n_jobs = -1, verbose = 2, 
                                scoring = 'accuracy')

# Fit the grid search to the data
rfc_gridCV3.fit(X_train_Cluster, y_train_Cluster)

# examine the best model
print(rfc_gridCV3.best_score_)
print(rfc_gridCV3.best_params_)
print(rfc_gridCV3.best_estimator_)

In [None]:
# get best parameters from the gridsearch for the clustered model of Q1 vs Q4
bestValues = rfc_gridCV3.best_params_

print("Best parameters set found on development set: {}".format(bestValues))

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Set model to best values found
cls = RandomForestClassifier(bootstrap = bestValues['bootstrap'], 
                             max_depth = bestValues['max_depth'], 
                             max_features = bestValues['max_features'], 
                             min_samples_leaf = bestValues['min_samples_leaf'], 
                             min_samples_split = bestValues['min_samples_split'], 
                             n_estimators = bestValues['n_estimators'])

classFit = cls.fit(X_train_Cluster, y_train_Cluster)

y_hat = cls.predict(X_test_Cluster)

# Train
cm = ConfusionMatrix(classFit)

# Predict test values
cm.predict(X_test_Cluster)

cm.score(X_test_Cluster, y_test_Cluster)

cm.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Classification Report
vis = ClassificationReport(cls)
vis.fit(X_train_Cluster, y_train_Cluster)
vis.score(X_test_Cluster, y_test_Cluster)
vis.poof()

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

# Prediction Error Report
vis = ClassPredictionError(cls)

# Fit the training data to the visualizer
vis.fit(X_train_Cluster, y_train_Cluster)

# Evaluate the model on the test data
vis.score(X_test_Cluster, y_test_Cluster)

# Draw visualization
vis.poof()

In [None]:
#feature analysis of clustered RF model

clf = RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=135, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
            oob_score=False, random_state=24, verbose=0, warm_start=False)

clf.fit(X_Cluster, y_Cluster)

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(dfExtremes[featCols].columns, clf.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=45)

<a href="#top">Back to Top</a> 
<a id="Model1"></a>
### Train and Adjust Parameters (10 points)

<a href="#top">Back to Top</a> 
<a id="Model2"></a>
### Evaluate and Compare (10 points)

<a href="#top">Back to Top</a> 
<a id="Model3"></a>
### Visualize Results (10 points)

<a href="#top">Back to Top</a> 
<a id="Model4"></a>
### Summarize the Ramifications (20 points)

<a href="#top">Back to Top</a> 
<a id="Deployment"></a>
## Deployment

<span style="color: blue">Be critical of your performance and tell the reader how *your* current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

<ul style="color: blue">
<li>How useful is your model for interested parties (i.e., the companies of organizations that might want to use it)?</li>
<li>How would *you* deploy your model for interested parties?</li>
<li>What other data should be collected</li>
<li>How often would the model need to be updated, etc. ?</li>

<a href="#top">Back to Top</a> 
<a id="Exceptional"></a>
## Exceptional Work (10 points total)

<span style="color: blue">You have free reign to provide additional analyses or combine analyses.