# Clustering Demo
### To run all the code click on 'Cell' on the toolbar at the top of the page and select 'Run All'
You will be prompted to enter a password to unlock the data file. Your coach can provide that password.


## <span style="color:blue">There are 10 questions below, after running the code answer the questions in blue or discuss with your team</span>


---

### Load python libraries
This section will load data packages which bring in prewritten code to make the clustering more simple

In [None]:
#import the necessary packages
%matplotlib inline

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import InterclusterDistance
from yellowbrick.cluster import SilhouetteVisualizer
from yellowbrick.model_selection import FeatureImportances

from zipfile import ZipFile

import seaborn as sns


import getpass

plt.style.use('ggplot')

pd.set_option("display.max_columns", None)
print("Everything was loaded correctly")

#Enter the password to access the data
str_pwd = getpass.getpass('Enter the password to access the data:  ')

### Open data files
This section unlocks the data and makes it ready to use

# <span style="color:blue">1. Look at all the columns what kind of data are you using to cluster?</span>


In [None]:
#open the encrypted zipfile
with ZipFile('clustering_file.zip') as zf:
    zf.extractall(pwd=bytes(str_pwd,'utf-8'))
    
#decrypt the file
df_data = pd.read_csv('clustering_file.csv')

#get list of features to use in visualizations
lst_features = df_data.columns.tolist()
print("data file was loaded correctly")

#View a sample of the data loaded in
df_data.head()

The table above shows all the data points (features) we will use in clustering. Scroll right to look at the types of features we're using in the clustering project. 

    -There are 503 features (the number of columns)
    -There are 788 stores in this sample dataset although we're just showing the first 5 stores in the file (rows)

The numbers showing are already prepared for data science use. These are the index of the column for the store compared to all the other stores. They are normalized in a 0-1 view so we can compare across features. a 1 means that the feature is the top index for that feature for that store.

# <span style="color:blue">2. How many clusters do you think we should use?</span>


The biggest question is how many clusters we should use. One way we use to find the optimal number of clusters is to use an elbow plot. If the plot looks like an arm, then the elbow on the arm is optimal number of clusters. A vertical line will be drawn for the optimal statistical count of clusters 



In [None]:
#how many clusters do you want to see on the list?
max_clusters=10

X = np.array(df_data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

visualizer = KElbowVisualizer(KMeans(), k=(1,max_clusters), size=(900,600))
visualizer.fit(X_scaled) # Fit the data to the visualizer
visualizer.show()  

---

# Run Clustering

The code below will run the actual clustering for 2, 4, 7,and 10 cluster groups. Use the graph above to help determine what is the optimal number of clusters.

In [None]:
#Creating 4 different clustering views
model_2 = KMeans(n_clusters = 2, init = "k-means++", max_iter = 300, n_init = 10, random_state = 0)
model_4 = KMeans(n_clusters = 4, init = "k-means++", max_iter = 300, n_init = 10, random_state = 0)
model_7 = KMeans(n_clusters = 7, init = "k-means++", max_iter = 300, n_init = 10, random_state = 0)
model_10 = KMeans(n_clusters = 10, init = "k-means++", max_iter = 300, n_init = 10, random_state = 0)

# <span style="color:blue">3. Which clustering option would you pick based on the bar plot view and why?</span>


## Bar plot view
These show the count of stores in each cluster. You want to make sure that the number of stores in each cluster have a good distribution. If one cluster has all the stores that probably means you need more clusters. Which clustering option do you think looks best?

In [None]:
#Bar plots
def viz_bar_plot(str_model, int_cluster):
    ax = sns.countplot(str_model.fit_predict(X_scaled))
    for p in ax.patches:
            ax.annotate('{:1}'.format(p.get_height()), (p.get_x()+.3, p.get_height()+2))
    plt.title(str(int_cluster) + ' cluster')
    ax.set(ylabel='# of Stores', xlabel='Clusters')
    plt.show()
    
viz_bar_plot(model_2, 2)
viz_bar_plot(model_4, 4)
viz_bar_plot(model_7, 7)
viz_bar_plot(model_10, 10)

# <span style="color:blue">4. Which clustering option would you pick based on the silhouette view and why?</span>


## Silhouette Visualizer
The score is computed by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between 1 and -1, where 1 is highly dense clusters and -1 is completely incorrect clustering.

You're looking for the best option between:
	- Largest average silhouette score (dotted red line)
    - Smallest width of each cluster
    - Least amount of negative silhouette coefficient values (lines moving left)

In [None]:
def viz_Silhouette(str_model):
    visualizer = SilhouetteVisualizer(str_model, colors='yellowbrick',
                                      size=(720,480))
    visualizer.fit(X_scaled)        # Fit the data to the visualizer
    visualizer.ax.set_ylabel("silhouette coefficient values. Score = ")
    print(str(visualizer.n_clusters_) + " Clusters Silhouette Plot Score = " + str("{:.2f}".format(visualizer.silhouette_score_)))
    visualizer.show()        # Finalize and render the figure

viz_Silhouette(model_2)
viz_Silhouette(model_4)
viz_Silhouette(model_7)
viz_Silhouette(model_10)

# <span style="color:blue">5. Which clustering option would you pick based on the intercluster distance map and why?</span>


## Intercluster Distance Map
Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. This gives a sense of the relative importance of clusters.

You're looking for clusters that do not overlap and are further apart, meaning that they are more unique.

The legend in the bottom right hand corner shows the total size of the clusters (like the bar plots) and the size is how important the clusters are for the top 2 features shown.

In [None]:
# Instantiate the clustering model and visualizer
def viz_InterCluster(str_model, int_cluster):
    visualizer = InterclusterDistance(str_model, size=(600,600),
                                     title="KMeans Intercluster map for " + str(int_cluster) + " cluster model")
    visualizer.fit(X_scaled)        # Fit the data to the visualizer
    visualizer.show()        # Finalize and render the figure

viz_InterCluster(model_2, 2)
viz_InterCluster(model_4, 4)
viz_InterCluster(model_7, 7)
viz_InterCluster(model_10, 10)

# <span style="color:blue">6. Using all of information from above, which clustering option do you choose?</span>


# <span style="color:blue">7. What insights can you find about the clustering you chose? </span>


# What features are important to the clusters

The figure below shows the features ranked according to the explained variance each feature contributes to the model. In this case the features are plotted against their relative importance, that is the percent importance of the most important feature.

You can see what featured matter most for each cluster model. Do some features seem more important than others? Does one of the clustering views use features that seem more realistic to you?

In [None]:
def viz_FeatureImportance(str_model, int_feat, int_cluster): 
    model_forest = RandomForestClassifier(n_estimators=10, random_state=13)
    viz = FeatureImportances(model_forest,
                             labels=lst_features,
                             topn=int_feat,
                             size=(720,1080),
                             title="Top " + str(int_feat) +" features for " + str(int_cluster) + " cluster model")
    viz.fit(X_scaled, str_model.fit_predict(X_scaled))
    viz.show()

int_feat = 20

viz_FeatureImportance(model_2, int_feat, 2)
viz_FeatureImportance(model_4, int_feat, 4)
viz_FeatureImportance(model_7, int_feat, 7)
viz_FeatureImportance(model_10, int_feat, 10)


The chart below shows which features are important and the importance for each of the clusters. 

    -Positive side (right side), the feature  is a good indicator (positively correlates) as an attribute of that cluster.
    -Negative side (left side), the feature  is a bad indicator (negatively correlates) as an attribute of that cluster.
    
If you have too few clusters, not enough data is available and the visualization can't be generated.

In [None]:
def viz_FeatureRegression(str_model, int_feat, int_cluster):
    model_regression = LogisticRegression(multi_class="auto", solver="liblinear")
    viz = FeatureImportances(model_regression, 
                             stack=True, 
                             labels=lst_features,
                             relative=False, 
                             topn=int_feat, 
                             size=(720,1080),
                            title="Top " + str(int_feat) +" features for " + str(int_cluster) + " cluster model")
    viz.fit(X_scaled, str_model.fit_predict(X_scaled))
    viz.show()

try:
    viz_FeatureRegression(model_2, int_feat, 2)
except:
    print("Not enough data to show for 2 cluster model")
try:
    viz_FeatureRegression(model_4, int_feat, 4)
except:
    print("Not enough data to show for 4 cluster model")
try:
    viz_FeatureRegression(model_7, int_feat, 7)
except:
    print("Not enough data to show for 7 cluster model")
try:
    viz_FeatureRegression(model_10, int_feat, 10)
except:
    print("Not enough data to show for 10 cluster model")

# <span style="color:blue">8. Share your cluster choice and insights with your group, where did your team agree and where did you differ? </span>

# <span style="color:blue">9. What did you learn about clustering? </span>

# <span style="color:blue">10. How could you use clustering in your areas? </span>