# Lab 10

## Unsupervised Learning

Today's lab will involve analyzing an unlabeled dataset and being able to cluster similar datapoints, and generate meaningful results from such data. It involves a greater level of data exploration and visualization, as well as clever feature extraction and engineering. 

*****************************************************************
Today we will be clustering customers into segments based on their spending trends, so that a wholesale distributor can understand what kind of customer he/she is dealing with. This is an active machine learning problem in the e-commerce community, but we will be looking at a simplified version of it.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


The metadata for this dataset is as below:
Attribute Information:

1) FRESH: annual spending (m.u.) on fresh products (Continuous)

2) MILK: annual spending (m.u.) on milk products (Continuous)

3) GROCERY: annual spending (m.u.)on grocery products (Continuous)

4) FROZEN: annual spending (m.u.)on frozen products (Continuous)

5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)

6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous)

7) CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)

8) REGION: customers Region- Lisnon, Oporto or Other (Nominal) 

### Task 1: Data Exploration

Q1. Print the attributes of the dataset, and standard values(overall description) like mean, standard deviation, etc. to get a better idea of the attribute distribution.

In [2]:
#TODO



440

Q2. Print 3 to 5 random data points from the dataset, and try and understand what kinds of spenders they can be(e.g. a luxurious spender or a tight-budget guy). This can help you influence your model selection later on.

In [None]:
#TODO



Q3. Unsupervised learning is all about understanding correlations between different attributes. Plot a scatter matrix using pandas between all attributes. If you wish, you can also plot a correlation heatmap. Analyze which attributes are more correlated and which aren't.

In [None]:
#TODO
sns.set_style('white')

plt.show()

Q4. Plot a facet grid which compares 2 attributes with each other based on region. For example, Grocery vs Frozen based on region. 

In [None]:
#TODO



### Task 2: Data Preprocessing

Sometimes the data is still hard to map out despite visualizing it. It may require some scaling and preprocessing before the visualizations make sense.

Q5. Apply a logarithmic transformation on the data, except for attributes Channel and Region(because that wouldn't make sense).

In [None]:
#TODO
log_data = None


Q6. Plot the scatter matrix again on the transformed data.Do you see any difference, any emerging correlations?

In [None]:
#TODO

plt.show()

Q7. Feature scaling is important in unsupervised learning particularly because it largely impacts the performance of clustering algorithms like K-means, and any algorithm which involves 'distance' calculation. If you wish, apply any scaling on the data other than logarithmic scaling below(you can try removing logarithmic scaling altogether):

In [None]:
#TODO



Q8. Remove outliers from the transformed data using Tukey's method. However, since we do not want to remove too many points, view the points which are outliers with respect to each attribute, and then remove those which occur in too many attributes' outliers.

In [5]:
#TODO

outliers = []  #These are the indices of the points that you want to remove
cleaned_data = None  #Remove the above manually selected outlier points


### Task 3: Feature Transformation

Once all the above is done, we need to generate features that encompass the distributions of independent attributes. We do this by applying PCA(Principal Component Analysis). What this transformation does is that it creates dimensions in the direction of maximum variance in the data.

Q9. We will now work only on the continuous data, so on all fields except Channel and Region, apply PCA on the data using pca.transform.

In [None]:
from sklearn.decomposition import PCA
#TODO
continuous_data = None #Create a dataframe with only the continuous attributes.

pca = None    #Apply PCA by fitting the data with an arbitrary number of dimensions(your choice)

reduced_continuous_data = None   #Transform the continuous data

df_reduced = None   #Create a dataframe from the reduced data, rename the dimensions D1,D2.....

Run the code cell below.

In [None]:
def pca_results(good_data, pca):

    # Dimension indexing
    dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = good_data.keys())
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (14,8))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)


    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n          %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)

pca_results(continuous_data,pca)

You can observe the relative weights of each original feature in each new dimension/feature.

Now, we can begin the process of clustering and learning. 

Q10. Create a scatter plot between pca dimensions 1 and 2, which are the most important features in the dataset.

In [None]:
#TODO



Run this cell below:

In [None]:
def biplot(good_data, reduced_data, pca):

    fig, ax = plt.subplots(figsize = (14,8))
    # scatterplot of the reduced data    
    ax.scatter(x=reduced_data.loc[:, 'D1'], y=reduced_data.loc[:, 'D2'], 
        facecolors='b', edgecolors='b', s=70, alpha=0.5)
    
    feature_vectors = pca.components_.T

    # we use scaling factors to make the arrows easier to see
    arrow_size, text_pos = 7.0, 8.0,

    # projections of the original features
    for i, v in enumerate(feature_vectors):
        ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], 
                  head_width=0.2, head_length=0.2, linewidth=2, color='red')
        ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black', 
                 ha='center', va='center', fontsize=18)

    ax.set_xlabel("D1", fontsize=14)
    ax.set_ylabel("D2", fontsize=14)
    ax.set_title("PC plane with original feature projections.", fontsize=16);
    return ax

biplot(continuous_data,df_reduced,pca)

The above plot shows the projection of the original features along the new components. It becomes easier to understand how the clustering will be done in the future. For example, you can literally observe the possibility that someone who buys less milk and more delicatessen will be in one category.

Let's explore another dimensionality reduction technique called t-SNE(t-distributed stochastic neighbor embeddings). It is a probabilistic reduction technique which is also used for visualization of high-dimensional data.

Q11. Now, rerun the code cell for Q19, and choose the number of PCA dimensions greater than 2. Apply t-SNE on the continuous reduced data(check out:http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), and plot it.

In [None]:
from sklearn.manifold import TSNE
#TODO

tsne = None   #Instantiate t-SNE, set number of features to 2

X_tsne = None #Fit and transform the continuous reduced features 

#Create a scatter plot on X_tsne



You can observe how higher dimensional data has been mapped down to 2 dimensions. 

Q12.(Optional) You can choose to apply either both PCA and t-SNE, or just one of them before proceeding. You can analyze the difference in performances later on.

In [None]:
#TODO



### Task 3: Clustering

We will now proceed to cluster the data. 

Q13. Pick any clustering algorithm(for starters, KMeans or Gaussian Mixture Model from sklearn). Check this link for more, but read carefully on whether it is applicable to such a problem: http://scikit-learn.org/stable/modules/clustering.html. 

In [None]:
#TODO

clusterer = None  #Apply the clustering algorithm on the reduced training data

Q14. Generate predictions on the test set. Print the centers of the clusters.

In [None]:
#TODO

preds = None

centers = None



Q15. The metric we will use for evaluating clustering performance is silhouette score:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html. Print the silhouette score on the test data.

In [None]:
#TODO

score = None

Run this cell below:

In [None]:
import matplotlib.cm as cm

def cluster_results(reduced_data, preds, centers):

    predictions = pd.DataFrame(preds, columns = ['Cluster'])
    plot_data = pd.concat([predictions, reduced_data], axis = 1)

    # Generate the cluster plot
    fig, ax = plt.subplots(figsize = (14,8))

    # Color map
    cmap = cm.get_cmap('gist_rainbow')

    # Color the points based on assigned cluster
    for i, cluster in plot_data.groupby('Cluster'):   
        cluster.plot(ax = ax, kind = 'scatter', x = 'D1', y = 'D2', \
                    color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30);

    # Plot centers with indicators
    for i, c in enumerate(centers):
        ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \
                   alpha = 1, linewidth = 2, marker = 'o', s=200);
        ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100);

    # Set plot title
    ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross");
    
cluster_results(df_reduced,preds,centers)

Above you can see how the clusters are split. Try changing the number of clusters to explore.

### Task 4: Data Inverse-Transformation

Q16. Apply inverse transform on the cluster centers to obtain their true values. First apply pca.inverse_transform, then apply inverse transform on whatever log scaling/any other scaling you have performed. Print true values of the centers.

In [None]:
#TODO

log_centers = None

true_centers = None



Q17.You have now successfully clustered the data into customer segments. To improve your performance in clustering, try using hyperparameter tuning on the algorithm,finding optimal number of clusters. Print your silhouette score after this.

In [None]:
#TODO