# Intruduction

In this dataset, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

Initially I am dropping the features **'Channel'** and  **'Region'** in the analysis — with focus instead on the six product categories recorded for customers.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.pyplot as plt
# Pretty display for notebooks
%matplotlib inline
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
###########################################
# Suppress matplotlib user warnings
# Necessary for newer version of matplotlib
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.mixture import GMM
def pca_results(good_data, pca):
	'''
	Create a DataFrame of the PCA results
	Includes dimension feature weights and explained variance
	Visualizes the PCA results
	'''

	# Dimension indexing
	dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

	# PCA components
	components = pd.DataFrame(np.round(pca.components_, 4), columns = list(good_data.keys()))
	components.index = dimensions

	# PCA explained variance
	ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
	variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
	variance_ratios.index = dimensions

	# Create a bar plot visualization
	fig, ax = plt.subplots(figsize = (14,8))

	# Plot the feature weights as a function of the components
	components.plot(ax = ax, kind = 'bar');
	ax.set_ylabel("Feature Weights")
	ax.set_xticklabels(dimensions, rotation=0)


	# Display the explained variance ratios
	for i, ev in enumerate(pca.explained_variance_ratio_):
		ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n          %.4f"%(ev))

	# Return a concatenated DataFrame
	return pd.concat([variance_ratios, components], axis = 1)

def cluster_results(reduced_data, preds, centers, pca_samples):
	'''
	Visualizes the PCA-reduced cluster data in two dimensions
	Adds cues for cluster centers and student-selected sample data
	'''

	predictions = pd.DataFrame(preds, columns = ['Cluster'])
	plot_data = pd.concat([predictions, reduced_data], axis = 1)

	# Generate the cluster plot
	fig, ax = plt.subplots(figsize = (14,8))

	# Color map
	cmap = cm.get_cmap('gist_rainbow')

	# Color the points based on assigned cluster
	for i, cluster in plot_data.groupby('Cluster'):   
	    cluster.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \
	                 color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30);

	# Plot centers with indicators
	for i, c in enumerate(centers):
	    ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \
	               alpha = 1, linewidth = 2, marker = 'o', s=200);
	    ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100);

	# Plot transformed sample points 
	ax.scatter(x = pca_samples[:,0], y = pca_samples[:,1], \
	           s = 150, linewidth = 4, color = 'black', marker = 'x');

	# Set plot title
	ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross");


def biplot(good_data, reduced_data, pca):
    '''
    Produce a biplot that shows a scatterplot of the reduced
    data and the projections of the original features.
    
    good_data: original data, before transformation.
               Needs to be a pandas dataframe with valid column names
    reduced_data: the reduced data (the first two dimensions are plotted)
    pca: pca object that contains the components_ attribute

    return: a matplotlib AxesSubplot object (for any additional customization)
    
    This procedure is inspired by the script:
    https://github.com/teddyroland/python-biplot
    '''

    fig, ax = plt.subplots(figsize = (14,8))
    # scatterplot of the reduced data    
    ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'], 
        facecolors='b', edgecolors='b', s=70, alpha=0.5)
    
    feature_vectors = pca.components_.T

    # we use scaling factors to make the arrows easier to see
    arrow_size, text_pos = 7.0, 8.0,

    # projections of the original features
    for i, v in enumerate(feature_vectors):
        ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], 
                  head_width=0.2, head_length=0.2, linewidth=2, color='red')
        ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black', 
                 ha='center', va='center', fontsize=18)

    ax.set_xlabel("Dimension 1", fontsize=14)
    ax.set_ylabel("Dimension 2", fontsize=14)
    ax.set_title("PC plane with original feature projections.", fontsize=16);
    return ax
    

def channel_results(reduced_data, outliers, pca_samples):
	'''
	Visualizes the PCA-reduced cluster data in two dimensions using the full dataset
	Data is labeled by "Channel" and cues added for student-selected sample data
	'''

	# Check that the dataset is loadable
	try:
	    full_data = pd.read_csv("../input/customers.csv")
	except:
	    print("Dataset could not be loaded. Is the file missing?")       
	    return False

	# Create the Channel DataFrame
	channel = pd.DataFrame(full_data['Channel'], columns = ['Channel'])
	channel = channel.drop(channel.index[outliers]).reset_index(drop = True)
	labeled = pd.concat([reduced_data, channel], axis = 1)
	
	# Generate the cluster plot
	fig, ax = plt.subplots(figsize = (14,8))

	# Color map
	cmap = cm.get_cmap('gist_rainbow')

	# Color the points based on assigned Channel
	labels = ['Hotel/Restaurant/Cafe', 'Retailer']
	grouped = labeled.groupby('Channel')
	for i, channel in grouped:   
	    channel.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \
	                 color = cmap((i-1)*1.0/2), label = labels[i-1], s=30);
	    
	# Plot transformed sample points   
	for i, sample in enumerate(pca_samples):
		ax.scatter(x = sample[0], y = sample[1], \
	           s = 200, linewidth = 3, color = 'black', marker = 'o', facecolors = 'none');
		ax.scatter(x = sample[0]+0.25, y = sample[1]+0.3, marker='$%d$'%(i), alpha = 1, s=125);

	# Set plot title
	ax.set_title("PCA-Reduced Data Labeled by 'Channel'\nTransformed Sample Data Circled");

In [None]:
data = pd.read_csv("../input/customers.csv")

In [None]:
data.drop(['Region', 'Channel'], axis = 1, inplace = True)

In [None]:
data.head()

# Data Preprocessing

**1. Checking for Null Entries**

In [None]:
data.isnull().sum()

As we see from above result that dataset have no null or missing values that is great.

**2. Next step is to check for datatypes**[](http://)

In [None]:
print(data.dtypes)

As we see that data has no categorical variable and it matches with the datatype shown in the dataset.

# EDA

In this section, we will begin exploring the data through visualizations and code to understand how each feature is related to the others. We will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which we will track throughout this project.

**Note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Consider what each category represents in terms of products you could purchase.**

In [None]:
# Display a description of the dataset
display(data.describe())

# Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add three indices of your choice to the indices list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [None]:
# Selecting three indices of your choice you wish to sample from the dataset
indices = [61,149,379]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)

# Kind of Establishment

1. Index 61 First establishment seems to be **supermarket** because it has well above the mean of all of the products offering all kind of products having sufficient stocks of each.<br>

2. Index 149 Second Establishment seems to be of **supplier or wholesale market** of fresh food (i.e., vegetables,fruits etc.) having sufficiently high stock of fresh whereas all other items have very low stocks below average.<br>

3. Index 379 third establishment seems to be a **cafe** having good stock of milk and more than average stock of grocery (snacks,coffee,tea etc.) while other items have very low stocks than their average values.

# Feature relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In [None]:
# I am dropping the Detergents_paper feature to see how relevant this feature is in model building process.
from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeRegressor
feature_dropped='Detergents_Paper'

new_data = data.drop(feature_dropped,axis=1,inplace = False)
labels=data[feature_dropped]
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, labels, test_size=0.25, random_state=0)

# TODO: Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train,y_train)

# TODO: Report the score of the prediction using the testing set
score = regressor.score(X_test,y_test)
display(np.round(score, 4))

The R^2 score obtained is 0.7287. This means that the "Detergents_Paper" feature is not necessary in our dataset, because after removing this feature our model is working fine having positive R^2 score of 0.7287 near to 1 means model fits the data well. This feature is not necessary for identifying customer spending habits and we can remove it in further model building process.

# Visualizing Feature Distribution

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. 

Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.

In [None]:
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

In [None]:
sns.barplot(data=data, palette="Set1")

In [None]:
sns.violinplot(data=data, palette="Set1")

After seeing the above scatter matrix it is observed that data is not normally distributed it is highly skewed towards the origin.

In [None]:
sns.heatmap(data.corr(), annot=True)

From the above correlation matrix it is observed that there is a high correlation exist between Detergents_Papers and Grocery having value 0.92 and Detergents papers have also high correlation with Milk of value 0.66. There is also one more pair of high correlation exist that is Grocery and Milk having value of 0.73.

This confirms my suspicion that Detergents_Papers is not an important feature as it is highly correlated with Milk and Grocery.

# Feature Scaling

We will create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a **Box-Cox test**, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the **natural logarithm**.

In [None]:
# Scaling the data using the natural logarithm
log_data = np.log(data)

# Scaling the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

# Observation

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

In [None]:
# Display the log-transformed sample data
display(log_samples)

# Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

We  will need to implement the following:

1. Assign the value of the 25th percentile for the given feature to Q1. Use np.percentile for this.
2. Assign the value of the 75th percentile for the given feature to Q3. Again, use np.percentile.
3. Assign the calculation of an outlier step for the given feature to step.
4. Optionally remove data points from the dataset by adding indices to the outliers list.

In [None]:
# List of all outliers
outliers = []

# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
    
    # Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = np.percentile(log_data[feature], 25.)
    
    # Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = np.percentile(log_data[feature], 75.)
    
    # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = (Q3-Q1)*1.5
    print ("Outlier step:", step)
    
    # Display the outliers
    print ("Data points considered outliers for the feature '{}':".format(feature))
    feature_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    display(feature_outliers)
    
    outliers += feature_outliers.index.tolist()
    
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print("Number of outliers (inc duplicates): ", len(outliers))
print("New dataset with removed outliers has {} samples with {} features each.".format(*good_data.shape))

There were 5 data points (65,66,75,128,154) that were considered outliers for more than one feature. So, instead of removing all 42 outliers (which would result in us losing a lot of information, around 10% of total data), only outliers that occur for more than one feature should be removed.

# Applying PCA

We will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

In [None]:
# Apply PCA by fitting the good data with the same number of dimensions as features
from sklearn.decomposition import PCA
pca = PCA(n_components=6).fit(good_data)

# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = pca_results(good_data, pca)


First 2 Principal components:

1. **1st PC:** 44.30%
2. **2nd PC:** 26.38% <br>

   **Total:** 70.68%

First 4 components:

1. **1st PC:** 44.30%
2. **2nd PC:** 26.38%
3. **3rd PC:** 12.31%
4. **4th PC:** 10.12%<br>

**Total: 93.11%**

Each component represents different sections of customer spending

1. **first dimension** represents a wide variety of the featureset. Most prominently it represents Frozen, but also provides information Gain for Fresh. However, it badly predicts Milk,Grocery,Detergent_papers and Delicatessen categories and needs another component to help. <br>
2. **Second Dimension** badly predicts all categories Fresh, frozen, Milk,Grocery,Detergent_papers and Delicatessen categories and needs another component to help. <br>

3. **Third Dimension** most prominently represents Delicatessen, but also provides information Gain for Frozen and Milk to some extent. However, it badly predicts Fresh and Detergent papers categories whereas Groceries are started to improve but another component needed to help. <br>

4. **Fourth Dimension** most prominently represents Frozen, but also provides information Gain for Detergent papers,Grocerry and Milk to some extent. However, it badly predicts Delicatessen and Fresh categories but another component needed to help. <br>



In [None]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

# Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.


In [None]:
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)

# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))

# Visualizing a Biplot¶

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

In [None]:
# Create a biplot
biplot(good_data, reduced_data, pca)

From the above visualization we can see that for Dimension 1 'Detergents_Paper', 'Grocery' and 'Milk' are strongly correlated on the negative side. For the Dimension 2, 'Fresh', 'Frozen' and 'Delicatessen' are strongly correlated on the negative direction.

# Clustering

In this section, we will  use  a K-Means clustering algorithm and a Gaussian Mixture Model  and compare both clustering algorithm to identify the various customer segments hidden in the data. We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

**Advantages of K-Means clustering algorithm ** <br>

The following are the advantages of using K Means Clustering

1. Easy to implement
2. With a large	number of variables,K-Means may be computationally faster	than	hierarchical clustering	(if K is small).
3. K-Means may produce higher clusters than hierarchical clustering.
4. Easy to interpret the clustering results.

**Advantages of Gaussian Mixture Model (GMM) clustering algorithm** <br>

1. If the model is having some hidden, not observable parameters, then we should use GMM. This is because, GMM assign a probability to each point to belong to certain cluster, instead of assigning a flag that the point belongs to certain cluster as in the classical k-Means.

2. GMM produce non-convex clusters, which can be controlled with the variance of the distribution. In fact, k-Means is a special case of GMM, such that the probability of a one point to belong to a certain cluster is 1, and all other probabilities are 0, and the variance is 1, which is a reason why k-Means produces only spherical clusters.

3. GMM allows for mixed membership of points to clusters. In kmeans, a point belongs to one and only one cluster, whereas in GMM a point belongs to each cluster to a different degree. The degree is based on the probability of the point being generated from each cluster’s (multivariate) normal distribution, with cluster center as the distribution’s mean and cluster covariance as its covariance. Depending on the task, mixed membership may be more appropriate (e.g. news articles can belong to multiple topic clusters) or not (e.g. organisms can belong to only one species).

# Creating Clusters

Depending on the problem, the number of clusters that you expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

We will need to implement the following:

1. Fit a clustering algorithm to the reduced_data and assign it to clusterer.<br>
2. Predict the cluster for each data point in reduced_data using clusterer.predict and assign them to preds.<br>
3. Find the cluster centers using the algorithm's respective attribute and assign them to centers.<br>
4. Predict the cluster for each sample data point in pca_samples and assign them sample_preds.<br>
5. Import sklearn.metrics.silhouette_score and calculate the silhouette score of reduced_data against preds.<br>
6. Assign the silhouette score to score and print the result.<br>

In [None]:
# Applying K means
def sil_coeff(no_clusters):
    # Apply your clustering algorithm of choice to the reduced data 
    clusterer_1 = KMeans(n_clusters=no_clusters, random_state=0 )
    clusterer_1.fit(reduced_data)
    
    # Predict the cluster for each data point
    preds_1 = clusterer_1.predict(reduced_data)
    
    # Find the cluster centers
    centers_1 = clusterer_1.cluster_centers_
    
    # Predict the cluster for each transformed sample data point
    sample_preds_1 = clusterer_1.predict(pca_samples)
    
    # Calculate the mean silhouette coefficient for the number of clusters chosen
    score = silhouette_score(reduced_data, preds_1)
    
    print("silhouette coefficient for `{}` clusters => {:.4f}".format(no_clusters, score))
    
clusters_range = range(2,11)
for i in clusters_range:
    sil_coeff(i)

Out of 10 clusters tried I found best score for 2 clusters having score of **0.4472**

In [None]:
# Applying GMM
from sklearn.mixture import GMM
from sklearn.metrics import silhouette_score

def produceGMM(k):
    global clusterer, preds, centers, sample_preds
    
    # Apply your clustering algorithm of choice to the reduced data 
    clusterer = GMM(n_components=k, random_state=0)
    clusterer.fit(reduced_data)

    # Predict the cluster for each data point
    preds = clusterer.predict(reduced_data)

    # Find the cluster centers
    centers = clusterer.means_ 
    
    # Predict the cluster for each transformed sample data point
    sample_preds = clusterer.predict(pca_samples)

    # Calculate the mean silhouette coefficient for the number of clusters chosen
    score = silhouette_score(reduced_data,preds)
    return score

results = pd.DataFrame(columns=['Silhouette Score'])
results.columns.name = 'Number of Clusters'    
for k in range(2,11):
    score = produceGMM(k) 
    results = results.append(pd.DataFrame([score],columns=['Silhouette Score'],index=[k]))

display(results)

Out of 10 clusters tried I found best score for 2 clusters having score of **0.4436**

As compared to K Means Silhouette Score for 2 clusters is 0.4472 while GMM has 0.4436 slightly less so we will proceed with the K means as it has higher score for 2 clusters.

# Cluster Visualization 


In [None]:
# 1. K Means Visualization 
# Display the results of the clustering from implementation for 2 clusters
clusterer = KMeans(n_clusters = 2)
clusterer.fit(reduced_data)
preds = clusterer.predict(reduced_data)
centers = clusterer.cluster_centers_
sample_preds = clusterer.predict(pca_samples)

cluster_results(reduced_data, preds, centers, pca_samples)

In [None]:
# 2. GMM Visualization 
# Display the results of the clustering from implementation for 2 clusters
clusterer = GMM(n_components = 2)
clusterer.fit(reduced_data)
preds = clusterer.predict(reduced_data)
centers = clusterer.means_ 
sample_preds = clusterer.predict(pca_samples)

display(cluster_results(reduced_data, preds, centers, pca_samples))

# Data Recovery 

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

In [None]:
# TODO: Inverse transform the centers
log_centers = pca.inverse_transform(centers)

# TODO: Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)

In [None]:
plt.figure()
plt.axes().set_title("Segment 0")
sns.barplot(x=true_centers.columns.values,y=true_centers.iloc[0].values)

plt.figure()
plt.axes().set_title("Segment 1")
sns.barplot(x=true_centers.columns.values,y=true_centers.iloc[1].values)

From the above figure 1 in segment 1 Grocery and milk have sufficiently higher spending than their mean values and fresh also has higher spending but less than their mean value. So, according to me this might be establishment of cafe/restaurants.

whereas in segment 0 Fresh has largest spending this might suggest that it would be supplier/wholesale market of fresh vegetables ,fruits etc.

In [None]:
# Display the predictions
for i, pred in enumerate(sample_preds):
    print("Sample point", i, "predicted to be in Cluster", pred)

Index 61 (1) Previous assessment: Supermarket because it has well above the mean of all of the products offering all kind of products having sufficient stocks of each. Model assessment: Cafe/ Restaurants

Index 149 (0) Previous assessment: Supplier/Wholesale market due to sufficiently high stock of fresh Model assessment: Supplier/Wholesale market Comments: This does seem to agree with the original prediction. I interpreted a predominance of Fresh as a characteristic of Supplier/Wholesale market, and the model seems to suggest this too.

Index 379 (1) Previous assessment: cafe due to good stock of milk and more than average stock of grocery (snacks,coffee,tea etc.) while other items have very low stocks than their average values. Model assessment: Cafe/Restaurant Comments: This does seem to agree with the original prediction. I interpreted a predominance of Milk + Grocery as a characteristic of Cafe/Restaurant, and the model seems to suggest this too.

The model has resulted into two main customer types - Cluster 0 'supplier/wholesale market' and Cluster 1 'restaurants/cafes' .

It is likely that customers from Cluster 1 who serve lots of fresh food are going to want 5-day weeks in order to keep food as fresh as possible

Cluster 0 could be more flexible - they buy a more wide variety of perishable and non-perishable goods so do not necessarily need a daily delivery.

With this in mind, the Company could run A/B tests and generalize.Company can take a subset of data points close to the cluster centers to act as representative of their respective clusters, change the delivery frequency on half of the points in the subsets for each cluster, and see how those customers react in comparison to the other customers in their cluster who still have the old delivery frequency. Based on this, company can predict how all customers in each cluster would react to the change in delivery frequency.

The wholesale distributor could train a supervised machine learning classification algorithm (e.g. SVC, or decision tree classifier, etc) with the initial dataset's customer product spending as inputs and the customer segments (engineered feature as obtained from GMM clustering) as the target variable.

Once the classifier is trained it can be used to predict the customer segment for new customers which would then determine the most appropriate delivery service.

Standard Supervised Learning optimizations can be used to tune the model - boosting, cross-validation etc. for more accurate resuts.

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.

Run the code block below to see how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space. In addition, you will find the sample points are circled in the plot, which will identify their labeling.

In [None]:
channel_results(reduced_data, outliers, pca_samples)