## Wine drinkers

#### Part 1- Segmenting wine drinkers
Here I explore online sales data for a wine store based in the Upper East Side in NYC. Although online sales are not representative of total sales for this particular store (most of their sales are in-store), it will be informative to take a look at what online customers are buying.

In Part 2 I'll use this data to build wine recommenders.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import metrics
from ggplot import *
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('wine_data.csv')

#### We have data for purchases by wine type. Each row is a customer.

In [None]:
data.head()

In [None]:
data.describe()

### The most popular wines

In [None]:
data.mean().plot(kind='bar')

The most popular wines are Pinot Noir, Zinfandel, Merlot, Chardonnay, and Sauvignon Blanc.

### Clustering.

In [None]:
X = data[data.columns]

# All column names (wine types) are stored as x_cols
x_cols = data.columns

I'll use the elbow method to find the optimal number of clusters. This identifies the value of k (number of clusters) where the distortion (the within-cluster sum of squared errors or SSE) begins to increase the most rapidly.

In [None]:
distortions = []
for i in range (1,10):
    km = KMeans(n_clusters=i,
               init='k-means++',
               n_init=10,
               max_iter=300,
               random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
    
plt.plot(range(1,10), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

It looks like the elbow is located at k=3... We can also use the silhouette score; this is a measure of how similar an objects is to its own cluster compared to other clusters. The score is higher when clusters are dense and well separated. A score of 1 is the highest and a score of -1 is the lowest. Scores around zero indicate overlapping clusters.

In [None]:
silhouette = {}
for i in range (2,10):
    km = KMeans(n_clusters=i,
               init='k-means++',
               n_init=10,
               max_iter=300,
               tol=1e-04,
               random_state=0)
    km.fit(X)
    silhouette[i] = metrics.silhouette_score(X, km.labels_, metric='euclidean')

silhouette

k=3 gives the highest score, by a hair. In general these scores are not that high indicating that there will be a fair amount of overlap between clusters. 

#### To visualize the data I will project the data to 2D.

In [None]:
pca = PCA(n_components=2)

data['x']=pca.fit_transform(data[x_cols])[:,0]
data['y']=pca.fit_transform(data[x_cols])[:,1]

clusters_2d = data[['cluster3', 'x', 'y']]

In [None]:
ggplot(clusters_2d, aes(x='x', y='y', color='cluster3')) + \
    scale_color_gradient(low='#E1FA72', high='#F46FEE') + \
    geom_point(size=75) + ggtitle("Customers Grouped by Cluster")

There is some overlap between the clusters. Let's look at the clusters more closely and see what people are buying for each cluster.

### Analyzing clusters

In [None]:
# Making columns that indicate whether a customer is in a particular cluster
data['is_0'] = data.cluster3==0.0
data['is_1'] = data.cluster3==1.0
data['is_2'] = data.cluster3==2.0

just_wine = data.drop(['cluster3','x','y'],1)

In [None]:
# Let's group by cluster
cluster0 = just_wine.groupby('is_0').sum()
cluster1 = just_wine.groupby('is_1').sum()
cluster2 = just_wine.groupby('is_2').sum()

In [None]:
# Getting just the relevant row for each cluster
zero = cluster0.iloc[1:2]
one = cluster1.iloc[1:2]
two = cluster2.iloc[1:2]

# Let's put all the groups into one dataframe
all_clusters = zero.append(one, ignore_index=True)
all_clusters = all_clusters.append(two, ignore_index=True)

all_clusters

In [None]:
'''For some reason appending alphabetizes columns. 
The previous ordering was more convenient because reds were with reds
and whites were with whites, so I'll go back to that column ordering.
'''
all_clusters = all_clusters.reindex_axis(cluster0.columns, axis=1)
all_clusters

In [None]:
all_clusters.drop(['is_1','is_2'], axis=1, inplace=True)
all_clusters

Now if you wanted to, you can see which wines are most/least popular for each cluster, and more easily look at differences between the clusters.

### Most/least popular wines by cluster

In [None]:
cluster3 = KMeans(n_clusters=3,
               init='k-means++',
               n_init=10,
               max_iter=300,
               tol=1e-04,
               random_state=0)

In [None]:
# Add a column that indicates which cluster each point falls into
data['cluster3'] = cluster3.fit_predict(X)

# Let's see how many are in each cluster
data.cluster3.value_counts()

In [None]:
all_clusters.plot.bar().legend(loc='center left', bbox_to_anchor=(1, 0.5))

Just some observations: Most of the Pinot Noir, Zinfandel, Merlot, Chardonnay, and Sauvignon Blanc sales come from cluster 1. And most of the Syrah sales are coming from those in cluster 2.

I'm also interested in the mean purchases for each wine type, grouped by cluster.