# Customer Segmentation and Analysis

Steps to solve the problem :

*   Importing Libraries.
*   Exploration of data.
*   Data Visualization.
*   Clustering using K-Means: 
    - Segmentation using Age and Spending Score,
    - Segmentation using Average Income and Spending Score
*   Selection of Clusters.
*   Ploting the Cluster Boundry and Clusters.
*   3D Plot of Clusters.



## Importing Libraries

In [None]:
import os # provides functions for interacting with the operating system
import numpy as np # for numerical computing in Python
import pandas as pd # for data processing
import matplotlib.pyplot as plt # plotting library for the Python
import seaborn as sns #for making statistical graphics in Python, built on top of Python
from sklearn.cluster import KMeans

## Data Exploration (_Discover and Visualize the Data to Gain Insights_)

In [None]:
!wget -O MallCustomers.csv https://sagemaker-studio-591933579993-hqmmz6xgv3m.s3.amazonaws.com/Mall_Customers.csv

In [None]:
!ls

In [None]:
df = pd.read_csv('MallCustomers.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.Gender.isin(['Male']).value_counts()

In [None]:
df.isnull().sum(axis=0)

## Data Visualization

[For Style reference](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html)

In [None]:
#plt.style.use('fivethirtyeight')

In [None]:
plt.style.use('default')

### **Histograms**

A histogram is a bar graph-like representation of data that buckets a range of outcomes into columns along the x-axis. The y-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distributions.

In [None]:
df.columns

In [None]:
plt.figure(1, figsize=(15, 5))
n = 0
for x in df.columns[2:]:
    n += 1
    plt.subplot(1, 3, n)
    sns.histplot(df[x], bins=20)
    plt.title('Histplot of {}'.format(x), fontsize=14)
plt.subplots_adjust(wspace=0.3)
plt.show()

kde: If `True`, compute a kernel density estimate to smooth the distribution

In [None]:
plt.figure(1, figsize=(15, 6))
n = 0
for x in df.columns[2:]:
    n += 1
    plt.subplot(1, 3, n)
    plt.subplots_adjust(hspace=0.5)
    sns.histplot(df[[x]], bins=20, kde=True)
    plt.title('Smooth histplot of {}'.format(x), fontsize=14)
plt.show()

In [None]:
sns.pairplot(df[['Age','Annual Income (k$)','Spending Score (1-100)','Gender']],hue='Gender')

In [None]:
plt.figure(1, figsize=(18, 18))
sns.pairplot(df[['Age','Annual Income (k$)','Spending Score (1-100)','Gender']],hue='Gender',diag_kind="hist", diag_kws={'bins': 20})
plt.show()

### **Count Plot of Gender**

In [None]:
plt.figure(1, figsize=(12, 4))
df['Gender'].value_counts().plot(kind='bar', cmap='Blues_r')
plt.ylabel('Count', fontsize=12)
plt.xlabel('Gender', fontsize=12)
plt.xticks(rotation=0)
plt.show()

In [None]:
sns.countplot(x='Gender',data=df)

### **Looking for Correlations**

Since the dataset is not too large, you can easily compute the standard correlation
coefficient (also called Pearson’s r) between every pair of attributes using the corr()
method:

In [None]:
corr_matrix = df.corr()
corr_matrix

In [None]:
from pandas.plotting import scatter_matrix

attrbutes = df.columns[2:]
scatter_matrix(df[attrbutes], figsize=(15, 10), alpha=0.8, marker='o', diagonal='hist', hist_kwds={'bins': 25})
plt.show()

In [None]:
df[['Age','Annual Income (k$)','Spending Score (1-100)']].corr()

In [None]:
sns.heatmap(df[['Age','Annual Income (k$)','Spending Score (1-100)']].corr())

### **Ploting the Relation between Age , Annual Income and Spending Score**

In [None]:
plt.figure(1, figsize=(15, 6))
for gender in ['Male', 'Female']:
    plt.scatter(x='Age', y='Annual Income (k$)', data=df[df['Gender'] == gender],
               s=200, alpha=0.5, label=gender)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Age vs Annual Income w.r.t Gender')
plt.legend()
plt.show()

In [None]:
plt.figure(1, figsize=(15, 6))
for gender in ['Male', 'Female']:
    plt.scatter(x='Annual Income (k$)', y='Spending Score (1-100)', data=df[df['Gender'] == gender],
               s=200, alpha=0.5, label=gender)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Annual Income vs Spending Score w.r.t Gender')
plt.legend()
plt.show()

**Distribution of values in Age , Annual Income and Spending Score according to Gender**

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

In [None]:
plt.figure(1, figsize=(15, 6))
n = 0
for cols in df.columns[2:]:
    n += 1
    plt.subplot(1, 3, n)
    plt.subplots_adjust(hspace=0.5)
    sns.boxplot(x = cols, y = 'Gender', data=df, palette='vlag')
    sns.swarmplot(x = cols , y = 'Gender' , data = df)
    plt.ylabel('Gender' if n == 1 else '')
    plt.title('Boxplots & Swarmplots' if n == 2 else '')
plt.show()

In [None]:
plt.figure(1, figsize=(15, 6))
n = 0
for cols in df.columns[2:]:
    n += 1
    plt.subplot(1, 3, n)
    plt.subplots_adjust(hspace=0.5)
    sns.violinplot(x = cols, y = 'Gender', data=df, palette='vlag')
    sns.swarmplot(x = cols , y = 'Gender' , data = df)
    plt.ylabel('Gender' if n == 1 else '')
    plt.title('Boxplots & Swarmplots' if n == 2 else '')
plt.show()

[To know about boxplot in more detail.](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)

## Clustering using KMeans

#### **Segmentation using Age and Spending Score**

Selecting n_clusters based on inertia(mean squared distance between each instance and it's closest centroid)

[Visualize KMeans]()

In [None]:
X = df[['Age', 'Spending Score (1-100)']].values
inertias = []
for n in range(1, 11):
    kmeans = KMeans(n_clusters=n, algorithm='elkan', tol=1e-4, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

In [None]:
plt.figure(1, figsize=(15, 6))
plt.plot(range(1, 11), inertias, 'go-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4, algorithm='elkan', tol=1e-4, random_state=42)
kmeans.fit(X)

In [None]:
def plot_data(X):
    plt.plot(X[:, 0], X[:, 1], 'ko', markersize=5)

def plot_centroids(centroids, circle_color='r', cross_color='w'):
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=35, linewidths=8, 
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=2, linewidths=12,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(clusterer, X, resolution=1000, show_centroids=True,
                            show_xlabels=True, show_ylabels=True):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution), np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.contourf(Z, extent=(mins[0],  maxs[0], mins[1], maxs[1]), cmap='Pastel2')
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                linewidths=1, colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)
    
    if show_xlabels:
        plt.xlabel('Age')
    
    if show_ylabels:
        plt.ylabel('Spending Score (1-100)')

In [None]:
plt.figure(1, figsize=(15, 6))
plot_decision_boundaries(kmeans, X)
plt.show()

#### **Segmentation using Annual Income and Spending Score**

In [None]:
X = df[['Annual Income (k$)', 'Spending Score (1-100)']].values
inertias = []
for n in range(1, 11):
    kmeans = KMeans(n_clusters=n, algorithm='elkan', tol=1e-4, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

In [None]:
plt.figure(1, figsize=(15, 6))
plt.plot(range(1, 11), inertias, 'go-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=5, algorithm='elkan', tol=1e-4, random_state=42)
kmeans.fit(X)
plt.figure(1, figsize=(15, 6))
plot_decision_boundaries(kmeans, X)
plt.show()