# <center>Plotly_Mall_Customers - Selma MOURTADI<center>

<center><blockquote><b>Visualization gives you answers to questions you didn’t know you had.</b><i> - Ben Shneiderman</i></blockquote><center>

![image.png](attachment:image.png)

In [19]:
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

In [20]:
data = pd.read_csv('Mall_customers.csv')

In [21]:
fig = go.Figure(data = [go.Table(
    header = dict(values = list(data.columns),
                fill_color = 'rgb(251,166,121)',
                align ='left'),
    cells=dict(values = [data['CustomerID'], data['Gender'], data['Age'], 
                       data['Annual Income (k$)'], data['Spending Score (1-100)']],
               fill_color = 'rgb(253,237,176)',
               align = 'left'))
])

fig.show()

<b><font color='purple'>Description :</font></b>
* **CustomerID** : Each Customer has its own ID. (Numerical)
* **Gender** : Male or Female. (Categorical)
* **Age** : (Numerical)
* **Annual Income** : Income of customers in K
* **Spending Score (1-100)** : A score assigned to the customer based on some defined parameters, such as purchasing data.

In [22]:
print('Number of columns: {:}\nNumber of rows: {:}'.format(data.shape[1], data.shape[0]))

Number of columns: 5
Number of rows: 200


In [23]:
data.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


* **Age** : Age ranges from 18 to 70. 
* **Annual Income** : The  Income of the customers ranges from 15k Dollars per Year to 137k Dollars per Year.
* **Spending Score** : Minimum Spending Score equals to 1, and Maximum Spending Score equals to 99.

<b><font color='purple'>Missing values :</font></b>

In [24]:
data.isnull().sum()

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

**None.**

## Interactive Histograms :

In [25]:
fig = px.histogram(data, x = "Age", color = 'Gender', title = 'Age Distribution')
fig.show()

In [26]:
fig = px.histogram(data, x ="Annual Income (k$)", color = 'Gender', title = 'Annual Income (k$) Distribution')
fig.show()

In [27]:
fig = px.histogram(data, x = "Spending Score (1-100)", color = 'Gender', title = 'Spending Score Distribution')
fig.show()

## Interactive Boxplots :

In [28]:
fig = px.box(data, y="Age", title = 'Age Boxplot')
fig.show()

In [29]:
fig = px.box(data, y="Annual Income (k$)", title = 'Annual Income (k$) Boxplot')
fig.show()

**We easily can spot an extreme value : 137k. 
Let's take a look at the characteristics of the customer(s) that owns/own this amount of annual income.**

In [30]:
data.loc[data['Annual Income (k$)'] == 137]

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
198,199,Male,32,137,18
199,200,Male,30,137,83


In [31]:
fig = px.box(data, y="Spending Score (1-100)", title = 'Spending Score (1-100) Boxplot')
fig.show()

## Interactive Pie Chart  :

In [32]:
df = pd.DataFrame(data['Gender'].value_counts()).reset_index()
df.columns = ['Gender', 'Total']
df

Unnamed: 0,Gender,Total
0,Female,112
1,Male,88


In [33]:
fig = px.pie(df, values = 'Total', names = 'Gender', title = 'Gender Pie Chart')
fig.show()

**56% of the customers are Female (112) and 44% are Male (88).**

## Interactive Barplots :

In [34]:
fig = px.bar(data, x = "Gender", y = "Age", color = "Gender", title = "Age By Gender")
fig.show()

In [35]:
fig = px.bar(data, x = "Gender", y = "Annual Income (k$)", color = "Gender", title = "Annual Income (k$) By Gender")
fig.show()

In [36]:
fig = px.bar(data, x = "Gender", y = "Spending Score (1-100)", color = "Gender", title = "Spending Score (1-100) By Gender")
fig.show()

In [37]:
fig = px.bar(data, x = "Age", y = "Annual Income (k$)", color = "Gender", title = "Annual Income By Age + Gender Info")
fig.show()

In [38]:
fig = px.bar(data, x = "Age", y = "Spending Score (1-100)", color = "Gender", title = "Spending Score by Age + Gender Info")
fig.show()

## Interactive Scatter Plot :

In [39]:
fig = px.scatter(data, x = "Annual Income (k$)", y = "Spending Score (1-100)", color = "Gender",
                 size = 'Age')
fig.show()

**Can't say much about the correlation between the Spending Score and the Annual Income.**

## Interactive Heatmap :

In [40]:
X = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
df_2 = data.copy()
df_2 = df_2.drop(['CustomerID'], axis = 1)

In [41]:
fig = go.Figure(data = go.Heatmap(
        z = df_2.corr(),
        x = X,
        y = X,
        colorscale='RdBu'))
fig.show()

**We used Heatmap to see if we can find a correlation between Spending Score and Age, and between Annual Income and Age. Unfortunately, we got a low correlation value. Let's move now to the Clustering Algorithms!**

## KMeans :

In [42]:
from sklearn.cluster import KMeans

**We will use WCSS (Within-Cluster-Sum-of-Squares), which is basically  the sum of squares of the distances of each data point in all clusters to their respective centroids. - Also known as The Elbow Method.**

In [43]:
x = df_2.iloc[:,2:4].values
wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = "k-means++", max_iter = 500, n_init = 10, random_state = 42)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
    
fig = go.Figure(data = go.Scatter(x = [1,2,3,4,5,6,7,8,9,10], y = wcss))


fig.update_layout(title = 'WCSS By Cluster Number',
                   xaxis_title = 'Clusters',
                   yaxis_title = 'WCSS')
fig.show()

**Optimal Number of Clusters according to the Elbow Method : 5 clusters.**

In [44]:
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y = kmeans.fit_predict(x)

**Let's use Silhouette Score this time to find the optimal number of clusters, and see if it's equals 5.**

In [54]:
from sklearn.metrics import silhouette_score

In [55]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]
silscore = []
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state=0)
    cluster_labels = clusterer.fit_predict(x)
    silhouette_avg = silhouette_score(x, cluster_labels)
    silscore.append(silhouette_avg)
    
    print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)

For n_clusters = 2 The average silhouette_score is : 0.2968969162503008
For n_clusters = 3 The average silhouette_score is : 0.46761358158775435
For n_clusters = 4 The average silhouette_score is : 0.4931963109249047
For n_clusters = 5 The average silhouette_score is : 0.553931997444648
For n_clusters = 6 The average silhouette_score is : 0.5393922132561455
For n_clusters = 7 The average silhouette_score is : 0.5270287298101395
For n_clusters = 8 The average silhouette_score is : 0.4575689106804838
For n_clusters = 9 The average silhouette_score is : 0.4565077334305076
For n_clusters = 10 The average silhouette_score is : 0.449795408266166


**Highest Silhouette Score = Best Cluster. Bingo! It's 5.**

In [46]:
data_clusters = data.copy()
data_clusters['Cluster'] = y

**We Created a dataframe that will contain a column of the number of clusters assigned to each CustomerID.**

In [47]:
data_clusters.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
0,1,Male,19,15,39,3
1,2,Male,21,15,81,1
2,3,Female,20,16,6,3
3,4,Female,23,16,77,1
4,5,Female,31,17,40,3


In [48]:
fig = px.scatter(data_clusters, x = "Annual Income (k$)", y = "Spending Score (1-100)", 
                 color = "Cluster",
                 size = 'Age',
                 title = "Clusters Visualization")
fig.show()

**Conclusion :**
* Low Annual Income (x <= 39) and Low Spending Score (x <= 40). - **Cluster 3 (Orange).**
* Low Annual Income (x <= 39) and High Spending Score (x >= 61). - **Cluster 1 (Purple).**
* Mid Annual Income (39 <= x <= 76) and Mid Spending Score (34 <= x <= 61). - **Cluster 0 (Blue).**
* High Annual Income (x >= 70) and Low Spending Score (x <= 39). - **Cluster 4 (Yellow).**
* High Annual Income (x >= 69) and High Spending Score (x >= 63). - **Cluster 2 (Pink).**

**Statistics about the Clusters :**

In [78]:
data_c0 = data_clusters[data_clusters['Cluster'] == 0]
data_c0.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Cluster
count,81.0,81.0,81.0,81.0,81.0
mean,86.320988,42.716049,55.296296,49.518519,0.0
std,24.240889,16.447822,8.988109,6.530909,0.0
min,44.0,18.0,39.0,34.0,0.0
25%,66.0,27.0,48.0,44.0,0.0
50%,86.0,46.0,54.0,50.0,0.0
75%,106.0,54.0,62.0,55.0,0.0
max,143.0,70.0,76.0,61.0,0.0


In [79]:
data_c1 = data_clusters[data_clusters['Cluster'] == 1]
data_c1.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Cluster
count,22.0,22.0,22.0,22.0,22.0
mean,23.090909,25.272727,25.727273,79.363636,1.0
std,13.147185,5.25703,7.566731,10.504174,0.0
min,2.0,18.0,15.0,61.0,1.0
25%,12.5,21.25,19.25,73.0,1.0
50%,23.0,23.5,24.5,77.0,1.0
75%,33.5,29.75,32.25,85.75,1.0
max,46.0,35.0,39.0,99.0,1.0


In [80]:
data_c2 = data_clusters[data_clusters['Cluster'] == 2]
data_c2.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Cluster
count,39.0,39.0,39.0,39.0,39.0
mean,162.0,32.692308,86.538462,82.128205,2.0
std,22.803509,3.72865,16.312485,9.364489,0.0
min,124.0,27.0,69.0,63.0,2.0
25%,143.0,30.0,75.5,74.5,2.0
50%,162.0,32.0,79.0,83.0,2.0
75%,181.0,35.5,95.0,90.0,2.0
max,200.0,40.0,137.0,97.0,2.0


In [81]:
data_c3 = data_clusters[data_clusters['Cluster'] == 3]
data_c3.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Cluster
count,23.0,23.0,23.0,23.0,23.0
mean,23.0,45.217391,26.304348,20.913043,3.0
std,13.56466,13.228607,7.893811,13.017167,0.0
min,1.0,19.0,15.0,3.0,3.0
25%,12.0,35.5,19.5,9.5,3.0
50%,23.0,46.0,25.0,17.0,3.0
75%,34.0,53.5,33.0,33.5,3.0
max,45.0,67.0,39.0,40.0,3.0


In [82]:
data_c4 = data_clusters[data_clusters['Cluster'] == 4]
data_c4.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Cluster
count,35.0,35.0,35.0,35.0,35.0
mean,164.371429,41.114286,88.2,17.114286,4.0
std,21.457325,11.341676,16.399067,9.952154,0.0
min,125.0,19.0,70.0,1.0,4.0
25%,148.0,34.0,77.5,10.0,4.0
50%,165.0,42.0,85.0,16.0,4.0
75%,182.0,47.5,97.5,23.5,4.0
max,199.0,59.0,137.0,39.0,4.0


## Affinity Propagation :

In [149]:
from sklearn.cluster import AffinityPropagation

In [143]:
AP = AffinityPropagation(random_state = 0)
y = AP.fit_predict(x)

In [144]:
data_clusters = data.copy()
data_clusters['Cluster'] = y

In [145]:
fig = px.scatter(data_clusters, x = "Annual Income (k$)", y = "Spending Score (1-100)", 
                 color = "Cluster",
                 size = 'Age',
                 title = "Clusters Visualization")
fig.show()

**The number of clusters is equal to 10. Doesn't look like a good result.**

`[Other Clustering Algorithms coming soon, stay tuned hehe! :)]`