# Students Do: PCA in Action

In this activity, you will use PCA to reduce the dimensions of the consumers shopping dataset from `4` to `2` features. After applying PCA, you will use the principal components data, to fit a K-Means model with `k=6` and make some conclusions.

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px


## Instructions

1. Import the preprocessed data from the customers shopping dataset into a DataFrame called `df_shopping`.

In [2]:
# Import data
file_path = Path("Data/shopping_data_cleaned.csv")
df_shopping = pd.read_csv(file_path)
df_shopping.head()


Unnamed: 0,Genre,Age,Annual Income,Spending Score (1-100)
0,1,19,15.0,39
1,1,21,15.0,81
2,0,20,16.0,6
3,0,23,16.0,77
4,0,31,17.0,40


2. Standardize the data of all the DataFrame features.

In [3]:
# Standardize data
shopping_scaled = StandardScaler().fit_transform(df_shopping)
print(shopping_scaled[0:5])


[[ 1.12815215 -1.42456879 -1.73899919 -0.43480148]
 [ 1.12815215 -1.28103541 -1.73899919  1.19570407]
 [-0.88640526 -1.3528021  -1.70082976 -1.71591298]
 [-0.88640526 -1.13750203 -1.70082976  1.04041783]
 [-0.88640526 -0.56336851 -1.66266033 -0.39597992]]


3. Apply PCA to reduce dimensions from 4 to 2 and create a DataFrame with the principal components data.

In [4]:
# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the data.
shopping_pca = pca.fit_transform(shopping_scaled)



In [5]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca, columns=["principal component 1", "principal component 2"]
)
df_shopping_pca.head()


Unnamed: 0,principal component 1,principal component 2
0,-0.406383,-0.520714
1,-1.427673,-0.36731
2,0.050761,-1.894068
3,-1.694513,-1.631908
4,-0.313108,-1.810483


4. Fetch the explained variance, analyze its value and answer the following question: Are two principal components the best number of new dimensions?

In [6]:
# Fetch the explained variance
pca.explained_variance_ratio_


array([0.33690046, 0.26230645])

**Sample Answer**

According to the explained variance, the first principal component contains `33.7%` of the variance and the second principal component contains `26.2%` of the variance. We have `59.9%` of the information in the original dataset, so we can explore increasing the number of principal components up to three to verify if this ratio increases.

5. If you conclude that two principal components is the appropriate number of new dimensions, proceed to step 6, on the contrary, explore what happens if you modify the number of principal components. Once you finish, write your conclusions.

In [6]:
# Initialize PCA model
pca = PCA(n_components=3)

# Get two principal components for the iris data.
shopping_pca = pca.fit_transform(shopping_scaled)



In [7]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca,
    columns=["principal component 1", "principal component 2", "principal component 3"],
)
df_shopping_pca.head()



Unnamed: 0,principal component 1,principal component 2,principal component 3
0,-0.406383,-0.520714,-2.072527
1,-1.427673,-0.36731,-2.277644
2,0.050761,-1.894068,-0.367375
3,-1.694513,-1.631908,-0.717467
4,-0.313108,-1.810483,-0.42646


In [15]:
dir(
df_shopping_pca)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_e

In [8]:
# Fetch the explained variance
pca.explained_variance_ratio_


array([0.33690046, 0.26230645, 0.23260639])

**Sample Conclusions**

Defining three principal components, we have `83.1%` of the information in the original dataset, so we can conclude that using three principal components is a better approach to reduce the dimensions in this case.

6. Fit the K-Means algorithm with `k=6` and the principal components data.

In [9]:
# Initialize the K-Means model
model = KMeans(n_clusters=6, random_state=0)

# Fit the model
model.fit(df_shopping_pca)

# Predict clusters
predictions = model.predict(df_shopping_pca)

# Add the predicted class columns
df_shopping_pca["class"] = model.labels_
df_shopping_pca.head()


Unnamed: 0,principal component 1,principal component 2,principal component 3,class
0,-0.406383,-0.520714,-2.072527,3
1,-1.427673,-0.36731,-2.277644,3
2,0.050761,-1.894068,-0.367375,2
3,-1.694513,-1.631908,-0.717467,3
4,-0.313108,-1.810483,-0.42646,2


7. Plot the resulting clusters, use the appropriate scatter plot depending on the number of dimensions you have.

In [11]:
# Since we have three principal components, a 3D-Scatter plot is going to be created.
fig = px.scatter_3d(
    df_shopping_pca,
    x="principal component 3",
    y="principal component 2",
    z="principal component 1",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()
