# Students Do: PCA in Action

In this activity, you will use PCA to reduce the dimensions of the consumers shopping dataset from `4` to `2` features. After applying PCA, you will use the principal components data, to fit a K-Means model with `k=6` and make some conclusions.

In [54]:
# Initial imports
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import plotly.express as px
import hvplot.pandas

## Instructions

1. Import the preprocessed data from the customers shopping dataset into a DataFrame called `df_shopping`.

In [55]:
# Import data
file_path = Path("./Resources/shopping_data_cleaned.csv")
df_shopping = pd.read_csv(file_path)
df_shopping.head()

Unnamed: 0,Gender,Age,Annual Income,Spending Score (1-100)
0,1,19,15.0,39
1,1,21,15.0,81
2,0,20,16.0,6
3,0,23,16.0,77
4,0,31,17.0,40


2. Standardize the data of all the DataFrame features.

In [56]:
# Standardize data
shopping_scaled = StandardScaler().fit_transform(df_shopping)
shopping_scaled[0:3]

array([[ 1.12815215, -1.42456879, -1.73899919, -0.43480148],
       [ 1.12815215, -1.28103541, -1.73899919,  1.19570407],
       [-0.88640526, -1.3528021 , -1.70082976, -1.71591298]])

3. Apply PCA to reduce dimensions from 4 to 2 and create a DataFrame with the principal components data.

In [57]:
# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [58]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    data=shopping_pca, 
    columns=["principal component 1", "principal component 2"])
    
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2
0,-0.406383,-0.520714
1,-1.427673,-0.36731
2,0.050761,-1.894068
3,-1.694513,-1.631908
4,-0.313108,-1.810483


4. Fetch the explained variance, analyze its value and answer the following question: Are two principal components the best number of new dimensions?

In [59]:
# Fetch the explained variance
ratio = pca.explained_variance_ratio_

**Write Your Answer Here**



In [60]:
print(f'Total for two components: {(ratio[0] + ratio[1])*100}%')
print(f'We can do better')

Total for two components: 59.92069019819846%
We can do better


5. If you conclude that two principal components is the appropriate number of new dimensions, proceed to step 6, on the contrary, explore what happens if you modify the number of principal components. Once you finish, write your conclusions.

In [61]:
# Initialize PCA model
pca = PCA(n_components=3)

# Get two principal components for the iris data.
shopping_pca = pca.fit_transform(shopping_scaled)

In [62]:
# Transform PCA data to a DataFrame
df_shopping_pca = pd.DataFrame(
    shopping_pca,
    columns=["principal component 1", "principal component 2", "principal component 3"])
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3
0,-0.406383,-0.520714,-2.072527
1,-1.427673,-0.36731,-2.277644
2,0.050761,-1.894068,-0.367375
3,-1.694513,-1.631908,-0.717467
4,-0.313108,-1.810483,-0.42646


In [63]:
# Fetch the explained variance
ratio = pca.explained_variance_ratio_

**Write Your Conclusions Here**



In [64]:
print(f'Total for two components: {(ratio[0] + ratio[1] + ratio[2])*100}%')
print(f'Improved accuracy with three attributes')

Total for two components: 83.18132878845951%
Improved accuracy with three attributes


6. Fit the K-Means algorithm with `k=6` and the principal components data.

In [65]:
def getInertiaPlot(data):
    inertia = []
    k = list(range(1, 11))

    # Calculate the inertia for the range of k values
    for i in k:
        km = KMeans(n_clusters=i, random_state=0)
        km.fit(data)
        inertia.append(km.inertia_)

    # Create the Elbow Curve using hvPlot
    elbow_data = {"k": k, "inertia": inertia}
    df_elbow = pd.DataFrame(elbow_data)
    fig = df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)
    return fig

In [66]:
fig = getInertiaPlot(df_shopping_pca)
fig

In [67]:
# Initialize the K-Means model
model = KMeans(n_clusters=6, random_state=0)

# Fit the model
model.fit(df_shopping_pca)

# Predict clusters
predictions = model.predict(df_shopping_pca)

# Add the predicted class columns
df_shopping_pca["class"] = model.labels_
df_shopping_pca.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3,class
0,-0.406383,-0.520714,-2.072527,3
1,-1.427673,-0.36731,-2.277644,3
2,0.050761,-1.894068,-0.367375,2
3,-1.694513,-1.631908,-0.717467,3
4,-0.313108,-1.810483,-0.42646,2


7. Plot the resulting clusters, use the appropriate scatter plot depending on the number of dimensions you have.

In [68]:
fig = px.scatter_3d(
    df_shopping_pca,
    x="principal component 1",
    y="principal component 2",
    z="principal component 3",
    color="class",
    symbol="class",
    width=1000,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()
