# PCA and other Dimensionality Reduction Techniques

## Kernel-PCA Example

In [None]:
import matplotlib.pyplot as plt

import pandas as pd
import plotly.express as px

from sklearn.cluster import MiniBatchKMeans, DBSCAN
from sklearn.datasets import make_circles, load_wine
from sklearn.decomposition import KernelPCA, PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# CHANGE THIS PARAMETERS TO CREATE OTHER CIRCLE-SHAPED DATASETS:
input_features, _ = make_circles(
    n_samples = 1000,
    factor=0.25,
    noise=0.1
)
df = pd.DataFrame(input_features)
df.columns = ['x1', 'x2']

In [None]:
plt.scatter(df['x1'], df['x2'])
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Original Data Spread')
plt.show()

The data is not linearly separable. Using K-means does not produce derisable clusters:

In [None]:
kmeans = MiniBatchKMeans(
    n_clusters=2,
    init='k-means++',
)
kmeans_clusters = kmeans.fit_predict(df[['x1', 'x2']])

df['kmeans'] = kmeans_clusters

clus1 = df[df['kmeans'] == 0]
clus2 = df[df['kmeans'] == 1]

plt.scatter(clus1['x1'], clus1['x2'], color='b')
plt.scatter(clus2['x1'], clus2['x2'], color='r')
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Original Data Spread')
plt.show()

In [None]:
# ************************************************************************************
# I had to tweak this for a while until I got the right value of the eps distance!!!
# This is not practical in some situations...
# ************************************************************************************
dbscan = DBSCAN(
    eps=0.15,  
    min_samples=5
)
dbscan_clusters = dbscan.fit_predict(df[['x1', 'x2']])
df['dbscan'] = dbscan_clusters

clus1 = df[df['dbscan'] == 0]
clus2 = df[df['dbscan'] == 1]

plt.scatter(clus1['x1'], clus1['x2'], color='b')
plt.scatter(clus2['x1'], clus2['x2'], color='r')
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Original Data Spread')
plt.show()

Let's use KernelPCA with an RBF kernel to linearly separate these two clusters in a higher-dimensional space with a Kernel Trick!

**In this case we are not using Kernel-PCA to reduce dimensionality**, we are just using it to project the data in a different 2D-space where it is linearly separated (because the original circles were also separated in a 2D plane, just they are not linearly separated!)

In [None]:
# PLAY WITH THE INPUT PARAMETERS TO SEE HOW THIS WORKS:
kpca = KernelPCA(
    n_components = 2,
    kernel='rbf',
    fit_inverse_transform=True,
    gamma=10,
    random_state=1000
)
transformed_data = kpca.fit_transform(df)

In [None]:
components = pd.DataFrame(transformed_data)
components.columns = ['pc1', 'pc2']

In [None]:
plt.scatter(components['pc1'], components['pc2'])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Principal Components Spread')
plt.show()

Those clusters are clearly separated now, and in a linear way! Now even the simple K-Means algorithm will recognise the two classes immediately. Let's reuse the exact same model as before:

In [None]:
kmeans_clusters = kmeans.fit_predict(components[['pc1', 'pc2']])

components['kmeans_pca'] = kmeans_clusters

clus1 = components[components['kmeans_pca'] == 0]
clus2 = components[components['kmeans_pca'] == 1]

plt.scatter(clus1['pc1'], clus1['pc2'], color='b')
plt.scatter(clus2['pc1'], clus2['pc2'], color='r')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Principal Components Spread with K-means applied:')
plt.show()

Not perfect but a much better approximation than the initial K-means. Let's plot these clusters in the original data now:

In [None]:
df['kmeans_pca'] = components['kmeans_pca']

clus1 = df[df['kmeans_pca'] == 0]
clus2 = df[df['kmeans_pca'] == 1]

plt.scatter(clus1['x1'], clus1['x2'], color='b')
plt.scatter(clus2['x1'], clus2['x2'], color='r')
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Original Data Spread')
plt.show()

## PCA without num. components example

In this example we will see how to create a PCA model that automatically selects the number of components.

**Note:** There are a number of ways in which the `n_components` parameter can be instantiated:
* If it is an integer over 1, it will be the number of components as we already saw.
* If it is a value between 0 and 1 and the `svd_solver` is `full`, then the `n_components` parameter refers to the % of variance we want to maintain in the data, and the actual number of components will be selected via the `svd_solver`, meeting the variance condition.
* If its value is `mle`, and `svd_solver` is `full`, then it will try to figure out everything based on the MLE algorithm

In [None]:
wine_df = load_wine(return_X_y=True, as_frame=True)
wine_df = wine_df[0]

In [None]:
# Important: Scale/Normalise your data!
wine_df = StandardScaler(with_mean=0, with_std=1).fit_transform(wine_df)

In [None]:
# I have no idea of how many components I should set... let's do 10...
vanilla_pca = PCA(n_components=10, svd_solver='auto')

v_pca = vanilla_pca.fit_transform(wine_df)
print('Shape of the PCA-transformed data: ',v_pca.shape)
print('Those are', v_pca.shape[1], 'components')
print('Variance explained: ', round(100*sum(vanilla_pca.explained_variance_ratio_), 2), '% of the total')

In [None]:
# Let's see what selection the MLE (Maximum Likelihood Estimator) comes up with:
mle_pca = PCA(n_components='mle', svd_solver='full')

mle_pca_data = mle_pca.fit_transform(wine_df)
print('Shape of the PCA-transformed data: ',mle_pca_data.shape)
print('Those are', mle_pca_data.shape[1], 'components')
print('Variance explained: ', round(100*sum(mle_pca.explained_variance_ratio_), 2), '% of the total')

In [None]:
# Let's see how many components we get if we are happy to maintain 65% of the variance:
# FEEL FREE TO CHANGE THAT PARAMETER TO SEE HOW MANY COMPONENTS ARE LEFT
var_pca = PCA(n_components=0.65, svd_solver='full')

var_pca_data = var_pca.fit_transform(wine_df)
print('Shape of the PCA-transformed data: ',var_pca_data.shape)
print('Those are', var_pca_data.shape[1], 'components')
print('Variance explained: ', round(100*sum(var_pca.explained_variance_ratio_), 2), '% of the total')

In [None]:
result = pd.DataFrame(var_pca_data, columns=['PC1', 'PC2', 'PC3'])

With just 3 components we can maintain over 65% of the variance of the original data!

Since they are just 3 components, we can plot it in a 3D chart.

**For some reason, I cannot visualise 3D charts in Jupyter Lab, only Jupyter Notebook**

Note that in the chart the X-axis corresponds to the first Principal Component, and therefore the most important one, then the Y-axis and the Z-axis.

If you explore the below chart a little bit, you might understand that there seems to be like 3 clusters: those could be the three classes of our original data*. Maybe? Let's run K-means with K=3 on it and see.

*Yes, one thing we know already is that this dataset has 3 classes... In a real-world scenario we wouldn't know how many classes we have in reality. However lowering the dimension can help us visualise it. In this case, it seems like the 3D chart shows 3 clusters.

In [None]:
fig = px.scatter_3d(result, x='PC1', y='PC2', z='PC3',
                    color_continuous_scale='Rainbow')
fig.show()

In [None]:
kmeans = MiniBatchKMeans(
    n_clusters=3,
    init='k-means++',
)
kmeans_clusters = kmeans.fit_predict(result[['PC1', 'PC2', 'PC3']])
result['kmeans'] = kmeans_clusters

fig = px.scatter_3d(result, x='PC1', y='PC2', z='PC3',
                    color='kmeans', color_continuous_scale='Rainbow')
fig.show()

Aha! That 3D chart looks good seems like. The 3 clusters are more or less well-defined.

# Questions and learning exercises:

Have we been able to discover the 3 original classes of the dataset using Dimensionality Reduction without using the original target feature? What do you think? Can you prove it? (using the original target feature of the dataset or otherwise)

Can you achieve the same clustering result with DBSCAN, OPTICS or SpectralClustering? What about Gaussian Mixture Models? 