# Dimensionality Reduction Techniques: PCA, tSNE, LDA
This notebook will perform and visualize various dimensionality reduction techniques on Scikit-learn's Breast Cancer Dataset.
Techniques covered include:
- Principal Component Analysis
- t-Distributed Stochastic Neighbor Embedding
- Linear Discriminant Analysis

In [27]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as pex
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [28]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data['data'],columns=data['feature_names'])

In [30]:
# Scale the data
df = (df - df.mean(axis=0)) / df.std(axis=0)
df.head()
df['status'] = data['target']

## Principal Component Analysis (PCA)
This technique involves projecting the high-dimensional data onto orthogonal axes that maximize the variance in the data.

- We will perform this manually by taking the eigendecomposition of the covariance matrix of the data. 
- The eigenvectors represent the orthogonal axes which we will project the data onto. 
- Each eigenvalue represents the variance of the data when projected onto the axis represented by its corresponding eigenvector.
- To visualize the data in reduced dimensions, we will choose 3 axes / eigenvectors that preserve the most variance in the data.
- Therefore, these chosen eigenvectors will have the largest eigenvalues.


In [31]:
cov_matrix = np.cov(df.drop('status',axis=1), rowvar=False)
cov_matrix.shape

In [32]:
values, vectors = np.linalg.eig(cov_matrix)

In [33]:
print("Values shape: " + str(values.shape))
print("Vectors shape: " + str(vectors.shape))

In [34]:
"""
- Each eigenvector has 30 entries, one for each feature.
- You can see how much each feature "contributes" (feature importance) to each axis / eigenvector
  by comparing the absolute values of entries in each eigenvector. For example,
"""

vectors[:,0]

In [35]:
plt.figure(figsize=(6,4),dpi=100)
plt.plot(values,marker='s',color='red',lw=1)
plt.xlabel('Number corresponding to each Eigenvalue')
plt.ylabel('Variance')
plt.title('Variances for Each Axis')
plt.show()

In [36]:
# Let's take the top 3 eigenvectors and project data onto them.
# From the plot we see that the eigenvalues are already sorted, so we can just take the first 3 indices.
top_3_vectors = vectors[:,np.array([0,1,2])]

In [37]:
# Project data down to 3 axes by computing dot product
principal_comp = np.dot(df.drop('status',axis=1).values, top_3_vectors)

In [38]:
print(principal_comp.shape) #Correct shape!

In [121]:
pex.scatter_3d(x=principal_comp[:,0],y=principal_comp[:,1],z=principal_comp[:,2],color=df.status, color_continuous_scale=pex.colors.sequential.Viridis)

We can see the clear separation of classes from this plot, illustrating the benefits of dimensionality reduction. With a few more principal components to maximize the variance, we could train a SVM to take advantage of the relatively clear separation of classes to predict Breast Cancer Status.

In [40]:
plt.figure(figsize=(10,6))
sns.heatmap(np.abs(top_3_vectors),yticklabels=df.drop('status',axis=1).columns,cmap = 'coolwarm')

With this plot, we can see which features contributed most to each principal component. For the first, second, and third axes, we can see that
the mean concave points, mean fractal dimension, and texture error features contributed the most in maximizing the variance respectively.

## t-Distributed Stochastic Neighbor Embedding
This technique involves transforming the data to a lower-dimensional space using similarity scores, Normal, and t-distributions.
- It will first create a similarity matrix incorporating similarity scores for all pairwise combinations of data points.
- The similarity score for two points is calculated by converting the distance between the points to probability using the Normal Distribution.
- Then all points will be randomly placed on the lower-dimensional space. A new similarity matrix will be calculated for this space, instead using the t-Distribution.
- Using gradient descent, the points will be adjusted away from other points and closer to their own cluster by making the new similarity matrix as close as possible to the original similarity matrix.

In [115]:
from sklearn.manifold import TSNE
model = TSNE(n_components=2 ,init='pca')

In [116]:
reduced = model.fit_transform(df.drop('status',axis=1))

In [124]:
model.kl_divergence_ # Kullback-Leibler Divergence - The difference between the random and original distributions after optimization

In [122]:
pex.scatter(x=reduced[:,0],y=reduced[:,1],color=df['status'], color_continuous_scale=pex.colors.sequential.Viridis)

The plot above illustrates the separation of classes after the t-SNE approach. As you can see, there are some outliers, but overall there is relatively clear class separation. Let's try experimenting with the perplexity hyperparameter below.

In [136]:
perplexity = np.arange(5,55,5) # Hyperparameter which is essentially an estimate for the number of nearest neighbors for east point.
kl = []

for i in perplexity:
    model = TSNE(n_components=2 ,init='pca', perplexity=i)
    reduced = model.fit_transform(df.drop('status',axis=1))
    kl.append(model.kl_divergence_)
    
plt.figure(figsize=(6,4),dpi=100)
plt.plot(perplexity,kl,marker='s',color='red',lw=1)
plt.title('KL Divergence After t-SNE Optimization for Varying Perplexity Values')
plt.ylabel('Divergence')
plt.xlabel('Perplexity Values')
plt.show()

In [135]:
# Let's try perplexity = 50, since it produced the lowest divergence
model = TSNE(n_components=2 ,init='pca', perplexity=50)
reduced = model.fit_transform(df.drop('status',axis=1))
pex.scatter(x=reduced[:,0],y=reduced[:,1],color=df['status'], color_continuous_scale=pex.colors.sequential.Viridis)

## Linear Discriminant Analysis
LDA is another dimensionality reduction. Like PCA, it involves projecting the data onto axes; however, its goal is to choose axes that maximize class separability and minimize intra-class scatter simultaneously. With a binary-class dataset, the data will be projected onto a line.
With a n-class dataset, the data will be projected onto an (n-1)-dimensional-space.

In [138]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda

model = lda(n_components=1) # n_components in this case is 1 less than the number of classes.

In [139]:
newData = model.fit_transform(df.drop('status', axis=1), df['status'])
newData = pd.DataFrame([newData[:,0],df['status']])
newData = newData.T
newData['y'] = [0 for _ in range(newData.shape[0])]

In [140]:
model.explained_variance_ratio_ # Variance of data explained by the axis projected upon

In [141]:
pex.scatter(newData, x=0, y='y', color=1, color_continuous_scale=pex.colors.sequential.Viridis)

This plot shows the difference in class distributions for the data projected on a single axis. Overall, we have relatively clear separation. Following from this, LDA can also be used as a classifier in addition to a dimensionality reduction technique.

In [142]:
plt.figure(figsize=(10,6))
sns.heatmap(model.coef_, xticklabels=df.drop('status',axis=1).columns,cmap = 'coolwarm')
plt.show()

Here, you can see the feature importances associated with this class; mean radius stands out the most.