# t-Distributed Stochastic Neighbour Embedding - Lecture 33

COMP 4304 / 6934 \
Terrence Tricco, Nov 2021

t-Distributed Stochastic Neighbour Embedding (t-SNE) is a dimensionality reduction method.

In this notebook, we will explore t-SNE using the method from the scikit-learn library, and compare to dimensionality reduction using PCA.

Sklearn t-SNE API: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html \
User guide: https://scikit-learn.org/stable/modules/manifold.html#t-sne


# Load Libraries

Scikit-learn, or sklearn, can be installed using pip or conda.

For example, ``pip install sklearn``.

Sklearn has many useful tools for machine learning and dealing with data. We will use the TSNE class from the ``manifold`` library.

In [None]:
import pandas as pd
import plotly.express as px

from sklearn import decomposition
from sklearn import manifold

# Load Data

In [None]:
df = pd.read_csv('used_cars.csv')

# t-SNE

Just like PCA, t-SNE works with numeric data. Let's apply t-SNE to the numeric columns of our data set.

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df2 = df[['year', 'price', 'mileage', 'tax', 'mpg', 'engineSize']].dropna()

In [None]:
df2.head()

In [None]:
df2.info()

We can make a new TSNE object from sklearn.

In [None]:
tsne = manifold.TSNE()

In [None]:
type(tsne)

The t-SNE transforms the data to the lower-dimensional space using the ``fit_transform()`` function.

Unlike PCA, fitting and transforming must be done in a single step. There is no learned function on how to transform the data, such that new data points cannot be transformed using an existing fit.

Creating a t-SNE is an expensive operation. Don't be surprised if the following cell takes several minutes to run.

In [None]:
tsne_data = tsne.fit_transform(df2)

The default parameters use:
- perplexity = 30,
- 1000 iterations for the gradient descent process to find a suitable transformation, and
- learning rate (epsilon) = 200. (equivalent to 800 for other libraries.)

You can think of perplexity as related to the number of neighbours. Higher values mean that more data points are considered as neighbours. Try values between 5-10 to 50-100.

1000 iterations usually sufficient.

There is recent research that suggests the learning rate should scale with the size of the data set (say, n/12, or for sklearn, n/12/4)

Creating the t-SNE can be a slow process. I have pre-computed visualizations using different perplexities. I will show visualization using perplexities of 10, 30, 100 and 200 (30 is the default).

In [None]:
df_tsne_p10 = pd.read_csv('tsne-perplexity10.csv')

In [None]:
df_tsne_p30 = pd.read_csv('tsne-perplexity30.csv')

In [None]:
df_tsne_p100 = pd.read_csv('tsne-perplexity100.csv')

In [None]:
df_tsne_p200 = pd.read_csv('tsne-perplexity200.csv')

Note that the t-SNE transformed columns don't have names. The original data has been blended together to create these new columns (dimensions), where the data is a combination of the previous data. Each column is no longer directly relatable to the original data.

In [None]:
df2.head()

In [None]:
df_tsne_p30.head()

In [None]:
df_tsne_p30.describe()

The scale for the columns have no particular meaning.

## Visualizing t-SNE Data

In the below example, we will add back some of the non-numeric data to our t-SNE transformed data set. These act like the "labels" of our data.

In [None]:
for df_tsne in [df_tsne_p10, df_tsne_p30, df_tsne_p100, df_tsne_p200]:
    df_tsne['brand'] = df['brand']
    df_tsne['model'] = df['model']
    df_tsne['price'] = df['price']

A small perplexity attempts to identify small clusters.

Compare Audi to Toyota. Notice how there are many small clusters with only a few points, and those clusters are diffused and scattered throughout the space.

In [None]:
px.scatter(x='0', y='1', data_frame=df_tsne_p10, color='brand', hover_data=['price'])

In [None]:
px.scatter(x='0', y='1', data_frame=df_tsne_p30, color='brand', hover_data=['price'])

A large perplexity attempts to identify large clusters. Notice how the t-SNE clusters are larger and more well-defined for perplexities of 100 and 200.

In [None]:
px.scatter(x='0', y='1', data_frame=df_tsne_p100, color='brand', hover_data=['price'])

In [None]:
px.scatter(x='0', y='1', data_frame=df_tsne_p200, color='brand', hover_data=['price'])

Similar behaviour can be seen for small and large perplexity values. For example, Toyota fills in the gaps around Audi. Audi and BMW overlap very strongly.

But the visual look of the data can vary significantly. The correct choice depends on what is the important cluster scale to highlight.


# t-SNE of PCA

One option is to use a t-SNE to visualize PCA transformed data.

In [None]:
pca = decomposition.PCA()

In [None]:
data_pca = pca.fit_transform(df2)

In [None]:
df_pca = pd.DataFrame(data_pca)

In [None]:
df_pca['brand'] = df['brand']
df_pca['model'] = df['model']
df_pca['price'] = df['price']

In [None]:
px.scatter(x=0, y=1, data_frame=df_pca, color='brand', hover_data=['price'])

I have pre-computed the t-SNE of the PCA transformed data, which we load for convenience.

In [None]:
df_pca_tsne = pd.read_csv('tsne-pca-p100.csv')

In [None]:
df_pca_tsne['brand'] = df['brand']
df_pca_tsne['model'] = df['model']
df_pca_tsne['price'] = df['price']

In [None]:
px.scatter(x='0', y='1', data_frame=df_pca_tsne, color='brand', hover_data=['price'])

# Summary

t-SNE is a powerful technique for visualizing high-dimensional data. It reduces the data down to 2 dimensions. It is possible to reduce to 3 dimensions, but almost universally 2 dimensions are used.

Using sklearn, the process is as simple as calling fit_transform() on the data. It can then be visualized using any number of standard visualization libraries.

There are a number of caveats to keep in mind. Distances may not mean anything. Large-scale structure may not mean anything. Cluster size is dependent on the perplexity (and learning rate to some extent). All the same, t-SNE is a valuable way to find commonalities in large data.