In [1]:
%matplotlib qt
import numpy as np
np.random.seed(2022)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import scipy.stats as stats

# PCA imports
from sklearn.decomposition import PCA
#t-SNE import
from sklearn import manifold, datasets

In [2]:
rng = np.random.RandomState(0)

n_samples = 1500
S_points, S_color = datasets.make_s_curve(n_samples, random_state=rng)

In [3]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

#ax.scatter(x_r[0],x_r[1],x_r[2])

ax.scatter(S_points.T[0],S_points.T[1],S_points.T[2],c=S_color)

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x205b374cbe0>

In [4]:
pca = PCA(n_components = 3) # this defines the model

# PCA vizualization
X_pca = pca.fit_transform(S_points) # transform the data into PCA space
print('Variance Explained: ', pca.explained_variance_ratio_)

fig =plt.figure()
ax=fig.add_subplot()
ax.scatter(X_pca[:,0],X_pca[:,1],c=S_color)
ax.set_aspect('equal')
plt.xlabel('PC1  '+str(100*round(pca.explained_variance_ratio_[0],3))+'%')
plt.ylabel('PC2  '+str(100*round(pca.explained_variance_ratio_[1],3))+'%')

Variance Explained:  [0.69133634 0.18724734 0.12141632]


Text(0, 0.5, 'PC2  18.7%')

In [5]:
fig =plt.figure()
ax=fig.add_subplot()
ax.scatter(X_pca[:,0],X_pca[:,2],c=S_color)
ax.set_aspect('equal')
plt.xlabel('PC1  '+str(100*round(pca.explained_variance_ratio_[0],3))+'%')
plt.ylabel('PC2  '+str(100*round(pca.explained_variance_ratio_[2],3))+'%')

Text(0, 0.5, 'PC2  12.1%')

In [6]:
t_sne = manifold.TSNE(
    n_components=2,
    learning_rate="auto",
    perplexity=10,
    n_iter=250,
    init="random",
    random_state=rng,)

S_t_sne = t_sne.fit_transform(S_points)

fig = plt.figure()

plt.scatter(S_t_sne.T[0],S_t_sne.T[1],c=S_color)

<matplotlib.collections.PathCollection at 0x205b5273a60>

In [7]:
#Let's create two clusters set

x_1=np.random.normal(loc=1,scale=2.0,size=100)
x_2=np.random.normal(loc=1,scale=2.0,size=100)
x_3=np.random.normal(loc=1,scale=0.7,size=100)

y_1=np.random.normal(loc=1,scale=2.0,size=100)
y_2=np.random.normal(loc=1,scale=2.0,size=100)
y_3=np.random.normal(loc=-5,scale=0.7,size=100)

x=np.array([x_1,x_2,x_1+2*x_2+x_3])
y=np.array([y_1,y_2,y_3])

z=np.concatenate((x,y),axis=1)

clr=100*['b']+100*['r']

fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax.scatter(z[0],z[1],z[2],c=clr)

ax.set_xlabel(r'$x_1$')
ax.set_ylabel(r'$x_2$')
ax.set_zlabel(r'$x_3$')

ax.set_xlim((-10,25))
ax.set_ylim((-10,25))
ax.set_zlim((-10,25))

(-10.0, 25.0)

In [8]:
t_sne = manifold.TSNE(
    n_components=2,
    learning_rate="auto",
    perplexity=100,
    n_iter=250,
    init="random",
    random_state=rng,
)

z_t_sne = t_sne.fit_transform(z.T)

In [9]:
fig = plt.figure()

plt.scatter(z_t_sne.T[0],z_t_sne.T[1],c=clr)


ax.set_xlabel(r'$x_1$')
ax.set_ylabel(r'$x_2$')

ax.set_xlim((-10,25))
ax.set_ylim((-10,25))

(-10.0, 25.0)

#### t-SNE 

##### t-distributed stochastic neighbor embedding

The intuition about t-SNE is that you want to reduce the dimensionality in a way that preserves the distances between the points.

Example: Let's move a village on a 2D grid, to a village with only 1 street.

How do you keep old neighbors in 2D close to each other in 1D street?

Example on board.

We can do it in the statistical sense. 

#### Side note: The curse of dimensionality

In 1D each points on a grid has 2 closest neighbors.

In 2D there are 4 closest ppoints. In 3D 6, etc. 

In Many dimensional space, say 1000 dimensions, each point has 2000 neighbors. 
If we have more dimensions than points, each point is a neighbor to any othe point. 
The mappings will be very difficult. 

Back to t-SNE. 

Let's have $N$ points $x_i$. 

Let's create a probability distribution to be a neighbor, based on the distances between the points. 

Let's say a gaussian:

$P_{i|j}=\frac{e^{-\frac{|x_i-x_j|^2}{2\sigma_i^2}}}{\sum_{k}e^{-\frac{|x_i-x_k|^2}{2\sigma_i^2}}}$

here $\sigma_i$ is a parameter, such that the entropy is preset to a chosen value. This is a parameter of the model. 
It is related to the $\bf{perplexity}$. This is a measure of how many neighbors do we anticipate for each point in the data.

We will symmetrize this probability:

$P_{ij}=\frac{P_{i|j}+P_{j|i}}{2N}$

#### Perplexity in t-SNE

This is a measure of how many neighbors do we anticipate for each point in the data. 

The result depend on this parameter, which is not a parameter of the data itself. 

The perplexity is the effective number of neighbors calculated from the entropy:

$N_{eff}=2^{-\sum P_{ij}logP_{ij}}$

#### Hyperparameters in ML

Parameters of the method, set the rate of learning. Can have influence on the results. 

Continue with the procedure:

We will try to reduce the dimensionality of the data, i.e. create new points $y_i$ in lower dimensional space, say 2D. 
We will do that in  a way that will preserve the distribution of distances, but with broader tails. 

$\bf{This\ means\ there\ will\ be\ more\ larger\ distances\ in\ the\ reduced\ dimensionality\ map.\ As\ if\ we\ push\ the\ points\ that\ are\ appart,\ further\ appart.}$ 

We will use the t-distribution for $y_i$. 

$Q_{ij}=\frac{(1+|y_i-y_j|^2)^{-1}}{\sum_{k\neq l}(1+|y_k-y_l|^2)^{-1}}$

We will try to make these distributions by minimizing the "distance" bewteen the distributions.

$\bf{Kullback-Leibler\ divergence}$

$KL(P||Q)=\sum_{i\neq j}P_{ij}ln\frac{P_{ij}}{Q_{ij}}$


If $P=Q$ (as a distribution) then $KL=0$. 

How do we minimize this? As always, we take the derivatives over the new points to be zero: $\frac{dKL}{dy_i}=0$.

We can use Steepest descent, for example. 
