<a href="https://colab.research.google.com/github/akanksha0911/Colab-demonstrate-various-dimensionality-reduction-techniques/blob/main/cmpe255_Assign_ManifoldLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MDS (multidimensional scaling) is an algorithm that transforms a dataset into another dataset, usually with lower dimensions, keeping the same euclidean distances between the points.

In [None]:
# Data manipulation
import pandas as pd # for data manipulation

# Visualization
import plotly.express as px # for data visualization

# Skleran
from sklearn.datasets import make_swiss_roll # for creating a swiss roll
from sklearn.manifold import MDS # for MDS dimensionality reduction

In [None]:
# Make a swiss roll
X, y = make_swiss_roll(n_samples=2000, noise=0.05)
# Make it thinner
X[:, 1] *= .5


# Create a 3D scatter plot
fig = px.scatter_3d(None, x=X[:,0], y=X[:,1], z=X[:,2], color=y,)

# Set figure title
fig.update_layout(title_text="Original 3D data plot")

# Update marker size
fig.update_traces(marker=dict(size=3, 
                              line=dict(color='black', width=0.1)))

fig.update(layout_coloraxis_showscale=False)
fig.show()

now use MDS to map this 3D structure to 2 dimensions while preserving distances between points as best as possible.

In [None]:
model2d=MDS(n_components=2, 
          random_state=1, 
          dissimilarity='euclidean')

### Step 2 - Fit the data and transform it, so we have 2 dimensions instead of 3
X_trans = model2d.fit_transform(X)
    
### Step 3 - Print a few stats
print('The new shape of X: ',X_trans.shape)
print('No. of Iterations: ', model2d.n_iter_)
print('Stress: ', model2d.stress_)

The new shape of X:  (2000, 2)
No. of Iterations:  69
Stress:  3358831.5993010853


In [None]:
#  Dissimilarity matrix contains distances between data points in the original high-dimensional space
print('Dissimilarity Matrix: ', model2d.dissimilarity_matrix_)
# Embedding contains coordinates for data points in the new lower-dimensional space
print('Embedding: ', model2d.embedding_)

Dissimilarity Matrix:  [[ 0.         10.84964673 12.00670743 ... 18.69664622 12.62967785
  11.29091881]
 [10.84964673  0.          2.24712119 ...  8.81408413 14.66209606
   1.13049556]
 [12.00670743  2.24712119  0.         ...  7.90411188 14.9457985
   3.13314473]
 ...
 [18.69664622  8.81408413  7.90411188 ...  0.         22.69896892
   8.69447095]
 [12.62967785 14.66209606 14.9457985  ... 22.69896892  0.
  15.21687923]
 [11.29091881  1.13049556  3.13314473 ...  8.69447095 15.21687923
   0.        ]]
Embedding:  [[  1.60893181 -11.69152698]
 [ -3.18709215  -2.24261137]
 [ -3.6228236   -2.36699373]
 ...
 [-11.9151402    0.28308985]
 [ 11.29441111  -3.95685279]
 [ -3.62779935  -1.62494283]]


We can see that the shape of the new array is 2000 by 2, which means that we have successfully reduced it to 2 dimensions. Also, it took the algorithm 65 iterations to reach the lowest Stress level.

now plot the new 2D data to see how it compares to the original 3D version.

In [None]:
# Create a scatter plot
fig = px.scatter(None, x=X_trans[:,0], y=X_trans[:,1], opacity=1, color=y)


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="MDS Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.2)))

fig.show()

The results are pretty good since we could preserve the global structure while at the same time not losing the separation observed between points in the original depth dimension.

While it depends on the exact problem we want to solve, MDS seems to perform better in this scenario than PCA



 **ISOMAP:   (non-linear approach to dimensionality reduction)    Isomap is a technique that combines several different algorithms, enabling it to use a non-linear way to reduce dimensions while preserving local structures.**


In [None]:
from sklearn.manifold import Isomap

In [None]:
ModelIsomap = Isomap(
    n_neighbors=20, # default=5, algorithm finds local structures based on the nearest neighbors
    n_components=2, # number of dimensions
    metric='minkowski', # string, or callable, default=”minkowski”
    p=2, # default=2, Parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2
    metric_params=None # default=None, Additional keyword arguments for the metric function.
)

### Step 2 - Fit the data and transform it, so we have 3 dimensions instead of 64
X_Isomap = ModelIsomap.fit_transform(X)
    
### Step 3 - Print shape to test
print('The new shape of X: ',X_Isomap.shape)

The new shape of X:  (2000, 2)


In [None]:
# Create a scatter plot
fig = px.scatter(None, x=X_Isomap[:,0], y=X_Isomap[:,1], opacity=1, color=y)


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="ISOMAP Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.2)))

fig.show()

Isomap is one of the best tools for dimensionality reduction, enabling us to preserve non-linear relationships between data points. Isomap algorithm is used in practice for handwritten digit recognition. Similarly, you could use Isomap as part of the NLP (Natural Language Processing) analysis to reduce the high dimensionality of text data before training a classification model.

**LLE :  Locally Linear Embedding,MDS tries to preserve distances between faraway points when constructing the embedding. But what if we instead modified the algorithm such that it only preserves distances between nearby points**

In [None]:
from sklearn.manifold import LocallyLinearEmbedding as LLE 

model_lle = LLE(n_neighbors=30, # default=5, number of neighbors to consider for each point.
                    n_components=2, # default=2, number of dimensions of the new space 
                    reg=0.001, # default=1e-3, regularization constant, multiplies the trace of the local covariance matrix of the distances.
                    eigen_solver='auto', # {‘auto’, ‘arpack’, ‘dense’}, default=’auto’, auto : algorithm will attempt to choose the best method for input data
                   )
 
 
X_LLE = model_lle.fit_transform(X)

print('The new shape of X: ',X_LLE.shape)

The new shape of X:  (2000, 2)


In [None]:
# Create a scatter plot
fig = px.scatter(None, x=X_LLE[:,0], y=X_LLE[:,1], opacity=1, color=y)


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="LLE Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.3)))

fig.show()

In [None]:
model_lle2 = LLE(n_neighbors=30, # default=5, number of neighbors to consider for each point.
                    n_components=2, # default=2, number of dimensions of the new space 
                    reg=0.001, # default=1e-3, regularization constant, multiplies the trace of the local covariance matrix of the distances.
                    eigen_solver='auto', # {‘auto’, ‘arpack’, ‘dense’}, default=’auto’, auto : algorithm will attempt to choose the best method for input data
                    method='modified'
                   )
 
 
X_LLE2 = model_lle2.fit_transform(X)

print('The new shape of X: ',X_LLE2.shape)

The new shape of X:  (2000, 2)


In [None]:
# Create a scatter plot
fig = px.scatter(None, x=X_LLE2[:,0], y=X_LLE2[:,1], opacity=1, color=y)


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="Modified LLE Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.3)))

fig.show()

We can see that Standard LLE was not able to unroll the swiss roll successfully. Modified LLE and Isomap did unroll, yielding similar results between them.

**t-SNE (t-distributed Stochastic Neighbor Embedding)- nonlinear dimensionality reduction**- t-SNE is iterative unlike PCA,  it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

In [None]:
from sklearn.manifold import TSNE
embedding = TSNE(n_components=2) #result has 2 features
X_TSNE = embedding.fit_transform(X)

print('The new shape of X: ',X_TSNE.shape)

The new shape of X:  (2000, 2)


In [None]:
# Create a scatter plot
fig = px.scatter(None, x=X_TSNE[:,0], y=X_TSNE[:,1], opacity=1, color=y)


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="t-SNE Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.3)))

fig.show()

UMAP :

In [None]:
pip install umap-learn

Collecting umap-learn
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 2.8 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 28.6 MB/s 
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=aec30a02d177dcff3503002929ca48ea7f6bb2b36a030cc382937f91e2b03644
  Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl size=52603 sha256=05ddf74ad19646b0c5bdc0911317157b5f807c57ee731722c7d7e7e9eccbc951
  Stored in directory: /root/.cache/pip/wheels/af/e9/33/04db1436df0757c42fda8ea6796d7a8586e23c85fac355f476
Successfull

In [None]:
import umap


In [None]:
embedding = umap.UMAP(n_neighbors=30,
                      min_dist=0.3,
                      metric='euclidean').fit_transform(X)


The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.



In [None]:
print('The new shape of X: ',embedding.shape)

The new shape of X:  (2000, 2)


In [None]:
# Create a scatter plot
fig = px.scatter(None, x=embedding[:,0], y=embedding[:,1], opacity=1, color=embedding[:,1])


# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))

# Set figure title
fig.update_layout(title_text="UMAP Transformation")

# Update marker size
fig.update_traces(marker=dict(size=5,
                             line=dict(color='black', width=0.3)))

fig.show()

UMAP often performs better at preserving aspects of global structure of the data than t-SNE, UMAP isn't just for visualisation! You can use UMAP as a general purpose dimension reduction technique as a preliminary step to other machine learning tasks