# Using t-SNE to reduce the MNIST data set

From the Scikit-Learn web site: ['t-distributed Stochastic Neighbor Embedding - scikit-learn'](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

Let's use it to reduce the dimensionality of the MNIST data set.

 - [Loading MNIST data set](#Loading-MNIST-data-set)
 - [t-SNE analysis](#t-SNE-analysis)

## Loading MNIST data set

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import manifold
from time import time
%matplotlib inline

In [3]:
# load the MNIST data set
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

In [11]:
print(f'mnist.data.shape: {mnist.data.shape}')
print(f'mnist.target.shape: {mnist.target.shape}')

mnist.data.shape: (70000, 784)
mnist.target.shape: (70000,)


In [12]:
X = mnist.data
Y = mnist.target

In [6]:
mnist.details

{'id': '554',
 'name': 'mnist_784',
 'version': '1',
 'format': 'ARFF',
 'upload_date': '2014-09-29T03:28:38',
 'licence': 'Public',
 'url': 'https://www.openml.org/data/v1/download/52667/mnist_784.arff',
 'file_id': '52667',
 'default_target_attribute': 'class',
 'tag': ['AzurePilot',
  'OpenML-CC18',
  'OpenML100',
  'study_1',
  'study_123',
  'study_41',
  'study_99',
  'vision'],
 'visibility': 'public',
 'status': 'active',
 'processing_date': '2018-10-03 21:23:30',
 'md5_checksum': '0298d579eb1b86163de7723944c7e495'}

## t-SNE analysis

In [8]:
n_components = 2
perplexity = 30
tsne = manifold.TSNE(n_components=n_components, init='random', random_state=42, perplexity=perplexity, n_jobs=-1)

In [13]:
t0 = time()
X_reduced = tsne.fit_transform(X)
print(f'Time elapsed: {time() - t0:.4f} sec')

Time elapsed: 2006.6585 sec


In [14]:
X_reduced.shape

(70000, 2)

In [15]:
X_reduced[0]

array([23.745035  ,  0.43958613], dtype=float32)