# Implementation of Visualizing Data Using t-SNE

#### Aasha Reddy and Madeleine Beckner

#### Due: April 27, 2021

## Abstract

In this report, we will implement the t-SNE algorithm as described in the paper "Visualizing Data using t-SNE" (Hinton and van der Maaten, 2008). t-SNE is a dimension-reduction technique that visualizes high-dimensional data by giving each datapoint a corresponding location in a two or three-dimensional mapping. We will demonstrate its accuracy and performance through various applications to both real and simulated datasets, optimize the code, and compare this algorithm with other competing algorithms that address a similar problem. We create a package that allows users to implement an optimized version of t-SNE that uses Numba JIT (just-in-time compilation). Installation instructions can be found on the [accompanying Github Repo](https://github.com/aashareddy14/tsne_project). 

Key phrases: t-SNE, stochastic neighbor embedding, dimensionality reduction, visualization

## Code

[Accompanying Github Repo](https://github.com/aashareddy14/tsne_project)


[Package Installation Instructions](https://test.pypi.org/project/TSNE-reddy-beckner/0.0.3/)

To install the package, please see the above link. You may type 

`pip install -i https://test.pypi.org/simple/ TSNE-reddy-beckner==0.0.3`

into your terminal and load the library using 

`from tsne.tsne import TSNE, TSNE_plot`

## 1. Background

In this paper, we will implement the t-SNE algorithm for visualizing high-dimensional data as described in the paper "Visualizing Data using t-SNE" (Hinton and van der Maaten, 2008). This algorithm gives each datapoint a corresponding location in a two or three-dimensional mapping. The t-SNE algorithm improves upon the Stochastic Neighbor Embedding algorithm (Hinton and Roweis, 2002) by reducing issues with points crowding in the center of the map.

Visualizing high-dimensional data in an effective and interpretable way is important to many different areas of application. High-dimensional datasets are common in numerous fields of study, such as biology, chemistry, political science, economics, astronomy, physics, and more. As availability of computing resources continues to increase, so does the ability to collect and store more and more complex and high-dimensional datasets. Effective visualization  presents an interesting and important challenge in analyzing and understanding these complex datasets.

Many different high-dimensional data methods have been proposed in recent years, but many of these methods come with certain issues. For example, techniques that use iconographic displays such as pixel-based techniques or Chernoff faces provide tools to visualize high-dimensional data while leaving the interpretation to the viewer. When scaled to thousands of dimensions, as is often the case with real-world data, these methods break down due to human inability to interpret this kind of visual dimension scale. To avoid this issue, methods with dimensionality-reduction have been proposed that convert high-dimensional data into two or three dimensions to be plotted. Popular linear dimensionality reduction methods include PCA, or Principal Components Analysis, and MDS, Multi-Dimensional Scaling. However, issues arise in applications where low-dimensional representations of similar points must be kept close together, because these methods prioritize keeping low-dimensional representations of dissimilar points far apart. To avoid this concern and preserve local data structures, non-linear dimensionality reduction methods have been proposed as well, one of which, Stochastic Neighbor Embedding, forms the basis of this paper. While these types of methods improve over other mentioned concerns and generally perform well on artificial datasets, they tend to break down when introduced to real-world high dimensional datasets because they are not capable of retaining both local and the global structure of the data in a single map. The method t-SNE presented in this paper provides a solution by being capable of capturing both local and global structures effectively. t-SNE has many advantages relative to other algorithms as discussed, and forms a valuable tool in effectively visualizing high-dimensional data. t-SNE has been shown to be better than existing techniques at creating a single map that reveals structure at many different scales.

While t-SNE makes vast improvements in comparison to other existing methods, one should note that this algorithm has its own weaknesses as well. First, it is not clear how t-SNE performs on general dimensionality reduction tasks where d > 3 due to the heavy tails of the Student-t distribution. Additionally, the relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data, which means t-SNE might be less successful if it is applied on data sets with a very high intrinsic dimensionality. The most significant potential weakness is that t-SNE is not guaranteed to converge to a global optimum of its cost function due to the non-convexity of the cost function. Therefore, we have to choose several optimization parameters, and the solutions may differ depending on the choice of these parameters. In spite of these drawbacks, t-SNE still outperforms other similar methods and improves upon common issues in high-dimensional visualization methods. In our paper, we will implement and evaluate t-SNE on multiple datasets. 


## 2. t-SNE Algorithm

t-SNE is a variation of Stochastic Neighbor Embedding algorithm (Hinton and Roweis, 2002) that simplifies optimization and prevents crowding in the center of the map (also referred to as the crowding problem). In order to understand t-SNE, we must first discuss Stochastic Neighbor Embedding (SNE), which forms a basis for the t-SNE algorithm.

### 2.1 Stochastic Neighbor Embedding (SNE)

SNE first converts Euclidean distances in high dimensions between the data into conditional probabilities, or pairwise affinities/similarities. The conditional probability as shown below represents a similarity between the two datapoints. In other words, $p_{j \mid i}$ is the probability that datapoint $x_i$ would choose $x_j$ as its closest point if closest points were chosen in proportion to their probability density under a Normal distribution centered at $x_i$. Therefore, when datapoints are close, the conditional probability will be low (and vice versa).

The conditional probability can be written mathematically as follows:
$$p_{j \mid i} = \frac{\text{exp}(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2)}{\sum_{k\neq i} \text{exp}(-\lVert x_i - x_k \rVert^2 / 2\sigma_i^2) }$$

The value of $\sigma_i$ is tricky because it must be determined by binary search with a fixed perplexity defined by the user, typically between 5 and 50.

$$\text{Perplexity} = 2^{- \sum_j p_{j \mid i} \text{log}_2 p_{j \mid i}}$$

Next in SNE, we calculate a similar probability for low-dimensional counterparts of the high-dimensional points $x_i$ and $x_j$. This time the variance is set to $\frac{1}{\sqrt2}$:

$$q_{j \mid i} = \frac{\text{exp}(-\lVert y_i - y_j \rVert^2 )}{\sum_{k\neq i} \text{exp}(-\lVert y_i - y_k \rVert^2) }$$ 

These two conditional probabilities will be equal if the map correctly models the similarity between datapoints in high dimensions and low dimensions. Therefore, SNE works by minimizing the difference between $q_{j \mid i}$ and $p_{j \mid i}$ in finding the value for $q_{j \mid i}$, the low dimensional counterpoint. This difference is minimized in SNE using gradient descent with a KL divergence:

$$ \text{Cost for SNE} = \sum_i \sum_j p_{j \mid i} \text{log} \frac{p_{j \mid i}}{q_{j \mid i}} $$

This cost function is large for using far datapoints to represent near ones, but it is relatively small when using close datapoints to represent far ones. In this way, SNE uses an asymmetrical cost function to preserve local map structure.

The minimization of the cost above is performed with gradient descent. The gradient is defined here:

$$\frac{\delta C}{\delta y_i} = 2 \sum_j (p_{j \mid i} -  q_{j \mid i} + p_{i \mid j} - q_{i \mid j})(y_i - y_j)$$

### 2.2 t-SNE and Comparison

t-SNE is very similar to SNE, but differs in two main ways– it uses a different (symmetric) cost function, and a Student-t distribution to compute similarity in low-dimension. The heavy tail of the Student t-distribution alleviates the crowding problem and difficult optimization from the SNE algorithm.

We define the probabilities differently to use the symmetric cost function. We will be using joint probabilities as follows (due to the properties of the Student t-distribution as well as symmetric cost function):


$$q_{ij} = \frac{(1 + \lVert y_i - y_j \rVert^2 )^{-1}}{\sum_{k\neq l} (1 + \lVert y_k - y_l \rVert^2)^{-1} }$$ 

$$p_{ij} = \frac{p_{j \mid i} + p_{i \mid j}}{2n}$$ 

The Perplexity is defined the same as in SNE: $$\text{Perplexity} = 2^{- \sum_j p_{j \mid i} \text{log}_2 p_{j \mid i}}$$


t-SNE differs from SNE in choice of cost function. The cost function for t-SNE is symmetric and has simpler gradients:

$$ \text{Cost for t-SNE} = \sum_i \sum_j p_{ij} \text{log} \frac{p_{ij}}{q_{ij}} $$


The minimization of the cost above is performed with gradient descent using momentum. The gradient is defined here (simpler than that of SNE):

$$\frac{\delta C}{\delta y_i} = 4 \sum_j (p_{ij} -  q_{ij})(y_i - y_j) (1 + \lVert y_i - y_j \rVert^2)^{-1}$$


Gradient descent using momentum allows us to speed up the optimization and to avoid poor local minima. This type of gradient descent remembers the update at each iteration and determines the next update as a linear combination of the gradient and the previous update. The update function is given below, where $\eta$ is defined to be the learning rate and $\alpha$ is the momentum. 

$$Y^t = Y^{t-1} + \eta \frac{\delta C}{\delta y_i} + \alpha (Y^{t-1} - Y^{t-2})$$


### 2.3 t-SNE Pseudocode

1. Begin with a data matrix $X$, and a chosen perplexity level (default to 30)
2. Compute pairwise distance matrix $D = \lVert x_i - x_j \rVert^2 $, using squared euclidean distance
3. Compute pairwise similarity matrix (conditional probabilities) $p_{j|i}$, using $D$ and binary search to find the optimal $\sigma_i$
3. Calculate joint probabilities $p_{ij}$ 
5. Sample initial solution $Y^{0} = (Y_1, Y_2)$ from Normal(0, 0.0001)
    6. For i in 1:max_iterations
        7. compute low-dimensional affinities $q_{ij}$
        8. compute gradient $\frac{\delta C}{\delta y_i}$
        9. Update $Y$: $Y^t = Y^{t-1} + \eta \frac{\delta C}{\delta y_i} + \alpha (Y^{t-1} - Y^{t-2})$



5. Result is $Y$, a 2-dimensional data representation of $X$: $Y = (Y_1, Y_2)$

### 2.4 Python Implementation

We implement an optimized version of t-SNE detailed below. Our implementation can be found in the library `TSNE-reddy-beckner0.0.3`. Installation instructions can be found [here](https://test.pypi.org/project/TSNE-reddy-beckner/0.0.3/). The primary function within our package is called `TSNE`, which takes an input of a high-dimensional data matrix X and returns the reduced-dimension map, $Y$. The package also includes useful code for plotting the $Y$ output of t-SNE based on given classification labels, called `TSNE_plot`. 

## 3. Optimization for Performance


### 3.1 Profiling

We tested our initial implementation of t-SNE on the MNIST dataset. The MNIST dataset is comprised of 60,000 grayscale images of handwritten digits, and we test on a 1000-observation subset. Profiled results our initial implementation can be found below. We show the top ten functions ordered by cumulative time. Code for optimization testing can be found in `Optimization.ipynb` in the accompanying Github Repository.

In [12]:
import pstats
p = pstats.Stats('tsne.prof')
p.sort_stats('cumulative').print_stats(10)
pass

Sat Apr 24 18:00:18 2021    tsne.prof

         1910406 function calls in 1163.256 seconds

   Ordered by: cumulative time
   List reduced from 61 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 1163.256 1163.256 {built-in method builtins.exec}
        1    0.000    0.000 1163.256 1163.256 <string>:1(<module>)
        1    2.048    2.048 1163.256 1163.256 <ipython-input-55-17c53d1c44a4>:118(tsne)
     1000  951.282    0.951  977.286    0.977 <ipython-input-55-17c53d1c44a4>:104(grad_C)
     1000  165.805    0.166  172.470    0.172 <ipython-input-55-17c53d1c44a4>:72(q_ij)
   177196   20.150    0.000   20.150    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    59399    0.169    0.000   19.003    0.000 {method 'sum' of 'numpy.ndarray' objects}
    59399    0.141    0.000   18.833    0.000 /opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py:36(_sum)
     2001    0.035    0.000   13.753    0.007 

We can see that our tsne function, which uses binary search to find the optimal $\sigma's$ to calculate the $p_{ij}$ matrix takes the longest amount of time. We also note that our tsne function calls the function that calculates $q_{ij}$, the gradient (grad_C) as well, so we would like to optimize all of these functions if possible. 

### 3.2 Optimization using Cython and Numba JIT

We first optimize performance by implementing our functions in Cython and JIT (just-in-time compiler, from the Python package `numba`). Cython converts Python to C code, while JIT optimizes machine code from the LLVM compiler. To implement our functions, we note that some functions need to be re-written to avoid use of certain Python libraries which are incompatible with Numba. We find that the Cythonized implementation is slower than our un-optimized code, due to high overhead costs of type conversion, while our JIT implementation performed about 4.6 times faster (see below table for performance comparison). 

### 3.3 Futher Optimize using Explicit Loops and PCA Initialization

We are able to further optimize our code using explicit loops in the calculation of the pairwise distance matrix, $D$, in our `squared_euc_dist` function. This distance matrix is an input for our calculation of the $p_{ij}$ matrix of pairwise affinities. We had initially vectorized this calculation using `numpy` operations. We thus optimized for performance by forgoing vectorization and instead defining explicit loops. This makes it even easier for the `numba` JIT to optimize our implementation. We see a good speed-up here from about 4.6 times faster to 5.2 times faster than our original implementation. We choose to definite explicit loops in our `squared_euc_dist` function in our final implementation for our package.

The user of our package is able to further optimize performance by performing PCA on the data to initialize before using our `TSNE` function. This allows for an initial reduction in dimensions rather quickly, before further reduction is done using t-SNE. PCA is in general much faster than t-SNE, and initialization with PCA can speed up computation of pairwise distances and supresses some noise without severe distortion of distances between points. We can see a slight increase in performance on top of including explicit looping for distance calculations. We choose to leave PCA initialization out of our final package, and as a choice for the user to implement based on background knowledge of the data. 

### 3.4 Comparison Table

The below table shows the Speed-up Multiplier of our optimization compared to our initial implementation of t-SNE. All tests were perfomed on MNSIT data as described above. In the end we use the Numba JIT implementation of our t-SNE algorithm with Looped Distance, leaving PCA initialization as up to the user of our package. As noted, we see that PCA Initialized Numba with explicit looped distance calculation is the fastest. For our final package, we use Numba with Looped Distance.

In [18]:
import pandas as pd
speed = pd.read_csv('Report_Plots/speed_table_final.csv', header = 0, names = ['Implementation', 'Speed-up Multipler'])
speed

Unnamed: 0,Implementation,Speed-up Multipler
0,Numba,4.605476
1,Cython,0.700209
2,PCA Initialized Numba,4.992868
3,Numba with Looped Distance,5.213304
4,PCA Initialized Numba with Looped Distance,5.291994


## 4. Testing on Datasets from Original Paper

Because t-SNE is primarily a way to visualize high-dimensional data in lower-dimensions, we test our implementation by comparing to known visualizations of t-SNE outputs on datasets from the original paper (Hinton and van der Maaten, 2008). Testing datasets from the original paper serves as our source of "ground truth", as we can compare visualizations produced from output of our implementation of t-SNE vs. that of Hinton and van der Maaten, 2008. This allows us to understand how our algorithm performs on simulated datasets for which we have a true visualization of t-SNE.

To control testing as much as possible, we use the same initial parameters as outlined in the paper. Specifically, we choose the number of interations = 1000, perlexity = 40, momentum = 0.8, learning rate (alpha) = 100. We note that differences in visualizations do arise, primarily because t-SNE output is highly sensitive to choice of initial value for $Y$. We use the same initialization as in the paper, which sets the initial value of Y to be randomly drawn from a Normal distribution with mean 0 and variance 0.0001. The seed is likely different however, thus leading to different initializations. Additionally, Hinton and van der Maaten, 2008 propose some additional optimizations that are added to the original implementation outlined. For instance, they use an "early exaggeration" that multiplies all $p_{ij}'s$ in the initial stages of the optimization, which can lead to differences in output as well.  Overall, we see that our implementation of t-SNE is similar to the implementation proposed in the paper based on the below visualizations. 


All data can be found in the `Data` folder in the accompanying Github repository. The below plots were generated from `Applications_Testing.ipynb` and can also be found in the accompanying Github repository.  

### 4.1 MNIST Data

The MNIST dataset is comprised of 60,000 grayscale images of handwritten digits. We use the same dataset used in Hinton and van der Maaten, 2008 which is a subset of 6,000 randomly selected digits. The goal of t-SNE is to separate the image data into respective digits. We can see a comparison of our implementation vs. the Hinton and van der Maaten implementation below. There are slight differences as we mentioned above.


Author's implementation: 
![img1](Report_Plots/MNIST_paper.png)

Our implementation:
![img2](Report_Plots/MNIST_BR.png)

### 4.2 Olivetti Faces Data 

The Olivetti Faces Data contains 400 images of 40 people, with each image being 10,304 pixels. For some subjects, images were taken at different times, varying the lighting and facial expressions. The target of this dataset is an integer from 0 to 39 indicating identity of the person pictured. The dataset contains only 10 instances per class and thus is relatively small. We can see below that our implementation of t-SNE produces a similar visual produced by the paper implementation. 

Author's implementation: 
![img3](Report_Plots/Olivetti_paper.png)

Our implementation:
![img4](Report_Plots/Olivetti_BR.png)

### 4.3 COIL-20 Data

COIL-20 Data consists of images for 20 types of objects that contain both the object and the background. The different objects are viewed from 72 equally spaced orientations, for a total of 1,440 images. The goal of t-SNE is to form clusters based on the type of object. We can see below that our implemenation is close to that of the paper. In both instances of our implementation and the paper's, t-SNE accurately represents the one-dimensional manifold of viewpoints as a closed loop. The paper suggests that for objects which look similar from the front and back, t-SNE distorts the loop so that the images of the front and back are mapped to nearby points. Specifically, for the types of toy cars in the data which are the clusters that look like sausages/a U shape in both plots, the manifolds are aligned by the orientation of the cars to capture the large similarity between different types of cars. Thus t-SNE keeps these manifolds close together. 

Author's implementation: 
![img4](Report_Plots/COIL_paper.png)

Our implementation:
![img5](Report_Plots/COIL_BR.png)

## 5. Applications to Real Datasets

We also test our implementation of the t-SNE algorithm on real-world examples outside of Hinton and van der Maatan, 2008. All data can be found in the `Data` folder in the accompanying Github Repository. The below plots were generated from `Applications_Testing.ipynb` and can also be found in the accompanying Github repository.  

### 5.1 RNA Sequence Data for Tumor Classification

A number of different classification techniques exist for classifying gene expression data. Many researchers aim to visually understand clustering of RNA expression levels as it has become possible to obtain expression data from thousands of cells. T-SNE is being increasingly used in this field as it does a good job of revealing local structure. One pitfall however it that it can often fail to represent global structure of the data accuractely. 

We examine how our t-SNE implemenation performs on the RNA-Seq (HiSeq) PANCAN data set, which comes from the UCI Machine Learning Repository. It is a random extraction of gene expressions of patients having different types of tumors. Our goal is to classify patients into one of five tumor categories: BRCA, KIRC, COAD, LUAD, and PRAD. Please note that this data is too large to store on the accompanying Github repo. We store a sample of the data on Github, but access to the full dataset can be found [here](https://data.mendeley.com/datasets/cgf6wnyc5w/1). 

We can see below that our implementation of t-SNE does a relatively good job of showing clusters of the different types of tumors. The halo of points is due to the fact that our data contains some large numbers that causes difficulty with the binary search for the correct perplexity. This is a known issue discussed in Hinton and van der Maaten, 2008.

![img6](Report_Plots/RNA_BR.png)

### 5.2 Breast Cancer Data

t-SNE is also widely applied to image classification, as observed in section 4. Image processing has historically been an accurate tool for diagnosing breast tumors. We apply our implementation to Breast Cancer Tumor data obtained from the UCI Machine Learning Repository. Features are computed from digitized images of a fine needle aspirate of a breast tumor, and describe characteristics of the cell nuclei in the image. We aim to classify benign vs. malignant tumors with our t-SNE implementation. In the below image, 0 indicates benign (blue) and 1 indicates malignant (red).  

Our t-SNE implementation does a good job of clustering based on tumor diagnosis, and thus supports the idea that image processing can be an accurate tool for breast cancer diagnosis. 

![img7](Report_Plots/Breast_Cancer_BR.png)

### 5.3 Iris Data

Finally, we test our implementation of optimized t-SNE on the well-known Iris dataset, which contains three classes of iris plant, with 50 observations in each class. The iris dataset is available within Python's Seaborn library. We would like to use t-SNE to identify clusters for the classes of the irises. The important feature of this dataset is that `sentosa` is linearly separable from the other two classes, `versicolor` and `virginica`, which are not linearly separable from each other. We can see that our implementation of t-SNE preserves this important structure and does a good job of separating classes into identifiable clusters. 

![img8](Report_Plots/Iris_BR.png)

## 6. Comparative Analysis to Competing Algorithms

We compare our optimized implementation of t-SNE to competing dimension-reduction and visualization algorithms, namely PCA, Isomap, Locally Linear Embedding, Neigborhood Components Analysis, and MDS. We test these algorithms on the complete MNIST dataset that is described above. The below plots were generated from `Comparisons.ipynb` and can also be found in the accompanying Github repository.  

### Our Implementation

We initialize with PCA before our implementation to decrease computation time. 
![img9](Report_Plots/MNIST_PCA_BR.png)

### Comparison: PCA

PCA, or Principal Components Analysis, is a traditional linear dimensionality reduction technique that that focuses on keeping low-dimensional representations of dissimilar datapoints far apart.


![img10](Report_Plots/PCA_MNIST_plot.png)

Overall PCA is much faster than all other methods. It is also one of the simplest. It does a much worse job of visualizing the data as it is difficult to see the patterns between the different labels.


### Comparison: Isomap

Isomap is a nonlinear dimensionality reduction technique that aims to preserve the local structure of data. Isomap can be viewed as an extension of Multi-dimensional Scaling (MDS), addressed later in this section.

![img11](Report_Plots/Isomap_MNIST_plot.png)


Isomap is an improvement to linear dimension reduction in the effectiveness of the visualization, but it is not as effective as t-SNE or more complex methods.

### Comparison: Locally Linear Embedding

LLE is a nonlinear dimensionality reduction technique that aims to preserve the local structure of data.


![img12](Report_Plots/LLE_MNIST_plot.png)

There is some separation particularly between the '1' cluster with LLE, but we again notice a lot of overlap between classes, which results in a somewhat ineffective visualization.

### Comparison: Neighborhood Components Analysis

NCA is a machine learning algorithm that learns a linear transformation in a supervised fashion to improve the classification accuracy of stochastic nearest neighbors.

![img13](Report_Plots/NCA_MNIST_plot.png)

Again, NCA does not present an improvement over t-SNE. There is little separation between clusters here.

### Comparison: MDS

MDS, or classical multidimensional scaling, is a traditional linear dimensionality reduction technique that that focuses on keeping low-dimensional representations of dissimilar datapoints far apart. MDS is used to translate information about the pairwise distances into n points mapped into an abstract Cartesian space.

![img14](Report_Plots/MDS_MNIST_plot.png)

While there is a bit of grouping where classes cluster together, there is again too much overlap between classes for this to be an effective visualization.

To summarize:

In [4]:
import pandas as pd
speed_comparison = pd.read_csv('Report_Plots/comparison_speed_table.csv', header = 0, names = ['Implementation', 'Time (s)'])
speed_comparison

Unnamed: 0,Implementation,Time (s)
0,PCA,0.113
1,LLE,14.108
2,Isomap,14.394
3,NCA,67.81
4,MDS,2215.406
5,our t-SNE package,785.343


## 7. Discussion and Conclusion

t-SNE is a highly effective method for visualizing high-dimensional data through a low-dimensional, visual representation. t-SNE is especially good at providing better visualizations than competing methods such as PCA, Isomap, LLE, etc. by reducing the tendency to crowd points and preserving local structure. However, as noted in the paper, t-SNE is very sensitive to user specifications, specifically the values of perplexity and initial choice of $Y$. It is also not a deterministic approach, unlike PCA, meaning that produces a different output each time unless a seed is set (as it is in our implementation). A very significant disadvantage of t-SNE is that it scales poorly to massive datasets. 

Thus, t-SNE can and has been improved upon in recent years to make up for some of these shortcomings. Much research has been conducted to optimize t-SNE for massive data. For instance, Barnes-hut t-SNE accelerates t-SNE using tree based algorithms (Maaten, 2014). A Multi-core t-SNE has also been developed which is a modification of the Barnes-hut acceleration with Torch CFFI-based wrappers.

## References

1. van der Maaten, Laurens and Hinton, Geoffrey. "Visualizing Data using t-SNE ." Journal of Machine Learning Research 9 (2008): 2579--2605.

2. S. Liu, D. Maljovec, B. Wang, P.-T. Bremer and V. Pascucci. “Visualizing High-Dimensional Data: Advances in the Past Decade.” Eurographics Conference on Visualization (2015).

3. W. Nick Street, William H. Wolberg, and O.L. Mangasarian. “Nuclear Feature Extraction for Breast Tumor Diagnosis”. International Symposium on Electronic Imaging: Science and Technology (1993). https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.707&rep=rep1&type=pdf

4. Sci-kit Learn implementation: https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/manifold/_utils.pyx

5. Zhang, Dehao. “Dimensionality Reduction using t-Distributed Stochastic Neighbor Embedding (t-SNE) on the MNIST Dataset.” Towards Data Science, https://towardsdatascience.com/dimensionality-reduction-using-t-distributed-stochastic-neighbor-embedding-t-sne-on-the-mnist-9d36a3dd4521

6. Sci-kit Learn Manifold Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html

7. RNA Data from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq

8. van der Maaten, Laurens. “Accelerating t-SNE using Tree-Based Algorithms
”. Journal of Machine Learning Research 15 (2014) 1-21. https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

9. Multicore-t-SNE implementation: https://github.com/DmitryUlyanov/Multicore-TSNE/tree/master/multicore_tsne

10. ASU Feature Selection Tutorial: https://jundongl.github.io/scikit-feature/tutorial.html

11. Breast Cancer Data from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

12. Schoneveld, Liam. "In Raw Numpy: t-SNE". (2021). https://nlml.github.io/in-raw-numpy/in-raw-numpy-t-sne/ 


Author Contributions: 

Mady Beckner: Package creation, Optimization, Testing and Analysis

Aasha Reddy: Initial implementation, Optimization, Testing and Analysis