# Introduction to t-SNE

### Goal

Learn to visualize **high-dimensional data** in a **low-dimensional space** using a **nonlinear dimensionality reduction** technique.

## What is t-SNE?

**t-SNE (t-distributed Stochastic Neighbor Embedding)** is an **unsupervised**, **nonlinear** dimensionality reduction algorithm for **data visualization**.
It maps high-dimensional data to **2D or 3D space**, revealing clusters and structures that are not linearly separable.

Unlike linear methods such as PCA, t-SNE can capture **nonlinear relationships** and preserve **local similarities** in the data.

## How t-SNE Works

The algorithm measures similarity between pairs of data points in both **high** and **low** dimensions, then optimizes the low-dimensional map so these similarities match as closely as possible.

### Step 1: High-Dimensional Similarities

For data points $x_i$ and $x_j$, define the conditional probability that $x_j$ is a neighbor of $x_i$:

$$
p_{j|i} = \frac{\exp\left(-\frac{|x_i - x_j|^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{|x_i - x_k|^2}{2\sigma_i^2}\right)}
$$

where $\sigma_i$ is chosen so that the **perplexity** of the distribution equals a user-defined constant:

$$
Perp(P_i) = 2^{H(P_i)} \quad \text{where} \quad H(P_i) = -\sum_j p_{j|i} \log_2 p_{j|i}
$$

The joint probability is then symmetrized:

$$
p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}
$$

### Step 2: Low-Dimensional Similarities

For mapped points $y_i$ and $y_j$ in low-dimensional space:

$$
q_{ij} = \frac{(1 + |y_i - y_j|^2)^{-1}}{\sum_{k \neq l} (1 + |y_k - y_l|^2)^{-1}}
$$

This uses a **Student t-distribution** with one degree of freedom (heavy-tailed), preventing points from crowding together.

### Step 3: Minimize the Divergence

t-SNE minimizes the **Kullback–Leibler divergence** between the two distributions:

$$
C = KL(P | Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
$$

The gradient with respect to each point $y_i$ is:

$$
\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij})(y_i - y_j)(1 + |y_i - y_j|^2)^{-1}
$$

## Algorithm 1: Simple Version of t-Distributed Stochastic Neighbor Embedding

**Data:**
Data set $X = {x_1, x_2, \ldots, x_n}$

**Cost function parameters:**
Perplexity $Perp$

**Optimization parameters:**
Number of iterations $T$, learning rate $\eta$, momentum $\alpha(t)$

**Result:**
Low-dimensional data representation $Y^{(T)} = {y_1, y_2, \ldots, y_n}$

### Pseudocode

Algorithm 1: Simple version of t-Distributed Stochastic Neighbor Embedding

begin

1. Compute pairwise affinities $p_{j|i}$ with perplexity Perp (using Equation 1)
       $$
        p_{j|i} = \frac{\exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{\|x_i - x_k\|^2}{2\sigma_i^2}\right)}
       $$
       
    - Ensure that the perplexity satisfies $Perp(P_i) = 2^{H(P_i)} $ where $H(P_i) = -\sum_j p_{j|i} \log_2 p_{j|i} $    
2. Symmetrize affinities:
       $$
        p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}
       $$
    
3. Sample initial solution:
       $$
        Y^{(0)} = \{y_1, y_2, \ldots, y_n\} \sim \mathcal{N}(0, 10^{-4}I)
       $$
    
4. For $t = 1$ to $T$ do

    a. Compute low-dimensional affinities $q_{ij}$ (using Equation 4)
           $$
            q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}
           $$
        
    b. Compute gradient (using Equation 5)
           $$
            \frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij})(y_i - y_j)(1 + \|y_i - y_j\|^2)^{-1}
           $$
        
    c. Update the low-dimensional map:
           $$
            Y^{(t)} = Y^{(t-1)} + \eta \frac{\partial C}{\partial Y} + \alpha(t) (Y^{(t-1)} - Y^{(t-2)})
           $$
    
    end for

end

### Explanation of Terms

| Symbol      | Meaning                                             |
| ----------- | --------------------------------------------------- |
| $X$         | Original high-dimensional data                      |
| $Y$         | Low-dimensional embedding                           |
| $p_{ij}$    | Similarity between points in high-dimensional space |
| $q_{ij}$    | Similarity between points in low-dimensional space  |
| $C$         | Kullback–Leibler divergence cost function           |
| $\eta$      | Learning rate                                       |
| $\alpha(t)$ | Momentum term at iteration $t$                      |

## Summary Table

| Step | Formula | Purpose |  
| ---- | ------- | ------|
| 1    | $p_{j \| i} = \frac{e^{-\|x_i - x_j\|^2 / 2\sigma_i^2}}{\sum_{k \neq i} e^{-\|x_i - x_k\|^2 / 2\sigma_i^2}}$ | High-D similarity |
| 2    | $q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}$ | Low-D similarity                                                                                |                   |
| 3    | $C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}$                                             | Cost (KL divergence)                                                                            |                   |
| 4    | $\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij})(y_i - y_j)(1 + \|y_i - y_j\|^2)^{-1}$ | Gradient                                                                                        |                   |
| 5    | $Y^{(t)} = Y^{(t-1)} + \eta \frac{\partial C}{\partial Y} + \alpha(t)(Y^{(t-1)} - Y^{(t-2)})$     | Update rule                                                                                     |                   |


## t-SNE Python Example

In [None]:
import plotly.express as px
from sklearn.datasets import make_classification

X, y = make_classification(
    n_features=6,
    n_classes=3,
    n_samples=1500,
    n_informative=2,
    random_state=5,
    n_clusters_per_class=1,
)


fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y, opacity=0.8)
fig.show()

## Fitting and Transforming PCA

We will now apply the PCA algorithm on the dataset to return two PCA components. The `fit_transform` learns and transforms the dataset at the same time. 

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

## t-SNE Visualization Python
We can now visualize the results by displaying two PCA components on a scatter plot. 

- x: First component
- y: Second companion
- color: target variable.

We have also used the `update_layout` function to add a title and rename the x-axis and y-axis.

In [None]:
fig = px.scatter(x=X_pca[:, 0], y=X_pca[:, 1], color=y)
fig.update_layout(
    title="PCA visualization of Custom Classification dataset",
    xaxis_title="First Principal Component",
    yaxis_title="Second Principal Component",
)
fig.show()

## Fitting and Transforming t-SNE

Now we will apply the t-SNE algorithm to the dataset and compare the results.  

After fitting and transforming data, we will display Kullback-Leibler (KL) divergence between the high-dimensional probability distribution and the low-dimensional probability distribution. Low KL divergence is a sign of better results

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
tsne.kl_divergence_

## t-SNE Visualization Python
Similar to PCA, we will visualize two t-SNE components on a scatter plot. 

In [None]:
fig = px.scatter(x=X_tsne[:, 0], y=X_tsne[:, 1], color=y)
fig.update_layout(
    title="t-SNE visualization of Custom Classification dataset",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
)
fig.show()

## t-SNE on Customer Churn Dataset

In this section, we will use the real **Customer Churn** dataset of an Iranian telecom company. The dataset contains information on the customers' activity, such as call failures and subscription length, and a churn label.

Churn means the percentage of customers that stop using a particular service during a given time frame.

## Data Dictionary
| Column                  | Explanation                                             |
|-------------------------|---------------------------------------------------------|
| Call Failure            | number of call failures                                 |
| Complaints              | binary (0: No complaint, 1: complaint)                  |
| Subscription Length     | total months of subscription                            |
| Charge Amount           | ordinal attribute (0: lowest amount, 9: highest amount) |
| Seconds of Use          | total seconds of calls                                  |
| Frequency of use        | total number of calls                                   |
| Frequency of SMS        | total number of text messages                           |
| Distinct Called Numbers | total number of distinct phone calls                    |
| Age Group               | ordinal attribute (1: younger age, 5: older age)        |
| Tariff Plan             | binary (1: Pay as you go, 2: contractual)               |
| Status                  | binary (1: active, 2: non-active)                       |
| Age                     | age of customer                                         |
| Customer Value          | the calculated value of customer                        |
| Churn                   | class label (1: churn, 0: non-churn)                    |

In [None]:
import pandas as pd

df = pd.read_csv("customer_churn.csv")
df.head(3)

## PCA Dimensionality Reduction
After that, we will:

- Create features (X) and target (y) using the Churn column.
- Normalize the features using a standard scaler.
- Split the dataset into a training and testing set.
- Apply PCA to the training dataset.
- Get the score using the testing dataset. The score represents the average log-likelihood of all samples.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('Churn', axis=1)
y = df['Churn']

scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_norm, y, random_state=13, test_size=0.25, shuffle=True
)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)

pca.score(X_test)

## Visualizing PCA
We will now visualize the PCA result using the Plotly Express scatter plot

In [None]:
fig = px.scatter(x=X_train_pca[:, 0], y=X_train_pca[:, 1], color=y_train)
fig.update_layout(
    title="PCA visualization of Customer Churn dataset",
    xaxis_title="First Principal Component",
    yaxis_title="Second Principal Component",
)
fig.show()

## Checking Perplexity vs. Divergence

For the t-SNE algorithm, **perplexity is a very important hyperparameter**. 

It controls the effective number of neighbors that each point considers during the dimensionality reduction process. We will run a loop to get the KL Divergence metric on various perplexities from 5 to 55 with 5 points gap. After that, we will display the result using the Plotly Express line plot.

In [None]:
import numpy as np

perplexity = np.arange(5, 55, 5)
divergence = []

for i in perplexity:
    model = TSNE(n_components=2, init="pca", perplexity=i)
    reduced = model.fit_transform(X_train)
    divergence.append(model.kl_divergence_)
fig = px.line(x=perplexity, y=divergence, markers=True)
fig.update_layout(xaxis_title="Perplexity Values", yaxis_title="Divergence")
fig.update_traces(line_color="red", line_width=1)
fig.show()

The KL Divergence has become constant after 40 perplexity. So, we will use 40 perplexity in t-SNE algorithm.  

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=40, random_state=42)
X_train_tsne = tsne.fit_transform(X_train)

tsne.kl_divergence_

We will now use the Plotly Scatter plot to display components and target classes. 

In [None]:
fig = px.scatter(x=X_train_tsne[:, 0], y=X_train_tsne[:, 1], color=y_train)
fig.update_layout(
    title="t-SNE visualization of Customer Churn dataset",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
)
fig.show()

As we can see, we have multiple clusters and sub-clusters. We can use this information to understand the pattern and come up with a strategy for retaining existing customers. 

## Application of t-SNE

Apart from visualizing complex multi-dimensional data, t-SNE has other uses mostly in the medical field. 

1. **Clustering and classification**: to cluster similar data points together in lower dimensional space. It can also be used for classification and finding patterns in the data. 
2. **Anomaly detection**: to identify outliers and anomalies in the data. 
3. **Natural language processing**: to visualize word embeddings generated from a large corpus of text that makes it easier to identify similarities and relationships between words.
4. **Computer security**: to visualize network traffic patterns and detect anomalies.
5. **Cancer research**: to visualize molecular profiles of tumor samples and identify subtypes of cancer. 
6. **Geological domain interpretation**: to visualize seismic attributes and to identify geological anomalies. 
7. **Biomedical signal processing**: to visualize electroencephalogram (EEG) and detect patterns of brain activity. 

## Conclusion

t-SNE is a powerful visualization tool for revealing hidden patterns and structures in complex datasets. You can use it for images, audio, biologicals, and single data to identify anomalies and patterns. 

In this notebook, we have learned about t-SNE, a popular dimensionality reduction technique that can visualize high-dimensional non-linear data in a low-dimensional space. We have explained the main idea behind t-SNE, how it works, and its applications. Moreover, we showed some examples of applying t-SNE to synthetics and real datasets and how to interpret the results. 