# Notebook 08 — Dimensionality Reduction
📁 File name: 08_dimensionality_reduction.ipynb

This notebook shows how to reduce the number of features in a dataset using two common dimensionality reduction techniques:

- **Principal Component Analysis (PCA)** — a linear method to transform high-dimensional data into a smaller set of components while preserving as much variance as possible.

- **T-distributed Stochastic Neighbor Embedding (TSNE)** — a non-linear method often used for visualization that preserves the local structure of the data in 2D or 3D space.

You'll learn when and why to use each technique, and how to apply them to numerical data using scikit-learn.

📒 Notebook Sections:

1. Title & Introduction
2. What is Dimensionality Reduction?
3. Load and Prepare Data
4. Apply PCA (Principal Component Analysis)
5. Visualize PCA Output
6. Apply TSNE (T-distributed Stochastic Neighbor Embedding)
7. When to Use PCA vs TSNE
8. Summary and What’s Next

## 1. Title & Introduction (Markdown Cell)
### 08 — Dimensionality Reduction with PCA and TSNE

This notebook demonstrates how to reduce the number of features in a dataset using two popular techniques:

- Principal Component Analysis (PCA)
- T-SNE (T-distributed Stochastic Neighbor Embedding)

Dimensionality reduction is useful when working with high-dimensional data, where visualizations or model performance may suffer due to the number of features.


## 2. What is Dimensionality Reduction? (Markdown)
### What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing the number of input variables (features) in your dataset while preserving as much useful information as possible.

Why it’s useful:

- Simplifies data visualization (e.g., reduce to 2D or 3D)
- Removes noise and redundancy
- Reduces overfitting risk
- Improves model training time


## 3. Load and Prepare Data

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("../data/sample_data.csv")

# Select numeric features only for PCA/TSNE
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
df_numeric = df[numeric_cols].dropna()  # Ensure no missing values

df_numeric.head()

## 4. Apply PCA (Principal Component Analysis)

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df_numeric)

# Create DataFrame
df_pca = pd.DataFrame(pca_result, columns=["PC1", "PC2"])
df_pca.head()

## 5. Visualize PCA Output

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(df_pca["PC1"], df_pca["PC2"], alpha=0.7)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Result (2 Components)")
plt.grid(True)
plt.show()

## 6. Apply TSNE (T-distributed Stochastic Neighbor Embedding)

In [None]:
from sklearn.manifold import TSNE

# Apply TSNE for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(df_numeric)

df_tsne = pd.DataFrame(tsne_result, columns=["TSNE1", "TSNE2"])
df_tsne.head()

## 7. Visualize TSNE Output

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df_tsne["TSNE1"], df_tsne["TSNE2"], alpha=0.7)
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.title("TSNE Result (2 Components)")
plt.grid(True)
plt.show()

## 8. When to Use PCA vs TSNE (Markdown)
### When to Use PCA vs TSNE

**PCA**  
- Linear technique  
- Fast and interpretable  
- Good for feature reduction and model input

**TSNE**  
- Non-linear technique  
- Best for visualization only (not used for modeling)  
- Slower and harder to interpret

For model training: Use PCA  
For data exploration/visualization: Use TSNE

## 9. Summary and What’s Next (Markdown)
### Summary

- PCA reduces features by projecting data onto principal axes.
- TSNE reduces data to 2D or 3D while preserving local relationships.
- These tools help in visualizing and simplifying datasets with many features.

Next up: We’ll look at selecting the most relevant features using filter, wrapper, and embedded methods.

