# TP 5: PCA and Anomaly Detection

## üìù Exercise 1: PCA on digits

In this exercise, you will apply **Principal Component Analysis (PCA)** to the `sklearn digits` dataset to understand dimensionality reduction and anomaly detection. You will analyze how data can be effectively compressed while preserving information.

### Part 0: Data Loading and Exploration

- Load the digits dataset from sklearn
- Explore the shape and characteristics of the data
- Visualize a sample of digits

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, confusion_matrix
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

print("Dataset shape:", X.shape)
print("Target classes:", np.unique(y))
print("Number of features:", X.shape[1])
print("Number of samples:", X.shape[0])

for i in range(1,10):
    plt.subplot(330+i)
    plt.imshow(digits.images[i-1],cmap='Greys')

plt.show() 

### Part 1: Implement PCA for 2D Projection

Apply PCA to reduce the digit images from 64 dimensions to 2 dimensions. Then create a visualization showing how the different digit classes are distributed in this 2D space.

1. Create a PCA model with 2 principal components

2. Transform the data to the 2D space using your model

3. Create a scatter plot where:
    - The x-axis is the first principal component
    - The y-axis is the second principal component
    - Each digit class (0-9) has a different color
    - Add a colorbar to show which color represents which digit

In [None]:
## TODO:

### Part 2: K-means Clustering with 10 Clusters

In the 2D PCA space you created in Exercise 1, apply K-means clustering to create 10 clusters (one for each digit). Then evaluate how well the clusters match the true digit labels.

1. Apply K-means with k=10 clusters to your 2D PCA data:
    - You may need to try multiple initializations to get exactly 10 non-empty clusters

2. Assign cluster labels by finding which digit class is most common in each cluster:
    - For each cluster, find the majority digit label
    - Assign this label to all points in that cluster

3. Evaluate the clustering using three metrics:
    - Precision: For each true digit class, what fraction of its samples were clustered correctly?
    - Recall: For each cluster, what fraction of its samples actually belong to the same digit class?
    - Gini-index: What percentage of all samples are in the "correct" cluster for their digit class?

In [None]:
## TODO:

### Part 3: 3D PCA Projection

Extend your analysis from 2D to 3D. With an extra dimension, you should capture more variance and see better separation of digit classes.

1. Create a 3D PCA model with 3 principal components

2. Transform the data to 3D space

3. Create a 3D scatter plot where:
    - Each digit class has a different color

In [None]:
## TODO:

### Part 4: Optimal Dimensionality Selection

**How many dimensions do we actually need?**

Too few and we lose important information, too many and we might overfit or negate the benefits of dimensionality reduction. We need to find the sweet spot.

1. Run PCA with increasing numbers of components:
    - For each k from 1 to 63:
        - Fit a PCA model with k components
        - Calculate the explained variance ratio

2. Create a plot showing:
    - X-axis: component number (k = 1 to 63)
    - Y-axis: explained variance ratio

In [None]:
## TODO:

## üìù Exercise 2: Anomaly Detection in Facebook Spatial Likes

In this exercise, you will apply the low-dimensional phenomenon to detect anomalous users in a real Facebook dataset. The dataset contains information about users' likes across different content categories.

- Dataset Description:
    - 9,000 users
    - 210 content categories (different categories of pages on Facebook)
    - 6 months of data

- Each entry represents the number of likes a user gave to pages in that category

- Data Format:
    - Rows = users
    - Columns = content categories
    - Matrix shape: 9000 √ó 210

### Part 0: Data Loading and Exploration

- Load the data `spatial_data.txt` and extract the content categories (exclude the first column which is user IDs).

In [None]:
# Load data
data = np.loadtxt('spatial_data.txt')
FBSpatial = data[:,1:]  # Remove user ID column
print(f"Data shape: {FBSpatial.shape}")
print(f"Number of users: {FBSpatial.shape[0]}")
print(f"Number of categories: {FBSpatial.shape[1]}")

- Check the total number of likes for each user (the row sums).

In [None]:
FBSnorm = np.linalg.norm(FBSpatial,axis=1,ord=1)
plt.plot(FBSnorm)
plt.title('Number of Likes Per User')
plt.xlabel('Users')

In many real-world datasets, most "normal" observations lie approximately in a low-dimensional subspace. This is called the low-dimensional phenomenon. What does this mean?

- Normal users might have similar like patterns across 210 categories
- These patterns can be summarized by just 20-25 principal components
- Anomalous users break this pattern - they have unusual preference profiles 

Let's check whether the low dimensional phenomenon holds.

In [None]:
u,s,vt = np.linalg.svd(FBSpatial,full_matrices=False)
plt.plot(s/np.linalg.norm(FBSpatial))
plt.title('Singular Values of Spatial Like Matrix')

### Part 1: Build the Normal Model and Extract Anomalies

Separate the portion of the data lying in the normal space from the amonalous space.

1. Compute SVD of FBSpatial

2. Use first 25 columns (which captures ~80-85% of the variance) to project data onto normal subspace

3. Calculate residuals as anomalous component

In [None]:
## TODO:

### Part 2: Identify Top 30 Anomalous Users

Find the 30 users with largest anomaly scores and visualize their position relative to overall user activity.

In [None]:
## TODO:

### Part 3: Visualize Patterns of Anomalous Users

Plot the like patterns (across the 210 categories) for 9 of the top anomalous users.

In [None]:
## TODO: