# Lecture 5: Unsupervised Learning - PCA and t-SNE

## **Learning Outcomes**
- Understand the concept of dimensionality reduction and its importance in machine learning.
- Learn to apply Principal Component Analysis (PCA) and t-SNE for visualizing high-dimensional data.
- Explore real-world applications of unsupervised learning using satellite imagery data.

---

## **What We Will Learn**
1. What is dimensionality reduction?
2. How does PCA work, and what is its significance in data analysis?
3. How to apply t-SNE to visualize non-linear relationships in high-dimensional data.
4. Hands-on demonstration with the EuroSAT Dataset.

---

## **Questions to Consider**
- What are the key differences between PCA and t-SNE?
- Why is dimensionality reduction important for satellite imagery?
- Can we identify land-cover patterns in the EuroSAT dataset using unsupervised learning?

---

## **Dataset Description**
- **Dataset Name**: EuroSAT (Sentinel-2 Satellite Images)
- **Number of Classes**: 10 land-cover types (e.g., Forest, Urban, Cropland).
- **Data Shape**: RGB images with size 64x64 pixels.
- **Use Case**: Dimensionality reduction to visualize and cluster satellite images.

---

## **Steps in this Notebook**
1. Import and preprocess the EuroSAT dataset.
2. Apply PCA to reduce dimensions and explain variance.
3. Visualize the data using t-SNE.
4. Discuss findings and implications of dimensionality reduction.

---

## **Guiding Questions**
1. How can we preprocess satellite images for unsupervised learning?
2. What are the practical use cases of PCA and t-SNE for satellite imagery?
3. How do PCA and t-SNE complement each other for data visualization?
4. What patterns can we observe in satellite images after dimensionality reduction?

---

## **Dataset Visualization and Feature Exploration**
### What We Will Do:
- Visualize sample images to understand the dataset structure.
- Explore the distribution of land-cover classes.
- Prepare the dataset for unsupervised learning.


## EuroSAT Dataset Features

The EuroSAT dataset consists of Sentinel-2 satellite images across 10 classes:
1. Annual Crop
2. Forest
3. Herbaceous Vegetation
4. Highway
5. Industrial
6. Pasture
7. Permanent Crop
8. Residential
9. River
10. Sea/Lake

### Key Details:
- **Image Dimensions**: 64x64 pixels
- **Number of Classes**: 10
- **Channels**: RGB (3 channels)
- **Format**: Supervised learning with labels for land-use classification.


In [None]:
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np

# Load EuroSAT dataset
dataset, info = tfds.load('eurosat', with_info=True, as_supervised=True)
train_dataset = dataset['train']

# Display dataset info
print(info)

# Extract class names
class_names = info.features['label'].names
print(f"Class Names: {class_names}")

# Visualize a few sample images with their class labels
def visualize_dataset(dataset, class_names):
    plt.figure(figsize=(12, 8))
    for i, (image, label) in enumerate(dataset.take(9)):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(image.numpy())
        plt.title(class_names[label.numpy()])
        plt.axis("off")
    plt.tight_layout()
    plt.show()

visualize_dataset(train_dataset, class_names)


In [None]:
# Get class distribution
from collections import Counter

def plot_class_distribution(dataset):
    labels = []
    for _, label in dataset:
        labels.append(label.numpy())
    class_counts = Counter(labels)

    # Plot the class distribution
    plt.figure(figsize=(10, 6))
    plt.bar([class_names[i] for i in class_counts.keys()], class_counts.values(), color='teal')
    plt.title('Class Distribution in EuroSAT Dataset')
    plt.xlabel('Class')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.show()

plot_class_distribution(train_dataset)


In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# Normalize and flatten the image data
X_flat = np.array([np.array(image).flatten() for image, _ in dataset['train']])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_flat)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Variance explained by each component
explained_variance = pca.explained_variance_ratio_

# Plot PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, c=[label for _, label in dataset['train']])
plt.colorbar(label='Class Labels')
plt.title('PCA: 2D Visualization of EuroSAT Data')
plt.xlabel(f'Principal Component 1 ({explained_variance[0]*100:.2f}%)')
plt.ylabel(f'Principal Component 2 ({explained_variance[1]*100:.2f}%)')
plt.show()


In [None]:
from sklearn.manifold import TSNE

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

# Plot t-SNE results
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.5, c=[label for _, label in dataset['train']])
plt.colorbar(label='Class Labels')
plt.title('t-SNE: 2D Visualization of EuroSAT Data')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Calculate silhouette score
silhouette_avg = silhouette_score(X_scaled, kmeans_labels)
print(f"Silhouette Score for K-Means: {silhouette_avg:.2f}")

# Visualize clusters with PCA
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis', alpha=0.5)
plt.colorbar(label='Cluster Labels')
plt.title('K-Means Clustering on PCA-reduced Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


In [None]:
from sklearn.cluster import DBSCAN

# Apply DBSCAN
dbscan = DBSCAN(eps=2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Count unique clusters
unique_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(f"Number of Clusters (DBSCAN): {unique_clusters}")

# Visualize DBSCAN clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=dbscan_labels, cmap='rainbow', alpha=0.5)
plt.colorbar(label='Cluster Labels')
plt.title('DBSCAN Clustering on PCA-reduced Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


In [None]:
from sklearn.cluster import SpectralClustering

# Apply Spectral Clustering
spectral = SpectralClustering(n_clusters=10, affinity='nearest_neighbors', random_state=42)
spectral_labels = spectral.fit_predict(X_scaled)

# Visualize Spectral Clustering results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=spectral_labels, cmap='rainbow', alpha=0.5)
plt.colorbar(label='Cluster Labels')
plt.title('Spectral Clustering on PCA-reduced Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


In [None]:
# Metrics summary
print("Summary of Clustering Metrics:")
print(f"Silhouette Score (K-Means): {silhouette_avg:.2f}")
print(f"Number of Clusters (DBSCAN): {unique_clusters}")
