## Unsupervised Learning
Unsupervised learning is a core area of machine learning where the model learns patterns from unlabeled data. Unlike supervised learning, where data comes with labels (e.g., images of cats labeled as "cat"), unsupervised learning works with inputs only and attempts to understand the structure or distribution of the data.

### What Is Unsupervised Learning?
Unsupervised learning refers to algorithms that infer patterns, relationships, and structures from data without being told the correct output. The goal is to find hidden structures or groupings in the data.

### Key Concepts
1. No Labels  
Only input features are provided.  

The algorithm must "make sense" of the data on its own.

2. Learning Patterns  
The system identifies groupings, distributions, or key features without explicit supervision.

### Main Types of Unsupervised Learning
#### 1. Clustering  
Groups similar data points together.  

#### Algorithms:  

K-Means: Assigns data into k clusters based on distance to centroids.

Hierarchical Clustering: Builds a tree of clusters.

DBSCAN: Groups based on density, good for discovering clusters of arbitrary shape.

#### Applications:  

Customer segmentation

Social network analysis

Image compression

#### 2. Dimensionality Reduction
Reduces the number of variables/features while retaining important information.  

#### Algorithms:  

PCA (Principal Component Analysis): Projects data to lower dimensions using linear transformations.

t-SNE: Visualizes high-dimensional data by mapping it to 2D/3D.

Autoencoders: Neural networks that compress and then reconstruct the data.

#### Applications:  

Data visualization

Noise reduction

Preprocessing for supervised learning

#### 3. Anomaly Detection  
Detects rare events or outliers.

#### Algorithms:

Isolation Forest  

    Randomly isolates data points to detect outliers.

    Fast and effective for high-dimensional data.

One-Class SVM  

    Learns a boundary that encloses normal data; anything outside is considered an anomaly.

    Works well when normal data vastly outweighs anomalies.

Statistical models (e.g., Gaussian models)  

    Assume data follows a distribution (e.g., Gaussian).

    Flag values that fall outside expected range (e.g., 3 standard deviations).

#### Applications:

Fraud detection

Fault detection in machines

Intrusion detection

#### 4. Association Rule Learning  
Finds relationships between variables in large datasets.

#### Algorithms:

Apriori  

    Generates frequent itemsets and then derives association rules.

    Uses support, confidence, and lift to evaluate rule strength.

Eclat  

    Uses a depth-first search strategy and vertical data format for efficiency.

    Faster than Apriori for large datasets with many frequent items.

#### Applications:

Market basket analysis (e.g., “people who bought X also bought Y”)

Recommender systems

#### Key Metrics in association rule learning:  
Support: Frequency of the itemset in the data.

Confidence: Likelihood that the rule is correct.

Lift: How much more likely items co-occur than expected by chance.




### Example Use cases of the unsupervised learning algorithms
| Use Case               | Method               | Description                                   |
| ---------------------- | -------------------- | --------------------------------------------- |
| Customer Segmentation  | Clustering (K-Means) | Group customers based on behavior             |
| Fraud Detection        | Anomaly Detection    | Identify abnormal transactions                |
| Image Compression      | PCA, Autoencoders    | Reduce image size without much loss of detail |
| Recommendation Systems | Association Rules    | Find co-purchased items                       |
| Data Visualization     | t-SNE, PCA           | Plot high-dimensional data in 2D or 3D        |


### 📌 What is K-Means Clustering?
K-Means is an unsupervised clustering algorithm that partitions data into K distinct clusters, where each data point belongs to the cluster with the nearest mean (centroid).

### 🧠 Key Concepts
Cluster: A group of data points with similar characteristics.

Centroid: The center of a cluster, calculated as the mean of all points in that cluster.

K: The number of clusters you want the algorithm to find (set manually).

Distance Metric: Usually Euclidean distance is used to assign points to centroids.

### ⚙️ How K-Means Works
Initialize:

Choose K random data points as initial centroids.

Assign:

Assign each point to the nearest centroid (based on distance).

Update:

Recalculate the centroid of each cluster (mean of all assigned points).

Repeat:

Repeat assign and update steps until:

Centroids stop changing (convergence), or

A maximum number of iterations is reached.

### Objective of this algorithm
![image.png](attachment:image.png)

### 📊 Pros & Cons
✅ Advantages:  
Simple and fast

Scales well to large datasets

Easy to interpret

❌ Limitations:  
Must specify K in advance

Sensitive to initialization

Struggles with non-spherical clusters or uneven cluster sizes

Sensitive to outliers



### Evaluating Clustering
![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)



![image.png](attachment:image.png)

#### pd.crosstab(index, columns):  
pd.crosstab() is a Pandas function that creates a cross-tabulation table, which shows the frequency counts (i.e., how often combinations of values occur) between two or more categorical variables.


### pipelines
We can use  Pipeline or make_pipeline in scikit-learn to chain preprocessing steps and estimators together.

#### Pipeline
![image.png](attachment:image.png)

#### make_pipeline
![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

### StandardScaler and Normalizer
| Feature        | `Normalizer`                              | `StandardScaler`                           |
| -------------- | ----------------------------------------- | ------------------------------------------ |
| **Normalizes** | Each **sample** (row)                     | Each **feature** (column)                  |
| **Goal**       | Make vector length = 1 (unit norm)        | Make mean = 0 and std = 1 (z-score)        |
| **Affects**    | Direction of rows                         | Distribution of each feature               |
| **Used in**    | Cosine similarity, KNN, clustering (rows) | PCA, regression, SVM, clustering (columns) |
| **Norm types** | `'l1'`, `'l2'`, `'max'`                   | N/A                                        |
| **Common in**  | Text/NLP (TF-IDF), signal data            | Numeric data analysis & ML models          |


## Hierarchical Clustering
Hierarchical clustering is an unsupervised learning algorithm used to group data into clusters based on similarity. Unlike flat clustering (e.g., k-means), it produces a tree-like structure (called a dendrogram) that shows how data points are grouped together step by step.

#### Types of Hierarchical Clustering
Agglomerative (Bottom-Up) – Starts with each data point as a single cluster and merges the closest pairs iteratively.

Divisive (Top-Down) – Starts with all points in one cluster and splits them recursively.

The most common method is agglomerative clustering.

#### Steps in Agglomerative Hierarchical Clustering
Compute the distance matrix (e.g., Euclidean distance).

Linkage criteria: Choose how to compute the distance between clusters:

        Single linkage: minimum distance

        Complete linkage: maximum distance

        Average linkage: average distance

        Ward’s method: minimizes the variance within clusters

Merge the closest clusters

Repeat until one cluster remains or a stopping criterion is met

####  Visual Output
A dendrogram helps you visualize the hierarchy and decide where to “cut” the tree to form flat clusters.

The height at which clusters are merged corresponds to the distance between them.

#### ⚖️ Pros and Cons
✅ Advantages  
No need to pre-specify number of clusters (unless you want a flat clustering)

Dendrogram provides rich insights into data structure

Works well for small to medium datasets

❌ Limitations  
Computationally expensive (time complexity: O(n² log n))

Not scalable to large datasets

Sensitive to noise and outliers

Merging decisions are not reversible

#### 🔄 Variants
Agglomerative Hierarchical Clustering (most commonly used)

Divisive Hierarchical Clustering (less common, top-down)

Balanced hierarchical clustering (used in specific applications like document clustering)

BIRCH: A scalable version for large datasets (implemented in scikit-learn)

#### Practical Use Cases
Gene expression data analysis

Social network analysis

Image segmentation

Document and text clustering

Customer segmentation in marketing

Anomaly detection

#### 📦 Python Libraries That Support It
scipy.cluster.hierarchy – for linkage matrix and dendrogram

sklearn.cluster.AgglomerativeClustering – for assigning cluster labels

seaborn.clustermap() – for clustering + heatmap visualization

#### 🧠 When to Use Hierarchical Clustering
Use it when:

You want to visualize the nested structure of the data.

You don’t know the ideal number of clusters in advance.

You’re working with small to medium-sized datasets.

You need to identify natural groupings or taxonomies in the data.

![image.png](attachment:image.png)

#### Dendogram example
![image-2.png](attachment:image-2.png)



## What is t-SNE?
t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique used primarily for visualizing high-dimensional data in 2 or 3 dimensions.

Developed by Laurens van der Maaten and Geoffrey Hinton, it's especially good at preserving local structure (i.e., similar points stay close in the lower-dimensional space), making it popular for visualizing clusters or patterns.

#### 🧠 Core Idea
t-SNE works in two main steps:

Convert distances in high-dimensional space into probabilities (using Gaussian distribution).

Map points to a low-dimensional space such that similar points in high-D remain close using a Student’s t-distribution, which helps avoid the "crowding problem."

#### Key parameters
| Parameter       | Description                                                             |
| --------------- | ----------------------------------------------------------------------- |
| `n_components`  | Number of output dimensions (usually 2 or 3 for visualization)          |
| `perplexity`    | Controls balance between local and global aspects (typical range: 5–50) |
| `learning_rate` | Affects optimization speed and convergence                              |
| `n_iter`        | Number of optimization iterations (default: 1000)                       |
| `init`          | Initialization: `'random'` or `'pca'`                                   |


![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

![image.png](attachment:image.png)

## 📊 What is PCA?
PCA (Principal Component Analysis) is a linear algebra-based technique for dimensionality reduction. It transforms a dataset from its original feature space to a new coordinate system such that:  

The first principal component captures the maximum variance possible in the data.

The second principal component captures the maximum remaining variance, subject to being orthogonal (uncorrelated) to the first.

This continues for all components.

So, the data is projected onto a new set of orthogonal axes ranked by the amount of information (variance) they capture.

####  Why Use PCA?
Reduce dimensionality of data while preserving as much variance as possible.

Remove redundancy (correlated features).

Helps with visualization, compression, and noise reduction.

Often used as a preprocessing step before applying machine learning algorithms.

#### 🧠 Core Concepts
1. Principal Components
Linear combinations of the original features.

Ordered by the amount of variance they explain.

2. Explained Variance
The proportion of total dataset variance captured by each principal component.

You can use this to decide how many components to keep.

#### ⚙️ How PCA Works (Step-by-Step)
Standardize the data (zero mean and unit variance).

Compute the covariance matrix of the features.

Compute the eigenvectors and eigenvalues of the covariance matrix.

Sort eigenvectors by decreasing eigenvalues.

Choose top k eigenvectors to form a new basis.

Project data onto the new subspace.

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

![image-5.png](attachment:image-5.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

![image-3.png](attachment:image-3.png)

| Aspect                  | Sparse Matrix                                                                                  | Dense Matrix                                                             |
| ----------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| **Definition**          | Matrix with **mostly zero** entries                                                            | Matrix where **most entries are non-zero**                               |
| **Storage**             | Stores **only non-zero elements** and their positions (e.g., CSR, CSC formats)                 | Stores **all elements explicitly**, zeros included (e.g., NumPy ndarray) |
| **Memory Usage**        | Very **memory efficient** for large, mostly-zero matrices                                      | Can consume a lot of memory if matrix is large, since all entries stored |
| **Computation Speed**   | Faster for operations that can skip zeros, but some operations slower due to indirect indexing | Faster for small matrices or when all values matter                      |
| **Use Cases**           | Text data (TF-IDF), graphs, recommender systems, social networks                               | Images, dense numerical data, small-to-medium datasets                   |
| **Supported Libraries** | `scipy.sparse` in Python, specialized sparse matrix libraries                                  | NumPy arrays, most numerical computing libraries                         |
| **Conversion**          | Can be converted to dense but might be very large                                              | Can be converted to sparse if many zeros present                         |
| **Example**             | Adjacency matrix of a large graph with few edges                                               | Matrix of pixel values in a photo                                        |


## 🔶 What is NMF (Non-negative Matrix Factorization)?
NMF is a linear dimensionality reduction and matrix factorization technique where a non-negative matrix is approximately factored into the product of two smaller non-negative matrices.

It is especially useful for parts-based representation — meaning it tends to find interpretable features or components (e.g., topics in text, features in images).

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)



![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)