# Learning Agenda

- **What is Clustering**
- Types of Clustering:
    - **K-Means Clustering** 
        - Basics Steps of K-Means Clustering
        - Limitations of K-Means
        - Possible Improvements in K-Means 
        - How to choose value of K in K-Means?
        
    - **Hierarchical Clustering**
        - Types of hierarchical clustering: Agglomerative and Divisive
        - Creating a dendrogram
        - Distance measures in hierarchical clustering
        - Determining the optimal number of clusters in hierarchical clustering
        - Visualizing hierarchical clustering results.

In [4]:
# !ls images/

## What is Clustering?

**Clustering** is a technique in machine learning that involves grouping similar objects together. The goal of clustering is to divide a large data set into smaller, more meaningful subgroups based on the similarity between the objects in the data set. This can help to identify patterns, relationships, and insights in the data that might not be easily visible when looking at the data in its raw form.


<img src="images/clustering.jpg" height=500px width=500px>

## K-Means Clustering

**k-means** is a type of centroid-based clustering algorithm. In k-means, the data set is divided into k clusters, where k is a user-defined number. Each cluster is represented by its centroid, which is the mean of all the data points in the cluster. The algorithm assigns each data point to the closest centroid, and then updates the centroid based on the new cluster assignments. This process continues until the centroids stop changing or a maximum number of iterations is reached.

<img src="images/k-means.png" height=500px width=500px>

### Limitations of k-means clustering

- K-means clustering is sensitive to outliers and can be skewed by the presence of outliers in the dataset. 

- It is not suitable for non-linear data, as it assumes that the clusters are formed based on Euclidean distance measures which do not work well with non-linear datasets. 

- The optimal number of clusters needs to be determined before applying k-means clustering and this could be difficult if there isn’t enough prior knowledge about the data set. 

- K-means clustering assumes spherical cluster shapes, but in practice many datasets have more complex geometries such as elongated or irregular shapes which cannot be represented using a spherical model

## Hierarchical Clustering

**Hierarchical clustering** is a type of clustering algorithm that creates a hierarchical representation of the data set. The hierarchical representation is called a dendrogram, and it is a tree-like structure that shows the relationships between the objects in the data set. In hierarchical clustering, the algorithm starts by treating each data point as a separate cluster. Then, the closest two clusters are merged into a single cluster, and the process continues until all the data points are combined into a single cluster. There are two main types of hierarchical clustering: Agglomerative and Divisive. In Agglomerative hierarchical clustering, the algorithm starts with individual data points and merges them into larger clusters. In Divisive hierarchical clustering, the algorithm starts with the entire data set and splits it into smaller clusters.

<img src="images/Hierarchical.jpeg" height=500px width=500px>


## Real world Applications of Clustering

- **Customer Segmentation:** Marketers can use clustering to group similar customers together to target them with personalized offers and advertisements.


- **Image Segmentation:** Clustering can be used to group similar pixels together to segment images into meaningful objects.


- **Anomaly Detection:** Clustering can be used to identify unusual data points or outliers in a dataset.


- **Gene expression analysis:** Clustering can be used to group genes with similar patterns of expression into clusters.


- **Fraud Detection:** Clustering can be used to group transactions into similar clusters to detect fraudulent behavior.


- **Text Clustering:** Clustering can be used to group documents with similar topics or themes into clusters.


- **Recommendation Systems:** Clustering can be used to group similar users together to make personalized recommendations.


- **Speech Recognition:** Clustering can be used to group similar speech sounds together to improve speech recognition.


- **Medical Imaging:** Clustering can be used to group similar pixels in medical images to segment and diagnose medical conditions.


- **Financial Portfolio Management:** Clustering can be used to group similar stocks together to form portfolios with diverse investment opportunities.


- **Weather Forecasting:** Clustering can be used to group similar weather patterns together to make more accurate weather predictions.


- **Social Network Analysis:** Clustering can be used to group similar users together to identify communities and connections within a social network.


- **Cluster analysis in Marketing:** Clustering can be used to group similar customers together based on their spending patterns and buying behavior.


- **Manufacturing Process Optimization:** Clustering can be used to group similar production processes together to optimize the manufacturing process.


- **Image Compression:** Clustering can be used to group similar pixels together to compress images.

## Interview Questions 

- What is Clustering and why is it important in Machine Learning?
- How does the k-means algorithm work, and what are its steps?
- How do you choose the number of clusters in k-means?
- What is the Elbow method and how is it used in k-means?
- What are the limitations of k-means, and how do you overcome them?
- Can you explain the difference between hard and soft clustering in k-means?
- How do you evaluate the performance of a k-means model?
- What is the Silhouette score, and how is it used in k-means?
- What is hierarchical clustering, and how does it differ from k-means?
- Can you explain the types of hierarchical clustering (Agglomerative and Divisive)?
- How do you create a dendrogram in hierarchical clustering?
- What are the common distance measures used in hierarchical clustering?
- How do you determine the optimal number of clusters in hierarchical clustering?
- How do you visualize the results of hierarchical clustering?