<h1>Unsupervised Learning</h1>

Unsupervised learning is a type of machine learning where the algorithm is given a dataset without explicit instructions on what to do with it. The system tries to learn the patterns, relationships, and structures within the data on its own, without labeled outputs or predefined target values.

In unsupervised learning, the algorithm explores the inherent structure of the data to find hidden patterns or representations. The goal is often to discover the underlying structure, group similar data points, or reduce the dimensionality of the data. Clustering and dimensionality reduction are common tasks in unsupervised learning.

Two main types of unsupervised learning are:

- **Clustering:** The algorithm identifies groups of similar instances in the data, grouping them together based on certain features or characteristics. Examples of clustering algorithms include k-means clustering and hierarchical clustering.

- **Dimensionality Reduction:** The algorithm aims to reduce the number of features or variables in the dataset while retaining the essential information. Principal Component Analysis (PCA) is a popular technique for dimensionality reduction.

Unsupervised learning is in contrast to supervised learning, where the algorithm is trained on labeled data with explicit input-output pairs. Unsupervised learning is often used in situations where labeled data is scarce or unavailable, and the goal is to explore and understand the structure of the data.


<table width="100%" border="0">
<tr>
    <td width="50%">
        <img src="./images/00/supervised_learning.png" />
        <div align="center">Supervised learning example: Classification Problem</div>
    </td>
    <td>
        <img src="./images/00/unsupervised_learning.png" />
        <div align="center">Unsupervised learning example: Clustering Problem</div>
    </td>
</tr>
</table>

Supervised learning and unsupervised learning are two fundamental approaches in machine learning, and they differ primarily in the way they use labeled or unlabeled data for training.

**Supervised Learning:**
- **Training Data:**
    - **Labeled Data:** Supervised learning relies on a labeled training dataset, where each input is associated with a corresponding output or target. The algorithm learns to map inputs to specific outputs based on the provided labels.
- **Objective:**
    - **Prediction:** The main goal of supervised learning is to make accurate predictions or classifications for new, unseen data points based on the patterns learned during training.
- **Examples:**
    - **Classification:** Identifying which category or class an input belongs to (e.g., spam or not spam, image classification).
Regression: Predicting a continuous output variable (e.g., predicting house prices).

**Unsupervised Learning:**
- **Training Data:**
    - **Unlabeled Data:** Unsupervised learning works with unlabeled data, where the algorithm is given a dataset without explicit instructions on what the output should be.
- **Objective:**
    - **Discover Patterns:** The primary goal is to explore the inherent structure of the data, uncover hidden patterns, or find relationships between data points without predefined outputs.
- **Examples:**
    - **Clustering:** Grouping similar data points together based on certain features.
    - **Dimensionality Reduction:** Reducing the number of features while retaining important information.
    - **Association:** Discovering relationships and associations between variables in the data.

<table width="100%" border="0">
<tr>
    <td align="center" width="33%">
        <img src="./images/00/market_seg.jpg" />
        <div>Market Segmentation</div>
    </td>
    <td align="center" width="33%">
        <img src="./images/00/organize_comp.jpg" />
        <div>Organize computing clusters</div>
    </td>
    <td align="center">
        <img src="./images/00/astronomical.jpg" />
        <div>Organize computing clusters</div>
    </td>
</tr>
</table>

## K-Means Clustering Algorithm

### Inputs:
- **Dataset:** $ X = \{x_1, x_2, \ldots, x_n\} $ - a set of $n$ data points in a feature space.

- **Number of Clusters:** $k$ - the desired number of clusters to partition the dataset into.

### Algorithm Steps:

1. **Initialization:**
    - Choose the number of clusters, $k$.
    - Randomly initialize $k$ cluster centroids: $\{c_1, c_2, \ldots, c_k\}$, where each centroid $c_i$ is a point in the feature space.

2. **Assignment Step:**
    - For each data point $x_j$ in the dataset, calculate the distance to each centroid:
        $ d_{ij} = \|x_j - c_i\| $
    - Assign the data point $x_j$ to the cluster with the nearest centroid:
        $ \text{cluster}(x_j) = \underset{i}{\text{argmin }} d_{ij} $

3. **Update Step:**
    - Recalculate the centroids of the clusters based on the data points assigned to them. The new centroid $c_i$ is the mean of all the points in the cluster:
        $$ c_i = \frac{1}{N_i} \sum_{x_j \in \text{cluster}(c_i)} x_j $$
        where $N_i$ is the number of data points in cluster $i$.

4. **Repeat:**
    - Repeat steps 2 and 3 until convergence. Convergence occurs when the assignment of data points to clusters and the update of centroids no longer change significantly.

### Outputs:
- **Cluster Assignments:** A set of cluster labels for each data point, indicating which cluster it belongs to.

- **Cluster Centroids:** The final positions of the $k$ centroids after convergence.

The algorithm seeks to partition the dataset into $k$ clusters, with each cluster represented by a centroid. The assignment step allocates each data point to the cluster associated with the nearest centroid, and the update step adjusts the centroids based on the mean of the points in each cluster. The process is iterated until convergence.