1. Problem identification 

2. Data wrangling

3. Exploratory data analysis

4. Prep-processing and training data development

5. **Modeling (Machine learning steps)**

6. Documentation

<div class="span5 alert alert-warning">
<h3>Clustering Models</h3>

- Clustering models are unsupervised machine learning algorithms that group data points into clusters based on similarity, without using labeled outcomes.
- Clustering models identify natural groupings or patterns in data by measuring how similar or close data points are to one another. Each cluster contains data points that are more similar to each other than to those in other clusters.

### <font color='brown'><b> K-means </b></font> 

- K-Means is an **unsupervised learning** machine learning algorithm used for clustering data into groups. It works by:

1️⃣ Choosing a number of clusters K.

2️⃣Assigning data points to the nearest cluster center (centroid).

3️⃣Updating centroids based on assigned points.

4️⃣ Repeating until centroids stabilize.

**<span style="background-color: goldenrod;">Model Code</span>**

```python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans( n_clusters= 3 )

# Fit model to points
model.fit(array)
# Determine the cluster labels of the new array 
labels = model.predict(new_array)

# Print cluster labels of new_array
print(labels)

#Vizualize result

# Import pyplot
import matplotlib.pyplot as plt 

# Assign the columns of new_array: xs and ys
xs =new_array[:,0]
ys = new_array[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys,c=labels,  alpha=0.5 )

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter( centroids_x,centroids_y, marker = 'D', s= 50)
plt.show()

```
**<span style="background-color: goldenrod;">Scenerios To Use Model In</span>**

**<span style="color: yellowgreen;">Scenario 1</span>**

- Company: A healthcare analytics startup

- Problem: “Can you group patients into risk profiles based on age, BMI, blood pressure, and lab results?”

**Why Clustering** (e.g., K-Means or Hierarchical Clustering):

- No labeled outcomes — we’re discovering patterns, not predicting a target

- Useful for identifying hidden subgroups (e.g., metabolic syndrome, cardiovascular risk)

- Helps clinicians personalize care without needing predefined categories

**Metrics**

Silhouette Score: This tells you how well each patient fits within their assigned cluster. A high score means patients are well-separated and the clusters are meaningful.

Davies-Bouldin Index: Measures how compact and distinct the clusters are. A lower score means better-defined groupings — ideal for clinical interpretation.

Visual Inspection (e.g., t-SNE or PCA plots): Helps stakeholders see how patients are grouped. If clusters are clearly separated, it builds trust in the model’s insights.


**<span style="color: yellowgreen;">Scenario 2</span>**

- Company: An e-commerce platform

- Problem: “Can you segment customers based on purchase frequency, average spend, and browsing behavior?”

**Why Clustering** (e.g., K-Means or DBSCAN):

- No target variable — we’re uncovering behavioral patterns

- Enables personalized marketing and product recommendations

- Helps identify high-value vs. casual shoppers without manual tagging

**Metrics**

Silhouette Score: Tells you how well each customer fits into their segment. A high score means your clusters reflect real behavioral differences.

Calinski-Harabasz Index: Measures cluster separation and cohesion. A higher score means your segments are distinct and actionable.

Cluster Profiling: Once clusters are formed, you analyze their characteristics (e.g., “Cluster A spends $200/month, browses skincare, shops late at night”) — this is what drives business decisions.



### <font color='brown'><b>Agglomerative Hierarchical Clustering</b></font> 


Agglomerative Hierarchical Clustering is a bottom-up clustering method that starts with each data point as its own cluster and gradually merges the closest clusters until all points belong to a single cluster or a predefined number of clusters is reached.

**How It Works**

1️⃣ Start with individual clusters → Each data point is its own cluster. 

2️⃣ Merge closest clusters → The algorithm finds the two most similar clusters and combines them but but how "similarity" is measured depends on the linkage method you choose.

    - Single Linkage → Merges clusters based on the closest points.
    - Complete Linkage → Uses the farthest points to decide merging
    - Average Linkage → Calculates the average distance between clusters
    - Ward’s Method → Minimizes the variation inside clusters.
    - Centroid Linkage → Uses the center (mean) of clusters to merge them.

        

3️⃣ Repeat until desired clusters → This continues until a stopping condition is met (e.g., reaching a set number of clusters). 

4️⃣ Visualize with a dendrogram → The merging process can be represented as a tree-like diagram called a dendrogram, which helps understand how clusters form.

**<span style="background-color: goldenrod;">Model Code</span>**

```python
# Perform the necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# Select relevant features for clustering, need to be 2D or more.

samples = df[['feature1', 'feature2']].values

# Calculate the linkage, determines how clusters are merged based on their distances
# theres more methods single, average, wards

mergings = linkage(samples, method='complete')

#Visualize the dendogram using labels. Note labels are only used to vizualise not train model 

# load labels
varieties = df['labels'].values

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels= varieties,
           leaf_rotation=90,
           leaf_font_size=6,)

plt.show()

```

**<span style="background-color: goldenrod;">Model Evaluations</span>**

Cluster Validity Assessment using External Labels- using external labels is the process of evaluating clustering quality by comparing predicted cluster assignments to actual known categories.

Cluster Validity Assessment using External Labels. - validating clustering quality by comparing predicted clusters (labels) to ground-truth categories (varieties).

Silhouette Score – Measures how well-separated clusters are; higher scores mean better-defined clusters.

Davies-Bouldin Index – Assesses cluster compactness and separation; lower values indicate better clustering.

Cluster Visualization – Helps understand separation using scatter plots or PCA projections.

Scatter plots or Principal Component Analysis (PCA) can help understand separation.

Rand Index : metric used to evaluate the similarity between two clustering results. 

### <font color='brown'><b>t-NSE (t-Distributed Stochastic Neighbor Embedding)</b></font> 

- Doesn't cluster but organizes similar points close together, creating a visual effect of clustering. When plotted, distinct groups appear, but t-SNE doesn’t assign cluster labels explicitly like K-Means or Hierarchical Clustering does. What it actually does is it is a dimensionality reduction technique used to visualize high-dimensional data in 2D or 3D space.


**<span style="background-color: goldenrod;">Not a Model but Model</span>**


```python
# Import TSNE
from sklearn.manifold import TSNE

# 10–100+ features → Works well for complex datasets
samples = df[['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 
              'feature6', 'feature7', 'feature8', 'feature9', 'feature10']].values


# Create a TSNE instance: model
model = TSNE(learning_rate=200)

# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
plt.scatter(xs,ys,c=variety_numbers)
plt.show()
```

**How it is used in the real world**

    ✅ Similarity & Grouping → It shows which companies (data points) have similar movement patterns. 
    ✅ Hidden Structures → Reveals clusters of companies that behave alike, even if not obvious in raw data. 
    ✅ Anomalies & Outliers → If a company appears far from others, it might behave differently from the rest. 
    ✅ Market Trends → Can help identify groups of companies with shared movement dynamics (e.g., stock trends).

- Silhouette Score
- Davies-Bouldin Index
- Cluster Visualization
- Rand Index

### Clustering Algorithms in Scikit-learn
<table border="1">
<colgroup>
<col width="15%" />
<col width="16%" />
<col width="20%" />
<col width="27%" />
<col width="22%" />
</colgroup>
<thead valign="bottom">
<tr><th>Method name</th>
<th>Parameters</th>
<th>Scalability</th>
<th>Use Case</th>
<th>Geometry (metric used)</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>K-Means</span></a></td>
<td>number of clusters</td>
<td>Very large<span class="pre">n_samples</span>, medium <span class="pre">n_clusters</span> with
MiniBatch code</td>
<td>General-purpose, even cluster size, flat geometry, not too many clusters</td>
<td>Distances between points</td>
</tr>
<tr><td>Affinity propagation</td>
<td>damping, sample preference</td>
<td>Not scalable with n_samples</td>
<td>Many clusters, uneven cluster size, non-flat geometry</td>
<td>Graph distance (e.g. nearest-neighbor graph)</td>
</tr>
<tr><td>Mean-shift</td>
<td>bandwidth</td>
<td>Not scalable with <span class="pre">n_samples</span></td>
<td>Many clusters, uneven cluster size, non-flat geometry</td>
<td>Distances between points</td>
</tr>
<tr><td>Spectral clustering</td>
<td>number of clusters</td>
<td>Medium <span class="pre">n_samples</span>, small <span class="pre">n_clusters</span></td>
<td>Few clusters, even cluster size, non-flat geometry</td>
<td>Graph distance (e.g. nearest-neighbor graph)</td>
</tr>
<tr><td>Ward hierarchical clustering</td>
<td>number of clusters</td>
<td>Large <span class="pre">n_samples</span> and <span class="pre">n_clusters</span></td>
<td>Many clusters, possibly connectivity constraints</td>
<td>Distances between points</td>
</tr>
<tr><td>Agglomerative clustering</td>
<td>number of clusters, linkage type, distance</td>
<td>Large <span class="pre">n_samples</span> and <span class="pre">n_clusters</span></td>
<td>Many clusters, possibly connectivity constraints, non Euclidean
distances</td>
<td>Any pairwise distance</td>
</tr>
<tr><td>DBSCAN</td>
<td>neighborhood size</td>
<td>Very large <span class="pre">n_samples</span>, medium <span class="pre">n_clusters</span></td>
<td>Non-flat geometry, uneven cluster sizes</td>
<td>Distances between nearest points</td>
</tr>
<tr><td>Gaussian mixtures</td>
<td>many</td>
<td>Not scalable</td>
<td>Flat geometry, good for density estimation</td>
<td>Mahalanobis distances to  centers</td>
</tr>
<tr><td>Birch</td>
<td>branching factor, threshold, optional global clusterer.</td>
<td>Large <span class="pre">n_clusters</span> and <span class="pre">n_samples</span></td>
<td>Large dataset, outlier removal, data reduction.</td>
<td>Euclidean distance between points</td>
</tr>
</tbody>
</table>
Source: http://scikit-learn.org/stable/modules/clustering.html

<div class="span5 alert alert-warning">
<h3>Clustering Models Evaluation Metrics</h3>

Elbow Method – Determines the optimal number of clusters by plotting within-cluster sum of squares (WCSS).

Cluster Validity Assessment using External Labels. - validating clustering quality by comparing predicted clusters (labels) to ground-truth categories (varieties).

Silhouette Score – Measures how well-separated clusters are; higher scores mean better-defined clusters.

Davies-Bouldin Index – Assesses cluster compactness and separation; lower values indicate better clustering.

Gap Statistic – Compares clustering results against random data to test effectiveness.

Cluster Visualization – Helps understand separation using scatter plots or PCA projections.