<p align="center">
    <img src="JHU.png" width="200" alt="Johns Hopkins University logo">
</p>

## Hands-on Lab: Clustering Algorithms

Estimated time needed: **60** minutes

### Overview:

The primary objective of this assignment is to refine the data so that we can classify species together and use their features to classify new observations.

In this lab, we will:

- **Classify and Compare:** Discuss and compare clustering algorithms such as K-Means and DBSCAN and use it to cluster the data together and find anomalies, outliers or errors.

- **Implement and Visualize a decision tree:** Implement a decision tree model to classify the species and then visualize the model decision tree. 

This lab is designed to deepen your understanding of machine learning concepts through practical application and comparison of different models and techniques.


### Learning Objectives:

In this lab, we aim to achieve the following objectives:

- **Explore and Compare Clustering algorithms:** Provide a high-level overview of various machine learning clustering algorithms including K-Means clustering and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). We will use these algorithms to cluster the data together and find anomalies, outliers or errors.


- **Implement and visualize a decision tree:** Fit the data to a decision tree model based on the data given in clustering_synthetic_dataset.csv dataset. This will help classify the species and then visualize it.

These objectives are designed to enhance your understanding of key machine learning concepts and their practical application in data analysis.

### Introduction:
Let us first explore and compare two essential machine learning clustering algorithms: 
- K-Means
- DBSCAN

The focus will be on providing a high-level understanding of these algorithms, highlighting their strengths, weaknesses, and typical use cases.


As we progress, We will guide you through the following aspects for each algorithm:

- How they work: We'll give a brief description of how each of these algorithms work.
- Advantages and disadvantages: We’ll evaluate the effectiveness of each algorithm in terms of when its advantageous to use.
- When to use: We'll discuss when to use each of the algorithms

**K-Means:**
- **How it works**: K-Means is a widely used clustering algorithm that partitions data points into K clusters based on their similarity.The algorithm works by iteratively updating the cluster centroids until convergence is achieved.

- **Advantages**: 
                Easy to implement and interpret
                Fast and efficient for large datasets
- **When to Use**: K-Means clustering algorithm works well with spherical clusters.

**DBSCAN(Density-Based Spatial Clustering of Applications with Noise):**

- **How it works**: The algorithm starts by randomly selecting an unvisited point and checking if it has enough neighbors within a specified radius. If the point has enough nearby neighbors, it is marked as part of a cluster. The algorithm then recursively checks if the neighbors also have enough neighbors within the radius, until all points in the cluster have been visited. Points that are not part of any cluster are marked as noise.

- **Advantages**: One of the advantages of DBSCAN is that it can find clusters of arbitrary shapes and sizes, unlike K-Means which assumes spherical clusters.DBSCAN is also robust to noise and outliers since they are not assigned to any cluster. However, DBSCAN can be sensitive to the choice of distance metric and parameters such as the radius and minim number of points required to form a cluster.

- **When to Use**: DBSCAN works wellwith all kinds of clusters of arbitrary shapes and sizes.


### Data Description:

The provided clustering_synthetic_dataset.csv contains two features for observations that can be grouped into different species. The task is to determine how many species are present and classify the data using clustering techniques, while handling anomalies and outliers to refine the dataset.

### Problem statement

**Implement clustering algorithms such as K-Means and DBSCAN to find anomalies, errors and outliers in the dataset clustering_synthetic_dataset.csv file. Visualize the clusters, remove these outliers and fit a decision tree model to the data and visualize it.**

Let’s start by loading the dataset and implementing these steps in Python.

In [1]:
import pandas as pd

# Load the dataset
file_path = 'clustering_synthetic_dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
df.head()

Unnamed: 0,f1,f2
0,0.494261,1.451067
1,-1.428081,-0.837064
2,0.338559,1.038759
3,0.119001,-1.053976
4,1.122425,1.774937


The dataset contains the 750 rows of data with the X coordinate and the Y coordinate:

**Task 1: Plot the data with a scatterplot. How many species must be there in the dataset?(For the rest of this assignment, use that number as the number-of-clusters parameter in
methods such as KMeans)**

In [2]:
import matplotlib.pyplot as plt
# Write your code here


<details><summary>Click here for the solution</summary>
 
```python
# There must be 3 species in the dataset.
plt.scatter(df['f1'], df['f2'], c='blue', label='Normal points')

```
 
</details>

**Explanation**: The scatterplot helps visually estimate the number of species (clusters) by grouping observations based on their feature values. By observing the scatterplot, you can roughly estimate the number of clusters present.

**Task 2: Find the rough feature ranges to classify these species correctly. It might be a good
idea to do this step visually from some data plots.**

In [6]:
# Display basic statistics
print(df.describe())

                 f1            f2
count  7.500000e+02  7.500000e+02
mean   5.217752e-15 -7.853274e-14
std    1.000667e+00  1.000667e+00
min   -2.274474e+00 -1.823801e+00
25%   -1.091894e+00 -7.775487e-01
50%    3.886712e-01 -4.095144e-01
75%    7.787849e-01  1.052538e+00
max    1.870438e+00  2.245794e+00


In [None]:
# Plotting histograms to understand feature ranges
# Write your code here
plt.figure(figsize=(12, 5))



<details><summary>Click here for the solution</summary>

```    
# Plotting histograms to understand feature ranges
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df.iloc[:, 0], bins=30, color='skyblue')  # Changed data to df
plt.title('Distribution of Feature 1')
plt.xlabel('Feature 1')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(df.iloc[:, 1], bins=30, color='salmon')  # Changed data to df
plt.title('Distribution of Feature 2')
plt.xlabel('Feature 2')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
    
```
</details>


**Explanation**: By examining the summary statistics and the code is used to visualize the distribution of Feature 1 and Feature 2 through histograms, which help in identifying rough feature ranges for different species. The distribution of data points across these features can give insights into how the species are grouped.

However, **note** that this alone might not directly link the ranges to specific species. To refine your understanding of the species, you may need to complement this analysis with clustering techniques (e.g., KMeans or DBSCAN) to further identify which range belongs to which species.

> **Note**: In the next set of problems, we will clean the points that are around the boundaries of the
cluster. (These points might be due to errors, anomalies, or simply be outliers.) This step is
done to refine feature boundaries so that a scientist can classify the species manually,
reliably, and with a high-level generalization.

**Task 3: Use K-means clustering to find anomalies. (Hint: find cluster data points that are far
from the centroids.)**

In [10]:
# Import the required libraries.
from sklearn.cluster import KMeans
import numpy as np

In [None]:
# Write your code here
#Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=0).fit(df)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Compute distances from centroids


# Define a threshold for anomaly detection


# Identify anomalies


# Plot results(optional)



<details><summary>Click here for the solution</summary>
 
```python

#Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=0).fit(df)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Compute distances from centroids
distances = np.linalg.norm(df - centroids[labels], axis=1)

# Define a threshold for anomaly detection
threshold = np.percentile(distances, 95)  # 95th percentile distance as threshold

# Identify anomalies
anomalies = distances > threshold

# Plot results
data_scaled = df.to_numpy()
plt.scatter(df['f1'],df['f2'], c='blue', label='Normal points')
plt.scatter(data_scaled[anomalies, 0], data_scaled[anomalies, 1], c='red', label='Anomalies')
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x', s=100, label='Centroids')
plt.legend()
plt.title('K-means Clustering and Anomaly Detection')
plt.show()

```
 
</details>

**Explanation**: After applying K-means clustering, we calculate the Euclidean distance from each data point to its cluster's centroid. Points with the largest distances are considered anomalies.

**Task 4: Use DBSCAN clustering to find anomalies. To be clear, look for anomalies with
DBSCAN in the full dataset; this is an alternative to Q3.’s method**

In [12]:
# Import the required libraries.
from sklearn.cluster import DBSCAN

In [None]:
# Write your code here
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(df)
# Identify anomalies


# Plot results(optional)



<details><summary>Click here for the solution</summary>
 
```python
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(df)
    
# Identify anomalies
anomalies = (labels == -1)
    
# Plot results
data_scaled = df.to_numpy()
plt.figure(figsize=(10, 7))
plt.scatter(df['f1'],df['f2'], c='blue', label='Normal points')
plt.scatter(data_scaled[anomalies, 0], data_scaled[anomalies, 1], c='red', label='Anomalies')
plt.title('DBSCAN Clustering and Anomaly Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

```
 
</details>

**Explanation**: DBSCAN (Density-Based Spatial Clustering) identifies anomalies by labeling points that do not belong to any dense region as -1. These are treated as anomalies.

**Task 5: Fit the Decision Tree**

Now, choose either the K-means results from Q3. or the DBSCAN results from Q4.,
remove the points that the chosen method deemed anomalous, and train a decision tree
from the remaining data to classify the species. (You do not need to justify the choice; they
should both be reasonable options.) Visualize the model decision tree (but not just by
plotting lines on a scatterplot of the data). Hint: the result should look like Module 6’s
Jupyter Notebook’s cell.

In [14]:
# Import the required libraries.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.tree import plot_tree

In [None]:
# Create a new column of labels
df['label'] = dbscan.labels_
# Filter out anomalies
data_cleaned = df[df['label'] != -1]
# Separate features and target variable
X = data_cleaned.drop(columns=['label'])
y = data_cleaned['label']

# Split the data into training and test sets
# Write your code here

# Create and train the Decision Tree model
# Write your code here

# Make predictions
# Write your code here

# Evaluate the model
# Write your code here

# Plot the decision tree
plt.figure(figsize=(20, 10))
# Write your code here

<details><summary>Click here for the solution</summary>
 
```python

# Create a new column of labels
df['label'] = dbscan.labels_
# Filter out anomalies
data_cleaned = df[df['label'] != -1]
# Separate features and target variable
X = data_cleaned.drop(columns=['label'])
y = data_cleaned['label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)

# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt_model, feature_names=X.columns, filled=True, rounded=True)
plt.show()

```
 
</details>

**Explanation**: After removing the anomalies, we train a decision tree on the remaining data. The decision tree is visualized, showing how the model splits the data based on feature values.

### Summary:

In this assignment, we applied K-Means and DBSCAN clustering techniques to identify and remove anomalies in a synthetic dataset. After cleaning the data, we trained a decision tree to classify the species and demonstrated the improvement in accuracy by removing outliers. This exercise underscores the importance of handling anomalies for better model performance and generalization.