# DBSCAN Clustering Exercise: Mall Customer Segmentation

In this exercise, you will use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to identify customer segments based on their shopping behavior. This analysis will help understand different customer groups and their characteristics.

## Dataset Description
The Mall Customer Segmentation dataset contains information about customers including:
- CustomerID: Unique identifier for each customer
- Gender: Customer's gender
- Age: Customer's age
- Annual Income (k$): Customer's annual income in thousands of dollars
- Spending Score (1-100): Score assigned by the mall based on customer behavior and spending nature

## Why DBSCAN?
DBSCAN is particularly useful for this dataset because:
1. It can identify clusters of varying shapes and sizes
2. It can detect outliers (unusual customer behavior)
3. It doesn't assume clusters are spherical
4. It doesn't require specifying the number of clusters beforehand

## Your Task
1. Load and explore the dataset
2. Preprocess the data
3. Implement DBSCAN clustering
4. Evaluate the clustering results
5. Visualize the customer segments
6. Analyze segment characteristics

Follow the steps below and fill in the code where indicated.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Your code here: Import required sklearn modules


# Create the customer dataset
data = {
    'CustomerID': range(1, 201),
    'Gender': np.random.choice(['Male', 'Female'], size=200),
    'Age': np.random.randint(18, 70, size=200),
    'Annual_Income': np.random.normal(60, 15, size=200),
    'Spending_Score': np.random.normal(50, 25, size=200)
}
df = pd.DataFrame(data)

# Ensure realistic values
df['Annual_Income'] = df['Annual_Income'].clip(20, 100)
df['Spending_Score'] = df['Spending_Score'].clip(1, 100)

# Display the first few rows and basic information about the dataset
# Your code here

## Data Preprocessing
1. Check for missing values
2. Convert categorical variables
3. Scale the features
4. Select relevant features for clustering

Your task:
- Examine the data for any missing values
- Handle the categorical 'Gender' variable
- Scale numerical features using StandardScaler
- Select appropriate features for clustering

In [None]:
# Your code here
# 1. Check for missing values

# 2. Convert categorical variables

# 3. Scale the features 'Annual_Income' and 'Spending_Score'

# 4. Select features for clustering

## DBSCAN Implementation
1. Determine appropriate epsilon (eps) and min_samples parameters
2. Use NearestNeighbors to find optimal eps value
3. Train the DBSCAN model
4. Analyze the clustering results

Your task:
- Implement the k-distance graph to find optimal eps
- Create and train the DBSCAN model
- Analyze the number of clusters and noise points

In [None]:
# Your code here
# 1. Find optimal eps using k-distance graph
from sklearn.neighbors import NearestNeighbors

k = 5
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(X_cluster) # you need to replace the 'X_cluster' with your sample data
distances, indices = neigh.kneighbors(X_cluster) # you need to replace the 'X_cluster' with your sample data

# Sort and plot k-distances
distances_sorted = np.sort(distances[:, k-1])
plt.figure(figsize=(10, 6))
plt.plot(range(len(distances_sorted)), distances_sorted)
plt.xlabel('Points')
plt.ylabel(f'{k}-th nearest neighbor distance')
plt.title('K-distance Graph')
plt.show()

# 2. Train DBSCAN with optimal parameters

# 3. Analyze clustering results

## Visualization and Evaluation
1. Create scatter plots of customer segments
2. Analyze segment characteristics
3. Visualize feature distributions within segments
4. Compare with other clustering methods (optional)

Your task:
- Create visualizations of customer segments

In [None]:
# Your code here
# 1. Create segment visualization

# 2. Analyze segment characteristics

## Conclusion and Interpretation
Summarize your findings and interpret the results. Consider the following questions:

1. What distinct customer segments did DBSCAN identify?
2. How many meaningful segments were found?
3. What percentage of customers were classified as outliers?
4. What are the characteristics of each segment?
5. How can these insights be used for marketing strategies?