### Step 1: Data Overview
- In this step, we load the marketing data from the AI marketing platform.
- The dataset includes the following features: `Clicks`, `Exposure`, `Budget`, `Conversion Volume`, and `Search Index`.
- Let’s begin by checking for missing values and understanding the data distribution.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Step 1: Load the data
# Assume the data is in 'marketing_data.csv'
data = pd.read_csv('data/marketing_data.csv')

# Display the first few rows of the dataset
data.head()


### Step 2: Data Cleaning
- We remove any rows with missing values and cap outliers to avoid skewed analysis.
- Next, we will standardize the features so they have the same scale for clustering.


In [None]:
# Step 2: Data Cleaning
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

# Remove rows with missing values (optional based on your case)
data_clean = data.dropna()

# Cap outliers at the 99th percentile for key numerical features
for col in ['Clicks', 'Conversion_Volume']:
    upper_limit = data_clean[col].quantile(0.99)
    data_clean[col] = np.where(data_clean[col] > upper_limit, upper_limit, data_clean[col])

# Display summary statistics to check data distribution
data_clean.describe()


### Step 3: Data Standardization
- Data standardization ensures that features like `Clicks`, `Exposure`, and `Budget` are on the same scale, which improves the performance of clustering algorithms like KMeans.


In [None]:
# Step 3: Standardize the data
scaler = StandardScaler()
# Standardize the numerical features
data_clean[['Clicks', 'Conversion_Volume', 'Exposure', 'Budget', 'Search_Index']] = scaler.fit_transform(
    data_clean[['Clicks', 'Conversion_Volume', 'Exposure', 'Budget', 'Search_Index']]
)

# Check the first few rows of the standardized data
data_clean.head()


### Step 4: KMeans Clustering
- We apply KMeans clustering to group the ads into 3 clusters.
- The `Silhouette Score` provides a measure of how well the clusters are separated.


In [None]:
# Step 4: KMeans Clustering
# We will try clustering the data into 3 clusters (this can be changed)
kmeans = KMeans(n_clusters=3, random_state=123)
data_clean['Cluster'] = kmeans.fit_predict(data_clean[['Clicks', 'Conversion_Volume', 'Exposure', 'Budget', 'Search_Index']])

# Display the cluster centers
print("Cluster Centers:\n", kmeans.cluster_centers_)

# Evaluate the silhouette score for the clustering
silhouette_avg = silhouette_score(data_clean[['Clicks', 'Conversion_Volume', 'Exposure', 'Budget', 'Search_Index']], data_clean['Cluster'])
print(f"Silhouette Score: {silhouette_avg}")


### Step 5: Scatter Plot Visualization
- The scatter plot shows the distribution of `Clicks` vs `Conversion Volume`, colored by the assigned cluster.


In [None]:
# Step 5: Visualize Clustering Results (Scatter Plot)
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Clicks', y='Conversion_Volume', hue='Cluster', data=data_clean, palette='viridis', s=100)
plt.title('KMeans Clustering of Marketing Data (Clicks vs. Conversion Volume)')
plt.xlabel('Clicks')
plt.ylabel('Conversion Volume')
plt.show()


### Step 6: Heatmap of Conversion Volume by Industry
- The heatmap shows the average conversion volume by industry and cluster, helping to visualize which clusters perform better across industries.


In [None]:
# Step 6: Visualize Clustering by Industry with Heatmap
# Group data by Industry and Cluster to calculate average conversion volume
heatmap_data = data_clean.groupby(['Industry', 'Cluster'])['Conversion_Volume'].mean().unstack()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Heatmap of Conversion Volume by Industry and Cluster')
plt.xlabel('Cluster')
plt.ylabel('Industry')
plt.show()
