# <center>Data Mining Project Code</center>

** **
## <center>*04 - Exploring Clustering Solutions*</center>

** **

After creating our clustering solutions, we decided to check how the customers where divided according to all variables in our dataset.

In this notebook, we start by importing all the data with the labels. After, we start exploring the variables according to the labels attribute to check if we can get any business insights.



The members of the `team` are:
- Ana Farinha  - 20211514
- António Oliveira - 20211595
- Mariana Neto - 20211527
- Salvador Domingues - 20240597


# Table of Contents

<a class="anchor" id="top"></a>


1. [Importing Libraries & Data](#1.-Importing-Libraries-&-Data) <br><br>

1. [Clustering](#2.-Clustering) <br><br>


# 1. Importing Libraries & Data

In [73]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px


import visualizations_by_cluster as v
import functions as f

In [7]:
all_data = pd.read_csv('./data/data.csv', 
                   index_col = "customer_id")

In [18]:
data = pd.read_csv('./data/labels/cuisine_data.csv', 
                   index_col = "customer_id")

In [75]:
# Selecting the clustering solution column
labels_col = 'cusine_labels'

In [21]:
merged_df = pd.merge(data[labels_col], all_data, on='customer_id')

In [70]:
for col in merged_df.columns:
    print(col, end=', ')

cusine_labels, customer_region, customer_age, vendor_count, product_count, is_chain, first_order, last_order, last_promo, payment_method, CUI_American, CUI_Asian, CUI_Beverages, CUI_Cafe, CUI_Chicken Dishes, CUI_Chinese, CUI_Desserts, CUI_Healthy, CUI_Indian, CUI_Italian, CUI_Japanese, CUI_Noodle Dishes, CUI_OTHER, CUI_Street Food / Snacks, CUI_Thai, DOW_0, DOW_1, DOW_2, DOW_3, DOW_4, DOW_5, DOW_6, HR_1, HR_2, HR_3, HR_4, HR_5, HR_6, HR_7, HR_8, HR_9, HR_10, HR_11, HR_12, HR_13, HR_14, HR_15, HR_16, HR_17, HR_18, HR_19, HR_20, HR_21, HR_22, HR_23, promo_DELIVERY, promo_DISCOUNT, promo_FREEBIE, promo_NO DISCOUNT, pay_CARD, pay_CASH, pay_DIGI, last_promo_enc, payment_method_enc, days_between, total_orders, avg_order_hour, total_spend, avg_spend_prod, is_repeat_customer, avg_prod_vendor, avg_orders_vendor, avg_prod_order, weekend_orders, weekday_orders, weekend_weekday_ratio, num_cuisines, average_spend_per_cuisine, CUI_American_ratio, CUI_Asian_ratio, CUI_Beverages_ratio, CUI_Cafe_ratio,

In [147]:
merged_df.groupby(labels_col)['CUI_Asian'].mean()

cusine_labels
0     1.902839
1    18.812087
2     7.730200
3    72.983790
4    40.692301
Name: CUI_Asian, dtype: float64

In [149]:
v.plot_boxplot_by_cluster(merged_df, labels_col, 'CUI_Asian_ratio')

In [96]:
v.plot_boxplot_by_cluster(merged_df, labels_col, 'CUI_Asian_ratio')

In [93]:
v.plot_boxplot_by_cluster(merged_df, labels_col, 'CUI_Asian')

In [12]:
v.plot_boxplot_by_cluster(merged_df, labels_col, 'average_spend_per_cuisine')

In [34]:
v.plot_grouped_bar_chart(merged_df, labels_col, 'payment_method', labels_col)

In [71]:
v.plot_grouped_bar_chart(merged_df, labels_col, 'last_promo', labels_col)

In [14]:
v.plot_boxplot_by_cluster(merged_df, labels_col, 'last_order')

In [None]:
v.plot_customer_region_scatter(merged_df, labels_col)

In [60]:
# Calculate the count of customers for each unique combination of customer_region and label
merged_df['customer_count'] = merged_df.groupby(['customer_region', labels_col])['customer_region'].transform('count')

# Create a scatter plot using Plotly Express, varying point size by 'customer_count'
fig = px.scatter(
    merged_df,
    x='customer_region',  # 'customer_region' on the x-axis
    y=labels_col,  # Cluster labels on the y-axis
    color=labels_col,  # Color points based on cluster label
    size='customer_count',  # Vary size based on customer count in region and label combination
    labels={'customer_region': 'Customer Region', f'{labels_col}': 'Cluster Label', 'customer_count': 'Customer Count'},
    title='Scatter Plot of Customer Regions by Cluster Label with Customer Count Size'
)

# Show the plot
fig.update_xaxes(type='category')  # Ensures customer_region categories are equally spaced
fig.show()

In [31]:
hour_columns = [f'HR_{i}' for i in range(1,24)]

df_hour = merged_df[hour_columns + [labels_col]].groupby(labels_col).mean()

fig = px.imshow(df_hour, labels={'x': 'Hour', 'y': 'Cluster', 'color': 'Average Orders'},
                x=hour_columns, y=df_hour.index, color_continuous_scale='Cividis',
                title='Cluster Distribution Across Hours of the Day')
fig.show()


In [32]:
hour_columns = [f'DOW_{i}' for i in range(0, 7)]

df_hour = merged_df[hour_columns + [labels_col]].groupby(labels_col).mean()

fig = px.imshow(df_hour, labels={'x': 'Hour', 'y': 'Cluster', 'color': 'Average Orders'},
                x=hour_columns, y=df_hour.index, color_continuous_scale='Cividis',
                title='Cluster Distribution Across Hours of the Day')
fig.show()


In [33]:
v.plot_avg_hr_by_label(merged_df, labels_col)