<a href="https://colab.research.google.com/github/giakomorssi/Machine_Learning/blob/main/02_MarketSegmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import the Data

In [1]:
import pandas as pd

df = pd.read_csv('customer_segmentation_eda.csv')

In [2]:
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
df['approved_date'] = pd.to_datetime(df['approved_date'])
df['handled_by_logistic_date'] = pd.to_datetime(df['handled_by_logistic_date'])
df['delivery_date'] = pd.to_datetime(df['delivery_date'])
df['estimated_delivery_date'] = pd.to_datetime(df['estimated_delivery_date'])
df['shipping_limit_date'] = pd.to_datetime(df['shipping_limit_date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13718 entries, 0 to 13717
Data columns (total 25 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       13718 non-null  object        
 1   customer_id                    13718 non-null  object        
 2   order_status                   13718 non-null  int64         
 3   purchase_date                  13718 non-null  datetime64[ns]
 4   approved_date                  13718 non-null  datetime64[ns]
 5   handled_by_logistic_date       13718 non-null  datetime64[ns]
 6   delivery_date                  13718 non-null  datetime64[ns]
 7   estimated_delivery_date        13718 non-null  datetime64[ns]
 8   payment_type                   13718 non-null  int64         
 9   payment_installments           13718 non-null  int64         
 10  payment_value                  13718 non-null  float64       
 11  customer_unique

# Recency value
Time since a customer’s last purchase.


**Calculate the Recency value:** 

We can calculate the time since a `customer's last purchase` by subtracting the purchase_date of the most recent order of each customer from last date on the dataframe. We can create a new column called "recency" to store the calculated values.

In [3]:
import pandas as pd
import datetime as dt

# Calculate recency
current_date = df['purchase_date'].max()
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
recency_df = df.groupby('customer_unique_id')['purchase_date'].max().reset_index()
recency_df['recency'] = (current_date - recency_df['purchase_date']).dt.days
recency_df.drop('purchase_date', axis=1, inplace=True)

# merge recency_df with the original df using customer_unique_id
df = pd.merge(df, recency_df, on='customer_unique_id')

In [4]:
df['recency'].describe()

count    13718.000000
mean        73.242309
std         42.439815
min          0.000000
25%         35.000000
50%         74.000000
75%        110.000000
max        481.000000
Name: recency, dtype: float64

# Frequency value
Refers to the number of times a customer has made a purchase.


**Calculate the Frequency value:**

We can calculate the n`umber of item per order by each customer` by grouping the number of item_per_order associated with each customer_unique_id. We can create a new column called "frequency" to store the calculated values.

In [5]:
import pandas as pd
import datetime as dt

# Calculate frequenc
frequency_df = df.groupby('customer_unique_id')['item_per_order'].nunique().reset_index()
frequency_df.rename(columns={'item_per_order': 'frequency'}, inplace=True)

# merge frequency_df with the original df using customer_unique_id
df = pd.merge(df, frequency_df, on='customer_unique_id')

In [6]:
df['frequency'].describe()

count    13718.000000
mean         1.468071
std          1.168014
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         13.000000
Name: frequency, dtype: float64

# Monetary value 
Refers to the total amount a customer has spent purchasing products

**Calculate the Monetary value:**

We can calculate the `total amount spent by each customer` by summing up the payment_value of all orders associated with each customer_unique_id. We can create a new column called "monetary" to store the calculated values.

In [7]:
import pandas as pd
import datetime as dt

# Calculate monetary
monetary_df = df.groupby('customer_unique_id')['payment_value'].sum().reset_index()
monetary_df.rename(columns={'payment_value': 'monetary'}, inplace=True)

# merge monetary_df with the original df using customer_unique_id
df = pd.merge(df, monetary_df, on='customer_unique_id')

In [8]:
df['monetary'].describe()

count    13718.000000
mean       403.794149
std       1196.035858
min         10.710000
25%         74.400000
50%        143.275000
75%        308.100000
max      29099.520000
Name: monetary, dtype: float64

# Export the data

In [9]:
df.to_csv('customer_segmentation_RFM.csv', index=False)