# <center>Data Mining Project</center>

<center>
Master in Data Science and Advanced Analytics <br>
NOVA Information Management School
</center>

** **
## <center>*ABCDEats Inc*</center>

<center>
Group 19 <br>
Jan-Louis Schneider, 20240506  <br>
Marta Boavida, 20240519  <br>
Matilde Miguel, 20240549  <br>
Sofia Gomes, 20240848  <br>
</center>

** **

## <span style="color:salmon"> Notebook </span> 

In this notebook, we use various techniques to prepare the data, such as Feature Encoding, Feature Transformation and Scaler.


## <span style="color:salmon"> Table of Contents </span>

<a class="anchor" id="top"></a>

1. [Import Libraries](#one-bullet) <br>

2. [Import Datasets](#two-bullet) <br>

3. [Feature Encoding](#three-bullet) <br>

4. [Feature Transformation](#four-bullet) <br>

5. [Scaling](#five-bullet) <br>

6. [Export Datasets](#six-bullet) <br> 


<a class="anchor" id="one-bullet"></a>
## <span style="color:salmon"> 1. Import Libraries </span> 

In [40]:
import pandas as pd 
import numpy as np
import scipy

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import chi2_contingency
import scipy.stats as stats
import warnings

from math import ceil
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.decomposition import PCA
from collections import Counter

import plotly.express as px

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)


<a class="anchor" id="two-bullet"> 

## <span style="color:salmon"> 2. Import Dataset </span> 

<a href="#top">Top &#129033;</a>

In [41]:
df = pd.read_csv("../dataset/df_visualizations.csv")

In [42]:
columns_to_drop = df.columns[9:55]  # dropping these columns since they are already included in features created in New_Features
columns_to_drop

Index(['CUI_American', 'CUI_Asian', 'CUI_Beverages', 'CUI_Cafe',
       'CUI_Chicken Dishes', 'CUI_Chinese', 'CUI_Desserts', 'CUI_Healthy',
       'CUI_Indian', 'CUI_Italian', 'CUI_Japanese', 'CUI_Noodle Dishes',
       'CUI_OTHER', 'CUI_Street Food / Snacks', 'CUI_Thai', 'DOW_0', 'DOW_1',
       'DOW_2', 'DOW_3', 'DOW_4', 'DOW_5', 'DOW_6', 'HR_1', 'HR_2', 'HR_3',
       'HR_4', 'HR_5', 'HR_6', 'HR_7', 'HR_8', 'HR_9', 'HR_10', 'HR_11',
       'HR_12', 'HR_13', 'HR_14', 'HR_15', 'HR_16', 'HR_17', 'HR_18', 'HR_19',
       'HR_20', 'HR_21', 'HR_22', 'HR_23', 'lifetime_days'],
      dtype='object')

In [43]:
df = df.drop(columns=columns_to_drop)

In [44]:
df.head()

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,last_promo,payment_method,preferred_order_days,preferred_part_of_day,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group
0,2360.0,18,2,5.0,1,0,1,DELIVERY,DIGI,"['DOW_6', 'DOW_0']",['18h-00h'],28.88,5.776,14.44,2.5,0.06667,0.5,1.0,18-22
1,8670.0,17,2,2.0,2,0,1,DISCOUNT,DIGI,"['DOW_6', 'DOW_0']",['06h-12h'],19.21,9.605,9.605,1.0,0.13333,1.0,1.0,15-17
2,4660.0,38,1,2.0,2,0,1,DISCOUNT,CASH,"['DOW_6', 'DOW_0']",['06h-12h'],9.2,4.6,4.6,1.0,0.06667,1.0,0.5,36-49
3,4660.0,26,2,3.0,1,0,2,DELIVERY,DIGI,"['DOW_1', 'DOW_6']","['06h-12h', '12h-18h']",31.56,10.52,15.78,1.5,0.13333,0.5,1.0,23-28
4,4660.0,20,2,5.0,0,0,2,DELIVERY,DIGI,"['DOW_1', 'DOW_6']",['06h-12h'],55.44,11.088,27.72,2.5,0.13333,0.0,1.0,18-22


<a class="anchor" id="three-bullet"> 

## <span style="color:salmon">3. Feature Encoding </span> 

<a href="#top">Top &#129033;</a>


Used to transform categorical (or textual) variables into numerical representations, since most machine learning algorithms cannot work directly with categories.

In [62]:
nonmetric_columns = ['customer_region', 'last_promo', 'payment_method', 'preferred_order_days', 'preferred_part_of_day']
# treat customer region as nonmetric because even tho it is numbers they dont have a natural order etc

In [61]:
metric_columns = [col for col in df.columns if col not in nonmetric_columns]

In [52]:
columns_to_encode = ['last_promo', 'payment_method']

1. Frequency Encoding

Use the frequency of a category as its feature value.

In [53]:
df_freq_encoding = df.copy()

In [54]:
#Frequency Encoding
for col in columns_to_encode:
    freq_encoding = df_freq_encoding[col].value_counts(normalize=True)
    df_freq_encoding[col + '_freq_encoded'] = df_freq_encoding[col].map(freq_encoding)

df_freq_encoding[[col + '_freq_encoded' for col in columns_to_encode]].head()

Unnamed: 0,last_promo_freq_encoded,payment_method_freq_encoded
0,0.721075,0.191781
1,0.141488,0.191781
2,0.141488,0.176989
3,0.721075,0.191781
4,0.721075,0.191781


2. One-hot encoding

Convert categorical variables into binary columns.

In [55]:
df_ohencoding = df.copy()

In [56]:
for col in columns_to_encode:
    df_ohencoding[col] = df_ohencoding[col].astype(str)

ohc = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid collinearity 
ohc_encoded = ohc.fit_transform(df_ohencoding[columns_to_encode])


ohc_feat_names = ohc.get_feature_names_out(columns_to_encode)


ohc_df = pd.DataFrame(ohc_encoded, index=df_ohencoding.index, columns=ohc_feat_names)


df_ohencoding = pd.concat([df_ohencoding.drop(columns=columns_to_encode), ohc_df], axis=1)

df_ohencoding.head()

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,preferred_order_days,preferred_part_of_day,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
0,2360.0,18,2,5.0,1,0,1,"['DOW_6', 'DOW_0']",['18h-00h'],28.88,5.776,14.44,2.5,0.06667,0.5,1.0,18-22,0.0,0.0,0.0,1.0
1,8670.0,17,2,2.0,2,0,1,"['DOW_6', 'DOW_0']",['06h-12h'],19.21,9.605,9.605,1.0,0.13333,1.0,1.0,15-17,1.0,0.0,0.0,1.0
2,4660.0,38,1,2.0,2,0,1,"['DOW_6', 'DOW_0']",['06h-12h'],9.2,4.6,4.6,1.0,0.06667,1.0,0.5,36-49,1.0,0.0,1.0,0.0
3,4660.0,26,2,3.0,1,0,2,"['DOW_1', 'DOW_6']","['06h-12h', '12h-18h']",31.56,10.52,15.78,1.5,0.13333,0.5,1.0,23-28,0.0,0.0,0.0,1.0
4,4660.0,20,2,5.0,0,0,2,"['DOW_1', 'DOW_6']",['06h-12h'],55.44,11.088,27.72,2.5,0.13333,0.0,1.0,18-22,0.0,0.0,0.0,1.0


The encoded features stay in the list non_metric_columns because they will not be used for clustering and scaling

In [57]:
df = df_ohencoding.copy()   # take one-hot-encoding in final dataset

<a class="anchor" id="four-bullet"> 

## <span style="color:salmon"> 4. Feature Transformation </span> 

<a href="#top">Top &#129033;</a>


Feature tranformation is the process of modifying or transforming variables in a dataset to improve the performance of machine learning models, facilitate analysis, or meet the requirements of certain algorithms. 

This involves changing the scale, distribution or shape of variables.


In [58]:
df_log = df.copy()

1. Logarithmic Transformation:

In [68]:
for col in metric_columns:
    df_log[col] = pd.to_numeric(df_log[col], errors='coerce')  # Converte ou substitui por NaN

# Substituir NaN por um valor padrão se necessário (exemplo: 0)
df_log[metric_columns] = df_log[metric_columns].fillna(0)

# Aplicar a transformação logarítmica
for col in metric_columns:
    df_log[col] = np.log1p(df_log[col])

# Exibir os primeiros resultados
df_log.head()

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,preferred_order_days,preferred_part_of_day,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
0,2360.0,1.372307,0.554618,0.706395,0.423036,0.0,0.423036,"['DOW_6', 'DOW_0']",['18h-00h'],0.908648,0.727218,0.840822,0.594518,0.060666,0.292944,0.423036,0.0,0.0,0.0,0.0,0.693147
1,8670.0,1.358505,0.554618,0.554618,0.554618,0.0,0.423036,"['DOW_6', 'DOW_0']",['06h-12h'],0.870388,0.794049,0.794049,0.423036,0.111475,0.423036,0.423036,0.0,0.693147,0.0,0.0,0.693147
2,4660.0,1.539779,0.423036,0.554618,0.554618,0.0,0.423036,"['DOW_6', 'DOW_0']",['06h-12h'],0.788768,0.693971,0.693971,0.423036,0.060666,0.423036,0.292944,0.0,0.693147,0.0,0.693147,0.0
3,4660.0,1.457646,0.554618,0.6258,0.423036,0.0,0.554618,"['DOW_1', 'DOW_6']","['06h-12h', '12h-18h']",0.916415,0.804983,0.850279,0.501012,0.111475,0.292944,0.423036,0.0,0.0,0.0,0.0,0.693147
4,4660.0,1.397363,0.554618,0.706395,0.0,0.0,0.554618,"['DOW_1', 'DOW_6']",['06h-12h'],0.961666,0.811168,0.904995,0.594518,0.111475,0.0,0.423036,0.0,0.0,0.0,0.0,0.693147


In [69]:
df_log.describe().round(2)

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
count,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0
mean,5205.87,1.46,0.57,0.64,0.46,0.81,0.96,0.88,0.74,0.76,0.46,0.12,0.29,0.38,0.0,0.1,0.1,0.12,0.13
std,2608.6,0.05,0.12,0.13,0.25,0.22,0.08,0.1,0.08,0.09,0.05,0.06,0.16,0.06,0.0,0.24,0.24,0.26,0.27
min,2360.0,1.33,0.42,0.42,0.0,0.0,0.0,0.24,0.24,0.24,0.42,0.06,0.0,0.09,0.0,0.0,0.0,0.0,0.0
25%,2360.0,1.43,0.42,0.55,0.42,0.77,0.95,0.83,0.68,0.71,0.42,0.06,0.23,0.35,0.0,0.0,0.0,0.0,0.0
50%,4660.0,1.46,0.55,0.63,0.55,0.88,0.98,0.89,0.75,0.77,0.45,0.11,0.36,0.42,0.0,0.0,0.0,0.0,0.0
75%,8670.0,1.5,0.67,0.73,0.63,0.95,0.99,0.94,0.8,0.83,0.5,0.15,0.42,0.42,0.0,0.0,0.0,0.0,0.0
max,8670.0,1.66,0.9,0.96,0.92,1.0,1.0,1.07,0.88,0.98,0.67,0.35,0.42,0.42,0.0,0.69,0.69,0.69,0.69


In [70]:
df = df_log.copy()

<a class="anchor" id="five-bullet"> 

## <span style="color:salmon"> 5. Scaling</span> 

<a href="#top">Top &#129033;</a>

Standardizes or normalizes numerical variables to improve the performance of scale-sensitive algorithms.


1. Min-Max Scaling


Goal: Rescale features to [0, 1]

In [71]:
df_minmax = df.copy()

For rescale the features, we use the method MinMaxScaler:

In [72]:
scaler_minmax = MinMaxScaler()
df_minmax[metric_columns] = scaler_minmax.fit_transform(df_minmax[metric_columns])
df_minmax.head()

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,preferred_order_days,preferred_part_of_day,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
0,2360.0,0.133818,0.274363,0.528005,0.458431,0.0,0.424847,"['DOW_6', 'DOW_0']",['18h-00h'],0.804681,0.756698,0.813402,0.687394,0.0,0.692481,1.0,0.0,0.0,0.0,0.0,1.0
1,8670.0,0.092356,0.274363,0.245188,0.601023,0.0,0.424847,"['DOW_6', 'DOW_0']",['06h-12h'],0.758505,0.860903,0.749876,0.0,0.178396,1.0,1.0,0.0,1.0,0.0,0.0,1.0
2,4660.0,0.636922,0.0,0.245188,0.601023,0.0,0.424847,"['DOW_6', 'DOW_0']",['06h-12h'],0.659997,0.704858,0.613955,0.0,0.0,1.0,0.612716,0.0,1.0,0.0,1.0,0.0
3,4660.0,0.390187,0.274363,0.377826,0.458431,0.0,0.556992,"['DOW_1', 'DOW_6']","['06h-12h', '12h-18h']",0.814055,0.877952,0.826245,0.312573,0.178396,0.692481,1.0,0.0,0.0,0.0,0.0,1.0
4,4660.0,0.209091,0.274363,0.528005,0.0,0.0,0.556992,"['DOW_1', 'DOW_6']",['06h-12h'],0.868669,0.887597,0.900558,0.687394,0.178396,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [73]:
df_minmax[metric_columns].describe().round(3)

Unnamed: 0,customer_age,vendor_count,product_count,is_chain,first_order,last_order,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
count,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0
mean,0.404,0.315,0.41,0.501,0.816,0.959,0.767,0.772,0.708,0.167,0.196,0.691,0.88,0.0,0.141,0.137,0.177,0.192
std,0.154,0.244,0.243,0.274,0.22,0.081,0.116,0.129,0.125,0.197,0.199,0.377,0.188,0.0,0.349,0.344,0.382,0.394
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.307,0.0,0.245,0.458,0.774,0.956,0.707,0.679,0.633,0.0,0.0,0.533,0.769,0.0,0.0,0.0,0.0,0.0
50%,0.39,0.274,0.378,0.601,0.887,0.983,0.784,0.796,0.723,0.119,0.178,0.855,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.507,0.52,0.577,0.678,0.95,0.995,0.846,0.871,0.796,0.313,0.331,1.0,1.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0


2. Standard Scaling

Goal: Scale features to have a mean of 0 and a standard deviation of 1

In [74]:
df_sd = df.copy()

In [75]:
scaler_standard = StandardScaler()
df_sd[metric_columns] = scaler_standard.fit_transform(df_sd[metric_columns])
df_sd.head()

Unnamed: 0,customer_region,customer_age,vendor_count,product_count,is_chain,first_order,last_order,preferred_order_days,preferred_part_of_day,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
0,2360.0,-1.760971,-0.164909,0.486877,-0.154926,-3.706455,-6.565418,"['DOW_6', 'DOW_0']",['18h-00h'],0.325846,-0.115258,0.83994,2.633865,-0.983418,0.003333,0.639862,0.0,-0.405963,-0.399168,-0.463735,2.052873
1,8670.0,-2.03102,-0.164909,-0.677214,0.365379,-3.706455,-6.565418,"['DOW_6', 'DOW_0']",['06h-12h'],-0.07345,0.690477,0.332738,-0.847343,-0.08843,0.819189,0.639862,0.0,2.463276,-0.399168,-0.463735,2.052873
2,4660.0,1.515788,-1.288398,-0.677214,0.365379,-3.706455,-6.565418,"['DOW_6', 'DOW_0']",['06h-12h'],-0.925276,-0.516099,-0.752476,-0.847343,-0.983418,0.819189,-1.418201,0.0,2.463276,-0.399168,2.156403,-0.487122
3,4660.0,-0.091221,-0.164909,-0.131266,-0.154926,-3.706455,-4.942367,"['DOW_1', 'DOW_6']","['06h-12h', '12h-18h']",0.406911,0.822304,0.942483,0.735636,-0.08843,0.003333,0.639862,0.0,-0.405963,-0.399168,-0.463735,2.052873
4,4660.0,-1.270712,-0.164909,0.486877,-1.827696,-3.706455,-4.942367,"['DOW_1', 'DOW_6']",['06h-12h'],0.879171,0.896878,1.535811,2.633865,-0.08843,-1.833839,0.639862,0.0,-0.405963,-0.399168,-0.463735,2.052873


In [76]:
df_sd[metric_columns].describe(include='all').round(2)

Unnamed: 0,customer_age,vendor_count,product_count,is_chain,first_order,last_order,total_expenses,avg_per_product,avg_per_order,avg_order_size,culinary_variety,chain_preference,loyalty_to_venders,customer_age_group,last_promo_DISCOUNT,last_promo_FREEBIE,payment_method_CASH,payment_method_DIGI
count,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0,31098.0
mean,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
min,-2.63,-1.29,-1.69,-1.83,-3.71,-11.78,-6.63,-5.97,-5.65,-0.85,-0.98,-1.83,-4.67,0.0,-0.41,-0.4,-0.46,-0.49
25%,-0.64,-1.29,-0.68,-0.15,-0.19,-0.04,-0.52,-0.71,-0.6,-0.85,-0.98,-0.42,-0.59,0.0,-0.41,-0.4,-0.46,-0.49
50%,-0.09,-0.16,-0.13,0.37,0.33,0.29,0.15,0.19,0.12,-0.24,-0.09,0.44,0.64,0.0,-0.41,-0.4,-0.46,-0.49
75%,0.67,0.84,0.69,0.65,0.61,0.43,0.68,0.77,0.7,0.74,0.68,0.82,0.64,0.0,-0.41,-0.4,-0.46,-0.49
max,3.88,2.81,2.43,1.82,0.84,0.5,2.01,1.77,2.33,4.22,4.03,0.82,0.64,0.0,2.46,2.51,2.16,2.05


Due to different possible different scales, the logic might be spoilt (for example first order > last order)

In [77]:
df_sd.loc[df_sd['first_order'] > df_sd['last_order']].count()   

customer_region          19196
customer_age             19196
vendor_count             19196
product_count            19196
is_chain                 19196
first_order              19196
last_order               19196
preferred_order_days     19196
preferred_part_of_day    19196
total_expenses           19196
avg_per_product          19196
avg_per_order            19196
avg_order_size           19196
culinary_variety         19196
chain_preference         19196
loyalty_to_venders       19196
customer_age_group       19196
last_promo_DISCOUNT      19196
last_promo_FREEBIE       19196
payment_method_CASH      19196
payment_method_DIGI      19196
dtype: int64

In [78]:
df_sd.loc[df_sd['vendor_count'] > df_sd['product_count']].count()
# also for vendor count and product count

customer_region          17330
customer_age             17330
vendor_count             17330
product_count            17330
is_chain                 17330
first_order              17330
last_order               17330
preferred_order_days     17330
preferred_part_of_day    17330
total_expenses           17330
avg_per_product          17330
avg_per_order            17330
avg_order_size           17330
culinary_variety         17330
chain_preference         17330
loyalty_to_venders       17330
customer_age_group       17330
last_promo_DISCOUNT      17330
last_promo_FREEBIE       17330
payment_method_CASH      17330
payment_method_DIGI      17330
dtype: int64

In [79]:
df_sd.loc[df_sd['is_chain'] > df_sd['product_count']].count()
# is_chain and product count also

customer_region          15436
customer_age             15436
vendor_count             15436
product_count            15436
is_chain                 15436
first_order              15436
last_order               15436
preferred_order_days     15436
preferred_part_of_day    15436
total_expenses           15436
avg_per_product          15436
avg_per_order            15436
avg_order_size           15436
culinary_variety         15436
chain_preference         15436
loyalty_to_venders       15436
customer_age_group       15436
last_promo_DISCOUNT      15436
last_promo_FREEBIE       15436
payment_method_CASH      15436
payment_method_DIGI      15436
dtype: int64

**Solution:** Scale these features together

In [80]:
import joblib
df_sd_2 = df.copy()

# features to scale together:
metric_features_scale = [col for col in df_sd_2[metric_columns].columns if col not in ['first_order', 'last_order', 'is_chain', 'vendor_count', 'product_count']]
#two groups of features to be scaled together:
group1 = ['first_order', 'last_order']
group2 = ['is_chain', 'vendor_count', 'product_count']


# standard scaling for normal features
scaler = StandardScaler()
scaler.fit(df_sd_2[metric_features_scale])
joblib.dump(scaler, "scalerGroupBasic.pkl")  # to export the scaler (in order to use it later for scaling new data (in the interface)
df_sd_2[metric_features_scale] = scaler.transform(df_sd_2[metric_features_scale])

# scaling the features of first group together
scaler_group1 = StandardScaler()
group1_values = df_sd_2[group1].values.flatten().reshape(-1, 1)
scaler_group1.fit(group1_values)
joblib.dump(scaler_group1, "scalerGroup1.pkl")    # to export the scaler (in order to use it later for scaling new data (in the interface)
scaled_group1 = scaler_group1.transform(group1_values)

# scaling the features of second group together
scaler_group2 = StandardScaler()
group2_values = df_sd_2[group2].values.flatten().reshape(-1, 1)
scaler_group2.fit(group2_values)
joblib.dump(scaler_group2, "scalerGroup2.pkl")  # to export the scaler (in order to use it later for scaling new data (in the interface)
scaled_group2 = scaler_group2.transform(group2_values)

# reshape the values back to normal shape
df_sd_2[group1] = scaled_group1.reshape(-1, len(group1))
df_sd_2[group2] = scaled_group2.reshape(-1, len(group2))


print(df_sd_2[metric_columns].head(1))

   customer_age  vendor_count  product_count  is_chain  first_order  \
0     -1.760971     -0.026319       0.761656 -0.709452    -4.908274   

   last_order  total_expenses  avg_per_product  avg_per_order  avg_order_size  \
0   -2.558657        0.325846        -0.115258        0.83994        2.633865   

   culinary_variety  chain_preference  loyalty_to_venders  customer_age_group  \
0         -0.983418          0.003333            0.639862                 0.0   

   last_promo_DISCOUNT  last_promo_FREEBIE  payment_method_CASH  \
0            -0.405963           -0.399168            -0.463735   

   payment_method_DIGI  
0             2.052873  


Lets check if it worked:

In [81]:
df_sd_2.loc[df_sd_2['first_order'] > df_sd_2['last_order']].count()   

customer_region          0
customer_age             0
vendor_count             0
product_count            0
is_chain                 0
first_order              0
last_order               0
preferred_order_days     0
preferred_part_of_day    0
total_expenses           0
avg_per_product          0
avg_per_order            0
avg_order_size           0
culinary_variety         0
chain_preference         0
loyalty_to_venders       0
customer_age_group       0
last_promo_DISCOUNT      0
last_promo_FREEBIE       0
payment_method_CASH      0
payment_method_DIGI      0
dtype: int64

In [82]:
df_sd_2.loc[df_sd_2['vendor_count'] > df_sd_2['product_count']].count()

customer_region          0
customer_age             0
vendor_count             0
product_count            0
is_chain                 0
first_order              0
last_order               0
preferred_order_days     0
preferred_part_of_day    0
total_expenses           0
avg_per_product          0
avg_per_order            0
avg_order_size           0
culinary_variety         0
chain_preference         0
loyalty_to_venders       0
customer_age_group       0
last_promo_DISCOUNT      0
last_promo_FREEBIE       0
payment_method_CASH      0
payment_method_DIGI      0
dtype: int64

In [83]:
df_sd_2.loc[df_sd_2['is_chain'] > df_sd_2['product_count']].count()
#those 75 already existed in the original data, probably wrong entries

customer_region          0
customer_age             0
vendor_count             0
product_count            0
is_chain                 0
first_order              0
last_order               0
preferred_order_days     0
preferred_part_of_day    0
total_expenses           0
avg_per_product          0
avg_per_order            0
avg_order_size           0
culinary_variety         0
chain_preference         0
loyalty_to_venders       0
customer_age_group       0
last_promo_DISCOUNT      0
last_promo_FREEBIE       0
payment_method_CASH      0
payment_method_DIGI      0
dtype: int64

In [84]:
df = df_sd_2.copy()   # take this version with standard scaling as final dataset

<a class="anchor" id="six-bullet"> 

## <span style="color:salmon"> 6. Export Datasets</span> 

<a href="#top">Top &#129033;</a>

In [85]:
# Store in df_visualizations the DataFrame of our dataset df
df_transform= pd.DataFrame(df)

# Save to CSV
df_transform.to_csv('../dataset/df_transform.csv', index=False)

### NOTE: 

We didn't do feature selection, because we ended up using all the features 