# Big Data Real-Time Analytics with Python and Spark

## Chapter 6 - Case Study 5 - Data Preprocessing for E-Commerce Analytics

**Note:** We are working in a big data science project divided in 3 chapters:
* Exploratory data analysis
    - EDA part 
    - EDA part 2 
* Feature Engineering
* **Data Preprocessing**

The goal of preprocessing is to get the data in the optimal format for the next step in a data science project. Usually preprocessing is the last step before training a Machine Learning model.

Some preprocessing techniques should be applied exclusively to the training data. Since we will not be doing Machine Learning in this Case Study, we will apply all the techniques studied to the entire set of data. Our focus will be studying the techniques.

![CaseStudy5 DSA](images/CaseStudy5.png "Case Study 5 DSA")

In [1]:
# Python Version
from platform import python_version
print('The version used in this notebook: ', python_version())

The version used in this notebook:  3.8.13


In [2]:
# Imports
import sklearn
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

In [3]:
# Package version used in this notebool
%reload_ext watermark
%watermark -a 'Bianca Amorim' --iversions

Author: Bianca Amorim

pandas : 1.4.2
numpy  : 1.22.3
sklearn: 1.0.2



##  Loading Dataset (Generated at the end of case study 4)

In [4]:
# Load the dataset
df = pd.read_csv('datasets/df_eng.csv', index_col = 0)

In [5]:
df.shape

(10643, 16)

In [6]:
df.head()

Unnamed: 0,ID,corredor_armazem,modo_envio,numero_chamadas_cliente,avaliacao_cliente,custo_produto,compras_anteriores,prioridade_produto,genero,desconto,peso_gramas,entregue_no_prazo,performance_prioridade_envio,performance_modo_envio,discount_range,discount_performance_range
0,1,D,Aviao,4,2,177,3,baixa,F,44,1233,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
1,2,F,Aviao,4,5,216,2,baixa,M,59,3088,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
2,3,A,Aviao,2,2,183,4,baixa,M,48,3374,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
3,4,B,Aviao,3,3,176,4,media,M,10,1177,1,No Delay,No Delay,Below Average Discount,On Time Delivery at Below Average Discount
4,5,C,Aviao,2,2,184,3,media,F,46,2484,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10643 entries, 0 to 10999
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   ID                            10643 non-null  int64 
 1   corredor_armazem              10643 non-null  object
 2   modo_envio                    10643 non-null  object
 3   numero_chamadas_cliente       10643 non-null  int64 
 4   avaliacao_cliente             10643 non-null  int64 
 5   custo_produto                 10643 non-null  int64 
 6   compras_anteriores            10643 non-null  int64 
 7   prioridade_produto            10643 non-null  object
 8   genero                        10643 non-null  object
 9   desconto                      10643 non-null  int64 
 10  peso_gramas                   10643 non-null  int64 
 11  entregue_no_prazo             10643 non-null  int64 
 12  performance_prioridade_envio  10643 non-null  object
 13  performance_modo

In [8]:
df.rename({'ID': 'ID',
           'corredor_armazem': 'aisle_store',
           'modo_envio': 'delivery_mode',
           'numero_chamadas_cliente': 'client_calls_number',
           'avaliacao_cliente': 'customer_rating',
           'custo_produto': 'product_cost',
           'compras_anteriores': 'previous_purchases',
           'prioridade_produto': 'product_priority',
           'genero': 'gender',
           'desconto': 'discount',
           'peso_gramas': 'weigh_grams',
           'entregue_no_prazo': 'delivered_on_time',
           'performance_prioridade_envio': 'delivery_priority_performance',
           'performance_modo_envio': 'delivery_mode_performance',
           'discount_range': 'discount_range',
           'discount_performance_range': 'discount_performance_range'}, axis=1, inplace = True)

In [9]:
df.head()

Unnamed: 0,ID,aisle_store,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,delivery_priority_performance,delivery_mode_performance,discount_range,discount_performance_range
0,1,D,Aviao,4,2,177,3,baixa,F,44,1233,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
1,2,F,Aviao,4,5,216,2,baixa,M,59,3088,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
2,3,A,Aviao,2,2,183,4,baixa,M,48,3374,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
3,4,B,Aviao,3,3,176,4,media,M,10,1177,1,No Delay,No Delay,Below Average Discount,On Time Delivery at Below Average Discount
4,5,C,Aviao,2,2,184,3,media,F,46,2484,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average


## Label Enconding
Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

#### Method 1

In [10]:
# Ordinal categorical variable
df.product_priority.value_counts()

baixa    5174
media    4587
alta      882
Name: product_priority, dtype: int64

In [11]:
# Mapping dictionary
dic_product_priority = {'baixa': 1, 'media':2, 'alta':0}

In [12]:
df['product_priority'] = df['product_priority'].map(dic_product_priority)

In [13]:
df.product_priority.value_counts()

1    5174
2    4587
0     882
Name: product_priority, dtype: int64

In [14]:
df.head()

Unnamed: 0,ID,aisle_store,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,delivery_priority_performance,delivery_mode_performance,discount_range,discount_performance_range
0,1,D,Aviao,4,2,177,3,1,F,44,1233,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
1,2,F,Aviao,4,5,216,2,1,M,59,3088,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
2,3,A,Aviao,2,2,183,4,1,M,48,3374,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
3,4,B,Aviao,3,3,176,4,2,M,10,1177,1,No Delay,No Delay,Below Average Discount,On Time Delivery at Below Average Discount
4,5,C,Aviao,2,2,184,3,2,F,46,2484,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average


In [15]:
# Ordinal categorical variable (open to interpretation)
df.delivery_mode.value_counts()

Navio       7212
Aviao       1728
Caminhao    1703
Name: delivery_mode, dtype: int64

In [16]:
# Mapping dictionary
dic_delivery_mode = {'Navio': 0, 'Aviao': 1, 'Caminhao': 2}

In [17]:
df['delivery_mode'] = df['delivery_mode'].map(dic_delivery_mode)

In [18]:
df.delivery_mode.value_counts()

0    7212
1    1728
2    1703
Name: delivery_mode, dtype: int64

In [19]:
df.head()

Unnamed: 0,ID,aisle_store,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,delivery_priority_performance,delivery_mode_performance,discount_range,discount_performance_range
0,1,D,1,4,2,177,3,1,F,44,1233,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
1,2,F,1,4,5,216,2,1,M,59,3088,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
2,3,A,1,2,2,183,4,1,M,48,3374,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
3,4,B,1,3,3,176,4,2,M,10,1177,1,No Delay,No Delay,Below Average Discount,On Time Delivery at Below Average Discount
4,5,C,1,2,2,184,3,2,F,46,2484,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average


#### Method 2
sklearn.preprocessing.LabelEncoder - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [20]:
# Nominal categorical variable
df.gender.value_counts()

F    5357
M    5286
Name: gender, dtype: int64

In [21]:
# Create the target enconder
le = LabelEncoder()

In [22]:
# Train the target, we usually do this only with the training data
le.fit(df.gender)

LabelEncoder()

In [23]:
list(le.classes_)

['F', 'M']

In [24]:
# We apply the trained encoder target
# (We do this on training and test data and also on new data used in the model)
df.gender = le.transform(df.gender)

In [25]:
df.gender.value_counts()

0    5357
1    5286
Name: gender, dtype: int64

## One-Hot Encoding

![one-hot-encoding](images/one-hot-encoding.png 'example of one-hot-encoding')

In [26]:
df.head()

Unnamed: 0,ID,aisle_store,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,delivery_priority_performance,delivery_mode_performance,discount_range,discount_performance_range
0,1,D,1,4,2,177,3,1,0,44,1233,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
1,2,F,1,4,5,216,2,1,1,59,3088,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
2,3,A,1,2,2,183,4,1,1,48,3374,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average
3,4,B,1,3,3,176,4,2,1,10,1177,1,No Delay,No Delay,Below Average Discount,On Time Delivery at Below Average Discount
4,5,C,1,2,2,184,3,2,0,46,2484,1,No Delay,No Delay,Above Average Discount,On Time Delivery at Above Average


In [27]:
# Nominal categorical variable
df.aisle_store.value_counts()

F    3539
B    1778
D    1777
A    1777
C    1772
Name: aisle_store, dtype: int64

In [28]:
# Nominal categorical variable (open to interpretation)
df.delivery_priority_performance.value_counts()

No Delay             6282
Tolerable Delay      2134
Problematic Delay    1917
Critical Delay        310
Name: delivery_priority_performance, dtype: int64

In [29]:
# Nominal categorical variable
df.delivery_mode_performance.value_counts()

No Delay                                  6282
Tolerable Delivery Delay by Ship          1453
Problematic Delay in Delivery by Ship     1307
Tolerable Delivery Delay by Truck          350
Tolerable Delivery Delay by Plane          331
Problematic Delay in Delivery by Truck     310
Problematic Delivery Delay by Plane        300
Critical Delay in Delivery by Ship         194
Critical Delivery Delay by Plane            65
Critical Delivery Truck Delay               51
Name: delivery_mode_performance, dtype: int64

In [30]:
# Nominal categorical variable
df.discount_range.value_counts()

Below Average Discount    8269
Above Average Discount    2374
Name: discount_range, dtype: int64

In [31]:
# Nominal categorical variable
df.discount_performance_range.value_counts()

Late Delivery with a Below Average discount    4361
On Time Delivery at Below Average Discount     3908
On Time Delivery at Above Average              2374
Name: discount_performance_range, dtype: int64

In [32]:
# Applying One-Hot Encoding
for cat in ['aisle_store',
           'delivery_priority_performance',
           'delivery_mode_performance',
           'discount_range',
           'discount_performance_range']:
    onehots = pd.get_dummies(df[cat], prefix = cat)
    df = df.join(onehots)

In [33]:
df.columns

Index(['ID', 'aisle_store', 'delivery_mode', 'client_calls_number',
       'customer_rating', 'product_cost', 'previous_purchases',
       'product_priority', 'gender', 'discount', 'weigh_grams',
       'delivered_on_time', 'delivery_priority_performance',
       'delivery_mode_performance', 'discount_range',
       'discount_performance_range', 'aisle_store_A', 'aisle_store_B',
       'aisle_store_C', 'aisle_store_D', 'aisle_store_F',
       'delivery_priority_performance_Critical Delay',
       'delivery_priority_performance_No Delay',
       'delivery_priority_performance_Problematic Delay',
       'delivery_priority_performance_Tolerable Delay',
       'delivery_mode_performance_Critical Delay in Delivery by Ship',
       'delivery_mode_performance_Critical Delivery Delay by Plane',
       'delivery_mode_performance_Critical Delivery Truck Delay',
       'delivery_mode_performance_No Delay',
       'delivery_mode_performance_Problematic Delay in Delivery by Ship',
       'delivery_

In [34]:
df.head()

Unnamed: 0,ID,aisle_store,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,...,delivery_mode_performance_Problematic Delay in Delivery by Truck,delivery_mode_performance_Problematic Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Ship,delivery_mode_performance_Tolerable Delivery Delay by Truck,discount_range_Above Average Discount,discount_range_Below Average Discount,discount_performance_range_Late Delivery with a Below Average discount,discount_performance_range_On Time Delivery at Above Average,discount_performance_range_On Time Delivery at Below Average Discount
0,1,D,1,4,2,177,3,1,0,44,...,0,0,0,0,0,1,0,0,1,0
1,2,F,1,4,5,216,2,1,1,59,...,0,0,0,0,0,1,0,0,1,0
2,3,A,1,2,2,183,4,1,1,48,...,0,0,0,0,0,1,0,0,1,0
3,4,B,1,3,3,176,4,2,1,10,...,0,0,0,0,0,0,1,0,0,1
4,5,C,1,2,2,184,3,2,0,46,...,0,0,0,0,0,1,0,0,1,0


In [35]:
# We do not need the old variables anymore after apply One-Hot Enconding
df = df.drop(columns = ['aisle_store',
           'delivery_priority_performance',
           'delivery_mode_performance',
           'discount_range',
           'discount_performance_range'])

In [36]:
# Now we already can deleted the ID column
df = df.drop(columns = ['ID'])

In [37]:
df.head()

Unnamed: 0,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,...,delivery_mode_performance_Problematic Delay in Delivery by Truck,delivery_mode_performance_Problematic Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Ship,delivery_mode_performance_Tolerable Delivery Delay by Truck,discount_range_Above Average Discount,discount_range_Below Average Discount,discount_performance_range_Late Delivery with a Below Average discount,discount_performance_range_On Time Delivery at Above Average,discount_performance_range_On Time Delivery at Below Average Discount
0,1,4,2,177,3,1,0,44,1233,1,...,0,0,0,0,0,1,0,0,1,0
1,1,4,5,216,2,1,1,59,3088,1,...,0,0,0,0,0,1,0,0,1,0
2,1,2,2,183,4,1,1,48,3374,1,...,0,0,0,0,0,1,0,0,1,0
3,1,3,3,176,4,2,1,10,1177,1,...,0,0,0,0,0,0,1,0,0,1
4,1,2,2,184,3,2,0,46,2484,1,...,0,0,0,0,0,1,0,0,1,0


## Feature Scaling

**Feature scaling** consists of transforming the value of features into a similar range so that machine learning algorithms behave better, resulting in optimal models.Feature scaling consists of transforming the value of features into a similar range so that machine learning algorithms behave better, resulting in optimal models. Standardization and normalization are two of the most common techniques for Feature Scaling: 

- **Normalization** is to transform the resource values to fall within limited intervals (min and max). 
> Using MinMaxScaler() from sklearn.preprocessing
- **Standardization** A is transforming the resource values to fall around the mean as 0 with a standard deviation as 1. Standardization keeps useful information about discrepant values and makes the algorithm less sensitive to them, in contrast to minimum-maximum scaling
> Using StandardScaler() from sklearn.preprocessing


In [38]:
df.head()

Unnamed: 0,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,...,delivery_mode_performance_Problematic Delay in Delivery by Truck,delivery_mode_performance_Problematic Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Ship,delivery_mode_performance_Tolerable Delivery Delay by Truck,discount_range_Above Average Discount,discount_range_Below Average Discount,discount_performance_range_Late Delivery with a Below Average discount,discount_performance_range_On Time Delivery at Above Average,discount_performance_range_On Time Delivery at Below Average Discount
0,1,4,2,177,3,1,0,44,1233,1,...,0,0,0,0,0,1,0,0,1,0
1,1,4,5,216,2,1,1,59,3088,1,...,0,0,0,0,0,1,0,0,1,0
2,1,2,2,183,4,1,1,48,3374,1,...,0,0,0,0,0,1,0,0,1,0
3,1,3,3,176,4,2,1,10,1177,1,...,0,0,0,0,0,0,1,0,0,1
4,1,2,2,184,3,2,0,46,2484,1,...,0,0,0,0,0,1,0,0,1,0


In [39]:
df.columns

Index(['delivery_mode', 'client_calls_number', 'customer_rating',
       'product_cost', 'previous_purchases', 'product_priority', 'gender',
       'discount', 'weigh_grams', 'delivered_on_time', 'aisle_store_A',
       'aisle_store_B', 'aisle_store_C', 'aisle_store_D', 'aisle_store_F',
       'delivery_priority_performance_Critical Delay',
       'delivery_priority_performance_No Delay',
       'delivery_priority_performance_Problematic Delay',
       'delivery_priority_performance_Tolerable Delay',
       'delivery_mode_performance_Critical Delay in Delivery by Ship',
       'delivery_mode_performance_Critical Delivery Delay by Plane',
       'delivery_mode_performance_Critical Delivery Truck Delay',
       'delivery_mode_performance_No Delay',
       'delivery_mode_performance_Problematic Delay in Delivery by Ship',
       'delivery_mode_performance_Problematic Delay in Delivery by Truck',
       'delivery_mode_performance_Problematic Delivery Delay by Plane',
       'delivery_mode_

**Attention:** In case of normalizing the training and test data set, the MinMaxScaler() estimator will have the fit() on the training data set and the same estimator will be used to transform the training and test data set. The same estimator should also be used on new data when making predictions with the model.

In [40]:
df.weigh_grams.sample(5)

7792    5383
3856    4282
752     1999
3950    5814
7097    5369
Name: weigh_grams, dtype: int64

In [41]:
df['weigh_grams'] = MinMaxScaler().fit_transform(df['weigh_grams'].values.reshape(len(df), 1))

In [42]:
df.weigh_grams.sample(5)

9349    0.120526
545     0.129291
5036    0.106939
3046    0.153104
9944    0.110007
Name: weigh_grams, dtype: float64

In [43]:
df.product_cost.sample(5)

759     202
9031    250
663     178
9735    261
7035    219
Name: product_cost, dtype: int64

In [44]:
df['product_cost'] = MinMaxScaler().fit_transform(df['product_cost'].values.reshape(len(df), 1))

In [45]:
df.product_cost.sample(5)

4371    0.518692
4394    0.369159
5880    0.177570
4072    0.686916
5542    0.214953
Name: product_cost, dtype: float64

**Attention:** In case of normalizing the training and test data set, the StandardScaler() estimator will have the fit() on the training data set and the same estimator will be used to transform the training and test data set. The same estimator should also be used on new data when making predictions with the model.

In [46]:
df['discount'] = StandardScaler().fit_transform(df['discount'].values.reshape(len(df), 1))

In [47]:
df.discount.sample(5)

6202   -0.302046
8912   -0.768984
4520   -0.635573
1350    3.099936
6119   -0.368751
Name: discount, dtype: float64

In [48]:
df['client_calls_number'] = StandardScaler().fit_transform(df['client_calls_number'].values.reshape(len(df), 1))

In [49]:
df.client_calls_number.sample(5)

7095    0.815832
408    -0.930527
6754   -0.057348
8194    0.815832
1238   -0.930527
Name: client_calls_number, dtype: float64

In [50]:
df['customer_rating'] = StandardScaler().fit_transform(df['customer_rating'].values.reshape(len(df), 1))

In [51]:
df.customer_rating.sample(5)

1918     1.423904
5403     0.007718
3384     1.423904
6312    -0.700376
10366    0.007718
Name: customer_rating, dtype: float64

In [52]:
df['previous_purchases'] = StandardScaler().fit_transform(df['previous_purchases'].values.reshape(len(df), 1))

In [53]:
df.previous_purchases.sample(5)

2278   -0.359702
5854   -0.359702
594    -1.135605
2060   -0.359702
8199    0.416201
Name: previous_purchases, dtype: float64

In [54]:
df.head()

Unnamed: 0,delivery_mode,client_calls_number,customer_rating,product_cost,previous_purchases,product_priority,gender,discount,weigh_grams,delivered_on_time,...,delivery_mode_performance_Problematic Delay in Delivery by Truck,delivery_mode_performance_Problematic Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Plane,delivery_mode_performance_Tolerable Delivery Delay by Ship,delivery_mode_performance_Tolerable Delivery Delay by Truck,discount_range_Above Average Discount,discount_range_Below Average Discount,discount_performance_range_Late Delivery with a Below Average discount,discount_performance_range_On Time Delivery at Above Average,discount_performance_range_On Time Delivery at Below Average Discount
0,1,-0.057348,-0.700376,0.378505,-0.359702,1,0,2.099353,0.033893,1,...,0,0,0,0,0,1,0,0,1,0
1,1,-0.057348,1.423904,0.560748,-1.135605,1,1,3.099936,0.304894,1,...,0,0,0,0,0,1,0,0,1,0
2,1,-1.803706,-0.700376,0.406542,0.416201,1,1,2.366175,0.346676,1,...,0,0,0,0,0,1,0,0,1,0
3,1,-0.930527,0.007718,0.373832,0.416201,2,1,-0.168635,0.025712,1,...,0,0,0,0,0,0,1,0,0,1
4,1,-1.803706,-0.700376,0.411215,-0.359702,2,0,2.232764,0.216654,1,...,0,0,0,0,0,1,0,0,1,0


In [55]:
df.to_csv('datasets/dataset_final.csv', sep = ',', encoding = 'utf-8')

# The End