# Data Cleaning and Data Wrangling

## Data

Dataset contains socio-demographic and firmographic features about 2.240 
customers.

|Feature |Description|
|--:|---|
|AcceptedCmp1| 1 if customer accepted the offer in the 1st campaign, 0 otherwise|
|AcceptedCmp2| 1 if customer accepted the offer in the 2nd campaign, 0 otherwise|
|AcceptedCmp3| 1 if customer accepted the offer in the 3rd campaign, 0 otherwise|
|AcceptedCmp4| 1 if customer accepted the offer in the 4th campaign, 0 otherwise|
|AcceptedCmp5| 1 if customer accepted the offer in the 5th campaign, 0 otherwise|
|Response (target)| 1 if customer accepted the offer in the last campaign, 0 otherwise|
|Complain| 1 if customer complained in the last 2 years|
|DtCustomer| date of customer's enrollment with the company|
|Education| customer's level of education|
|Marital| customer's marital status|
|Kidhome| number of small children in customer's household|
|Teenhome |number of teenagers in customer's household|
|Income| customer's yearly household income|
|MntFishProducts| amount spent on fish products in the last 2 years|
|MntMeatProducts| amount spent on meat products in the last 2 years|
|MntFruits| amount spent on fruits products in the last 2 years|
|MntSweetProducts| amount spent on sweet products in the last 2 years|
|MntWines| amount spent on wines products in the last 2 years|
|MntGoldProds| amount spent on gold products in the last 2 years|
|NumDealsPurchases| number of purchases made with discount|
|NunCatalogPurchases| number of purchases made using catalog|
|NunStorePurchases| number of purchases made directly in stores|
|NumWebPurchases| number of purchases made through company's web site|
|NumWebVisitsMonth| number of visits to company's web site in the last month|
|Recency|number of days since the last purchase|
|Z_Revenue|revenue from the new gadget|
|Z_CostContact|cost of contact for the sixth campaign|

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.stats import kstest

import psutil
import warnings
warnings.filterwarnings("ignore")

## Data Exploration

In [2]:
# Storing path
path = Path("../data/ifood_customers.csv")

# Read CSV with pandas
data = pd.read_csv(path)

# showing a sample of the dataset
data.sample(8)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1861,10241,1975,2n Cycle,Divorced,11448.0,0,0,2013-12-15,16,0,...,6,0,0,0,0,0,0,3,11,0
1723,4686,1962,PhD,Widow,82571.0,0,0,2014-04-02,28,861,...,2,0,0,1,0,0,0,3,11,0
1647,7005,1981,Graduation,Single,58684.0,0,0,2014-06-16,71,479,...,2,0,1,0,0,0,0,3,11,0
1333,5147,1948,Graduation,Single,90842.0,0,0,2013-07-29,57,774,...,1,0,0,0,0,0,0,3,11,0
1217,8876,1963,PhD,Together,33629.0,1,1,2012-08-08,49,132,...,9,0,0,0,0,0,0,3,11,0
1261,3979,1983,PhD,Divorced,90687.0,0,0,2013-05-22,98,982,...,2,0,0,1,0,0,0,3,11,1
898,7037,1974,PhD,Married,37087.0,1,0,2013-08-11,50,194,...,6,0,0,0,0,0,0,3,11,0
393,455,1946,PhD,Married,51012.0,0,0,2013-04-18,86,102,...,6,0,0,0,0,0,0,3,11,0


In [3]:
# Exploring the columns and dtype

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [4]:
# Exploring unique values in some columns

print("\n'Education' Column Values:\n", "\t",
      data["Education"].unique(),
      sep=''
      )

print("\n'Marital_Status' Column Values:\n", "\t",
      data["Marital_Status"].unique(),
      sep=''
      )

print("\n'Dt_Customer' Column sample:\n",
      data["Dt_Customer"].sample(3), "\n", "\t",
      sep=''
      )



'Education' Column Values:
	['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']

'Marital_Status' Column Values:
	['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

'Dt_Customer' Column sample:
1547    2012-11-09
2129    2013-11-03
1306    2013-04-03
Name: Dt_Customer, dtype: object
	


## Creating columns

In [5]:
# Stabilizing dtype 'category' to Education and Marital_Status columns
# Also Dt_Customer to Datetime dtype and Income to 'Int64'

data["Income"] = data["Income"].astype("Int64")
data["Education"] = data["Education"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%Y-%m-%d")

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   2240 non-null   int64         
 1   Year_Birth           2240 non-null   int64         
 2   Education            2240 non-null   category      
 3   Marital_Status       2240 non-null   category      
 4   Income               2216 non-null   Int64         
 5   Kidhome              2240 non-null   int64         
 6   Teenhome             2240 non-null   int64         
 7   Dt_Customer          2240 non-null   datetime64[ns]
 8   Recency              2240 non-null   int64         
 9   MntWines             2240 non-null   int64         
 10  MntFruits            2240 non-null   int64         
 11  MntMeatProducts      2240 non-null   int64         
 12  MntFishProducts      2240 non-null   int64         
 13  MntSweetProducts     2240 non-nul

In [6]:
# Creating new columns: customer year old and days since becoming customer

today = pd.to_datetime(datetime.today().strftime('%Y-%m-%d'))

data["Year_Old"] = (today.year - data["Year_Birth"])
data["CustomerFor"] = (today - data["Dt_Customer"])

data[["Year_Old", "CustomerFor"]].sample(5)

Unnamed: 0,Year_Old,CustomerFor
1956,48,3825 days
1550,73,3922 days
132,55,3801 days
1334,46,4342 days
528,60,4348 days


In [7]:
# Reordering columns

pop_column = data.pop("Dt_Customer")
data.insert(2, "Dt_Customer", pop_column)

last_columns = data.columns[-2:]
first_columns = data.columns[:2]
middle_columns = data.columns[2:-2]
new_order = list(first_columns) + list(last_columns) + list(middle_columns)

data = data[new_order]

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   ID                   2240 non-null   int64          
 1   Year_Birth           2240 non-null   int64          
 2   Year_Old             2240 non-null   int64          
 3   CustomerFor          2240 non-null   timedelta64[ns]
 4   Dt_Customer          2240 non-null   datetime64[ns] 
 5   Education            2240 non-null   category       
 6   Marital_Status       2240 non-null   category       
 7   Income               2216 non-null   Int64          
 8   Kidhome              2240 non-null   int64          
 9   Teenhome             2240 non-null   int64          
 10  Recency              2240 non-null   int64          
 11  MntWines             2240 non-null   int64          
 12  MntFruits            2240 non-null   int64          
 13  MntMeatProducts   

In [8]:
# Checking if the customer bought in the last month

data["PurchaseLastMonth"] = (data["Recency"] < 30)
data["PurchaseLastMonth"] = data["PurchaseLastMonth"].replace({True:1, 
                                                               False:0})

print(data["PurchaseLastMonth"].sample(5))

1967    1
887     0
1837    0
1438    0
95      0
Name: PurchaseLastMonth, dtype: int64


In [9]:
# Calculating total amount spent per customer

MntSpentTotal_sum = ["MntFishProducts", "MntFruits", "MntGoldProds", 
                     "MntMeatProducts", "MntSweetProducts", "MntWines"]
data["MntSpentTotal"] = data[MntSpentTotal_sum].sum(axis=1)

print(data["MntSpentTotal"].sample(5))

1982    2211
1333    1424
1780      47
873      182
829       22
Name: MntSpentTotal, dtype: int64


In [10]:
# How many campaigns the customer accepted

AcceptedCmpTotal_sum = ["AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", 
                        "AcceptedCmp4", "AcceptedCmp5", "Response"]
data["AcceptedCmpTotal"] = data[AcceptedCmpTotal_sum].sum(axis=1)

data["AcceptedCmpTotal"].sample(5)

1614    1
620     0
797     0
1969    2
1154    0
Name: AcceptedCmpTotal, dtype: int64

In [11]:
# How many children (Kids and teenagers) the customer has at home

ChildrenHome_sum = ["Kidhome", "Teenhome"]
data["ChildrenHome"] = data[ChildrenHome_sum].sum(axis="columns")

data["ChildrenHome"].sample(5)

1776    1
2008    1
1133    1
662     1
1556    2
Name: ChildrenHome, dtype: int64

### Columns reordered

In [12]:
# Reordering columns

pop_column = data.pop("AcceptedCmpTotal")
data.insert(27, "AcceptedCmpTotal", pop_column)

pop_column = data.pop("PurchaseLastMonth")
data.insert(17, "PurchaseLastMonth", pop_column)

pop_column = data.pop("MntSpentTotal")
data.insert(11, "MntSpentTotal", pop_column)

pop_column = data.pop("ChildrenHome")
data.insert(10, "ChildrenHome", pop_column)

pop_column = data.pop("AcceptedCmp2")
data.insert(25, "AcceptedCmp2", pop_column)

pop_column = data.pop("AcceptedCmp1")
data.insert(25, "AcceptedCmp1", pop_column)

pop_column = data.pop("Response")
data.insert(30, "Response", pop_column)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   ID                   2240 non-null   int64          
 1   Year_Birth           2240 non-null   int64          
 2   Year_Old             2240 non-null   int64          
 3   CustomerFor          2240 non-null   timedelta64[ns]
 4   Dt_Customer          2240 non-null   datetime64[ns] 
 5   Education            2240 non-null   category       
 6   Marital_Status       2240 non-null   category       
 7   Income               2216 non-null   Int64          
 8   Kidhome              2240 non-null   int64          
 9   Teenhome             2240 non-null   int64          
 10  ChildrenHome         2240 non-null   int64          
 11  Recency              2240 non-null   int64          
 12  MntSpentTotal        2240 non-null   int64          
 13  MntWines          

## Data Cleaning

In [13]:
# Missing Values

row_nan = data[data.isna().any(axis=1)]
print("Customers with missing values: ", len(row_nan))

data.drop(row_nan.index, inplace=True)

Customers with missing values:  24


In [14]:
# Identifying logically incoherent customers and dropping from the dataframe

marital_filt = data[data["Marital_Status"].isin(['Alone', 'Absurd', 'YOLO'])]
print("Customers with a logically incoherent Marital Status: ", 
      len(marital_filt))

data.drop(marital_filt.index, inplace=True)

Customers with a logically incoherent Marital Status:  7


### Kolmogorov-Smirnov test

In [15]:
# KS-Test on 'Year_Old' column
ks_result = kstest(data["Year_Old"], stats.norm.cdf, 
                   args=(data["Year_Old"].mean(), data["Year_Old"].std()))

print(f"Test statistic: {ks_result.statistic:.4f}")
print(f"p Value: {ks_result.pvalue:.4f}")

Test statistic: 0.0590
p Value: 0.0000


In [16]:
# KS-Test on 'Year_Old' column
ks_result = kstest(data["Income"], stats.norm.cdf, 
                   args=(data["Income"].mean(), data["Income"].std()))

print(f"Test statistic: {ks_result.statistic:.4f}")
print(f"p Value: {ks_result.pvalue:.4f}")

Test statistic: 0.0542
p Value: 0.0000


### Obtaining outliers

In [17]:
# Quartiles and IQR
quartiles = data["Year_Old"].quantile([0.25, 0.75])
iqr = quartiles[0.75] - quartiles[0.25]

# Identify Outliers
lower_bound = quartiles[0.25] - 1.5 * iqr
upper_bound = quartiles[0.75] + 1.5 * iqr

# Filtering
Year_Old_outliers = data[(data["Year_Old"] < lower_bound) | 
                (data["Year_Old"] > upper_bound)]

print("IQR: ", iqr)
print("outliers: ")
print(Year_Old_outliers[["ID", "Year_Old", "Income", "CustomerFor", 
                "Marital_Status", "MntSpentTotal"]])

# Extracting Outliers from the Dataset
data.drop(Year_Old_outliers.index, inplace=True)

IQR:  18.0
outliers: 
        ID  Year_Old  Income CustomerFor Marital_Status  MntSpentTotal
192   7829       124   36640   3938 days       Divorced             65
239  11004       131   60182   3705 days         Single             22
339   1150       125   83532   3938 days       Together           1853


In [18]:
# Quartiles and IQR
quartiles = data["Income"].quantile([0.25, 0.75])
iqr = quartiles[0.75] - quartiles[0.25]

# Identify Outliers
lower_bound = quartiles[0.25] - 1.5 * iqr
upper_bound = quartiles[0.75] + 1.5 * iqr

# Filtering
income_outliers = data[(data["Income"] < lower_bound) | 
                (data["Income"] > upper_bound)]

print("IQR: ", iqr)
print("outliers: ")
print(income_outliers[["ID", "Year_Old", "Income", "CustomerFor", 
                "Marital_Status", "MntSpentTotal"]]\
                    .sort_values("Income", ascending=False))

# Extracting logically incoherent Outlier from the Dataset
income_excluded = data.drop(2233, inplace=True)

IQR:  33383.5
outliers: 
         ID  Year_Old  Income CustomerFor Marital_Status  MntSpentTotal
2233   9432        47  666666   4054 days       Together             62
617    1503        48  162397   4053 days       Together            107
687    1501        42  160803   4356 days        Married           1717
1300   5336        53  157733   4052 days       Together             59
164    8475        51  157243   3782 days        Married           1608
1653   4931        47  157146   4088 days       Together           1730
2132  11181        75  156924   3966 days        Married              8
655    5555        49  153924   3804 days       Divorced              6


In [19]:
# Storing excluded entries

data_excluded = pd.concat([income_excluded, 
                          Year_Old_outliers, 
                          marital_filt, 
                          row_nan])

print(data_excluded[["ID", "Year_Old", "Income", "CustomerFor", 
                "Marital_Status", "MntSpentTotal"]]\
                    .sort_values("ID"))

         ID  Year_Old  Income CustomerFor Marital_Status  MntSpentTotal
153      92        36   34176   3710 days          Alone             89
131     433        66   61331   4138 days          Alone            632
2177    492        51   48432   4281 days           YOLO            424
339    1150       125   83532   3938 days       Together           1853
133    1295        61    <NA>   3984 days        Married            725
2061   1612        43    <NA>   4056 days         Single             47
10     1994        41    <NA>   3888 days        Married             19
312    2437        35    <NA>   4053 days        Married           1611
319    2863        54    <NA>   3972 days         Single           1052
1382   2902        66    <NA>   4326 days       Together             45
2081   3117        69    <NA>   3916 days         Single            450
1386   3769        52    <NA>   3781 days       Together             42
1383   4345        60    <NA>   3830 days         Single        

## New Categorical columns

In [20]:
# Age binning categories

print(data["Year_Old"].describe()[["min", "mean", "max"]])

bins = [25, 35, 45, 55, 65, 75, 2000]
labels = ["25_34", "35_44", "45_54", "55_64", "65_74", "75_above"]

data["Age_cat"] = pd.cut(data["Year_Old"], bins, labels=labels, right=False)

print("\n", data[["ID", "Year_Old", "Age_cat", "Income"]].sample(5), sep='')

min     28.000000
mean    55.101134
max     84.000000
Name: Year_Old, dtype: float64

        ID  Year_Old Age_cat  Income
463   7059        61   55_64   80124
715  10479        49   45_54   76618
275   2304        61   55_64   66313
478   8970        52   45_54   62010
24    1409        73   65_74   40689


In [21]:
# Income binning categories

print(data["Income"].describe()[["min", "mean", "max"]])

labels = [f"D{i+1}" for i in np.arange(0,10)]

data["Income_cat"] = pd.cut(data["Income"], 10, precision=0, labels=labels)

print("\n", data[["ID", "Age_cat", "Income", "Income_cat"]].sample(5), sep='')


min          1730.0
mean    51954.61542
max        162397.0
Name: Income, dtype: Float64

         ID Age_cat  Income Income_cat
719    1463   65_74   45160         D3
822       1   55_64   57091         D4
495    3136   55_64   59432         D4
252   10089   45_54  102692         D7
1495  10770   65_74   65492         D4


In [22]:
# Total amount spent binning categories

print(data["MntSpentTotal"].describe()[["min", "mean", "max"]])

data["MntTotal_cat"], intervals = pd.cut(data["MntSpentTotal"], 6, 
                                         precision=0, retbins=True)

# Exploring bins:
# print("\n", data["MntTotal_cat"].value_counts().sort_index(), sep='')

# Creating new, more descriptive bins
temp, first_int = pd.cut(np.arange(2, 426), 5, retbins=True)
bins = list(first_int) + list(intervals[2:])

data["MntTotal_cat"], intervals = pd.cut(data["MntSpentTotal"], bins, 
                                         precision=0, right=False, 
                                         retbins=True)

print("\n", data[["ID", "Age_cat", 
                  "Income", "MntTotal_cat", 
                  "MntSpentTotal"]].sample(5), sep='')


min        5.000000
mean     607.380499
max     2525.000000
Name: MntSpentTotal, dtype: float64

        ID   Age_cat  Income     MntTotal_cat  MntSpentTotal
2141  9216     45_54   35788      [2.0, 87.0)             44
1624  7019     55_64   54414   [171.0, 256.0)            211
1849  5010     35_44   25008      [2.0, 87.0)             34
1838  9847     65_74   62972   [425.0, 845.0)            587
22    1993  75_above   58607  [845.0, 1265.0)            972


In [23]:
# Recency binning categories

print(data["Recency"].describe()[["min", "mean", "max"]])

labels = ["0_24", "25_49", "50_74", "75_99"]

data["Recency_cat"] = pd.cut(data["Recency"], 4, precision=0, labels=labels)

print("\n", data[["MntTotal_cat", "Age_cat", 
                  "Income_cat", "Recency_cat"]].sample(5), sep='')


min      0.000000
mean    49.082993
max     99.000000
Name: Recency, dtype: float64

          MntTotal_cat Age_cat Income_cat Recency_cat
1801  [1685.0, 2105.0)   45_54         D5       75_99
2106     [87.0, 171.0)   55_64         D4        0_24
1561    [425.0, 845.0)   45_54         D3       50_74
919   [1265.0, 1685.0)   65_74         D5       50_74
1291       [2.0, 87.0)   45_54         D2       75_99


## Saving new Dataset

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2205 entries, 0 to 2239
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   ID                   2205 non-null   int64          
 1   Year_Birth           2205 non-null   int64          
 2   Year_Old             2205 non-null   int64          
 3   CustomerFor          2205 non-null   timedelta64[ns]
 4   Dt_Customer          2205 non-null   datetime64[ns] 
 5   Education            2205 non-null   category       
 6   Marital_Status       2205 non-null   category       
 7   Income               2205 non-null   Int64          
 8   Kidhome              2205 non-null   int64          
 9   Teenhome             2205 non-null   int64          
 10  ChildrenHome         2205 non-null   int64          
 11  Recency              2205 non-null   int64          
 12  MntSpentTotal        2205 non-null   int64          
 13  MntWines             22

In [27]:
# Saving DataFrame as csv
data.to_csv("../data/ifood_cleaned.csv")

# Storing excluded rows
data.to_csv("../data/ifood_excluded.csv")

In [26]:
# Notebook info

!pip list

process = psutil.Process()
memory_used = process.memory_info().rss / (1024 ** 2)  # MB

print(f"\n\tMemory used: {memory_used:.2f} MB")


Package                   Version
------------------------- ---------
asttokens                 2.4.1
attrs                     23.2.0
Brotli                    1.1.0
certifi                   2024.6.2
charset-normalizer        3.3.2
colorama                  0.4.6
comm                      0.2.2
contourpy                 1.2.1
cycler                    0.12.1
Cython                    3.0.10
debugpy                   1.8.1
decorator                 5.1.1
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.20.0
fonttools                 4.51.0
idna                      3.7
importlib_metadata        7.1.0
importlib_resources       6.4.0
iniconfig                 2.0.0
ipykernel                 6.29.4
ipython                   8.24.0
ipywidgets                8.1.2
jedi                      0.19.1
jsonschema                4.22.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.2
jupyter_core              5.7.2
jupyterlab_widgets  