# Data Cleaning and Data Wrangling

## Data

Dataset contains socio-demographic and firmographic features about 2.240 
customers.

|Feature |Description|
|--:|---|
|AcceptedCmp1| 1 if customer accepted the offer in the 1st campaign, 0 otherwise|
|AcceptedCmp2| 1 if customer accepted the offer in the 2nd campaign, 0 otherwise|
|AcceptedCmp3| 1 if customer accepted the offer in the 3rd campaign, 0 otherwise|
|AcceptedCmp4| 1 if customer accepted the offer in the 4th campaign, 0 otherwise|
|AcceptedCmp5| 1 if customer accepted the offer in the 5th campaign, 0 otherwise|
|Response (target)| 1 if customer accepted the offer in the last campaign, 0 otherwise|
|Complain| 1 if customer complained in the last 2 years|
|DtCustomer| date of customer's enrollment with the company|
|Education| customer's level of education|
|Marital| customer's marital status|
|Kidhome| number of small children in customer's household|
|Teenhome |number of teenagers in customer's household|
|Income| customer's yearly household income|
|MntFishProducts| amount spent on fish products in the last 2 years|
|MntMeatProducts| amount spent on meat products in the last 2 years|
|MntFruits| amount spent on fruits products in the last 2 years|
|MntSweetProducts| amount spent on sweet products in the last 2 years|
|MntWines| amount spent on wines products in the last 2 years|
|MntGoldProds| amount spent on gold products in the last 2 years|
|NumDealsPurchases| number of purchases made with discount|
|NunCatalogPurchases| number of purchases made using catalog|
|NunStorePurchases| number of purchases made directly in stores|
|NumWebPurchases| number of purchases made through company's web site|
|NumWebVisitsMonth| number of visits to company's web site in the last month|
|Recency|number of days since the last purchase|
|Z_Revenue|revenue from the new gadget|
|Z_CostContact|cost of contact for the sixth campaign|

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
from scipy import stats
from scipy.stats import kstest

import warnings
warnings.filterwarnings("ignore")

## Data Exploration

In [2]:
# Storing path
path = Path("../data/ifood_customers.csv")

# Read CSV with pandas
data = pd.read_csv(path)

# showing a sample of the dataset
data.sample(8)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1759,5883,1972,Graduation,Married,77981.0,1,0,2013-05-26,78,138,...,5,0,0,0,0,0,0,3,11,0
1256,9094,1955,2n Cycle,Married,62972.0,0,1,2012-08-03,39,313,...,6,0,0,0,0,0,0,3,11,1
1688,9256,1971,Graduation,Single,58350.0,0,1,2013-01-04,5,493,...,6,0,0,0,0,0,0,3,11,0
389,9799,1968,PhD,Divorced,83664.0,1,1,2013-05-08,57,866,...,5,0,0,0,0,0,0,3,11,0
1344,5181,1982,Basic,Single,24367.0,1,0,2013-03-20,58,2,...,9,0,0,0,0,0,0,3,11,0
111,7431,1991,PhD,Single,68126.0,0,0,2012-11-10,40,1332,...,9,0,1,0,0,0,0,3,11,1
1522,1998,1976,Graduation,Single,37697.0,1,0,2014-02-07,82,34,...,6,0,0,0,0,0,0,3,11,0
1736,7500,1967,Graduation,Single,79146.0,1,1,2014-04-24,33,245,...,6,0,0,0,0,0,0,3,11,0


In [3]:
# Exploring the columns and dtype

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [4]:
# Exploring unique values in some columns

print("\n'Education' Column Values:\n", "\t",
      data["Education"].unique(),
      sep=''
      )

print("\n'Marital_Status' Column Values:\n", "\t",
      data["Marital_Status"].unique(),
      sep=''
      )

print("\n'Dt_Customer' Column sample:\n",
      data["Dt_Customer"].sample(3), "\n", "\t",
      sep=''
      )



'Education' Column Values:
	['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']

'Marital_Status' Column Values:
	['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

'Dt_Customer' Column sample:
2040    2013-05-02
1676    2013-03-25
391     2013-12-09
Name: Dt_Customer, dtype: object
	


## Creating columns

In [5]:
# Stabilizing dtype 'category' to Education and Marital_Status columns
# Also Dt_Customer to Datetime dtype and Income to 'Int64'

data["Income"] = data["Income"].astype("Int64")
data["Education"] = data["Education"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%Y-%m-%d")

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   2240 non-null   int64         
 1   Year_Birth           2240 non-null   int64         
 2   Education            2240 non-null   category      
 3   Marital_Status       2240 non-null   category      
 4   Income               2216 non-null   Int64         
 5   Kidhome              2240 non-null   int64         
 6   Teenhome             2240 non-null   int64         
 7   Dt_Customer          2240 non-null   datetime64[ns]
 8   Recency              2240 non-null   int64         
 9   MntWines             2240 non-null   int64         
 10  MntFruits            2240 non-null   int64         
 11  MntMeatProducts      2240 non-null   int64         
 12  MntFishProducts      2240 non-null   int64         
 13  MntSweetProducts     2240 non-nul

In [6]:
# Creating new columns: customer year old and days since becoming customer

today = pd.to_datetime(datetime.today().strftime('%Y-%m-%d'))

data["Year_Old"] = (today.year - data["Year_Birth"])
data["CustomerFor"] = (today - data["Dt_Customer"])

data[["Year_Old", "CustomerFor"]].sample(5)

Unnamed: 0,Year_Old,CustomerFor
1205,68,4154 days
226,48,4272 days
1639,48,4027 days
1801,50,4283 days
1587,59,3662 days


In [7]:
# Reordering columns

pop_column = data.pop("Dt_Customer")
data.insert(2, "Dt_Customer", pop_column)

last_columns = data.columns[-2:]
first_columns = data.columns[:2]
middle_columns = data.columns[2:-2]
new_order = list(first_columns) + list(last_columns) + list(middle_columns)

data = data[new_order]

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   ID                   2240 non-null   int64          
 1   Year_Birth           2240 non-null   int64          
 2   Year_Old             2240 non-null   int64          
 3   CustomerFor          2240 non-null   timedelta64[ns]
 4   Dt_Customer          2240 non-null   datetime64[ns] 
 5   Education            2240 non-null   category       
 6   Marital_Status       2240 non-null   category       
 7   Income               2216 non-null   Int64          
 8   Kidhome              2240 non-null   int64          
 9   Teenhome             2240 non-null   int64          
 10  Recency              2240 non-null   int64          
 11  MntWines             2240 non-null   int64          
 12  MntFruits            2240 non-null   int64          
 13  MntMeatProducts   

In [8]:
# Checking if the customer bought in the last month

data["PurchaseLastMonth"] = (data["Recency"] < 30)
data["PurchaseLastMonth"] = data["PurchaseLastMonth"].replace({True:1, 
                                                               False:0})

print(data["PurchaseLastMonth"].sample(5))

1823    0
1232    1
546     0
659     0
1865    1
Name: PurchaseLastMonth, dtype: int64


In [9]:
# Calculating total amount spent per customer

MntSpentTotal_sum = ["MntFishProducts", "MntFruits", "MntGoldProds", 
                     "MntMeatProducts", "MntSweetProducts", "MntWines"]
data["MntSpentTotal"] = data[MntSpentTotal_sum].sum(axis=1)

print(data["MntSpentTotal"].sample(5))

820     1808
1814    1338
460      893
222      714
196      834
Name: MntSpentTotal, dtype: int64


In [10]:
# How many campaigns the customer accepted

AcceptedCmpTotal_sum = ["AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", 
                        "AcceptedCmp4", "AcceptedCmp5", "Response"]
data["AcceptedCmpTotal"] = data[AcceptedCmpTotal_sum].sum(axis=1)

data["AcceptedCmpTotal"].sample(5)

1694    0
289     0
162     0
428     0
2000    0
Name: AcceptedCmpTotal, dtype: int64

In [11]:
# How many children (Kids and teenagers) the customer has at home

ChildrenHome_sum = ["Kidhome", "Teenhome"]
data["ChildrenHome"] = data[ChildrenHome_sum].sum(axis="columns")

data["ChildrenHome"].sample(5)

2032    1
1455    0
1679    1
548     2
2030    1
Name: ChildrenHome, dtype: int64

## All columns reordered

In [12]:
# Reordering columns

pop_column = data.pop("AcceptedCmpTotal")
data.insert(27, "AcceptedCmpTotal", pop_column)

pop_column = data.pop("PurchaseLastMonth")
data.insert(17, "PurchaseLastMonth", pop_column)

pop_column = data.pop("MntSpentTotal")
data.insert(11, "MntSpentTotal", pop_column)

pop_column = data.pop("ChildrenHome")
data.insert(10, "ChildrenHome", pop_column)

pop_column = data.pop("AcceptedCmp2")
data.insert(25, "AcceptedCmp2", pop_column)

pop_column = data.pop("AcceptedCmp1")
data.insert(25, "AcceptedCmp1", pop_column)

pop_column = data.pop("Response")
data.insert(30, "Response", pop_column)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype          
---  ------               --------------  -----          
 0   ID                   2240 non-null   int64          
 1   Year_Birth           2240 non-null   int64          
 2   Year_Old             2240 non-null   int64          
 3   CustomerFor          2240 non-null   timedelta64[ns]
 4   Dt_Customer          2240 non-null   datetime64[ns] 
 5   Education            2240 non-null   category       
 6   Marital_Status       2240 non-null   category       
 7   Income               2216 non-null   Int64          
 8   Kidhome              2240 non-null   int64          
 9   Teenhome             2240 non-null   int64          
 10  ChildrenHome         2240 non-null   int64          
 11  Recency              2240 non-null   int64          
 12  MntSpentTotal        2240 non-null   int64          
 13  MntWines          

## Data Cleaning

In [13]:
# Missing Values

row_nan = data[data.isna().any(axis=1)]
print("Customers with missing values: ", len(row_nan))

data.drop(row_nan.index, inplace=True)

Customers with missing values:  24


In [14]:
# Identifying logically incoherent customers and dropping from the dataframe

marital_filt = data[data["Marital_Status"].isin(['Alone', 'Absurd', 'YOLO'])]
print("Customers with a logically incoherent Marital Status: ", 
      len(marital_filt))

data.drop(marital_filt.index, inplace=True)

Customers with a logically incoherent Marital Status:  7


### Kolmogorov-Smirnov test

In [15]:
# KS-Test on 'Year_Old' column
ks_result = kstest(data["Year_Old"], stats.norm.cdf, 
                   args=(data["Year_Old"].mean(), data["Year_Old"].std()))

print(f"Test statistic: {ks_result.statistic:.4f}")
print(f"p Value: {ks_result.pvalue:.4f}")

Test statistic: 0.0590
p Value: 0.0000


In [16]:
# KS-Test on 'Year_Old' column
ks_result = kstest(data["Income"], stats.norm.cdf, 
                   args=(data["Income"].mean(), data["Income"].std()))

print(f"Test statistic: {ks_result.statistic:.4f}")
print(f"p Value: {ks_result.pvalue:.4f}")

Test statistic: 0.0542
p Value: 0.0000


### Obtaining outliers

In [17]:
# Quartiles and IQR
quartiles = data["Year_Old"].quantile([0.25, 0.75])
iqr = quartiles[0.75] - quartiles[0.25]

# Identify Outliers
lower_bound = quartiles[0.25] - 1.5 * iqr
upper_bound = quartiles[0.75] + 1.5 * iqr

# Filtering
Year_Old_outliers = data[(data["Year_Old"] < lower_bound) | 
                (data["Year_Old"] > upper_bound)]

print("IQR: ", iqr)
print("outliers: ")
print(Year_Old_outliers[["ID", "Year_Old", "Income", "CustomerFor", 
                "Marital_Status", "MntSpentTotal"]])

# Extracting Outliers from the Dataset
data.drop(Year_Old_outliers.index, inplace=True)

IQR:  18.0
outliers: 
        ID  Year_Old  Income CustomerFor Marital_Status  MntSpentTotal
192   7829       124   36640   3936 days       Divorced             65
239  11004       131   60182   3703 days         Single             22
339   1150       125   83532   3936 days       Together           1853


In [18]:
# Quartiles and IQR
quartiles = data["Income"].quantile([0.25, 0.75])
iqr = quartiles[0.75] - quartiles[0.25]

# Identify Outliers
lower_bound = quartiles[0.25] - 1.5 * iqr
upper_bound = quartiles[0.75] + 1.5 * iqr

# Filtering
income_outliers = data[(data["Income"] < lower_bound) | 
                (data["Income"] > upper_bound)]

print("IQR: ", iqr)
print("outliers: ")
print(income_outliers[["ID", "Year_Old", "Income", "CustomerFor", 
                "Marital_Status", "MntSpentTotal"]]\
                    .sort_values("Income", ascending=False))

# Extracting logically incoherent Outlier from the Dataset
data.drop(2233, inplace=True)

IQR:  33383.5
outliers: 
         ID  Year_Old  Income CustomerFor Marital_Status  MntSpentTotal
2233   9432        47  666666   4052 days       Together             62
617    1503        48  162397   4051 days       Together            107
687    1501        42  160803   4354 days        Married           1717
1300   5336        53  157733   4050 days       Together             59
164    8475        51  157243   3780 days        Married           1608
1653   4931        47  157146   4086 days       Together           1730
2132  11181        75  156924   3964 days        Married              8
655    5555        49  153924   3802 days       Divorced              6


## Saving new Dataset

In [20]:
data.to_csv("../data/ifood_customers_cleaned.csv")