# Data Cleaning and Data Wrangling

## Data

Dataset contains socio-demographic and firmographic features about 2.240 
customers.

|Feature |Description|
|--:|---|
|AcceptedCmp1| 1 if customer accepted the offer in the 1st campaign, 0 otherwise|
|AcceptedCmp2| 1 if customer accepted the offer in the 2nd campaign, 0 otherwise|
|AcceptedCmp3| 1 if customer accepted the offer in the 3rd campaign, 0 otherwise|
|AcceptedCmp4| 1 if customer accepted the offer in the 4th campaign, 0 otherwise|
|AcceptedCmp5| 1 if customer accepted the offer in the 5th campaign, 0 otherwise|
|Response (target)| 1 if customer accepted the offer in the last campaign, 0 otherwise|
|Complain| 1 if customer complained in the last 2 years|
|DtCustomer| data of customer's enrollment with the company|
|Education| customer's level of education|
|Marital| customer's marital status|
|Kidhome| number of small children in customer's household|
|Teenhome |number of teenagers in customer's household|
|Income| customer's yearly household income|
|MntFishProducts| amount spent on fish products in the last 2 years|
|MntMeatProducts| amount spent on meat products in the last 2 years|
|MntFruits| amount spent on fruits products in the last 2 years|
|MntSweetProducts| amount spent on sweet products in the last 2 years|
|MntWines| amount spent on wines products in the last 2 years|
|MntGoldProds| amount spent on gold products in the last 2 years|
|NumDealsPurchases| number of purchases made with discount|
|NunCatalogPurchases| number of purchases made using catalog|
|NunStorePurchases| number of purchases made directly in stores|
|NumWebPurchases| number of purchases made through company's web site|
|NumWebVisitsMonth| number of visits to company's web site in the last month|
|Recency|number of days since the last purchase|
|Z_Revenue|revenue from the new gadget|
|Z_CostContact|cost of contact for the sixth campaign|

In [17]:
import numpy as np
import pandas as pd
from pathlib import Path

import warnings
warnings.filterwarnings("ignore")

### Data Exploration

In [18]:
# Storing path
path = Path("../data/ifood_customers.csv")

# Read CSV with pandas
data = pd.read_csv(path)

# showing a sample of the dataset
data.sample(8)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
167,3712,1959,Graduation,Divorced,52332.0,0,0,2013-08-28,63,212,...,4,0,1,0,0,0,0,3,11,0
592,4501,1965,Master,Single,69882.0,0,0,2013-11-10,94,292,...,1,0,0,0,0,0,0,3,11,0
1858,9029,1972,PhD,Married,70116.0,0,0,2013-01-26,73,707,...,1,0,0,0,0,0,0,3,11,0
116,1592,1970,Graduation,Married,90765.0,0,0,2014-01-24,25,547,...,1,0,0,1,1,0,0,3,11,0
1127,1010,1977,Graduation,Together,46931.0,2,1,2014-04-24,94,41,...,3,0,0,0,0,0,0,3,11,0
1900,10789,1964,PhD,Married,45759.0,1,1,2013-02-23,13,42,...,7,0,0,0,0,0,0,3,11,0
1698,10356,1957,PhD,Divorced,41437.0,1,1,2012-09-22,5,29,...,7,0,0,0,0,0,0,3,11,0
522,9214,1991,Graduation,Married,42691.0,0,0,2013-08-16,48,179,...,5,0,0,0,0,0,0,3,11,0


In [19]:
# Exploring the column names

data.columns

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')

In [20]:
# Exploring unique values in some columns

print("\n'Education' Column Values:\n", "\t",
      data["Education"].unique(),
      sep=''
      )

print("\n'Marital_Status' Column Values:\n", "\t",
      data["Marital_Status"].unique(),
      sep=''
      )

print("\n'Dt_Customer' Column sample:\n",
      data["Dt_Customer"].sample(3), "\n", "\t",
      sep=''
      )



'Education' Column Values:
	['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']

'Marital_Status' Column Values:
	['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

'Dt_Customer' Column sample:
1348    2013-08-31
1849    2014-02-02
1905    2013-05-11
Name: Dt_Customer, dtype: object
	


In [26]:
# Stabilizing dtype 'category' to Education and Marital_Status columns
# Also Year_Birth and Dt_Customer to Datetime dtype

data["Education"] = data["Education"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%Y-%m-%d")
data["Year_Birth"] = pd.to_datetime(data["Year_Birth"], format="%Y")
#data["Year_Birth"] = data["Year_Birth"].dt.year

"""
print("\n'Marital_Status' Data Type: ", data["Marital_Status"].dtype, sep='')
print("\n'Education' Data Type: ", data["Education"].dtype, sep='')
"""

for col in data.columns:
    print(f"column {col} with dtype '{data[str(col)].dtype}'")

column ID with dtype 'int64'
column Year_Birth with dtype 'datetime64[ns]'
column Education with dtype 'category'
column Marital_Status with dtype 'category'
column Income with dtype 'float64'
column Kidhome with dtype 'int64'
column Teenhome with dtype 'int64'
column Dt_Customer with dtype 'datetime64[ns]'
column Recency with dtype 'int64'
column MntWines with dtype 'int64'
column MntFruits with dtype 'int64'
column MntMeatProducts with dtype 'int64'
column MntFishProducts with dtype 'int64'
column MntSweetProducts with dtype 'int64'
column MntGoldProds with dtype 'int64'
column NumDealsPurchases with dtype 'int64'
column NumWebPurchases with dtype 'int64'
column NumCatalogPurchases with dtype 'int64'
column NumStorePurchases with dtype 'int64'
column NumWebVisitsMonth with dtype 'int64'
column AcceptedCmp3 with dtype 'int64'
column AcceptedCmp4 with dtype 'int64'
column AcceptedCmp5 with dtype 'int64'
column AcceptedCmp1 with dtype 'int64'
column AcceptedCmp2 with dtype 'int64'
colum