# Data Cleaning and Data Wrangling

## Data

Dataset contains socio-demographic and firmographic features about 2.240 
customers.

|Feature |Description|
|--:|---|
|AcceptedCmp1| 1 if customer accepted the offer in the 1st campaign, 0 otherwise|
|AcceptedCmp2| 1 if customer accepted the offer in the 2nd campaign, 0 otherwise|
|AcceptedCmp3| 1 if customer accepted the offer in the 3rd campaign, 0 otherwise|
|AcceptedCmp4| 1 if customer accepted the offer in the 4th campaign, 0 otherwise|
|AcceptedCmp5| 1 if customer accepted the offer in the 5th campaign, 0 otherwise|
|Response (target)| 1 if customer accepted the offer in the last campaign, 0 otherwise|
|Complain| 1 if customer complained in the last 2 years|
|DtCustomer| data of customer's enrollment with the company|
|Education| customer's level of education|
|Marital| customer's marital status|
|Kidhome| number of small children in customer's household|
|Teenhome |number of teenagers in customer's household|
|Income| customer's yearly household income|
|MntFishProducts| amount spent on fish products in the last 2 years|
|MntMeatProducts| amount spent on meat products in the last 2 years|
|MntFruits| amount spent on fruits products in the last 2 years|
|MntSweetProducts| amount spent on sweet products in the last 2 years|
|MntWines| amount spent on wines products in the last 2 years|
|MntGoldProds| amount spent on gold products in the last 2 years|
|NumDealsPurchases| number of purchases made with discount|
|NunCatalogPurchases| number of purchases made using catalog|
|NunStorePurchases| number of purchases made directly in stores|
|NumWebPurchases| number of purchases made through company's web site|
|NumWebVisitsMonth| number of visits to company's web site in the last month|
|Recency|number of days since the last purchase|
|Z_Revenue|revenue from the new gadget|
|Z_CostContact|cost of contact for the sixth campaign|

In [74]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
from datetime import date

import warnings
warnings.filterwarnings("ignore")

### Data Exploration

In [75]:
# Storing path
path = Path("../data/ifood_customers.csv")

# Read CSV with pandas
data = pd.read_csv(path)

# showing a sample of the dataset
data.sample(8)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1158,7959,1961,Graduation,Married,79410.0,0,0,2014-05-29,19,658,...,1,0,0,0,0,0,0,3,11,0
79,1618,1965,Graduation,Together,56046.0,0,0,2013-01-02,9,577,...,8,1,0,0,0,0,0,3,11,1
1514,3865,1977,2n Cycle,Together,20981.0,0,0,2013-04-30,14,2,...,8,0,0,0,0,0,0,3,11,1
1725,2634,1979,Master,Single,16653.0,1,0,2014-04-18,10,5,...,6,0,0,0,0,0,0,3,11,1
1744,9710,1969,PhD,Divorced,58086.0,0,1,2013-01-20,80,708,...,8,0,0,0,0,0,0,3,11,0
159,2730,1955,Graduation,Single,80317.0,0,0,2013-08-20,64,536,...,1,0,0,0,0,0,0,3,11,0
789,347,1976,Graduation,Divorced,40780.0,0,1,2012-09-08,30,229,...,9,0,0,0,0,0,0,3,11,0
1220,10395,1986,Basic,Single,8940.0,1,0,2012-08-22,25,1,...,8,0,0,0,0,0,0,3,11,0


In [76]:
# Exploring the column names

data.columns

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')

In [77]:
# Exploring unique values in some columns

print("\n'Education' Column Values:\n", "\t",
      data["Education"].unique(),
      sep=''
      )

print("\n'Marital_Status' Column Values:\n", "\t",
      data["Marital_Status"].unique(),
      sep=''
      )

print("\n'Dt_Customer' Column sample:\n",
      data["Dt_Customer"].sample(3), "\n", "\t",
      sep=''
      )



'Education' Column Values:
	['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']

'Marital_Status' Column Values:
	['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

'Dt_Customer' Column sample:
1841    2013-06-03
1906    2013-09-06
1560    2013-02-14
Name: Dt_Customer, dtype: object
	


In [78]:
# Stabilizing dtype 'category' to Education and Marital_Status columns
# Also Year_Birth and Dt_Customer to Datetime dtype

data["Education"] = data["Education"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format="%Y-%m-%d")
#data["Year_Birth"] = pd.to_datetime(data["Year_Birth"], format="%Y")

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   2240 non-null   int64         
 1   Year_Birth           2240 non-null   int64         
 2   Education            2240 non-null   category      
 3   Marital_Status       2240 non-null   category      
 4   Income               2216 non-null   float64       
 5   Kidhome              2240 non-null   int64         
 6   Teenhome             2240 non-null   int64         
 7   Dt_Customer          2240 non-null   datetime64[ns]
 8   Recency              2240 non-null   int64         
 9   MntWines             2240 non-null   int64         
 10  MntFruits            2240 non-null   int64         
 11  MntMeatProducts      2240 non-null   int64         
 12  MntFishProducts      2240 non-null   int64         
 13  MntSweetProducts     2240 non-nul

In [79]:
data["Year_Birth"]

0       1957
1       1954
2       1965
3       1984
4       1981
        ... 
2235    1967
2236    1946
2237    1981
2238    1956
2239    1954
Name: Year_Birth, Length: 2240, dtype: int64

In [80]:
# data[data["Income"].isna() == True]

today = pd.to_datetime(datetime.today().strftime('%Y-%m-%d'))

data["Year_Old"] = (today.year - data["Year_Birth"])
data["CustomerFor"] = (today - data["Dt_Customer"])



In [81]:
# Test

data[["Year_Old", "CustomerFor"]]

Unnamed: 0,Year_Old,CustomerFor
0,67,4320 days
1,70,3770 days
2,59,3969 days
3,40,3796 days
4,43,3818 days
...,...,...
2235,57,4038 days
2236,78,3676 days
2237,43,3812 days
2238,68,3813 days
