# Introduction

The notebook is intended to perform data cleaning process over the dataset **Customer Personality Analysis**.


[Dataset](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)

In [3]:
# Import Standard Modules
import pandas as pd

import plotly.express as ex

# Set Pandas Options
pd.set_option('display.max_columns', 500)

# Read Data

In [4]:
# Read data from csv
data = pd.read_csv('../data/marketing_campaign.csv', sep='\t', encoding='latin1')

# Cleaning Outliers

Possible approaches:
1. Drop outliers - This technique can drastically reduce the amount of data.
2. Cap outliers - This technique is useful when we can assume that all outliers epress the same behaviors or pattern and thus the model wouldn't learn anything new from them
3. Fill using mean

## Interquartile Range (IQR)

In [None]:
# Define the list of columns for which compute the IQR
iqr_columns = ['Year_Birth', 'Income', 'MntWines', 'MntFruits']

In [25]:
# Compute Q1 and Q3
q1 = data.drop(['Complain'], axis=1).quantile(0.25)
q3 = data.drop(['Complain'], axis=1).quantile(0.75)

In [26]:
# Compute the IQR
iqr = q3 - q1

In [27]:
# Check lower filtering range
q1 - 1.5*iqr

ID                     -5571.0
Year_Birth              1932.0
Income                -14525.5
Kidhome                   -1.5
Teenhome                  -1.5
Recency                  -51.0
MntWines                -697.0
MntFruits                -47.0
MntMeatProducts         -308.0
MntFishProducts          -67.5
MntSweetProducts         -47.0
MntGoldProds             -61.5
NumDealsPurchases         -2.0
NumWebPurchases           -4.0
NumCatalogPurchases       -6.0
NumStorePurchases         -4.5
NumWebVisitsMonth         -3.0
AcceptedCmp3               0.0
AcceptedCmp4               0.0
AcceptedCmp5               0.0
AcceptedCmp1               0.0
AcceptedCmp2               0.0
Z_CostContact              3.0
Z_Revenue                 11.0
Response                   0.0
dtype: float64

In [28]:
# Check Upper filtering range
q3 + 1.5*iqr

ID                      16827.0
Year_Birth               2004.0
Income                 118350.5
Kidhome                     2.5
Teenhome                    2.5
Recency                   149.0
MntWines                 1225.0
MntFruits                  81.0
MntMeatProducts           556.0
MntFishProducts           120.5
MntSweetProducts           81.0
MntGoldProds              126.5
NumDealsPurchases           6.0
NumWebPurchases            12.0
NumCatalogPurchases        10.0
NumStorePurchases          15.5
NumWebVisitsMonth          13.0
AcceptedCmp3                0.0
AcceptedCmp4                0.0
AcceptedCmp5                0.0
AcceptedCmp1                0.0
AcceptedCmp2                0.0
Z_CostContact               3.0
Z_Revenue                  11.0
Response                    0.0
dtype: float64