# Introduction

The notebook is intended to perform data cleaning process over the dataset **Customer Personality Analysis**.


[Dataset](https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis)

In [3]:
# Import Standard Modules
import pandas as pd

import plotly.express as ex

# Set Pandas Options
pd.set_option('display.max_columns', 500)

# Read Data

In [4]:
# Read data from csv
data = pd.read_csv('../data/marketing_campaign.csv', sep='\t', encoding='latin1')

# Cleaning Outliers

Possible approaches:
1. Drop outliers - This technique can drastically reduce the amount of data.
2. Cap outliers - This technique is useful when we can assume that all outliers epress the same behaviors or pattern and thus the model wouldn't learn anything new from them
3. Fill using mean

## Interquartile Range (IQR)

In [5]:
# Compute Q1 and Q3
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)

In [18]:
q1

ID                      2828.25
Year_Birth              1959.00
Income                 35303.00
Kidhome                    0.00
Teenhome                   0.00
Recency                   24.00
MntWines                  23.75
MntFruits                  1.00
MntMeatProducts           16.00
MntFishProducts            3.00
MntSweetProducts           1.00
MntGoldProds               9.00
NumDealsPurchases          1.00
NumWebPurchases            2.00
NumCatalogPurchases        0.00
NumStorePurchases          3.00
NumWebVisitsMonth          3.00
AcceptedCmp3               0.00
AcceptedCmp4               0.00
AcceptedCmp5               0.00
AcceptedCmp1               0.00
AcceptedCmp2               0.00
Complain                   0.00
Z_CostContact              3.00
Z_Revenue                 11.00
Response                   0.00
Name: 0.25, dtype: float64

In [20]:
q3

ID                      8427.75
Year_Birth              1977.00
Income                 68522.00
Kidhome                    1.00
Teenhome                   1.00
Recency                   74.00
MntWines                 504.25
MntFruits                 33.00
MntMeatProducts          232.00
MntFishProducts           50.00
MntSweetProducts          33.00
MntGoldProds              56.00
NumDealsPurchases          3.00
NumWebPurchases            6.00
NumCatalogPurchases        4.00
NumStorePurchases          8.00
NumWebVisitsMonth          7.00
AcceptedCmp3               0.00
AcceptedCmp4               0.00
AcceptedCmp5               0.00
AcceptedCmp1               0.00
AcceptedCmp2               0.00
Complain                   0.00
Z_CostContact              3.00
Z_Revenue                 11.00
Response                   0.00
Name: 0.75, dtype: float64

In [13]:
# Compute the IQR
iqr = q3 - q1

In [17]:
iqr

ID                      5599.5
Year_Birth                18.0
Income                 33219.0
Kidhome                    1.0
Teenhome                   1.0
Recency                   50.0
MntWines                 480.5
MntFruits                 32.0
MntMeatProducts          216.0
MntFishProducts           47.0
MntSweetProducts          32.0
MntGoldProds              47.0
NumDealsPurchases          2.0
NumWebPurchases            4.0
NumCatalogPurchases        4.0
NumStorePurchases          5.0
NumWebVisitsMonth          4.0
AcceptedCmp3               0.0
AcceptedCmp4               0.0
AcceptedCmp5               0.0
AcceptedCmp1               0.0
AcceptedCmp2               0.0
Complain                   0.0
Z_CostContact              0.0
Z_Revenue                  0.0
Response                   0.0
dtype: float64

In [14]:
# Filter out outliers based on IQR
outliers = data[(data < (q1 - 1.5*iqr)) | (data > (q3 + 1.5*iqr))]

  outliers = data[(data < (q1 - 1.5*iqr)) | (data > (q3 + 1.5*iqr))]
