# __Data Preprocessing__

## __Dealing with Missing Values__

Customer Lifetime Value is the total monetary worth the customer has to the business, over the course of its business-customer relationship. Below are different missing value imputation techniques by analyzing simulated customer lifetime value data. 

- Check the number of null values
- Dropping Null Values
- Mean/Median/Mode Imputation
- Multiple Imputation using Regression
- Imputation using Nearest Neighbors

### __Import the Libraries__

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer

In [26]:
df = pd.read_csv('datasets/clv_data.csv')

# Since calculation of customer lifetime value is extremely complex, we will use a simple assumption for this example
df['lifetime_value'] = df['purchases'] * 20

df.head()

Unnamed: 0.1,Unnamed: 0,id,age,gender,income,days_on_platform,city,purchases,lifetime_value
0,0,0,,Male,126895,14.0,San Francisco,0,0
1,1,1,,Male,161474,14.0,Tokyo,0,0
2,2,2,24.0,Male,104723,34.0,London,1,20
3,3,3,29.0,Male,43791,28.0,London,2,40
4,4,4,18.0,Female,132181,26.0,London,2,40


### __Checking null values__

The first step in any data analysis or ML model is to check null values.

In [8]:
df.isnull().sum()

Unnamed: 0             0
id                     0
age                 2446
gender                 0
income                 0
days_on_platform     141
city                   0
purchases              0
lifetime_value         0
dtype: int64

### __Dropping null values__

Dropping nulls is the quickest and easiest method to remove null/missing values.

In [9]:
drop_df = df.copy()

drop_df = drop_df.dropna()
drop_df.isnull().sum()

Unnamed: 0          0
id                  0
age                 0
gender              0
income              0
days_on_platform    0
city                0
purchases           0
lifetime_value      0
dtype: int64

#### __Divide dataset into training and testing sets__

In [10]:
# Create Feature and target variables
X = drop_df[['age', 'gender', 'days_on_platform', 'income']]
y = drop_df['lifetime_value']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### __Mean/Median/Mode Imputation for missing values__

The next is mean/median/mode imputation which can be used to fill the null/missing values.

In [38]:
m_df = df.copy()

# Create Feature and target variables
X = m_df[['age', 'gender', 'days_on_platform', 'income']]
y = m_df['lifetime_value']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
# Calculate the mean/median/mode of the training set and use it to fill the null values in testing set to avoid data leakage

X_train.fillna({'age': np.mean(X_train['age']), 'days_on_platform': np.mean(X_train['days_on_platform'])}, inplace=True)
X_test.fillna({'age': np.mean(X_train['age']), 'days_on_platform': np.mean(X_train['days_on_platform'])}, inplace=True)