# __Data Preprocessing__

## __Train-Test split__

Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data by splitting a dataset into a training set and a testing set. 

- The training set is data used to train the model, and the testing set data (which is new to the model) is used to test the model’s performance and accuracy.
- A train test split can also involve splitting data into a validation set, which is data used to fine-tune hyperparameters and optimize the model during the training process.

<img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_train-test-split_0.jpg" width=700 />

#### __Methods for Splitting Data in a Train Test Split__

Some common methods of splitting data in a train test split:

1. __Random Splitting:__ involves randomly shuffling data and splitting it into training and testing sets based on given percentages (like 75% training and 25% testing).
   
2. __Stratified Splitting:__ divides a dataset in a way that preserves its proportion of classes or categories. This creates training and testing sets with class proportions representative of the original dataset. Using stratified splitting can prevent model bias, and is most effective for imbalanced datasets. Use `stratify` parameter in the `train_test_split()` method.

3. __Time-Based Splitting:__ involves organizing data in a set by points in time, ensuring past data is in the training set and future or later data is in the testing set. Splitting data based on time works to simulate real-world scenarios (for example, predicting future financial or market trends) and allows for time series analysis on time series datasets. However, one drawback to time-based splitting is that it may not fully capture trends for non-stationary data (data that continually changes over time). In scikit-learn, time series data can be split into training and testing sets by using the `TimeSeriesSplit()` method.

## __Data Standardization__

Data standardization comes into the picture when features of the input data set have large differences between their ranges, or simply when they are measured in different units. It converts data into a standard, uniform format, making it consistent across different data sets and easier to understand for machine learning or statistical models. 

- Standardizing data can enhance data quality and accuracy, which helps users make reliable data-driven decisions.
- Z-score normalization, or standardization, is one of the most popular methods to standardize data.
- With this method, data is transformed to have a mean of 0 and a standard deviation of 1, giving all data points the same scale.
- We can use the `StandardScaler()` method from scikit-learn.
- `StandardScaler()` provides the 3 methods fit(), transform(), and fit_transform().
- `fit()` method - takes the dataset we aim to standardize as an argument and computes its mean and standard deviation.
- `transform()` method - applies the scaling performed using the `.fit()` method to every feature value.
- `fit_transform()` method - does both `.fit()` and `.transform()`. Has more computational efficiency as it combines two methods into one.

### __Should we perform `.fit_transform()` before or after the split of training and test data?__

Normalization / Standardization should be done after splitting the data into train and test sets. The reason is to avoid any data leakage. This is also applicable to `CountVectorizer()` where it counts the no. of words in a text message.

__Data Leakage:__ Data leakage happens when information from outside the training set is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed.

The testing data points represent real-world data. Feature normalization (or data standardisation) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the standard deviation. If you take the mean and variance of the whole dataset, you will be introducing future information into the explanatory variables (i.e. the mean and std. deviation). 

- We perform standardisation as `fit_transform()` on the training set and `transform()` on the testing set.
- `fit_transform()` on the train data standardises it and calculates the mean and standard deviation for the train data, `transform()` is used on the test data which means we will apply the metrics calculated from the train set onto the test set. We do so to prevent data leakage, i.e. learning something new from the test data, in order to accurately test the selected model.
- Using the `fit_transform` method on the entire dataset could cause the model to overperform, as it would have prior knowledge of the test data vocabulary, leading to an unrealistic assessment of its performance.

## __Missing data imputation__

Missing data imputation is a technique used to fill in missing values within a dataset, preventing potential issues with analysis or model training. It involves replacing missing values with estimated values based on the existing data, ensuring the dataset is complete and usable for further analysis.

- __Univariate Imputation:__ This method focuses on a single variable, using the mean, median, or mode of the non-missing values to fill in the missing values for that specific variable. 
- __Multivariate Imputation:__ This method considers multiple variables to estimate the missing values. It often involves using regression models or other statistical methods to predict the missing values based on the relationships between the variables. 
- __Multiple Imputation:__ This technique creates multiple imputed datasets by generating different estimates for the missing values. This allows for incorporating uncertainty about the true values, as analysis can be performed separately on each imputed dataset and results can be pooled. 

#### __Common Imputation Techniques:__

1. __Mean, Median, and Mode Imputation:__ Replacing missing values with the average, middle value, or most frequent value, respectively. 
2. __Constant Value Imputation:__ Replacing missing values with a predetermined constant, which can be a specific value or a value representing an "unknown" or "missing" category. 
3. __Regression Imputation:__ Using regression models to predict missing values based on other available variables. 
4. __K-Nearest Neighbors Imputation:__ Replacing missing values with the average of the values from the nearest neighbors in the dataset. 

## __Dealing with Missing Values__

Customer Lifetime Value is the total monetary worth the customer has to the business, over the course of its business-customer relationship. Below are different missing value imputation techniques by analyzing simulated customer lifetime value data. 

- Check the number of null values
- Dropping Null Values
- Mean/Median/Mode Imputation
- Multiple Imputation using Regression
- Imputation using Nearest Neighbors

### __Import the Libraries__

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer

In [26]:
df = pd.read_csv('datasets/clv_data.csv')

# Since calculation of customer lifetime value is extremely complex, we will use a simple assumption for this example
df['lifetime_value'] = df['purchases'] * 20

df.head()

Unnamed: 0.1,Unnamed: 0,id,age,gender,income,days_on_platform,city,purchases,lifetime_value
0,0,0,,Male,126895,14.0,San Francisco,0,0
1,1,1,,Male,161474,14.0,Tokyo,0,0
2,2,2,24.0,Male,104723,34.0,London,1,20
3,3,3,29.0,Male,43791,28.0,London,2,40
4,4,4,18.0,Female,132181,26.0,London,2,40


### __Checking null values__

The first step in any data analysis or ML model is to check null values.

In [8]:
df.isnull().sum()

Unnamed: 0             0
id                     0
age                 2446
gender                 0
income                 0
days_on_platform     141
city                   0
purchases              0
lifetime_value         0
dtype: int64

### __Dropping null values__

Dropping nulls is the quickest and easiest method to remove null/missing values.

In [9]:
drop_df = df.copy()

drop_df = drop_df.dropna()
drop_df.isnull().sum()

Unnamed: 0          0
id                  0
age                 0
gender              0
income              0
days_on_platform    0
city                0
purchases           0
lifetime_value      0
dtype: int64

#### __Divide dataset into training and testing sets__

In [10]:
# Create Feature and target variables
X = drop_df[['age', 'gender', 'days_on_platform', 'income']]
y = drop_df['lifetime_value']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### __Mean/Median/Mode Imputation for missing values__

The next is mean/median/mode imputation which can be used to fill the null/missing values.

In [38]:
m_df = df.copy()

# Create Feature and target variables
X = m_df[['age', 'gender', 'days_on_platform', 'income']]
y = m_df['lifetime_value']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
# Calculate the mean/median/mode of the training set and use it to fill the null values in testing set to avoid data leakage

X_train.fillna({'age': np.mean(X_train['age']), 'days_on_platform': np.mean(X_train['days_on_platform'])}, inplace=True)
X_test.fillna({'age': np.mean(X_train['age']), 'days_on_platform': np.mean(X_train['days_on_platform'])}, inplace=True)