# Missing Values

In this section, we'll be learning different missing value imputation techniques by analyzing simulated customer lifetime value data. Customer Lifetime Value is the total monetary worth the customer has to the business, over the course of its business-customer relationship. In this dataset, to keep things simple, we'll be using purchases as a proxy for CLV. 

We'll be covering how to:

- Check the number of null values

- Dropping Null Values

- Mean/Median/Mode Imputation

- Multiple Imputation using Regression

- Imputation using Nearest Neighbors

Let's get started!

## Import Libraries

First, we'll need to import the relevant libraries. We'll be using the standard `pandas`, `numpy` libraries for data manipulation. We'll then be using `sklearn` for the more advanced imputation techniques. We will use `scipy` for the `mode` imputation later on:

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer

## Load Data

We load our customer lifetime value dataset. we have about 6 columns. The `purchases` column is the column we care about in our customer lifetime value problem. 

In [2]:
import random

df = pd.read_csv("clv_data.csv")
df['lifetime_value'] = df['purchases'] * 20
df.head()

Unnamed: 0.1,Unnamed: 0,id,age,gender,income,days_on_platform,city,purchases,lifetime_value
0,0,0,,Male,126895,14.0,San Francisco,0,0
1,1,1,,Male,161474,14.0,Tokyo,0,0
2,2,2,24.0,Male,104723,34.0,London,1,20
3,3,3,29.0,Male,43791,28.0,London,2,40
4,4,4,18.0,Female,132181,26.0,London,2,40


In [3]:
random.randint(5,50)

22

## Checking Null Values

The first step in any data analysis or ML model is to check null values. We can check the number of nulls in a single line:

In [4]:
df.isnull().sum()

Unnamed: 0             0
id                     0
age                 2446
gender                 0
income                 0
days_on_platform     141
city                   0
purchases              0
lifetime_value         0
dtype: int64

### Percentages of NULL values

In [5]:
def nulls_summary_table(df):
    """
    Returns a summary table showing null value counts and percentage
    
    Parameters:
    df (DataFrame): Dataframe to check
    
    Returns:
    null_values (DataFrame)
    """
    null_values = pd.DataFrame(df.isnull().sum())
    null_values[1] = null_values[0]/len(df)
    null_values.columns = ['null_count','null_pct']
    return null_values

nulls_summary_table(df)

Unnamed: 0,null_count,null_pct
Unnamed: 0,0,0.0
id,0,0.0
age,2446,0.4892
gender,0,0.0
income,0,0.0
days_on_platform,141,0.0282
city,0,0.0
purchases,0,0.0
lifetime_value,0,0.0


### Dropping Null Values

Dropping nulls is the quickest and easiest method to dropping nulls. We will use the internal pandas method `dropna` which will simply drop all rows that contain nulls:

In [7]:
drop_df = df.copy()
drop_df = drop_df.dropna()
drop_df.isnull().sum()

Unnamed: 0          0
id                  0
age                 0
gender              0
income              0
days_on_platform    0
city                0
purchases           0
lifetime_value      0
dtype: int64

In [8]:
X_d = drop_df[['age','days_on_platform','income']]
y_d = drop_df['lifetime_value']

X_train_d = X_d[:4000]
y_train_d = y_d[:4000]

X_test_d = X_d[1000:]
y_test_d = y_d[1000:]

### Mean/Median/Mode Imputation

The next is mean/median/mode imputation. We can use the native numpy functions for the mean and median. We can use scipy for the mode. Then, pandas as a native `fillna` method we can use to impute the nulls with the mean/median/mode:

In [9]:
m_df = df.copy()

X_m = m_df[['age','days_on_platform','income']]
y_m = m_df['lifetime_value']


X_train_m = X_m[:4000]
y_train_m = y_m[:4000]

X_test_m = X_m[1000:]
y_test_m = y_m[1000:]

In [11]:
## Mean
X_train_m.loc[:,'age'] = X_train_m['age'].fillna(np.mean(X_train_m['age']))
X_test_m.loc[:,'age'] = X_test_m['age'].fillna(np.mean(X_train_m['age'])) ## Cannot use training dataset to impute


X_train_m.loc[:,'days_on_platform'] = X_train_m['days_on_platform'].fillna(np.mean(X_train_m['days_on_platform']))
X_test_m.loc[:,'days_on_platform'] = X_test_m['days_on_platform'].fillna(np.mean(X_train_m['days_on_platform'])) ## Cannot use training dataset to impute

In [13]:
## Median
m_df.loc[:,'age'] = df['age'].fillna(np.median(m_df['age']))

In [18]:
## Mode
m_df.loc[:,'age'] = m_df['age'].fillna(stats.mode(m_df['age'], keepdims=True).mode[0])

### Multiple Imputation Using Regression

Now that we've covered the simpler imputation techniques, we'll cover a more complicated imputation technique: Multiple Imputation.

Multiple imputation has a few different estimators, using the `estimator` argument:

- `BayesianRidge`: Regularized Linear Regression

- `RandomForestRegressor`: Random Forest Model. Mimics missForest in the R language.

The `missing_values` argument is a placeholder for the data type of the missing values we want to impute. 

It's important to use `add_indicatorbool` as it'll create a placeholder indicating that we've imputed a missing value. This is important, because there could be patterns behind how a value is missing. Adding an indicator would allow us to keep track of where we made an imputation. Plus, it could also add signal into your model. 

`max_iter`: The number of iteration rounds.

In [23]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer
from sklearn.linear_model import LinearRegression

## Target - Purchases in the first six months

r_df = df.copy()

X_r = r_df[['age','days_on_platform','income']]
y_r = r_df['lifetime_value']


X_train_r = X_r[:4000]
y_train_r = y_r[:4000]

X_test_r = X_r[1000:]
y_test_r = y_r[1000:]


Imp = IterativeImputer(estimator=LinearRegression(), max_iter=10, random_state = 0)
Imp.fit(X_train_r)

X_train_r = Imp.transform(X_train_r)
X_test_r = Imp.transform(X_test_r)

X_train_r = pd.DataFrame(X_train_r)
X_train_r.columns = X_train_r.columns

X_test_r = pd.DataFrame(X_test_r)
X_test_r.columns = X_test_r.columns

r_df = pd.concat([X_train_r,X_test_r],axis = 0)

### Nearest Neighbor Imputation

On top of using linear regression or random forest regression to impute values, we can also use nearest neighbors imputation. Nearest neighbor imputation essentially uses a K-Nearest Neighbors algorithm to find the most similar data points, to impute the null values. 

In [24]:
imputer = KNNImputer(n_neighbors=5, weights="uniform")
imputer.fit(X_train_r)
X_train_k = imputer.transform(X_train_r)
X_test_k = imputer.transform(X_test_r)

y_train_k = y_train_r.copy()
y_test_k = y_test_r.copy()

## Comparison

In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Drop Null Model
clf_n = RandomForestRegressor(random_state=0)
clf_n.fit(X_train_d, y_train_d)
pred_dropna = clf_n.predict(X_test_d)

# Mean Imputation Model
clf_m = RandomForestRegressor(random_state=0)
clf_m.fit(X_train_m, y_train_m)
pred_m = clf_m.predict(X_test_m)

# Regression Imputation
clf_r = RandomForestRegressor(random_state=0)
clf_r.fit(X_train_r, y_train_r)
pred_r = clf_r.predict(X_test_r)

#Nearest Neighbor Imputation
clf_n = RandomForestRegressor(random_state=0)
clf_n.fit(X_train_k, y_train_k)
pred_k = clf_n.predict(X_test_k)


print('Drop Null MAE Score: %.3f' % mean_absolute_error(y_test_d,pred_dropna))
print('Mean Impute MAE Score: %.3f' % mean_absolute_error(y_test_m,pred_m))
print('Regression MAE Score: %.3f '% mean_absolute_error(y_test_r,pred_r))
print('Nearest Neighbor MAE Score: %.3f'% mean_absolute_error(y_test_k,pred_k))

Drop Null MAE Score: 7.636
Mean Impute MAE Score: 10.828
Regression MAE Score: 10.785 
Nearest Neighbor MAE Score: 10.785
