# Data Pre-Processing

Before the data can be used to train models, it first needs to be pre-processed. Namely, we need to do four things:
- one-hot encode categorical variables (customer rank and acquisition channel).
- convert sex to a binary variable.
- remove variables that aren't needed for analysis (cutomer number, name, IT-Tag and Port).
- impute missing values for age and highest win.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import os.path
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [2]:
# Make sure we're in the right directory
correct_directory = "C:\\Users\\anear\\Desktop\\LiveScore_ML_Challenge\\LiveScore ML Challenge 4"
if os.getcwd() != correct_directory:
    os.chdir(correct_directory)
print("Current working directory: ", os.getcwd())

Current working directory:  C:\Users\anear\Desktop\LiveScore_ML_Challenge\LiveScore ML Challenge 4


In [3]:
# Load the data
_data = np.load('data/dfs.npy', allow_pickle=True).tolist()
train = _data['train']
test = _data['test']
print("Loaded!")

Loaded!


In [4]:
# Print information about data
print("No. of customers in train set: {}".format(len(train)))
print("No. of customers in test set: {}".format(len(test)))
print("Proportion of retained customers in train set: {:.1f}%".format(100*sum(train['Retained'].ravel()/len(train))))

No. of customers in train set: 891
No. of customers in test set: 418
Proportion of retained customers in train set: 38.4%


In [5]:
# Show sample data in train
train.head()

Unnamed: 0,Customer,Retained,CuRank,Name,Sex,Age,FriPlay,RelPlay,IT-Tag,Port,HighWin,AqChan
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,C85,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,C123,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,,8.05,S


In [6]:
# Show sample data in test
test.head()

Unnamed: 0,Customer,CuRank,Name,Sex,Age,FriPlay,RelPlay,IT-Tag,Port,HighWin,AqChan
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,,7.8292,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,,7.0,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,,9.6875,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,,8.6625,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,,12.2875,S


In [7]:
# Missing data in train set
print('No. of customers with missing data by variable (train set)')
print('')
for variable in train.columns:
    print('{}: {} ({:.1f}%)'.format(variable, sum(train[variable].isna()), 100*sum(train[variable].isna())/len(train)))

No. of customers with missing data by variable (train set)

Customer: 0 (0.0%)
Retained: 0 (0.0%)
CuRank: 0 (0.0%)
Name: 0 (0.0%)
Sex: 0 (0.0%)
Age: 177 (19.9%)
FriPlay: 0 (0.0%)
RelPlay: 0 (0.0%)
IT-Tag: 0 (0.0%)
Port: 687 (77.1%)
HighWin: 0 (0.0%)
AqChan: 2 (0.2%)


In [8]:
# Missing data in test set
print('No. of customers with missing data by variable (test set)')
print('')
for variable in test.columns:
    print('{}: {} ({:.1f}%)'.format(variable, sum(test[variable].isna()), 100*sum(test[variable].isna())/len(test)))

No. of customers with missing data by variable (test set)

Customer: 0 (0.0%)
CuRank: 0 (0.0%)
Name: 0 (0.0%)
Sex: 0 (0.0%)
Age: 86 (20.6%)
FriPlay: 0 (0.0%)
RelPlay: 0 (0.0%)
IT-Tag: 0 (0.0%)
Port: 327 (78.2%)
HighWin: 1 (0.2%)
AqChan: 0 (0.0%)


## One-Hot Encoding

Customer rank and acquisiton channels are categorical variables. Each will have to be one-hot encoded. This is to say that the single feature will have to be replaced with a number of features equal to the number of categories, where the value of each feature is either 0 or 1, representing the category of the feature for that row.

In [9]:
# One-hot encoding CuRank in train
curank_onehot = pd.get_dummies(train.CuRank, prefix='CuRank')
train = pd.merge(train, curank_onehot, left_index=True, right_index=True)
train = train.drop('CuRank', axis=1)

# One-hot encoding CuRank in test
curank_onehot = pd.get_dummies(test.CuRank, prefix='CuRank')
test = pd.merge(test, curank_onehot, left_index=True, right_index=True)
test = test.drop('CuRank', axis=1)

In [10]:
# One-hot encoding AqChan in train
aqchan_onehot = pd.get_dummies(train.AqChan, prefix='AqChan')
train = pd.merge(train, aqchan_onehot, left_index=True, right_index=True)
train['AqChan_NaN'] = train['AqChan'].isna()*1
train = train.drop('AqChan', axis=1)

# One-hot encoding AqChan in test
aqchan_onehot = pd.get_dummies(test.AqChan, prefix='AqChan')
test = pd.merge(test, aqchan_onehot, left_index=True, right_index=True)
test['AqChan_NaN'] = test['AqChan'].isna()*1
test = test.drop('AqChan', axis=1)

## Convert Sex to Binary Variable

Sex is a categorical variable, where the value is marked 'male' or 'female'. This feature will need to be replaced with a binary variable, where the values are either 0 or 1. In this instance, a value of 1 represents that the customer is female.

In [11]:
# Convert Sex to binary variable in train
train['Sex'] = (train['Sex']=='female')*1

# Convert Sex to binary variable in test
test['Sex'] = (test['Sex']=='female')*1

## Drop Variables

Customer number, name, IT-Tag and port won't be used for our purposes. As such, we drop them from our datasets.

In [12]:
# Drop row that won't be used in our models
train = train.drop(['Customer','Name','IT-Tag','Port'], axis=1)
test = test.drop(['Customer','Name','IT-Tag','Port'], axis=1)

## Impute Missing Data

20% of ages are missing in the two datasets. In order to train our machine learning models, we have to impute the missing data.

One option is to simply use the mean age. This is a simple and quick method to implement. However, it ignores the relationship between variables in the dataset.

To leverage the information contained in the other variables, we will instead train a linear regression model to predict the missing ages. The other variables will be used as features to train the model. All rows with ages present will be used for training, and predictions will be made on rows with a missing age.

There is also a missing highest win value in the train set. Linear regression will be used for imutation here as well.

In [13]:
# Impute missing ages for train set
data = train[train.columns[1:]]
X = np.array(data.drop('Age', axis=1))
y = np.array(data['Age'])

# Define rain set (where age is present)
X_train = X[~np.isnan(y)]
y_train = y[~np.isnan(y)]

# Define test set (where age is missing)
X_test = X[np.isnan(y)]

# Scale the training data using a z-transform, and apply that scaler to the test data
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Train a model to predict the missing ages
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict ages
y_pred = model.predict(X_test_scaled)

# Assign values to train set
m = train['Age'].isna()
train.loc[m,'Age'] = y_pred

In [14]:
# Impute missing ages for test set
data = test.dropna(subset=['HighWin'])
X = np.array(data.drop('Age', axis=1))
y = np.array(data['Age'])

# Define rain set (where age is present)
X_train = X[~np.isnan(y)]
y_train = y[~np.isnan(y)]

# Define test set (where age is missing)
X_test = X[np.isnan(y)]

# Scale the training data using a z-transform, and apply that scaler to the test data
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Train a model to predict the missing ages
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict ages
y_pred = model.predict(X_test_scaled)

# Assign values to test set
m = test['Age'].isna()
test.loc[m,'Age'] = y_pred

In [15]:
# Impute missing HighWin for test set
data = test
X = np.array(data.drop('HighWin', axis=1))
y = np.array(data['HighWin'])

# Define rain set (where HighWin is present)
X_train = X[~np.isnan(y)]
y_train = y[~np.isnan(y)]

# Define test set (where HighWin is missing)
X_test = X[np.isnan(y)]

# Scale the training data using a z-transform, and apply that scaler to the test data
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Train a model to predict the missing ages
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict ages
y_pred = model.predict(X_test_scaled)

# Assign values to test set
m = test['HighWin'].isna()
test.loc[m,'HighWin'] = y_pred

In [16]:
# Show sample data from train
train.head()

Unnamed: 0,Retained,Sex,Age,FriPlay,RelPlay,HighWin,CuRank_1,CuRank_2,CuRank_3,AqChan_C,AqChan_Q,AqChan_S,AqChan_NaN
0,0,0,22.0,1,0,7.25,0,0,1,0,0,1,0
1,1,1,38.0,1,0,71.2833,1,0,0,1,0,0,0
2,1,1,26.0,0,0,7.925,0,0,1,0,0,1,0
3,1,1,35.0,1,0,53.1,1,0,0,0,0,1,0
4,0,0,35.0,0,0,8.05,0,0,1,0,0,1,0


In [17]:
# Show sample data from test
test.head()

Unnamed: 0,Sex,Age,FriPlay,RelPlay,HighWin,CuRank_1,CuRank_2,CuRank_3,AqChan_C,AqChan_Q,AqChan_S,AqChan_NaN
0,0,34.5,0,0,7.8292,0,0,1,0,1,0,0
1,1,47.0,1,0,7.0,0,0,1,0,0,1,0
2,0,62.0,0,0,9.6875,0,1,0,0,1,0,0
3,0,27.0,0,0,8.6625,0,0,1,0,0,1,0
4,1,22.0,1,1,12.2875,0,0,1,0,0,1,0


In [18]:
# Missing data in train set
print('No. of customers with missing data by variable (train set)')
print('')
for variable in train.columns:
    print('{}: {} ({:.1f}%)'.format(variable, sum(train[variable].isna()), 100*sum(train[variable].isna())/len(train)))

No. of customers with missing data by variable (train set)

Retained: 0 (0.0%)
Sex: 0 (0.0%)
Age: 0 (0.0%)
FriPlay: 0 (0.0%)
RelPlay: 0 (0.0%)
HighWin: 0 (0.0%)
CuRank_1: 0 (0.0%)
CuRank_2: 0 (0.0%)
CuRank_3: 0 (0.0%)
AqChan_C: 0 (0.0%)
AqChan_Q: 0 (0.0%)
AqChan_S: 0 (0.0%)
AqChan_NaN: 0 (0.0%)


In [19]:
# Missing data in test set
print('No. of customers with missing data by variable (test set)')
print('')
for variable in test.columns:
    print('{}: {} ({:.1f}%)'.format(variable, sum(test[variable].isna()), 100*sum(test[variable].isna())/len(test)))

No. of customers with missing data by variable (test set)

Sex: 0 (0.0%)
Age: 0 (0.0%)
FriPlay: 0 (0.0%)
RelPlay: 0 (0.0%)
HighWin: 0 (0.0%)
CuRank_1: 0 (0.0%)
CuRank_2: 0 (0.0%)
CuRank_3: 0 (0.0%)
AqChan_C: 0 (0.0%)
AqChan_Q: 0 (0.0%)
AqChan_S: 0 (0.0%)
AqChan_NaN: 0 (0.0%)


In [20]:
# Save to processed_dfs.npy
if not os.path.exists('./data'):
    os.makedirs('./data')

tosave = {'train': train, 'test': test}
np.save('data/processed_dfs.npy',tosave)
print("Saved!")

Saved!
