# Data pre-processing

Description: The script will pre-process the data. This step of pre-processing will include: removing rows with missing values, binary encoding categorical data

Author: Caroline Risoud

License:

Last update date: 23.10.2021

In [2]:
import pandas as pd

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss

import xgboost as xgb

from time import time

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import matplotlib.pyplot as plt
import matplotlib

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Reading the csv data file
df_train_validate = pd.read_csv('nhts_train_validate.csv', index_col='TRIPID')

### Missing values in the dataset

In the following cell:

- We create a dataframe 'df_null_values' which will list all the columns (features, labels) of the dataset and will show how many NULL (None, NaN) values each of the columns counts.

- We see that NULL values (exactly 8) are only found in the TRAVELMODE column which is the target/label column.


In [None]:
# Finding the amount of NULL values

# For each column, total stores the number of null values
total = df_train_validate.isnull().sum().sort_values(ascending=False)


# For each column, percent translater the number of null values into a percentage
percent = total/len(df_train_validate)*100


# Dataframe with Total as the first column and Percent as the second one
df_null_values = pd.concat([total,percent], axis=1, keys=['Total', 'Percent'])

- We will drop the 8 rows with missing TRAVELMODE. We assume that these row removals will not introduce sampling bias because. 

In [None]:
df_train_validate = df_train_validate.dropna(how='any', axis=0)

In [None]:
df_train_validate

### Categorical data - Binary encoding

All data must be numerical.

For our target label, TRAVELMODE, we choose numerical encoding. Indeed, we want to have one column for the target categorical value which leads to the use of linear encoding to achieve that.

For the features, we choose One-hot encoding (Binary encoding) as a solution. On one hand, this solution does not impute an order and distance compared to a numerical encoding. On the other hand, it makes the data much wider by creating many more features.

In [None]:
# shows us the type of each column - the object types are the ones to be encoded

display(df_train_validate.dtypes)

In [None]:
# numerical encoding for the target categorical lable TRAVELMODE

str_to_val = {
    'WALK': 0,
    'CYCLE': 1,
    'RAIL': 2,
    'BUS': 3,
    'DRIVE': 4,
    'PASSENGER': 5,
    'TAXI': 6,
    'OTHER': 7
}

# Replacing the strings with their respective values
df_train_validate.TRAVELMODE.replace(str_to_val, inplace=True)

df_train_validate.head()

In [None]:
# Binary encoding for the 4 remaining categorical features: 'TRIPPURP','HHSTATE', 'OBHUR', 'DBHUR'

categorical_cols = [
    'TRIPPURP',
    'HHSTATE',
    'OBHUR',
    'DBHUR'
]


df_processed = pd.get_dummies(
    df_train_validate, prefix_sep=':', columns=categorical_cols)

df_processed.head()

### Internal and external validation

 The test data here can not be used for external validation because it DOES NOT include the choice label.
 
In order to create a test set that represents external validation, we will seperate the train_validate set into train_validation and test set. Cross-validation to optimize the hyperparameters will be carried out on the train_validate set and finally the test set can be used for external validation. 
 
    train_validate data:
        --> 80 % train_validate_rev
        --> 20 % test_rev
 

The Cross-validation with random search will be used to optimise the hyperparameters, it will also account for the hierarchical nature of the data. It will be used as follows:

 - Train on 4 folds, test on 1 fold
 - Training data: 80% of the train_validate_rev
 - Test data: 20% of the train_validate_rev
 
 Random sampling of validation folds
 --> INTERNAL VALIDATION
 
     
 The Test:
 
 - Training data: 100% of the train_validate_rev data with the optimal hyperparameters found previously
 - Test data: 100% of the test_rev
 
 --> EXTERNAL VALIDATION

Splitting train_validate dataframe by row index into train and validate dataframe:

- 80% of the df_train_validate is assigned to df_train_validate_rev
- 20% of the df_train_validate is assigned to df_test_rev

In [None]:
# defining index 1
id1 = round(len(df_processed)*0.8)

df_train_validate_rev = df_processed.iloc[:id1,:]
df_test_rev = df_processed.iloc[id1:,:]

print("Shape of new dataframes - {} , {}".format(df_train_validate_rev.shape, df_test_rev.shape))

df_processed.columns.tolist()

In [None]:
# We extract the features and labels, removing the id and context columns

target = ['TRAVELMODE']

id_context = ['TRIPID', 
              'HOUSEID', 
              'PERSONID', 
              'TDTRPNUM',
              'LOOP_TRIP'
             ]



features = [c for c in df_processed.columns 
            if c not in (target + id_context)]

# y is the target label (here: 'TRAVELMODE')
# X are the features that are inputed to the model to predict the target label

# ravel() is used to flatten the multi-dimensional array to a vector
y = df_processed[target].values.ravel()
X = df_processed[features]

y_train_validate_rev = df_train_validate_rev[target].values.ravel()
X_train_validate_rev = df_train_validate_rev[features]

y_test_rev = df_test_rev[target].values.ravel()
X_test_rev = df_test_rev[features]

In [None]:
%store X_train_validate_rev, y_train_validate_rev
%store X_test_rev, y_test_rev