# Machine Learning Models

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import re

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso
from xgboost import XGBRegressor

from nltk.tokenize import word_tokenize
import gensim.models

from scipy.sparse import hstack
from sklearn.metrics import mean_squared_log_error

from prettytable import PrettyTable

import joblib
import gc

In [2]:
import warnings
warnings.filterwarnings('ignore')

### Loading the featurized data

In [3]:
df_train, df_cv = joblib.load('df_train_cv_after_featurizations.joblib')
y_train = np.array(df_train['price'].values).reshape(-1, 1)
y_cv    = np.array(df_cv['price'].values).reshape(-1, 1)

In [4]:
y_train = np.log1p(y_train)

The target variable was converted to its log form so that we can directly optimize for Mean Squared Error (MSE) instead of Mean Squared Log Error (MSLE). This was easy since most models optimize for MSE by default for Regression tasks while not all support direct optimization for MSLE.

#### Defining the loss metric for evaluation: Root mean squared log error (RMSLE)

In [5]:
def rmsle(y_true, y_pred):
    '''
    This function take the tuple of true class labels and the predicted class labels as input and gives the Root mean squared
    log error between these as the output
    '''
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

#### Baseline Model
We build a baseline model so as to get an upper bound to the error with a very simple model that predicts the mean price of the training data for each CV data point. In this way, we'll know, how much better, each of our model performs

In [6]:
y_pred_baseline = np.expm1(np.repeat(np.mean(y_train), len(y_cv)))
print("Baseline Model CV RMSLE = ", np.round(rmsle(y_cv, y_pred_baseline), 3))

Baseline Model CV RMSLE =  0.746


#### Since, text is a crucial part in deciding the price of the product (intuition), we will featurize it in form of sparse as well as dense vectors

#### TF-IDF Vectorization for Text Data (Sparse Vectors)

I tried different number of MaxFeatures and found out that using 100,000 features is appropriate as it captures substantial amount of information <br>
Bi-grams were used for text field which contained item description with the intuition that there were brands like "Michael Kors" and since brands played an important role in predicting the price (observed in EDA), it would be better if this information is captured in a better way through bi-grams

Token pattern was changed to '\w+' because in normal TF-IDF vectorization, punctuations are treated as token separators which may not always be the case in this case study since there are reviews which contain alphabets, numbers, periods, underscores (for brand names) and hence '\w+' token pattern is more appropriate for this case study.

In [7]:
def tfidf_encoder(train_data, test_data, N_GRAMS = 1):
    '''
    This function returns the TF-IDF encoding of the text
    
    Input ->
    
        train_data       : Text (string or list of strings or Pandas Series with elements as strings)
        test_data        : Text (string or list of strings or Pandas Series with elements as strings)
        N_GRAMS(int)     : Upper bound of the n_grams to be considered while vectorizing the data using TF-IDF encoder
                           For eg., If the n_grams = 2, then both unigrams and bi-grams will be used while vectorizing
                           the text data. Default value is kept as 1, which means only uni-grams will be generated if this
                           argument is not supplied explicitly while calling this function
    
    Output -> Tuple of TF-IDF vectors of "train_data" and "test_data" computed using sklearn's Tfidfvectorizer() 
    
    Task   -> Given a text (string), return the TF-IDF vectors for that text
              The vectorizer is fitted on the train_data and used to tranform both the train data and the test data
    '''
    vectorizer = TfidfVectorizer(max_features = 100000,
                                 ngram_range = (1, N_GRAMS),
                                 strip_accents = 'unicode',
                                 analyzer = 'word',
                                 token_pattern = r'\w+')
    
    train_tdidf = vectorizer.fit_transform(train_data)
    test_tfidf =  vectorizer.transform(test_data)
    return (train_tdidf, test_tfidf)

In [8]:
X_tr_name, X_cv_name = tfidf_encoder(df_train['name'], df_cv['name'], N_GRAMS = 1)
X_tr_text, X_cv_text = tfidf_encoder(df_train['text'], df_cv['text'], N_GRAMS = 2)

In [9]:
print(X_tr_name.shape)
print(X_tr_text.shape)
print(X_cv_name.shape)
print(X_cv_text.shape)

(1111234, 85394)
(1111234, 100000)
(370424, 85394)
(370424, 100000)


#### One Hot Encoding for Shipping and item_condition_id

In [10]:
def one_hot_encoder(train_data, cv_data):
    '''
    This function returns the One Hot Encoded vectors for the given train and CV data
    Input ->
        train_data : Training data to be one hot encoded (List of integers/strings or a Pandas Series)
        cv_data    : Cross Validation data to be one hot encoded (List of integers/strings or a Pandas Series)
    Output -> Tuple of One hot encoded vectors of training and CV data
    Task   -> This function converts the raw values (integers/strings) into one hot encoded vectors using
              sklearn's OneHotEncoder()
    '''
    ohe_encoder = OneHotEncoder()
    train_ohe = ohe_encoder.fit_transform(train_data)
    cv_ohe = ohe_encoder.transform(cv_data)
    return train_ohe, cv_ohe

In [11]:
X_tr_shipping, X_cv_shipping = one_hot_encoder(np.reshape(df_train['shipping'].values, (-1, 1)), np.reshape(df_cv['shipping'].values, (-1, 1)))
X_tr_item_condition, X_cv_item_condition = one_hot_encoder(np.reshape(df_train['item_condition_id'].values, (-1, 1)), np.reshape(df_cv['item_condition_id'].values, (-1, 1)))

In [12]:
print(X_tr_shipping.shape)
print(X_cv_shipping.shape)
print(X_tr_item_condition.shape)
print(X_cv_item_condition.shape)

(1111234, 2)
(370424, 2)
(1111234, 5)
(370424, 5)


#### Combining all the features to create the training data matrix

In [13]:
X_train = hstack((X_tr_name,
                  X_tr_text,
                  X_tr_shipping,
                  X_tr_item_condition)).tocsr().astype('float32')

X_cv   = hstack((X_cv_name,
                 X_cv_text,
                 X_cv_shipping,
                 X_cv_item_condition)).tocsr().astype('float32')

In [14]:
print(X_train.shape)
print(X_cv.shape)

(1111234, 185401)
(370424, 185401)


### Model 1: Ridge Regression (Linear Regression + L2 Regularization)

Since, Linear regression does not work well with correlated features, hence, we'll use only one of the price statistics features for linear regression as they're highly correlated with each other. We choose median of the log price since it had the highest correlation with the target variable (as observed in EDA)

In [15]:
def ridge_model(X_train, y_train, parameter):
    '''
    Task -> This function fits the Ridge Model (Linear Regression + L2 Regularization) using sklearn's Ridge() Model 
            on the training data(X_train and y_train) and returns the fitted model
    
    Input ->
        
        X_train:   Training data matrix (Numpy/Scipy Array)
        y_train:   Class labels of the training data matrix in the form of a 1-D numpy array/Pandas Series
        parameter: Hyperparameter "alpha" of Ridge() model of sklearn.linear_model
    
    Output -> Ridge model fitted on X_train and y_train
    '''
    # Solver was chosen to be 'lsqr' as it converged very quickly
    
    model = Ridge(solver = 'lsqr', fit_intercept=False, alpha = parameter, random_state = 11)
    model.fit(X_train, y_train)
    return model

In [16]:
# Since, we took log of original y_train, we do the inverse process so as to get back the original y_train which 
# we will use for calculating training error

y_train_original = np.expm1(y_train)

In [18]:
'''
Here, we find the best hyperparameter alpha for the Ridge model by evaluating the performance on both train as well as CV data
using various values of hyperparameter alpha
'''
for alpha in [10**x for x in range(-5, 4)]:
    clf = ridge_model(X_train, y_train, alpha)
    y_pred_train = clf.predict(X_train)[:, 0]
    y_pred_cv = clf.predict(X_cv)[:, 0]
    preds_ridge_train = np.expm1(y_pred_train.reshape(-1, 1))[:, 0]
    preds_ridge_cv = np.expm1(y_pred_cv.reshape(-1, 1))[:, 0]
    print("Train RMSLE for alpha = ", alpha, " is ", rmsle(y_train_original, preds_ridge_train))
    print("CV RMSLE for alpha = ", alpha, " is ", rmsle(np.reshape(y_cv, (-1, 1)), preds_ridge_cv))
    del clf
    gc.collect()
    print("--------------------------------------------------------------")

Train RMSLE for alpha =  1e-05  is  0.4349957935621989
CV RMSLE for alpha =  1e-05  is  0.4551584494167266
--------------------------------------------------------------
Train RMSLE for alpha =  0.0001  is  0.4349957935639252
CV RMSLE for alpha =  0.0001  is  0.4551583628082112
--------------------------------------------------------------
Train RMSLE for alpha =  0.001  is  0.43499579374758734
CV RMSLE for alpha =  0.001  is  0.45515749692449653
--------------------------------------------------------------
Train RMSLE for alpha =  0.01  is  0.4349958122058277
CV RMSLE for alpha =  0.01  is  0.4551488582121009
--------------------------------------------------------------
Train RMSLE for alpha =  0.1  is  0.43499764082049175
CV RMSLE for alpha =  0.1  is  0.45506446447269716
--------------------------------------------------------------
Train RMSLE for alpha =  1  is  0.4367416695328943
CV RMSLE for alpha =  1  is  0.455041713196296
----------------------------------------------------

#### Evaluating performance on Cross Validation Data for best hyperparameter

In [19]:
best_alpha = 1

In [20]:
clf = ridge_model(X_train, y_train, best_alpha)
y_pred = clf.predict(X_cv)[:, 0]
preds_ridge = np.expm1(y_pred.reshape(-1, 1))[:, 0]
rmsle(y_cv, preds_ridge)

0.455041713196296

### Model 2: Lasso Regression (Linear Regression + L1 Regularization)
Since we have very sparse data, it is logical to train a Lasso model so that inherent feature selection happens

In [16]:
def lasso_model(X_train, y_train, parameter):
    '''
    Task -> This function fits the Lasso Model (Linear Regression + L1 Regularization) using sklearn's Lasso() Model 
            on the training data(X_train and y_train) and returns the fitted model
    
    Input ->
        
        X_train:   Training data matrix (Numpy/Scipy Array)
        y_train:   Class labels of the training data matrix in the form of a 1-D numpy array/Pandas Series
        parameter: Hyperparameter "alpha" of Lasso() model of sklearn.linear_model
    
    Output -> Lasso model fitted on X_train and y_train
    '''
    model = Lasso(alpha = parameter,
                  normalize = True,
                  max_iter=2000,
                  random_state = 11)
    model.fit(X_train, y_train)
    return model

In [17]:
%%time
clf = lasso_model(X_train, y_train, 1)

Wall time: 5min 26s


#### Evaluating performance on Cross Validation Data

In [18]:
y_pred = clf.predict(X_cv)
preds_lasso = np.expm1(y_pred.reshape(-1, 1))[:, 0]
rmsle(y_cv, preds_lasso)

0.74622476

It seems like the model did not converge within 2000 iterations and hence it is not feasible to use it since it will take a lot of time to train

### Model 3: XGBoost on TF-IDF Features

In [21]:
y_cv = np.log1p(y_cv)

In [17]:
# Since XGBoost supported direct optimization of squared log error, hence we used original y_train for training here
xgb_reg = XGBRegressor(random_state = 21,
                       n_jobs = -1)

#### Using RandomSearch to find the best set of hyperparameters on XGBoost model on TF-IDF features

In [8]:
params = {'n_estimators' : [500, 1000, 1500],
          'max_depth' : [5, 7],
          'subsample' : [0.7, 0.9],
          'colsample_bytree' : [0.7, 0.9]}

In [33]:
clf = RandomizedSearchCV(xgb_reg,
                         params,
                         cv=3,
                         n_jobs=-1,
                         random_state=21,
                         verbose=2)

clf.fit(X_train, y_train)

clf.best_params_

{'subsample': 0.9,
 'n_estimators': 1500,
 'max_depth': 7,
 'colsample_bytree': 0.7}

In [20]:
clf.best_score_

0.2552079203446068

In [41]:
best_est = clf.best_estimator_
y_pred_train = best_est.predict(X_train)
'''
Below statement was added since RMSLE's definition does not allow negative numbers and it makes logical sense also since product
price can never be negative. However, to be on the safe side, we make this ammendment
'''
y_pred_train[y_pred_train < 0] = 0
y_pred_cv = best_est.predict(X_cv)
y_pred_cv[y_pred_cv < 0] = 0
print("Training RMSLE = ", rmsle(y_train, y_pred_train))
print("CV RMSLE = ", rmsle(y_cv, y_pred_cv))

Training RMSLE =  0.4766871070500511
CV RMSLE =  0.490089474076724


Training time for XGBoost was significantly high but still the RSMLE is higher than simple linear regression model <br> This could be because we used high number of features for XGBoost, which is a tree based ensemble and does not work well with high dimensional features <br> Now, we will use dense Word2Vec representations for vectorizing text fields

### Model 4: XGBoost on Word2Vec Features

### Dense Vectorizations (Word2Vec)
Here, we use 300-dimensional Word2Vec dense vectors trained on Google News

In [2]:
'''
Here, we tokenize the 'name' and 'text' fields of the train and CV data using NLTK's work_tokenize() function
'''
train_name = [0]*len(df_train)
for index, review in enumerate(df_train['name'].values):
    train_name[index] = word_tokenize(review)

cv_name = [0]*len(df_cv)
for index, review in enumerate(df_cv['name'].values):
    cv_name[index] = word_tokenize(review)
    
train_text = [0]*len(df_train)
for index, review in enumerate(df_train['text'].values):
    train_text[index] = word_tokenize(review)
    
cv_text = [0]*len(df_cv)
for index, review in enumerate(df_cv['text'].values):
    cv_text[index] = word_tokenize(review)

In [4]:
# Word2Vec model is used to generate 300 - dimensional dense vectors for the train and text data
# Here, we load the pre-trained model trained on Google News data, downloaded from https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

model = gensim.models.KeyedVectors.load_word2vec_format(r'./GoogleNews-vectors-negative300.bin', binary=True)

In [3]:
#Code inspired from: https://github.com/sdimi/average-word2vec/blob/master/avg_word2vec_from_documents.py
zeros_vector = np.zeros(300)

def document_vector(word2vec_model, doc):
    '''
    Task -> This function finds the average word vector representation for all the words given in the "doc"
            This is done by performing a lookup for the words in the doc which are also present in the word2vec_model's
            vocabulary and then calculating the mean/average vector for all the words in the doc which are also present in the
            w2v model's vocabulary. If none of the words in the doc are present in the w2v model's vocabulary, we return a 300
            dimensional vector of all zeroes
            
    Input ->
            word2vec_model: Pre-trained word2vec model trained on Google News data (300-dimensional dense vectors)
            doc           : (string) The document/text for which we have to calculate the average word2vec
    
    Output ->
            300 dimensional average dense word vector of "doc" or a same sized vector of all zeroes if none of the word
            in doc is present in the word2vec_model's vocabulary    
    '''
    doc = [word for word in doc if word in word2vec_model.vocab]
    if len(doc) != 0:
        return np.mean(word2vec_model[doc], axis=0)
    else:
        return zeros_vector

In [5]:
'''
Here, we calculate the average word-vectors for each of the document in the 'name' and 'text' fields of the training and CV
data using the document_vector() function defined above
'''
avg_word2vec_train_name = np.zeros((len(train_name), 300))
for index, doc in enumerate(train_name): 
    avg_word2vec_train_name[index] = document_vector(model, doc)

avg_word2vec_cv_name = np.zeros((len(cv_name), 300))
for index, doc in enumerate(cv_name): 
    avg_word2vec_cv_name[index] = document_vector(model, doc)

avg_word2vec_train_text = np.zeros((len(train_text), 300))
for index, doc in enumerate(train_text): 
    avg_word2vec_train_text[index] = document_vector(model, doc)

avg_word2vec_cv_text = np.zeros((len(cv_text), 300))
for index, doc in enumerate(cv_text):
    avg_word2vec_cv_text[index] = document_vector(model, doc)

#### One Hot Encoding for Shipping and item_condition_id

In [26]:
X_tr_shipping, X_cv_shipping = one_hot_encoder(np.reshape(df_train['shipping'].values, (-1, 1)), np.reshape(df_cv['shipping'].values, (-1, 1)))
X_tr_item_condition, X_cv_item_condition = one_hot_encoder(np.reshape(df_train['item_condition_id'].values, (-1, 1)), np.reshape(df_cv['item_condition_id'].values, (-1, 1)))

#### Combining all the features to create the training data matrix

In [38]:
'''
All the data is converted to 'float32' type for efficient RAM utilisation
'''
avg_word2vec_train_name = avg_word2vec_train_name.astype('float32')
avg_word2vec_train_text = avg_word2vec_train_name.astype('float32')
avg_word2vec_cv_name = avg_word2vec_cv_name.astype('float32')
avg_word2vec_cv_text = avg_word2vec_cv_text.astype('float32')

X_tr_shipping = X_tr_shipping.astype('float32')
X_cv_shipping = X_cv_shipping.astype('float32')

X_tr_item_condition = X_tr_item_condition.astype('float32')
X_cv_item_condition = X_cv_item_condition.astype('float32')

In [60]:
'''
Converting all data to dense type as these are not sparse vectors
'''
X_tr_shipping = X_tr_shipping.todense()
X_tr_item_condition = X_tr_item_condition.todense()
X_cv_shipping = X_cv_shipping.todense()
X_cv_item_condition = X_cv_item_condition.todense()

#### Combining the dense features to form the training and cross validation data matrices

In [64]:
X_train = np.concatenate((avg_word2vec_train_name,
                          avg_word2vec_train_text,
                          X_tr_shipping,
                          X_tr_item_condition), axis = 1)

In [68]:
X_cv= np.concatenate((avg_word2vec_cv_name,
                      avg_word2vec_cv_text,
                      X_cv_shipping,
                      X_cv_item_condition), axis = 1)

### Model 4.1: XGboost with 500 trees

In [42]:
xgb_reg = XGBRegressor(n_estimators = 500,
                       random_state = 21,
                       n_jobs = -1)

xgb_reg.fit(X_train, y_train)

y_pred_train = xgb_reg.predict(X_train)
y_pred_train[y_pred_train < 0] = 0
y_pred_cv = xgb_reg.predict(X_cv)
y_pred_cv[y_pred_cv < 0] = 0
print("Training RMSLE = ", rmsle(y_train, y_pred_train))
print("CV RMSLE = ", rmsle(y_cv, y_pred_cv))

Training RMSLE =  0.53467884038316
CV RMSLE =  0.5667811273104297


### Model 4.2: XGboost with 1000 trees

In [43]:
xgb_reg = XGBRegressor(n_estimators = 1000,
                       random_state = 21,
                       n_jobs = -1)

xgb_reg.fit(X_train, y_train)

y_pred_train = xgb_reg.predict(X_train)
y_pred_train[y_pred_train < 0] = 0
y_pred_cv = xgb_reg.predict(X_cv)
y_pred_cv[y_pred_cv < 0] = 0
print("Training RMSLE = ", rmsle(y_train, y_pred_train))
print("CV RMSLE = ", rmsle(y_cv, y_pred_cv))

Training RMSLE =  0.5048164553553274
CV RMSLE =  0.5566548151608218


### Model 4.3: XGboost with 1500 trees

In [44]:
xgb_reg = XGBRegressor(n_estimators = 1500,
                       random_state = 21,
                       n_jobs = -1)

xgb_reg.fit(X_train, y_train)

y_pred_train = xgb_reg.predict(X_train)
y_pred_train[y_pred_train < 0] = 0
y_pred_cv = xgb_reg.predict(X_cv)
y_pred_cv[y_pred_cv < 0] = 0
print("Training RMSLE = ", rmsle(y_train, y_pred_train))
print("CV RMSLE = ", rmsle(y_cv, y_pred_cv))

Training RMSLE =  0.48509613301263727
CV RMSLE =  0.5519472286014295


### None of the boosting models performed well on Dense vector representations
- This maybe because most of the words in the given text corpus are not present in the Word2Vec model's vocabulary and therefore, we have to use a vector of all zeros place of those. <br>
- Exhaustive GridSearch/RandomSearch was not performed due to memory constraints as the dense vector representations consumed very high RAM

### Model 5: MLP on TF-IDF Features

In [23]:
# For reproducible results
tf.random.set_seed(21)

In [24]:
'''
Since, we took log of the original y_train and y_cv (so that we could directly optimize for MSE), we convert it back
to the original format so that this can be used to evaluate model's performance on training data
'''
y_train_original = np.expm1(y_train.reshape(-1, 1))
y_cv_original    = np.expm1(y_cv.reshape(-1, 1))

In [33]:
def mlp_model_1(train_shape):
    '''
    Task  -> This function builds the architecture of an MLP model with the input dimensions as "train_shape"
             The architecture of the model is as follows:
             Input Layer -> Dense (256) -> Dense (128) -> Dense (1) -> Output Layer
             The activation function is kept as ReLu for the hidden layers and linear activation (f(x) = x) for the output layer
    
    Input  -> train_shape: Input shape (dimensions) of the data which will be fed to the MLP
    
    Output -> Builded MLP Model
    '''
    model_input = Input(shape=(train_shape,), dtype='float32', sparse=True)
    out = Dense(256, activation='relu')(model_input)
    out = Dense(128, activation='relu')(out)
    model_out = Dense(1)(out)
    model = Model(model_input, model_out)
    return model
    
def mlp_model_2(train_shape):
    '''
    Task  -> This function builds the architecture of an MLP model with the input dimensions as "train_shape"
             The architecture of the model is as follows:
             Input Layer -> Dense (1024) -> Dense (512) -> Dense (256) -> Dense (128) -> Dense (64) -> Dense (32) -> Dense (1)
             -> Output Layer
             The activation function is kept as ReLu for the hidden layers and linear activation (f(x) = x) for the output layer
    
    Input  -> train_shape: Input shape (dimensions) of the data which will be fed to the MLP
    
    Output -> Builded MLP Model
    
    '''
    model_input = Input(shape=(train_shape,), dtype='float32', sparse=True)
    out = Dense(1024, activation='relu')(model_input)
    out = Dense(512, activation='relu')(out)
    out = Dense(256, activation='relu')(out)
    out = Dense(128, activation='relu')(out)
    out = Dense(64, activation='relu')(out)
    out = Dense(32, activation='relu')(out)
    out = Dense(1)(out)
    model = Model(model_input, out)
    return model

Note: Each of these models were trained for only 3 epochs since they heavily overfitted within that only <br> The doubling of batch size after each epoch was a hack that worked brilliantly (inspired by the winners' solution to this problem)

#### Training MLP Model  - 1

In [9]:
'''
We build the MLP Model - 1 using the above function mlp_model_1(), then compile it using the Adam Optimizer, optimizing for
the mean squared error. Finally, we fit the model on the training data for 3 epochs doubling the batch size after each epoch
with an initial batch size of 256
'''

mlp1 = mlp_model_1(X_train.shape[1])
mlp1.compile(optimizer='adam', loss='mean_squared_error')
mlp1.fit(X_train, y_train, batch_size = 256, epochs = 1, verbose = 1)
mlp1.fit(X_train, y_train, batch_size = 512, epochs = 1, verbose = 1)
mlp1.fit(X_train, y_train, batch_size = 1024, epochs = 1, verbose = 1)



#### Predictions from MLP Model - 1

In [12]:
'''
We obtain the predictions from the above trained model on the training data and calculate the train RMSLE
'''

y_pred_train = mlp1.predict(X_train)[:, 0]
y_pred_train_mlp_1 = np.expm1(y_pred_train.reshape(-1, 1))[:, 0]
print("Train RMSLE for MLP model 1 is: ", rmsle(y_train_original, y_pred_train_mlp_1))

Train RMSLE for MLP model 1 is:  0.18930484114517668


In [36]:
'''
We obtain the predictions from the above trained model on the CV data and calculate the CV RMSLE
'''
y_pred_cv = mlp1.predict(X_cv)[:, 0]
y_pred_cv_mlp_1 = np.expm1(y_pred_cv.reshape(-1, 1))[:, 0]
print("CV RMSLE for MLP model 1 is: ", rmsle(y_cv_original, y_pred_cv_mlp_1))

CV RMSLE for MLP model 1 is:  0.4119441261549299


#### Training MLP Model  - 2

In [13]:
'''
We build the MLP Model - 2 using the above function mlp_model_2(), then compile it using the Adam Optimizer, optimizing for
the mean squared error. Finally, we fit the model on the training data for 3 epochs doubling the batch size after each epoch
with an initial batch size of 256
'''

mlp2 = mlp_model_2(X_train.shape[1])
mlp2.compile(optimizer='adam', loss='mean_squared_error')
mlp2.fit(X_train, y_train, batch_size = 256, epochs = 1, verbose = 1)
mlp2.fit(X_train, y_train, batch_size = 512, epochs = 1, verbose = 1)
mlp2.fit(X_train, y_train, batch_size = 1024, epochs = 1, verbose = 1)



#### Predictions from MLP Model - 2

In [16]:
'''
We obtain the predictions from the above trained model on the training data and calculate the train RMSLE
'''
y_pred_train = mlp2.predict(X_train)[:, 0]
y_pred_train_mlp_2 = np.expm1(y_scaler.inverse_transform(y_pred_train.reshape(-1, 1))[:, 0])
print("Train RMSLE for MLP model 2 is: ", rmsle(y_train_original, y_pred_train_mlp_2))

Train RMSLE for MLP model 2 is:  0.15686919200114158


In [37]:
'''
We obtain the predictions from the above trained model on the CV data and calculate the CV RMSLE
'''
y_pred_cv = mlp2.predict(X_cv)[:, 0]
y_pred_cv_mlp_2 = np.expm1(y_pred_cv.reshape(-1, 1))[:, 0]
print("CV RMSLE for MLP model 2 is: ", rmsle(y_cv_original, y_pred_cv_mlp_2))

CV RMSLE for MLP model 2 is:  0.40526546757131204


In [25]:
#This code is inspired from: https://github.com/debayanmitra1993-data/Mercari-Price-Recommendation/blob/master/kaggle_sub.py

def ensemble_generator(mlp1_preds, mlp2_preds):
    '''
    
    Task   -> This function calculates the best ensemble we can generate from predictions of MLP model 1 and MLP model 2.
              This is done by assigning a weight of 'w' to the first model's predictions and a weight of '(1-w)' to the second
              model's predictions. This is like weighted averaging of the two models to generate a better model.
              We select w using hyperparameter search, with, w in range [0, 0.02, 0.04, 0.06, ..., 1)
              Best w is selected which gives the lowest RMSLE on the CV data
              The best w along with the final weighted predictions are returned as an output
    
    Input  -> Tuple: (MLP_Model_1_Predictions, MLP_Model_2_Predictions)
    
    Output -> Tuple: (best_w, final_weighted_predictions using averaging from best_w)
    
    '''
    
    weights = np.arange(0, 1, 0.02)
    scores = []
    
    for w in weights:
        preds_f = (w*mlp1_preds) + (1-w)*(mlp2_preds)
        scores.append(rmsle(y_cv, preds_f))
    
    min_rmsle_index = np.argmin(scores)
    
    w_min_rmsle = weights[min_rmsle_index]

    preds_final = (w_min_rmsle*mlp1_preds) + (1-w_min_rmsle)*(mlp2_preds)
    
    return w_min_rmsle, preds_final

#### Evaluating performance of the Ensemble model on training and CV data

In [38]:
w_best, final_predictions_cv = ensemble_generator(y_pred_cv_mlp_1, y_pred_cv_mlp_2)
print("Best w =", w_best)
print("CV RMSLE for ensemble of MLP-1 and MLP-2 is:", np.round(rmsle(y_cv_original, final_predictions_cv), 4))

Best w = 0.4
CV RMSLE for ensemble of MLP-1 and MLP-2 is: 0.3986


In [31]:
final_predictions_train = w_best*y_pred_train_mlp_1 + (1-w_best)*y_pred_train_mlp_2
print("Train RMSLE for ensemble of MLP-1 and MLP-2 is:", np.round(rmsle(y_train_original, final_predictions_train), 4))

Train RMSLE for ensemble of MLP-1 and MLP-2 is: 0.1511


## Evaluating the model's performance on unseen testing data (on Kaggle)

### Loading the test data

In [2]:
df_test = pd.read_csv('test_stg2.tsv', sep='\t')
test_ids = df_test['test_id'].values

### Preprocessing the test data

#### Handling Null Values

In [9]:
df_test.fillna('', inplace=True)
df_test['item_description']  = df_test['item_description'].str.replace('^no description yet$', '', regex=True)

#### Combining text fields together

In [None]:
df_test['name'] = df_test['name'] + " " + df_test['brand_name']
df_test['text'] = df_test['item_description'] + " " + df_test['name'] + " " + df_test['category_name']

### Cleaning Text Data

In [13]:
# Ref: AAIC Notebook for Donors' Choose
def decontracted(sent):
    '''
    Task:   This Function changes common short forms like can't, won't to can not, will not resp. (Decontraction)
            This is done to ensure uniformity in the whole text
    Input:  Raw Text
    Output: Decontracted Text
    '''
    sent = re.sub(r"aren\'t", "are not", sent)
    sent = re.sub(r"didn\'t", "did not", sent)
    sent = re.sub(r"can\'t", "can not", sent)
    sent = re.sub(r"couldn\'t", "could not", sent)
    sent = re.sub(r"won\'t", "would not", sent)
    sent = re.sub(r"wouldn\'t", "would not", sent)
    sent = re.sub(r"haven\'t", "have not", sent)
    sent = re.sub(r"shouldn\'t", "should not", sent)
    sent = re.sub(r"doesn\'t", "does not", sent)
    sent = re.sub(r"don\'t", "do not", sent)
    sent = re.sub(r"didn\'t", "did not", sent)
    sent = re.sub(r"mustn\'t", "must not", sent)
    sent = re.sub(r"needn\'t", "need not", sent)
    
    return sent

In [14]:
df_test['name'] = df_test['name'].apply(lambda x : decontracted(x))
df_test['text'] = df_test['text'].apply(lambda x : decontracted(x))

In [15]:
#Defining some special regexes which would be used in the function text_preprocessing() to clean the text
regex_special_chars = re.compile('[^A-Za-z0-9.]+')
regex_decimal_digits = re.compile('(?<!\d)\.(?!\d)')
regex_white_space = re.compile(r'\s+')           
    
#Creating a slightly modified list of stopwords which does not contain "no", "nor" or "not"
stop_words = set(stopwords.words("english")) - {"no", "nor", "not"}

In [16]:
def text_preprocessing(sent):
    '''
    Input  -> Raw text (string)
    Output -> Cleaned Text (string)
    Task   -> The objective of this function is to clean the text and make it suitable for Bag of Words/TF-IDF vectorization
              This includes removal of new lines, special characters, emojis etc.
    
    '''
    #Removing special characters such as carriage return and newline character
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\n', ' ')

    #Removing all special characters except the period
    sent = regex_special_chars.sub(' ', sent)
    
    #Removing periods which are not either followed or preceeded by a digit
    #Ref: https://stackoverflow.com/questions/6599646/remove-decimal-point-when-not-between-two-digits
    
    sent = regex_decimal_digits.sub(' ', sent)
    
    #Converting multiple white spaces to single white space
    sent = regex_white_space.sub(' ', sent)
    
    #Removing space at starting and ending and converting to lower case
    sent = sent.strip().lower()
    
    # Lemmatizing the text: Lemmetization in NLP means to convert similar words to the same word while taking care of grammar
    sent_list = sent.split()
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in sent_list if word not in stop_words] 
    sent = " ".join(text)
    
    return sent

In [17]:
df_test['name'] = df_test['name'].apply(lambda x : text_preprocessing(x))
df_test['text'] = df_test['text'].apply(lambda x : text_preprocessing(x))

#### TF-IDF Encoding the text data

In [4]:
def tfidf_encoder_evaluate(train_data, test_data, N_GRAMS = 1):
    '''
    
    This function returns the TF-IDF encoding of the test data
    
    Input ->
    
        train_data       : Text (string or list of strings or Pandas Series with elements as strings)
        test_data        : Text (string or list of strings or Pandas Series with elements as strings)
        n_grams(int)     : Upper bound of the n_grams to be considered while vectorizing the data using TF-IDF encoder
                           For eg., If the n_grams = 2, then both unigrams and bi-grams will be used while vectorizing
                           the text data. Default value is kept as 1, which means only uni-grams will be generated if this
                           argument is not supplied explicitly while calling this function
        MaxFeatures(int) : Maximum number of features that the TF-IDF vectorizer will consider while vectorizing the text data
                           provided as input in the "train_data" and "test_data" arguments. Default value in kept as 10,000
    
    Output -> Tuple of TF-IDF vectors of "test_data" computed using sklearn's Tfidfvectorizer() with parameters
              as "n_grams" and "MaxFeatures"
    
    Task   -> Given a text (string), return the TF-IDF vectors for that text
              The vectorizer is fitted on the train_data and used to tranform the test data
    
    '''
    vectorizer = TfidfVectorizer(max_features = 100000,
                                 ngram_range = (1, N_GRAMS),
                                 strip_accents = 'unicode',
                                 analyzer = 'word',
                                 token_pattern = r'\w+')
    
    vectorizer.fit(train_data)
    
    test_tfidf = vectorizer.transform(test_data)
    
    del vectorizer
    gc.collect()
    
    return test_tfidf

In [17]:
X_test_name = tfidf_encoder_evaluate(df_train['name'], df_test['name'], N_GRAMS = 1)
X_test_text = tfidf_encoder_evaluate(df_train['text'], df_test['text'], N_GRAMS = 2)

#### One Hot Encoding for Shipping and item_condition_id

In [21]:
def one_hot_encoder_evaluate(train_data, test_data):
    '''
    This function returns the One Hot Encoded vectors for the given train and test data
    Input ->
        train_data : Training data to be fitted on (List of integers/strings or a Pandas Series)
        test_data  : Testing data to be one hot encoded (List of integers/strings or a Pandas Series)
    Output -> Tuple of One hot encoded vectors of testing data
    Task   -> This function converts the raw values (integers/strings) into one hot encoded vectors using
              sklearn's OneHotEncoder()
    '''
    ohe_encoder = OneHotEncoder()
    ohe_encoder.fit(train_data)
    test_ohe = ohe_encoder.transform(test_data)
    return test_ohe

In [22]:
X_test_shipping       = one_hot_encoder_evaluate(np.reshape(df_train['shipping'].values, (-1, 1)),
                                                 np.reshape(df_test['shipping'].values, (-1, 1)))

X_test_item_condition = one_hot_encoder_evaluate(np.reshape(df_train['item_condition_id'].values, (-1, 1)),
                                                 np.reshape(df_test['item_condition_id'].values, (-1, 1)))

#### Combining all the features to create the test data matrix

In [24]:
X_test = hstack((X_test_name,
                 X_test_text,
                 X_test_shipping,
                 X_test_item_condition)).tocsr().astype('float32')

In [25]:
print(X_test.shape)

(3460725, 190470)


#### Getting the final outputs using MLP Models

In [38]:
'''
We obtain the predictions from the above trained model on the Test data
'''
y_pred_test       = mlp1.predict(X_test)[:, 0]
y_pred_test_mlp_1 = np.expm1(y_pred_test.reshape(-1, 1))[:, 0]

y_pred_test       = mlp2.predict(X_test)[:, 0]
y_pred_test_mlp_2 = np.expm1(y_pred_test.reshape(-1, 1))[:, 0]

In [42]:
w_best = 0.4
final_predictions_test = w_best*y_pred_test_mlp_1 + (1-w_best)*y_pred_train_mlp_2

#### Storing the result to a csv file to submit to Kaggle

In [16]:
submission = pd.DataFrame({'test_id' : test_ids, 'price' : final_predictions_test})
submission.to_csv("submission.csv", index=False)

In [24]:
submission.head(10)

Unnamed: 0,test_id,price
0,0,6.283154
1,1,9.482358
2,2,50.16456
3,3,12.036359
4,4,8.741852
5,5,7.547048
6,6,8.493185
7,7,25.18097
8,8,62.259933
9,9,11.92149


### Conclusion

In [23]:
x = PrettyTable()

x.field_names = ["Model", "Vectorization", "RMSLE"]

x.add_row(["Ridge Regression", "TF-IDF Sparse vectors", "0.455"])
x.add_row(["Lasso Regression", "TF-IDF Sparse vectors", "0.746"])
x.add_row(["XGBoost", "TF-IDF Sparse vectors", "0.490"])
x.add_row(["XGBoost", "Average W2V dense vectors", "0.552"])
x.add_row(["MLP Model - 1", "TF-IDF Sparse vectors", "0.412"])
x.add_row(["MLP Model - 2", "TF-IDF Sparse vectors", "0.405"])
x.add_row(["Ensemble Model (MLP-1 + MLP-2)", "TF-IDF Sparse vectors", "0.398"])

print(x)

+--------------------------------+---------------------------+-------+
|             Model              |       Vectorization       | RMSLE |
+--------------------------------+---------------------------+-------+
|        Ridge Regression        |   TF-IDF Sparse vectors   | 0.455 |
|        Lasso Regression        |   TF-IDF Sparse vectors   | 0.746 |
|            XGBoost             |   TF-IDF Sparse vectors   | 0.490 |
|            XGBoost             | Average W2V dense vectors | 0.552 |
|         MLP Model - 1          |   TF-IDF Sparse vectors   | 0.412 |
|         MLP Model - 2          |   TF-IDF Sparse vectors   | 0.405 |
| Ensemble Model (MLP-1 + MLP-2) |   TF-IDF Sparse vectors   | 0.398 |
+--------------------------------+---------------------------+-------+


### Summary

1. The best model we got was an ensemble of MLP Model 1 and MLP Model 2. It had a train RMSLE of 0.151 and a <b> CV RMSLE of 0.398 </b> <br> <br>
2. Rest all the models could not capture the information present in the data and thus had a relatively higher bias but reduced variance, while, the MLP models, being Deep Neural Networks, were able to capture the information in the data and had very less bias but a high variance. <br> <br>
3. New features like historical price statistics which were engineered during Feature Engineering phase were not finally used as  these caused significant decrease in model's performance, probably, because these were causing overiftting due to being very focused towards the training data <br> <br>
4. Sparse vecrorizations (TF-IDF) were used instead of dense vectorizations (Word2Vec) as most of the words (~70%) in the given text corpus were not present in the vocabulary of Word2Vec model trained Google News dataset and hence the information present in the corpus was not utilized at its full potential. <br> <br>
5. Ridge Regression model is a very simple model, still, it performed quite well due to the presence of high number of features. Lasso Regression typically works better in case of sparse vectorization, but, it failed to converge in this case and hence could not be used. XGBoost models are complex tree based ensembles and hence these don't perform well with high dimensional data. Finally, the MLP models performed very well on the test data, but, as these are deep neural networks, these are prone to overfitting. <br>

#### Kaggle Submission

The final RMSLE obtained on the unseen test data of 3.5 million rows on Kaggle was 0.405, which resulted in <b> 16th position out of 2380 participants </b>, that is, in <b>top 1% of the final results.</b>

![Kaggle_Submission](https://i.imgur.com/BS8zTqz.png)
![Kaggle_Leaderboard](https://i.imgur.com/7m7jXB9.png)