<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Data preprocessing</a></span></li></ul></li><li><span><a href="#Get-category-embeddings" data-toc-modified-id="Get-category-embeddings-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Get category embeddings</a></span><ul class="toc-item"><li><span><a href="#Get-the-cardinality-of-the-categorical-features" data-toc-modified-id="Get-the-cardinality-of-the-categorical-features-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Get the cardinality of the categorical features</a></span></li></ul></li></ul></div>

In [1]:
import os
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
import feather

In [2]:
from fastai.imports import *
from fastai.torch_imports import *
from fastai.dataset import *
from fastai.learner import *
from fastai.structured import *
from fastai.column_data import *

  data = yaml.load(f.read()) or {}


In [3]:
import torch.optim as optim

In [4]:
import torch.nn as nn
import torch.nn.functional as F

0. Data preprocessing
1. Define dataset
2. Initialize dataloader
3. Categorize continuous & categorical features in the data
4. Define and intialize embeddings for the categorical features
5. Define the model architecture
6. Pick loss function 
7. Train the model
8. Room for improvements

### Data preprocessing

Let's load the train & test data. We have already done lot of feature engineering & preprocessing on the data. Before feeding data to neural networks we need to

1. Pick the most important features for training
2. Categorize continuous & categorical data features
3. Replace null or missing values with median
4. Mark categorical features as `categorical`
5. Scale/normalize continuous data as neural networks work better with normalized data

In [4]:
PATH = 'data/elo/'
dep = 'target'
df_raw = feather.read_dataframe('train_df_alpha')
df_test = feather.read_dataframe('test_df_alpha')

In [13]:
y = df_raw_copy['target'].values

After loading our previously calculated aggregates to data frames, we exclude some unimportant columns which showed little improvement to the model's accuracy. We label features with <100 unique values as categorical columns and the rest as contiguous columns. We save the column names of categorical & contiguous columns to `cat_flds` and `cont_flds` respectively.

In [8]:
### Convert date cols

for df in [df_raw, df_test]:
    for f in ['purchase_date_max','purchase_date_min','purchase_date_max_old',\
                     'purchase_date_min_old', 'observation_date_old']:
        df[f] = df[f].astype(np.int64) * 1e-9

### Remove some low importance cols


cols_excluded = ['purchase_date_max', 'purchase_date_max_old', 'card_id', 'first_active_month',
                 'target','outliers','card_id_size', 'card_id_size_old', 
                 'purchase_date_min', 'purchase_date_min_old','first_active_monthYear',
                 'first_active_monthMonth',
                 'first_active_monthWeek',
                 'first_active_monthDay',
                 'first_active_monthDayofweek',
                 'first_active_monthDayofyear',
                 'first_active_monthIs_month_end',
                 'first_active_monthIs_month_start',
                 'first_active_monthIs_quarter_end',
                 'first_active_monthIs_quarter_start',
                 'first_active_monthIs_year_end',
                 'Black_Friday_2017_mean',
                 'amount_month_ratio_max',
                 'purchase_Month_mean_old',
                 'purchase_amount_total_max',
                 'first_active_monthIs_year_start']

### Pick highly important features

cols_included = ['feature_1','feature_2','feature_3','transactions_count','subsector_id_nunique','merchant_id_nunique','merchant_category_id_nunique','purchase_Month_mean','purchase_Month_min','purchase_Month_max','purchase_Week_nunique','purchase_Week_mean','purchase_Week_min','purchase_Week_max','purchase_Dayofweek_mean','purchase_Dayofweek_min','purchase_Dayofweek_max','purchase_Day_nunique','purchase_Day_mean','purchase_Day_min','purchase_Day_max','purchase_Hour_nunique','purchase_Hour_mean','purchase_Hour_min','purchase_Hour_max','purchase_amount_sum','purchase_amount_max','purchase_amount_min','purchase_amount_mean','purchase_amount_var','purchase_amount_skew','installments_sum','installments_max','installments_mean','installments_var','installments_skew','month_lag_max','month_lag_min','month_lag_mean','month_lag_var','month_lag_skew','month_diff_mean','month_diff_var','month_diff_skew','purchased_on_weekend_mean','category_1_mean','category_2_mean','category_3_mean','card_id_count','price_mean','price_max','price_min','price_var','Christmas_Day_2017_mean','Children_day_2017_mean','Black_Friday_2017_mean','Mothers_Day_2018_mean','duration_mean','duration_min','duration_max','duration_var','duration_skew','amount_month_ratio_mean','amount_month_ratio_min','amount_month_ratio_max','amount_month_ratio_var','amount_month_ratio_skew','category_2_mean_mean','category_3_mean_mean','purchase_date_diff','purchase_date_average','purchase_date_uptonow','purchase_date_uptomin','transactions_count_old','subsector_id_nunique_old','merchant_id_nunique_old','merchant_category_id_nunique_old','purchase_Month_nunique','purchase_Month_mean_old','purchase_Month_min_old','purchase_Month_max_old','purchase_Week_nunique_old','purchase_Week_mean_old','purchase_Week_min_old','purchase_Week_max_old','purchase_Dayofweek_mean_old','purchase_Day_nunique_old','purchase_Day_mean_old','purchase_Day_min_old','purchase_Hour_nunique_old','purchase_Hour_mean_old','purchase_Hour_min_old','purchase_Hour_max_old','purchase_amount_sum_old','purchase_amount_max_old','purchase_amount_min_old','purchase_amount_mean_old','purchase_amount_var_old','purchase_amount_skew_old','installments_sum_old','installments_max_old','installments_mean_old','installments_var_old','installments_skew_old','month_lag_max_old','month_lag_min_old','month_lag_mean_old','month_lag_var_old','month_lag_skew_old','month_diff_max','month_diff_min','month_diff_mean_old','month_diff_var_old','month_diff_skew_old','authorized_flag_mean','purchased_on_weekend_mean_old','category_1_mean_old','category_2_mean_old','category_3_mean_old','card_id_count_old','price_sum','price_mean_old','price_max_old','price_min_old','price_var_old','Christmas_Day_2017_mean_old','Mothers_Day_2017_mean','fathers_day_2017_mean','Children_day_2017_mean_old','Valentine_Day_2017_mean','Black_Friday_2017_mean_old','Mothers_Day_2018_mean_old','duration_mean_old','duration_min_old','duration_max_old','duration_var_old','duration_skew_old','amount_month_ratio_mean_old','amount_month_ratio_min_old','amount_month_ratio_max_old','amount_month_ratio_var_old','amount_month_ratio_skew_old','category_2_mean_mean_old','category_3_mean_mean_old','purchase_date_diff_old','purchase_date_average_old','purchase_date_uptonow_old','purchase_date_uptomin_old','quarter','observed_elapsed_time','days_feature1','days_feature2','days_feature3','days_feature1_ratio','days_feature2_ratio','days_feature3_ratio','feature_sum','feature_mean','feature_max','feature_min','feature_var','card_id_total','card_id_count_total','card_id_count_ratio','purchase_amount_total','purchase_amount_total_mean','purchase_amount_total_max','purchase_amount_total_min','purchase_amount_sum_ratio','hist_first_buy','new_first_buy','hist_last_buy','new_last_buy','month_diff_ratio','installments_total','installments_ratio','price_total','CLV','CLV_old','CLV_ratio']

df_train_columns = [c for c in cols_included if c not in cols_excluded]

exp_cols = ['merchant_address_id_nunique', 'merchant_rating_nunique']

df_train_columns = df_train_columns + exp_cols

len(df_train_columns)

df_raw_copy = df_raw.copy()
df_test_copy = df_test.copy()

df_raw = df_raw[df_train_columns]
df_test = df_test[df_train_columns]

### Get validation idx

n_valid = 12000
n_trn = len(df_raw)-n_valid
val_idx = list(range(n_trn, len(df_raw)))

### Get categorical & continous fields

cat_flds = [n for n in df_raw.columns.values if (df_raw[n].nunique()<100) & (n != 'outliers')]
' '.join(cat_flds)

[n for n in df_raw.drop(cat_flds,axis=1).columns if not is_numeric_dtype(df_raw[n])]

for n in cat_flds: df_raw[n] = df_raw[n].astype('category').cat.as_ordered()

cont_flds = [n for n in df_raw.columns if n not in cat_flds and n!= 'outliers']


We have some infinity values in our dataframe (came from ratios), boosting trees handle infinity very well. But it's not the case with neural nets. Let's impute them by replacing them with 0 and also replacing the missing values with median.

In [6]:
df_raw.replace(np.inf, 0, inplace=True)
df_raw.replace(-np.inf, 0, inplace=True)

df_test.replace(np.inf, 0, inplace=True)
df_test.replace(-np.inf, 0, inplace=True)

for n in cat_flds: df_raw[n] = df_raw[n].astype('category').cat.as_ordered()

for n in cont_flds: df_raw[n] = df_raw[n].fillna(df_raw[n].median()).astype('float32')
for n in cont_flds: df_test[n] = df_test[n].fillna(df_test[n].median()).astype('float32')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Let's go ahead and mark all the categorical columns as categories. Pandas assigns them an integer for each category and stores the mapping of category to integer separately. For neural net embeddings we need our categories to be integers.

In [8]:
def apply_cats(df, trn):
    """Changes any columns of strings in df into categorical variables using trn as
    a template for the category codes.
    Parameters:
    -----------
    df: A pandas dataframe. Any columns of strings will be changed to
        categorical values. The category codes are determined by trn.
    trn: A pandas dataframe. When creating a category for df, it looks up the
        what the category's code were in trn and makes those the category codes
        for df.
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    note the type of col2 is string
    >>> train_cats(df)
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    now the type of col2 is category {a : 1, b : 2}
    >>> df2 = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['b', 'a', 'a']})
    >>> apply_cats(df2, df)
           col1 col2
        0     1    b
        1     2    a
        2     3    a
    now the type of col is category {a : 1, b : 2}
    """
    for n,c in df.items():
        if (n in trn.columns) and (trn[n].dtype.name=='category'):
            df[n] = c.astype('category').cat.as_ordered()
            df[n].cat.set_categories(trn[n].cat.categories, ordered=True, inplace=True)

In [40]:
apply_cats(df_test, df_raw)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[n] = c.astype('category').cat.as_ordered()


We initialise the `target` in test dataframe to be `0`.

In [41]:
df_raw['target'] = df_raw_copy['target']
df_test['target'] = 0.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


We write a function to convert our df into an entirely numeric dataframe for reason mentioned above. For each column of df which is not in skip_flds nor in ignore_flds, na values are replaced by the
    median value of the column.

In [10]:
def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
            preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
    """ proc_df takes a data frame df and splits off the response variable, and
    changes the df into an entirely numeric dataframe. For each column of df 
    which is not in skip_flds nor in ignore_flds, na values are replaced by the
    median value of the column.
    Parameters:
    -----------
    df: The data frame you wish to process.
    y_fld: The name of the response variable
    skip_flds: A list of fields that dropped from df.
    ignore_flds: A list of fields that are ignored during processing.
    do_scale: Standardizes each column in df. Takes Boolean Values(True,False)
    na_dict: a dictionary of na columns to add. Na columns are also added if there
        are any missing values.
    preproc_fn: A function that gets applied to df.
    max_n_cat: The maximum number of categories to break into dummy values, instead
        of integer codes.
    subset: Takes a random subset of size subset from df.
    mapper: If do_scale is set as True, the mapper variable
        calculates the values used for scaling of variables during training time (mean and standard deviation).
    Returns:
    --------
    [x, y, nas, mapper(optional)]:
        x: x is the transformed version of df. x will not have the response variable
            and is entirely numeric.
        y: y is the response variable
        nas: returns a dictionary of which nas it created, and the associated median.
        mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous
        variables which is then used for scaling of during test-time.
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    note the type of col2 is string
    >>> train_cats(df)
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    now the type of col2 is category { a : 1, b : 2}
    >>> x, y, nas = proc_df(df, 'col1')
    >>> x
       col2
    0     1
    1     2
    2     1
    >>> data = DataFrame(pet=["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
                 children=[4., 6, 3, 3, 2, 3, 5, 4],
                 salary=[90, 24, 44, 27, 32, 59, 36, 27])
    >>> mapper = DataFrameMapper([(:pet, LabelBinarizer()),
                          ([:children], StandardScaler())])
    >>>round(fit_transform!(mapper, copy(data)), 2)
    8x4 Array{Float64,2}:
    1.0  0.0  0.0   0.21
    0.0  1.0  0.0   1.88
    0.0  1.0  0.0  -0.63
    0.0  0.0  1.0  -0.63
    1.0  0.0  0.0  -1.46
    0.0  1.0  0.0  -0.63
    1.0  0.0  0.0   1.04
    0.0  0.0  1.0   0.21
    """
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    else: df = df.copy()
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    if preproc_fn: preproc_fn(df)
    if y_fld is None: y = None
    else:
        if not is_numeric_dtype(df[y_fld]): df[y_fld] = pd.Categorical(df[y_fld]).codes
        y = df[y_fld].values
        skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)

    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, y, na_dict]
    if do_scale: res = res + [mapper]
    return res

In [42]:
df_raw_copy = df_raw[cat_flds+cont_flds+[dep]].copy()
df, y, nas, mapper_t = proc_df(df_raw_copy, 'target', do_scale=True)

  sqr = np.multiply(arr, arr, out=arr)


For validation set, we pick 10000 rows from the dataframe and copy their indices to `val_idx`.

In [None]:
val_idx = list(range(n_trn, len(df)))

In [43]:
df.shape

(201917, 178)

Apply the same transformation we did for training data set. We will be passing the mapping of categorical columns so that our test dataset has the same mapping as our training dataset.

In [44]:
%%time
df_test_copy = df_test[cat_flds+cont_flds+[dep]].copy()
df_t, _, nas, mapper = proc_df(df_test_copy,'target', do_scale=True, mapper=mapper_t)

CPU times: user 5.86 s, sys: 379 ms, total: 6.24 s
Wall time: 5 s


In [45]:
df.shape, df_t.shape

((201917, 178), (123623, 178))

Define the Root Mean Square Error function.

In [5]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

In [9]:
df = feather.read_dataframe('df_torch_train')
df_t = feather.read_dataframe('df_torch_test')

## Get category embeddings

### Get the cardinality of the categorical features

In [15]:
embedding_cardinality = {n: len(c.cat.categories)+1 for n,c in df_raw[cat_flds].items()}

Now for the embeddings, we get the unique categories in each of the categorical fields. We create embeddings with half the caridanality of the categorical features with minimum being 10.

In [16]:
emb_sizes = [(size, max(5, size//2)) for item, size in embedding_cardinality.items()]

Specify the target variable range to be max & min of the existing target range. We define the range of our output value to be within 20% of our maximum value.

In [19]:
y_range=(np.min(y)*1,np.max(y)*1.2)

Finally we have 178 features of which 53 are categorical and 125 are contiguous.

In [20]:
len(cat_flds), len(cont_flds), len(df.columns)

(53, 125, 178)

Now let's prepare our dataset using pytorch Dataset class for feeding our model. Our dataset simply returns an array of [categorical + contiguous] feature values as x & target as y.

In [53]:
class ModelDataset(Dataset):
    def __init__(self, df, cat_fields, cont_fields, y):
        self.df = df
        self.y = y.astype(np.float32)
        cat_values = [c.values for n,c in df[cat_fields].items()]
        cont_values = [c.values for n,c in df[cont_fields].items()]
        self.cat_features = np.stack(cat_values, 1).astype(np.int64)
        self.cont_features = np.stack(cont_values, 1).astype(np.float32)
#         self.cats = np.stack(cat_fields,  1).astype(np.int64)

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        cat_val = self.cat_features[idx]
        cont_val = self.cont_features[idx]
        y = self.y[idx]
        return [cat_val, cont_val, y]

Let's initialise our training dataset with the `ModelDataset` class.

In [54]:
train_ds = ModelDataset(df, cat_flds, cont_flds, y)

For the dataloader we use pytorch provided `DataLoader`, and will initialise with our training dataset and a batch size of 20.

In [59]:
train_dl = torch.utils.data.DataLoader(train_ds, 20)

Let's have a look at how each mini-batch of data looks like

In [60]:
l1,l2,l3 = next(iter(train_dl))

In [61]:
l1.shape, l2.shape, l3.shape

(torch.Size([20, 53]), torch.Size([20, 125]), torch.Size([20]))

Let's create a function which uses pytorch ModuleList to create embeddings given the embedding sizes. We will initialise the embeddings inplace with uniformly distributed data between `(-0.01,0.01)`. Kaiming He initialization also proved to work good for this.

In [63]:
def get_embeddings(emb_sizes):
    embeddings = nn.ModuleList([nn.Embedding(car, siz) for car,siz in emb_sizes])
    for emb in embeddings:
        emb.weight.data.uniform_(-0.01,0.01)
    return embeddings

Let's build the pytorch model which will take embeddings for categorical features and add linear layers with batch normalization & dropouts and output a single value which is our target. This is a deep neural net as it has more than one layers with non linear layers woven in between.

In [14]:
class TabularModel(nn.Module):
    def __init__(self, emb_sizes, emb_dropout, lin_layers, lin_layers_dropout, n_cat_fields, n_cont_fields, y_range):
        super().__init__()
        # get embeddings
        self.embeddings = get_embeddings(emb_sizes)
        # embedding dropout
        self.emb_dropout = nn.Dropout(emb_dropout)
        # calculate linear layer sizes accounting embeddings
        emb_vectors_sum = sum([e.embedding_dim for e in self.embeddings])
        
        # Linear layer sizes are sum of embeddings size + contiguous fields' size + linear layers we wish to have
        linear_szs = [emb_vectors_sum + n_cont_fields] + lin_layers
        
        self.n_cont_fields = n_cont_fields
        # initialize linear layers
        self.lin_layers = nn.ModuleList([nn.Linear(linear_szs[i], linear_szs[i+1]) 
                                         for i in range(len(linear_szs)-1)])
        # Define output layer
        self.output_layer = nn.Linear(linear_szs[-1], 1)
        
        # Initialize batch normalisation for linear layers
        self.batch_norms_lin = nn.ModuleList([nn.BatchNorm1d(s) for s in linear_szs[1:]])
        # Initialize batch normalisation for continous fields
        self.batch_norm_cont = nn.BatchNorm1d(n_cont_fields)
        
        # dropout for linear layers
        self.linear_drops = nn.ModuleList([nn.Dropout(p) for p in lin_layers_dropout])
        
        self.y_range = y_range
        
    def forward(self, cat_fields, cont_fields):
        # Initialize embeddings for respective categorical fields
        x1 = [e(cat_fields[:,i]) for i,e in enumerate(self.embeddings)]
        # concatenate all the embeddings on axis 1
        x1 = torch.cat(x1, 1)
        # apply dropout on embeddings
        x1 = self.emb_dropout(x1)
        
        # apply batch normalization on continous fields
        x2 = self.batch_norm_cont(cont_fields)
        
        # concatenate along axis 1
        x1 = torch.cat([x1,x2], 1)
        
        # apply linear layers and respective batch norms followed by dropouts 
        for lin, drop, bn in zip(self.lin_layers, self.linear_drops, self.batch_norms_lin):
            # Non linear activation function relu will give only the non-negative values, negatives zeroed.
            x1 = F.relu(lin(x1))
            x1 = bn(x1)
            x1 = drop(x1)
        x1 = self.output_layer(x1)
        # pass the final layer through sigmoid which gives a value between 0 & 1
        x1 = torch.sigmoid(x1)
        y_min = self.y_range[0]
        y_max = self.y_range[1]
        # Mulitply/scale the output from sigmoid with the range of target to get our required y value.
        x1 = x1*(y_max-y_min)
        x1 = y_min + x1
        
        return x1

Let's initialize our pytorch model with 2 linear layers of sizes 1000, 500 with 1e-3 & 1e-2 dropouts. We will apply a dropout of 0.1 on our embeddings to ensure they're optimally regularized to prevent over fitting.

In [87]:
model = TabularModel(emb_sizes, 0.1, [1000,500], [0.001,0.01], len(cat_flds), len(cont_flds), y_range)

Let's have a look of the model parameters here. We can see the embedding layers followed by dropout, linear layers, batch normalization layers and finally dropout.

In [73]:
model.parameters

<bound method Module.parameters of TabularModel(
  (embeddings): ModuleList(
    (0): Embedding(6, 5)
    (1): Embedding(4, 5)
    (2): Embedding(3, 5)
    (3): Embedding(88, 44)
    (4): Embedding(25, 12)
    (5): Embedding(87, 43)
    (6): Embedding(40, 20)
    (7): Embedding(13, 6)
    (8): Embedding(13, 6)
    (9): Embedding(11, 5)
    (10): Embedding(53, 26)
    (11): Embedding(53, 26)
    (12): Embedding(8, 5)
    (13): Embedding(8, 5)
    (14): Embedding(31, 15)
    (15): Embedding(32, 16)
    (16): Embedding(32, 16)
    (17): Embedding(24, 12)
    (18): Embedding(25, 12)
    (19): Embedding(25, 12)
    (20): Embedding(14, 7)
    (21): Embedding(3, 5)
    (22): Embedding(3, 5)
    (23): Embedding(88, 44)
    (24): Embedding(63, 31)
    (25): Embedding(64, 32)
    (26): Embedding(64, 32)
    (27): Embedding(35, 17)
    (28): Embedding(93, 46)
    (29): Embedding(12, 6)
    (30): Embedding(12, 6)
    (31): Embedding(12, 6)
    (32): Embedding(53, 26)
    (33): Embedding(49, 24)
  

We will use Stochastic Gradient Descent as our optimiser function. let's go ahead and intialize the optimizer function with model parameters and learning rate of 1e-4. I found that 1e-3 works better in converging faster.

In [90]:
optimizer = optim.SGD(model.parameters(), 1e-4)

We will pick Mean Square Error as our loss function.

In [91]:
criterion = nn.MSELoss()

Let's train our models for 2 epochs

In [93]:
for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_dl, 0):
        # get the inputs; data is a list of [inputs, labels]
        in1, in2, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(in1, in2)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

[1,  2000] loss: 15.206
[1,  4000] loss: 14.758
[1,  6000] loss: 14.831
[1,  8000] loss: 14.637
[1, 10000] loss: 15.437
[2,  2000] loss: 14.602
[2,  4000] loss: 14.734
[2,  6000] loss: 14.821
[2,  8000] loss: 14.631
[2, 10000] loss: 15.434
Finished Training


Let's calculate the accuracy on our training data. 

In [107]:
train_dl_all = torch.utils.data.DataLoader(train_ds, 201917)

In [110]:
x_al1, x_al2, y_all = next(iter(train_dl_all))

In [111]:
preds = model(x_al1, x_al2)

In [120]:
preds.shape, y.shape

(torch.Size([201917, 1]), (201917,))

In [127]:
preds.detach().numpy()[:, 0].shape, y.shape

((201917,), (201917,))

In [128]:
rmse(preds.detach().numpy()[:, 0], y)

3.8548685360753963

3.85 is our training rmse. There is lot of scope for improvement here. We can use one-cycle policy to determine to train faster on larger epochs to converge better. We can finetune the hyperparameters like batchsize, linear layers size, embeddings size, dropout etc. I was able to get an rmse of 3.61 on the test dataset for the finetuned model on kaggle competition.