# DATA 607 -- Assignment 3
## Graeme Kempthorne, 30130245

In this assignment, we apply the ideas underlying dense word embeddings like Word2Vec and GloVe to construct dense embeddings of categorical features.

The context of our exploration will be the [Rossmann Store Sales Competition](https://www.kaggle.com/c/rossmann-store-sales/overview/description) from *Kaggle*, the goal of which is to forecast store sales using store, promotion, and competitor data.

## Instructions

1. Download the data from the competition page or from [my github](https://github.com/mgreenbe/rossmann).

2. Replace each date in the `Date` column with number of days between it and January 1, 2013, the earliest date in the table.

3. Use `pd.get_dummies` to construct dataframes `stores`, `days_of_week`, and `state_holidays` containing 1-hot encodings of the categorical variables `Store`, `DayOfWeek`, and `StateHoliday`, respectively.

4. Assemble these encoded features, together with the numerical ones (`Date`, `Customers`) and binary ones (`Open`, `Promo`, `SchoolHoliday`), in a matrix `X`, the first 1115 columns of which represent the store ID.

5. Split the data `X` and `Y` into training and validation sets. Standardize the numerical feature columns. Here, the relevant means and standard deviations should be computed from *training data*.

6. Train the model `MyModel`, below, using `MeanSquaredLogarithmicError` as the loss function. Explain, briefly, why this is an appropriate choice of loss function. Stop training when validation error stabilizes.

7. **(Optional)** Add hidden layers to this model and tune the `store_emb_dim` hyperparameter to improve your results.


#1

In [475]:
from google.colab import drive
import tensorflow as tf
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

drive.mount('/content/drive/')
df_main = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/rossmann-main/store.csv")
df_sample = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/rossmann-main/sample_submission.csv")
df_test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/rossmann-main/test.csv")
df_train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/rossmann-main/train.csv")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


  interactivity=interactivity, compiler=compiler, result=result)


#2

In [476]:
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train['Date'] = df_train['Date'] - pd.to_datetime('2013-01-01')
df_train['Date'] = df_train['Date'] / pd.Timedelta(1, unit='d')
df_train['Date']= df_train['Date'].astype('int')
df_train

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,941,5263,555,1,1,0,1
1,2,5,941,6064,625,1,1,0,1
2,3,5,941,8314,821,1,1,0,1
3,4,5,941,13995,1498,1,1,0,1
4,5,5,941,4822,559,1,1,0,1
...,...,...,...,...,...,...,...,...,...
1017204,1111,2,0,0,0,0,0,a,1
1017205,1112,2,0,0,0,0,0,a,1
1017206,1113,2,0,0,0,0,0,a,1
1017207,1114,2,0,0,0,0,0,a,1


In [477]:
df_train_sample = df_train.sample(n=10000, random_state=42)
df_train_sample

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
76435,616,7,873,0,0,0,0,0,0
923026,592,2,84,5548,710,1,1,0,1
731180,526,6,256,7467,1150,1,0,0,0
790350,601,2,203,3360,323,1,0,0,1
252134,953,1,713,11414,853,1,1,0,0
...,...,...,...,...,...,...,...,...,...
319724,160,5,640,0,0,0,1,a,0
356752,832,1,601,4933,448,1,0,0,1
912066,782,5,94,4491,425,1,0,0,1
448301,857,1,510,6055,781,1,0,0,0


# 3

In [478]:
df_train_all = pd.get_dummies(df_train_sample, columns=['Store','DayOfWeek','StateHoliday'])
df_train_all

Unnamed: 0,Date,Sales,Customers,Open,Promo,SchoolHoliday,Store_1,Store_2,Store_3,Store_4,Store_5,Store_6,Store_7,Store_8,Store_9,Store_10,Store_11,Store_12,Store_13,Store_14,Store_15,Store_16,Store_17,Store_18,Store_19,Store_20,Store_21,Store_22,Store_23,Store_24,Store_25,Store_26,Store_27,Store_28,Store_29,Store_30,Store_31,Store_32,Store_33,Store_34,...,Store_1088,Store_1089,Store_1090,Store_1091,Store_1092,Store_1093,Store_1094,Store_1095,Store_1096,Store_1097,Store_1098,Store_1099,Store_1100,Store_1101,Store_1102,Store_1103,Store_1104,Store_1105,Store_1106,Store_1107,Store_1108,Store_1109,Store_1110,Store_1111,Store_1112,Store_1113,Store_1114,Store_1115,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,DayOfWeek_7,StateHoliday_0,StateHoliday_0.1,StateHoliday_a,StateHoliday_b,StateHoliday_c
76435,873,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
923026,84,5548,710,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
731180,256,7467,1150,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
790350,203,3360,323,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
252134,713,11414,853,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319724,640,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
356752,601,4933,448,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
912066,94,4491,425,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
448301,510,6055,781,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0


#4

In [479]:
#combining all of the predictor variables
X = df_train_all.iloc[:,6:-1]
X = pd.concat([X, df_train_all['Date'], df_train_all['Customers'], df_train_all['Open'], df_train_all['Promo'], df_train_all['SchoolHoliday']],  axis = 1)
X

Unnamed: 0,Store_1,Store_2,Store_3,Store_4,Store_5,Store_6,Store_7,Store_8,Store_9,Store_10,Store_11,Store_12,Store_13,Store_14,Store_15,Store_16,Store_17,Store_18,Store_19,Store_20,Store_21,Store_22,Store_23,Store_24,Store_25,Store_26,Store_27,Store_28,Store_29,Store_30,Store_31,Store_32,Store_33,Store_34,Store_35,Store_36,Store_37,Store_38,Store_39,Store_40,...,Store_1092,Store_1093,Store_1094,Store_1095,Store_1096,Store_1097,Store_1098,Store_1099,Store_1100,Store_1101,Store_1102,Store_1103,Store_1104,Store_1105,Store_1106,Store_1107,Store_1108,Store_1109,Store_1110,Store_1111,Store_1112,Store_1113,Store_1114,Store_1115,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,DayOfWeek_7,StateHoliday_0,StateHoliday_0.1,StateHoliday_a,StateHoliday_b,Date,Customers,Open,Promo,SchoolHoliday
76435,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,873,0,0,0,0
923026,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,84,710,1,1,1
731180,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,256,1150,1,0,0
790350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,203,323,1,0,1
252134,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,713,853,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319724,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,640,0,0,1,0
356752,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,601,448,1,0,1
912066,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,94,425,1,0,1
448301,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,510,781,1,0,0


#5

In [480]:
#Calcuating the mean and std of the numerical columns from the original dataset
datemean = np.mean(df_train['Date'])
datestd  = np.std(df_train['Date'])
cusmean = np.mean(df_train['Customers'])
cusstd  = np.std(df_train['Customers'])
print(datemean, datestd, cusmean, datestd )



465.0629959034967 274.4539720515303 633.1459464082602 274.4539720515303


In [481]:
X['Date'] = X.apply(lambda x: (X['Date'] - datemean)/datestd)
X['Customers'] = X.apply(lambda x: (X['Customers'] - cusmean)/cusstd)
X

Unnamed: 0,Store_1,Store_2,Store_3,Store_4,Store_5,Store_6,Store_7,Store_8,Store_9,Store_10,Store_11,Store_12,Store_13,Store_14,Store_15,Store_16,Store_17,Store_18,Store_19,Store_20,Store_21,Store_22,Store_23,Store_24,Store_25,Store_26,Store_27,Store_28,Store_29,Store_30,Store_31,Store_32,Store_33,Store_34,Store_35,Store_36,Store_37,Store_38,Store_39,Store_40,...,Store_1092,Store_1093,Store_1094,Store_1095,Store_1096,Store_1097,Store_1098,Store_1099,Store_1100,Store_1101,Store_1102,Store_1103,Store_1104,Store_1105,Store_1106,Store_1107,Store_1108,Store_1109,Store_1110,Store_1111,Store_1112,Store_1113,Store_1114,Store_1115,DayOfWeek_1,DayOfWeek_2,DayOfWeek_3,DayOfWeek_4,DayOfWeek_5,DayOfWeek_6,DayOfWeek_7,StateHoliday_0,StateHoliday_0.1,StateHoliday_a,StateHoliday_b,Date,Customers,Open,Promo,SchoolHoliday
76435,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1.486359,-1.363330,0,0,0
923026,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,-1.388440,0.165487,1,1,1
731180,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,-0.761742,1.112923,1,0,0
790350,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,-0.954852,-0.667826,1,0,1
252134,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0.903383,0.473404,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319724,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0.637400,-1.363330,0,1,0
356752,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0.495300,-0.398668,1,0,1
912066,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,-1.352004,-0.448193,1,0,1
448301,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0.163732,0.318369,1,0,0


In [482]:

X_tr, X_te, y_tr, y_te = train_test_split(X, df_train_all['Sales'], test_size=0.2)

print(X_tr.shape, X_te.shape,y_tr.shape, y_te.shape )

(8000, 1131) (2000, 1131) (8000,) (2000,)


# 6

In [483]:
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import mean_squared_log_error

class MyModel(keras.Model):
  def __init__(self, n_stores=1115, store_emb_dim=20):
    super(MyModel, self).__init__()
    self.n_stores = n_stores
    self.encoder = keras.layers.Dense(store_emb_dim, name="encoder")
    self.regressor = keras.layers.Dense(1, name="regressor")


  def call(self, X):
    x = tf.concat([self.encoder(X[:, :self.n_stores]), X[:, self.n_stores:]], axis=-1)
    return self.regressor(x)

In [484]:
model = MyModel()

In [524]:
model.compile(loss="MeanSquaredLogarithmicError", optimizer=Adam(learning_rate=0.0001), metrics=["mean_absolute_error"])


In [525]:
model.fit(X_tr, y_tr, validation_split=0.2, epochs=50) 
#learning rate adjusted as model trained

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f66de9ccc90>

In [526]:
model.summary()

Model: "my_model2_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder (Dense)              multiple                  11160     
_________________________________________________________________
regressor (Dense)            multiple                  27        
Total params: 11,187
Trainable params: 11,187
Non-trainable params: 0
_________________________________________________________________


In [527]:
results = model.predict(X_te)

MSLE = np.mean((results.reshape(-1,1) - y_te.values))
print (f"Average difference between model predictions and actual sales : {MSLE}")

Average difference between model predictions and actual sales : -3168.2118432040215


The MeanSquaredLogError metric is an approprite loss function in this model because we are training our model on a continous variable.  The loss function attempts to minimize the squared log difference between the calculated sales value for a given row and the given training sales value. Other loss functions like accuracy or probability metrics would not be applicable in this sense because the predictions would give us poor values of how close or good the predictions are (probability would give 0-1 measure of how close the predicted value is to the real value and accuracy would almost always be 0 as none of the predicted values would perfectly match the the actual sales values). 

#7

In [503]:
class MyModel2(keras.Model):
  def __init__(self, n_stores=1115, store_emb_dim=10):
    super(MyModel2, self).__init__()
    self.n_stores = n_stores
    self.encoder = keras.layers.Dense(store_emb_dim, name="encoder")
    self.regressor = keras.layers.Dense(1, name="regressor")

   

  def call(self, X):
    x = tf.concat([self.encoder(X[:, :self.n_stores]), X[:, self.n_stores:]], axis=-1)
    return self.regressor(x)

In [504]:
model2 = MyModel2()

In [505]:
model2.compile(loss="MeanSquaredLogarithmicError", optimizer="adam", metrics=["mean_absolute_error"])

In [512]:
model2.fit(X_tr, y_tr, validation_split=0.2, epochs=5) 

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f66de887bd0>

In [513]:
model2.summary()

Model: "my_model2_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder (Dense)              multiple                  11160     
_________________________________________________________________
regressor (Dense)            multiple                  27        
Total params: 11,187
Trainable params: 11,187
Non-trainable params: 0
_________________________________________________________________


In [508]:
results = model2.predict(X_te)

MSLE = (results.reshape(-1,1) - y_te.values)
print (np.mean(MSLE))

-5780.300213445186


- This is model is *not* built with `keras.models.Sequential` -- it's not simply passing data through a sequence of layers. The first 1115 columns of the input, representing the store ID, are projected onto a `store_emb_dim`-dimensional space. The resulting projections are then concatenated with the remaining features before applying linear regression. (Notice the absence of nonlinear activation functions.)

- **Warning:** The data set contains > 1 million rows. To avoid running out of memory, work initially with a subset of the rows (say, 10,000). Train on as large a subset of the whole dataset as you can without crashing your session.

## References

Rachel Thomas, [An Introduction to Deep Learning for Tabular Data](https://www.fast.ai/2018/04/29/categorical-embeddings/) (fast.ai blog, April 29, 2018)

Cheng Guo and Felix Berkhahn, [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf) (April 25, 2016)