<a href="https://colab.research.google.com/github/alby1976/Data607608Project/blob/master/data607/ass/DATA_607_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA 607 -- Assignment 3

Author: Albert Leung & Li Lam

In this assignment, we apply the ideas underlying dense word embeddings like Word2Vec and GloVe to construct dense embeddings of categorical features.

The context of our exploration will be the [Rossmann Store Sales Competition](https://www.kaggle.com/c/rossmann-store-sales/overview/description) from *Kaggle*, the goal of which is to forecast store sales using store, promotion, and competitor data.

## Instructions

1. Download the data from the competition page or from [my github](https://github.com/mgreenbe/rossmann).

2. Replace each date in the `Date` column with number of days between it and January 1, 2013, the earliest date in the table.

3. Use `pd.get_dummies` to construct dataframes `stores`, `days_of_week`, and `state_holidays` containing 1-hot encodings of the categorical variables `Store`, `DayOfWeek`, and `StateHoliday`, respectively.

4. Assemble these encoded features, together with the numerical ones (`Date`, `Customers`) and binary ones (`Open`, `Promo`, `SchoolHoliday`), in a matrix `X`, the first 1115 columns of which represent the store ID.

5. Split the data `X` and `Y` into training and validation sets. Standardize the numerical feature columns. Here, the relevant means and standard deviations should be computed from *training data*.

6. Train the model `MyModel`, below, using `MeanSquaredLogarithmicError` as the loss function. Explain, briefly, why this is an appropriate choice of loss function. Stop training when validation error stabilizes.

7. **(Optional)** Add hidden layers to this model and tune the `store_emb_dim` hyperparameter to improve your results.


## Setup

In [30]:
!pip install --user h5py==2.10.0
!pip install --upgrade scikit-learn keras tensorflow

Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.24.1)
Requirement already up-to-date: keras in /usr/local/lib/python3.7/dist-packages (2.4.3)
Requirement already up-to-date: tensorflow in /usr/local/lib/python3.7/dist-packages (2.4.1)


In [36]:
#convert csv to hdf5s
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import datetime
import gc

base_uri = 'https://raw.githubusercontent.com/mgreenbe/rossmann/main/'
df = pd.read_csv(base_uri + 'train.csv', parse_dates=['Date'], infer_datetime_format=True, 
                 dtype={'Store' : 'category', 'DayOfWeekStore' : 'category', 'DayOfWeek' : 'category', 'StateHoliday' : 'category', 
                        'Open' : np.int8, 'Promo' : np.int8, 'SchoolHoliday' : np.int8})
df.Date = (df.Date - datetime.datetime.strptime('2013-01-01','%Y-%m-%d')).dt.days
df.to_hdf('train.hdf5', key='train', complevel=9, mode='w', format='table', data_columns=True)

#generate stores dataframe 
stores = pd.get_dummies(df.Store, prefix='Store')
stores.to_hdf('stores.hdf5', key='stores', complevel=9, mode='w', data_columns=True)

#generate days_of_week dataframe
days_of_week = pd.get_dummies(df.DayOfWeek, prefix='Day_Of_Week')
days_of_week.to_hdf('days_of_week.hdf5', key='days_of_week', complevel=9, mode='w', data_columns=True)

#genearte state_holidays dataframe
state_holidays = pd.get_dummies(df.StateHoliday, prefix='State_Hoiday')
state_holidays.to_hdf('state_holidays.hdf5', key='days_of_week', complevel=9, mode='w', data_columns=True)

#merge dummy tables together
categorical = stores.merge(days_of_week, left_index=True, right_index=True, how='inner', copy=False)
categorical = categorical.merge(state_holidays, left_index=True, right_index=True, how='inner', copy=False)
categorical.to_hdf('categorical.hdf5', key='categorical', complevel=9, mode='w', data_columns=True)

del stores, days_of_week, state_holidays

#
sales = df.Sales
sales.to_hdf('sales.hdf5', key='sales', complevel=9, mode='w', data_columns=True)

df.drop(columns=['Store', 'DayOfWeek', 'Sales', 'StateHoliday'], inplace=True)

dataset = categorical.merge(df, left_index=True, right_index=True, copy=False)
dataset.to_hdf('dataset.hdf5', key='dataset', complevel=9, mode='w', data_columns=True)

del df
gc.collect()
dataset.info()


#Split the orignal data into the corresponding traing and testing
X_train, X_test, y_train, y_test = train_test_split(dataset, sales, test_size=0.20, random_state=42)

X_train.to_hdf('X_train.hdf5', key='x_train', complevel=9, mode='w', data_columns=True)
y_train.to_hdf('y_train.hdf5', key='y_train', complevel=9, mode='w', data_columns=True)
X_test.to_hdf('X_test.hdf5', key='x_test', complevel=9, mode='w', data_columns=True)
y_test.to_hdf('y_test.hdf5', key='y_test', complevel=9, mode='w', data_columns=True)

del dataset, sales, X_train, X_test, y_train, y_test
gc.collect()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Columns: 1131 entries, Store_1 to SchoolHoliday
dtypes: int64(2), int8(3), uint8(1126)
memory usage: 1.1 GB


96

In [None]:

from tensorflow import keras

class MyModel(keras.Model):
  def __init__(self, n_stores=1115, store_emb_dim=20):
    super(MyModel, self).__init__()
    self.n_stores = n_stores
    self.encoder = keras.layers.Dense(store_emb_dim, name="encoder")
    self.regressor = keras.layers.Dense(1, name="regressor")

  def call(self, X):
    x = tf.concat([self.encoder(X[:, :self.n_stores]), X[:, self.n_stores:]], axis=-1)
    return self.regressor(x)

- This is model is *not* built with `keras.models.Sequential` -- it's not simply passing data through a sequence of layers. The first 1115 columns of the input, representing the store ID, are projected onto a `store_emb_dim`-dimensional space. The resulting projections are then concatenated with the remaining features before applying linear regression. (Notice the absence of nonlinear activation functions.)

- **Warning:** The data set contains > 1 million rows. To avoid running out of memory, work initially with a subset of the rows (say, 10,000). Train on as large a subset of the whole dataset as you can without crashing your session.

## References

Rachel Thomas, [An Introduction to Deep Learning for Tabular Data](https://www.fast.ai/2018/04/29/categorical-embeddings/) (fast.ai blog, April 29, 2018)

Cheng Guo and Felix Berkhahn, [Entity Embeddings of Categorical Variables](https://arxiv.org/pdf/1604.06737.pdf) (April 25, 2016)