# Multivariate Time Series Forecasting with LSTM

Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.
This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
from datetime import datetime
import numpy as np
import pandas as pd
import pylab
import calendar
from scipy import stats
import seaborn as sns
from sklearn import model_selection, preprocessing
from scipy.stats import kendalltau
from sklearn.metrics import mean_squared_error
import warnings
import matplotlib.pyplot as plt

# Keras imports
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, BatchNormalization, PReLU
from keras.optimizers import Adam
from keras import callbacks
from keras.callbacks import ModelCheckpoint

# Use SciKeras for scikit-learn compatibility
from scikeras.wrappers import KerasRegressor

# Scikit-learn imports
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.pipeline import Pipeline

In [9]:
print('Loading train, prop and sample data')
train = pd.read_csv("/content/drive/MyDrive/zillow-prize-1/train_2016_v2.csv", parse_dates=["transactiondate"])
prop = pd.read_csv('/content/drive/MyDrive/zillow-prize-1/properties_2016.csv')
sample = pd.read_csv('/content/drive/MyDrive/zillow-prize-1/sample_submission.csv')

Loading train, prop and sample data


  prop = pd.read_csv('/content/drive/MyDrive/zillow-prize-1/properties_2016.csv')


###### Label Encoder
LabelEncoder is a utility class to help normalize labels categorical values and to encode such that they contain only values between 0 and n_classes-1.

Here, we LabelEncode the properties dataset.

In [10]:
print('Fitting Label Encoder on properties')
for c in prop.columns:
    prop[c]=prop[c].fillna(-1)
    if prop[c].dtype == 'object':
        lbl = LabelEncoder()
        lbl.fit(list(prop[c].values))
        prop[c] = lbl.transform(list(prop[c].values))

Fitting Label Encoder on properties


In [11]:
print('Creating training set:')
df_train = train.merge(prop, how='left', on='parcelid')

print('Creating df_test  :')
sample['parcelid'] = sample['ParcelId']

print("Merge Sample with property data :")
df_test = sample.merge(prop, on='parcelid', how='left')

Creating training set:
Creating df_test  :
Merge Sample with property data :


In [12]:
df_train["transactiondate"] = pd.to_datetime(df_train["transactiondate"])
df_train['transactiondate_quarter'] = df_train['transactiondate'].dt.quarter

basedate = pd.to_datetime('2015-11-15').toordinal()
df_train['cos_season'] = \
        ( (pd.to_datetime(df_train['transactiondate']).apply(lambda x: x.toordinal()-basedate)) * \
          (2*np.pi/365.25) ).apply(np.cos)
df_train['sin_season'] = \
        ( (pd.to_datetime(df_train['transactiondate']).apply(lambda x: x.toordinal()-basedate)) * \
          (2*np.pi/365.25) ).apply(np.sin)

#test dataset
df_test["transactiondate"] = pd.to_datetime('2016-11-15')
df_test['transactiondate_quarter'] = df_test['transactiondate'].dt.quarter

df_test['cos_season'] = np.cos( (pd.to_datetime('2016-11-15').toordinal() - basedate) * \
                                    (2*np.pi/365.25) )
df_test['sin_season'] = np.sin( (pd.to_datetime('2016-11-15').toordinal() - basedate) * \
                                    (2*np.pi/365.25) )

df_train_x = df_train.drop(['logerror','parcelid', 'transactiondate', 'propertyzoningdesc', 'propertycountylandusecode',
                             'fireplacecnt', 'fireplaceflag'],axis=1)
df_train = df_train.drop(['parcelid', 'transactiondate', 'propertyzoningdesc', 'propertycountylandusecode',
                             'fireplacecnt', 'fireplaceflag'], axis=1)

train_columns = df_train_x.columns
df_test=df_test[train_columns]

#### LSTM Data Preparation
The first step is to prepare properties dataset for LSTM.This involves framing the dataset as a supervised learning problem and normalizing the input variables.
We will frame the supervised learning problem as predicting the Log Error for a particular parcel Id given other features and conditions at the prior time step.

We can transform the dataset using the series_to_supervised() function that is developed below;

In [13]:
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [14]:
Values = df_train.values
test_values = df_test.values
values = Values.astype('float32')
test_values = test_values.astype('float32')
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
scaled_test = scaler.fit_transform(test_values)
reframed = series_to_supervised(scaled, 1, 1)
reframed_test = series_to_supervised(scaled,1,1)
reframed.drop(reframed.columns[58:116], axis=1, inplace=True)
reframed_test.drop(reframed_test.columns[56:111],axis =1,inplace=True)

train_X, train_Y = reframed.iloc[:80000,:-1], reframed.iloc[:80000,-1]
valid_X, valid_Y = reframed.iloc[80000:,:-1], reframed.iloc[80000:,-1]

train_X = np.array(train_X)
train_Y = np.array(train_Y)

valid_X = np.array(valid_X)
valid_Y = np.array(valid_Y)
test_X = np.array(reframed_test)

In [15]:
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
valid_X = valid_X.reshape((valid_X.shape[0], 1, valid_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_Y.shape, valid_X.shape, valid_Y.shape,test_X.shape)

(80000, 1, 57) (80000,) (10274, 1, 57) (10274,) (90274, 1, 59)


In [16]:
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
model.fit(train_X, train_Y, epochs=50, batch_size=72, validation_data=(valid_X, valid_Y), verbose=2, shuffle=False)
yhat = model.predict(valid_X)
valid_X = valid_X.reshape((valid_X.shape[0], valid_X.shape[2]))

rmse = np.sqrt(mean_squared_error(valid_Y, yhat))
print('Test RMSE: %.3f' % rmse)

  super().__init__(**kwargs)


Epoch 1/50
1112/1112 - 5s - 4ms/step - loss: 0.0171 - val_loss: 0.0490
Epoch 2/50
1112/1112 - 6s - 5ms/step - loss: 0.0086 - val_loss: 0.0452
Epoch 3/50
1112/1112 - 5s - 5ms/step - loss: 0.0083 - val_loss: 0.0382
Epoch 4/50
1112/1112 - 2s - 2ms/step - loss: 0.0082 - val_loss: 0.0332
Epoch 5/50
1112/1112 - 3s - 2ms/step - loss: 0.0081 - val_loss: 0.0282
Epoch 6/50
1112/1112 - 5s - 4ms/step - loss: 0.0079 - val_loss: 0.0235
Epoch 7/50
1112/1112 - 3s - 2ms/step - loss: 0.0078 - val_loss: 0.0207
Epoch 8/50
1112/1112 - 2s - 2ms/step - loss: 0.0078 - val_loss: 0.0180
Epoch 9/50
1112/1112 - 4s - 3ms/step - loss: 0.0077 - val_loss: 0.0156
Epoch 10/50
1112/1112 - 5s - 4ms/step - loss: 0.0077 - val_loss: 0.0134
Epoch 11/50
1112/1112 - 5s - 5ms/step - loss: 0.0077 - val_loss: 0.0119
Epoch 12/50
1112/1112 - 3s - 2ms/step - loss: 0.0076 - val_loss: 0.0109
Epoch 13/50
1112/1112 - 4s - 3ms/step - loss: 0.0076 - val_loss: 0.0098
Epoch 14/50
1112/1112 - 4s - 3ms/step - loss: 0.0076 - val_loss: 0.0090
E

# Summary:

In this notebook, we have implemented Simple neural networks with 5 layers for prediction of LogError = (log(Zestimate)-log(salesprice)) using 2016 property dataset and its corresponding log error values provided by zillow for home value prediction in Python using Keras and tensorflow deep learning libraries.

Finally, we have predicted logerror values of 2016 and 2017 for the last quarter (from November to December) in the test dataset.
Calculated RMSE for the Network built can be seen as 0.017 which is improved from that which we got by simple neural network. we can aslo infer that there is minimal error in the logerror gives us the better predictions.Further, the model can improvised by add more layers or changing the backpropagation parameters.
