# Cross validation for LSTM Graph Laplacian Regularized Matrix Factorization

> About the author: Jinming Yang (yangjm67@sjtu.edu.cn), Center for Intelligent Transportation Systems and Unmanned Aerial Systems Applications Research, School of Naval Architecture, Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai 200240, China. 

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from numpy.linalg import inv as inv
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
from sklearn.preprocessing import MinMaxScaler
import time

Using TensorFlow backend.


### Load the Seattle speed data set
Guangzhou speed dataset is stored in a 323 by 17568 matrix.
- **323** stands for 214 different road segments in Guangzhou.
- **17568** stands for 17568 time slots over two months. (5 min a time slot)

In [2]:
# directory = '../datasets/Seattle_loop-data-set/'
directory = 'E:/ACADEMIC/CODE/datasets/Seattle_loop-data-set/'
ADJ = np.load( directory + 'Loop_Seattle_2015_A.npy')
dense_mat = np.load( directory + 'dense_mat.npy')
print('Adjacency matrix shape:')
print(ADJ.shape)
print()
print('Dataset shape:')
print(dense_mat.shape)

missing_rate = 0.2
# =============================================================================
### Random missing (PM) scenario
### Set the PM scenario by:
rm_random_mat = np.load(directory + 'rm_random_mat.npy')
binary_mat = np.round(rm_random_mat + 0.5 - missing_rate)
# =============================================================================
# =============================================================================
### Non-random missing (CM) scenario
### Set the CM scenario by:
# nm_random_mat = np.load(directory + 'nm_random_mat.npy')
# binary_tensor = np.zeros((dense_mat.shape[0], 61, 288))
# for i1 in range(binary_tensor.shape[0]):
#     for i2 in range(binary_tensor.shape[1]):
#         binary_tensor[i1, i2, :] = np.round(nm_random_mat[i1, i2] + 0.5 - missing_rate)
# binary_mat = binary_tensor.reshape([binary_tensor.shape[0], binary_tensor.shape[1] * binary_tensor.shape[2]])
# =============================================================================

sparse_mat = np.multiply(dense_mat, binary_mat)

Adjacency matrix shape:
(323, 323)

Dataset shape:
(323, 17568)


## Train LSTM-GL-ReMF spatial temporal feature matrices and LSTM coefficients

Bofore moving to the online prediction part of the framework, static data features(spatial feature matrix `W` and temporal feature matrix `X`) and LSTM network coefficients(`var(f)`) should be trained first.

The following function is used to generate training samples for the LSTM neural network:

- `dataset` is the spatial temporal matrix(training data matrix).
- `rate` ranging from $(0, 1]$ stands for the sampling rate.
- `time_lags` stands for the leg set which denotes the temporal correlation topology.

In [3]:
def create_lstm_samples(dataset, time_lags, rate):
    dataX, dataY = [], []
    data_len = dataset.shape[0] - np.max(time_lags)
    t_sample = np.random.choice(data_len, int(rate * data_len), replace = False)
    
    for t in t_sample:
        a = dataset[t + np.max(time_lags) - time_lags, :][::-1]
        dataX.append(a)
        dataY.append(dataset[t + np.max(time_lags), :])
    return np.array(dataX), np.array(dataY)

The following function creates a LSTM-full connection neural network. The input layer of the network has `rank` number of units, the LSTM layer has `rank` number of units and the full connection layer also has `rank` number of units.

In [4]:
def lstmmodel(rank, lag_len):
    # create the LSTM network
    model = Sequential()
#     model.add(LSTM(rank, input_shape = (lag_len, rank), return_sequences = True)) # If you need multi-layer LSTM
    model.add(LSTM(rank, input_shape = (lag_len, rank)))
    model.add(Dense(rank))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

## Error calculator
<div class="alert alert-block alert-warning">
<ul>
<li><b><code>mean_absolute_percentage_error</code>:</b> <font color="black">Compute the value of Mean Absolute Percentage Error (MAPE).</font></li>
<li><b><code>root_mean_squared_error</code>:</b> <font color="black">Compute the value of Root Mean Square Error (RMSE).</font></li>
</ul>
</div>

> Note that $$\mathrm{MAPE}=\frac{1}{n} \sum_{i=1}^{n} \frac{\left|y_{i}-\hat{y}_{i}\right|}{y_{i}} \times 100, \quad\mathrm{RMSE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-{y}'_{i}\right)^{2}},$$ where $n$ is the total number of estimated values, and $y_i$ and ${y}'_i$ are the actual value and its estimation, respectively.

In [5]:
def mean_absolute_percentage_error(y_true, y_pred, pos): 
    return np.mean(np.abs((y_true[pos] - y_pred[pos]) / y_true[pos])) * 100
def root_mean_squared_error(y_true, y_pred, pos): 
    return np.sqrt(np.mean(np.square(y_true[pos] - y_pred[pos])))

### LSTM-GL-ReMF training algorithm

The function **LSTM_GL_ReMF** is used to train spatial temporal feature matrices and LSTM network parameters.

- `sparse_mat` is the training set spatial temporal matrix.
- `ADJ` is the adjacency matrix of sensors.
- `init` is the initiated hyperparameters of LSTM-ReMF which includes the initiated spatial matrix `W` and the initiated temporal matrix `X`.
- `time_lags` stands for the leg set which denotes the temporal correlation topology.
- `lambda_w`, `lambda_x` and `eta` are regularizer parameters. 
- `sampling rate` is the ratio of data used to train the LSTM-full connection network.
- `maxiter` is the maxiter time.
- `track` is a 0 or 1 parameter that indicates whether to compute errors while training.
- `patience` is the tolerance waiting step number. It is only required when `track` variable is 1.
- `dense_mat` is the training ground truth without data missing simulation. It is only required when `track` variable is 1.

In [6]:
def LSTM_GL_ReMF(sparse_mat, ADJ, init, time_lags, lambda_w, lambda_x, eta, sampling_rate, maxiter, track, patience = 5, dense_mat = 0):
    W = init["W"]
    X = init["X"]
    dim1, dim2 = sparse_mat.shape
    binary_mat = np.zeros((dim1,dim2))
    position = np.where((sparse_mat != 0))
    binary_mat[position] = 1
    d = len(time_lags)
    max_lags = np.max(time_lags)
    r = X.shape[1]
    if track:
        mape_pre = float(np.inf)
        rmse_pre = float(np.inf)
        pos_err = np.where((sparse_mat == 0) & (dense_mat != 0))
        count = 0
    model = lstmmodel(r, d)
    model_reverse = lstmmodel(r, d)
    start_time = time.time()
    for iters in range(maxiter):
        for i in range(dim1):
            pos0 = np.where(sparse_mat[i, :] != 0)
            pos1 = np.where(ADJ[i,:] == 1)[0]
            vec1 = np.sum(W[pos1, :], axis=0)
            
            Xt = X[pos0[0], :]
            vec0 = np.matmul(Xt.T, sparse_mat[i, pos0[0]]) + vec1
            mat0 = inv(np.matmul(Xt.T, Xt) + lambda_w * eta * np.eye(rank) + len(pos1) * lambda_w * np.eye(rank))
            W[i, :] = np.matmul(mat0, vec0)

        for t in range(dim2):
            pos0 = np.where(sparse_mat[:, t] != 0)
            Wt = W[pos0[0], :]
            if iters == 0 or t < max_lags:
                X[t, :] = np.matmul(inv(np.matmul(Wt.T, Wt) + lambda_x * eta * np.eye(r)), np.matmul(Wt.T, sparse_mat[pos0[0], t]))
            else:
                X_hat = X[t - time_lags, :][::-1]
                X_hat_feed = X_hat[np.newaxis, :, :]
                Qt =  model.predict(X_hat_feed)[0]
                X[t, :] = np.matmul(inv(np.matmul(Wt.T, Wt)
                                           + lambda_x * np.eye(r) + lambda_x * eta * np.eye(r)),
                                       (np.matmul(Wt.T, sparse_mat[pos0[0], t]) + lambda_x * Qt))
        
        if iters == 0:
            lstmX, lstmY = create_lstm_samples(X, time_lags, 1)
            model.fit(lstmX, lstmY, epochs=20, batch_size=50, verbose=0)
        else:
            lstmX, lstmY = create_lstm_samples(X, time_lags, sampling_rate)
            model.fit(lstmX, lstmY, epochs=2, batch_size=200, verbose=0)
        if (iters + 1) % 10 == 0:
#             print('Iterations: %d, time cost: %ds'%((iters + 1), (time.time() - start_time)))
#             start_time = time.time()
            if track:
                mat_hat = np.matmul(W, X.T)
                mat_hat[position] = sparse_mat[position]
                mat_hat[mat_hat < 0] = 0
                rmse = root_mean_squared_error(dense_mat, mat_hat, pos_err)
                mape = mean_absolute_percentage_error(dense_mat, mat_hat, pos_err)
#                 print('Imputation RMSE = %.2f'%rmse)
#                 print('Imputation MAPE = %.2f'%mape)
                rmse_dif = rmse_pre - rmse
                mape_dif = mape_pre - mape
                rmse_pre = rmse
                mape_pre = mape
                if rmse_dif < 0.001 and mape_dif < 0.001:
                    count += 1
                    if count == patience:
#                         print('Wait step: %d'%count)
                        break
                else:
                    count = 0
            print()
#     model.save('model_save\lstm_trained1.h5')
    mat_hat = np.matmul(W, X.T)
    mat_hat[position] = sparse_mat[position]
    mat_hat[mat_hat < 0] = 0
    return mat_hat, W, X, model

### Online temporal embedding calibration algorithm

In [7]:
def OnlineLSTMReMF(sparse_vec, init, lambda_x, time_lags):
    time_lags = time_lags[::-1]
    W = init["W"]
    X = init["X"]
    lambda_x *= 6000
    model = init["model"]
    dim = sparse_vec.shape[0]
    t, rank = X.shape
    X_hat = X[t - 1 - time_lags, :].copy()
    X_hat_feed = X_hat[np.newaxis, :, :]
    Qt =  model.predict(X_hat_feed)[0]
    pos0 = np.where(sparse_vec != 0)
    Wt = W[pos0[0], :]
    var_mu = np.matmul(Wt.T, sparse_vec[pos0]) + lambda_x * Qt
    inv_var_Lambda = inv(np.matmul(Wt.T, Wt) + lambda_x * np.eye(rank))
    return np.matmul(inv_var_Lambda, var_mu)

### Online prediction framework

In [8]:
def online_prediction(sparse_mat, init, time_lags, lambda_x, eta):
    W = init["W"]
    X = init["X"]
    model = init["model"]
    pre_step_num = X.shape[0]
    rank = X.shape[1]
    dim1, dim2 = sparse_mat.shape
    X_hat = np.zeros((dim2 + pre_step_num, rank))
    mat_pred = np.zeros((dim1, dim2))
    X_hat[:pre_step_num,:] = X.copy()
    start_time = time.time()
    for t in range(dim2):
        if t == 0:
            X_star = X_hat[pre_step_num + t - time_lags, :][::-1]
            X_star_feed = X_star[np.newaxis, :, :]
            Qt =  model.predict(X_star_feed)[0]
            X_hat[pre_step_num + t, :] = Qt.copy()
        else:
            sparse_vec = sparse_mat[:, t - 1]
            if np.where(sparse_vec > 0)[0].shape[0] > 0:
                init = {"W": W, "X": X_hat[pre_step_num + t - np.max(time_lags) - 1 : pre_step_num + t, :],
                        "model": model}
                X_c = OnlineLSTMReMF(sparse_vec, init, lambda_x/dim2, time_lags)
                X_hat[pre_step_num + t - 1, :] = X_c.copy()
                X_star = X_hat[pre_step_num + t - time_lags, :][::-1]
                X_star_feed = X_star[np.newaxis, :, :]
                Qt =  model.predict(X_star_feed)[0]
                X_hat[pre_step_num + t, :] = Qt.copy()
            else:
                X_star = X_hat[pre_step_num + t - time_lags, :][::-1]
                X_star_feed = X_star[np.newaxis, :, :]
                Qt =  model.predict(X_star_feed)[0]
                X_hat[pre_step_num + t, :] = Qt.copy()
        mat_pred[:, t] = np.matmul(W, X_hat[pre_step_num + t, :])
        if (t + 1) % 1000 == 0:
            print('Time step: %d, time cost: %d s'%((t + 1), (time.time() - start_time)))
            start_time = time.time()
            
    sparse_vec = sparse_mat[:, -1]
    init = {"W": W, "X": X_hat[dim2 + pre_step_num - np.max(time_lags) - 1 : , :], "model": model}
    X_c = OnlineLSTMReMF(sparse_vec, init, lambda_x/dim2, time_lags)
    X_hat[dim2 + pre_step_num - 1,:] = X_c.copy()
    mat_rec = np.matmul(W, X_hat[pre_step_num : , :].T)
    return np.round(mat_rec), np.round(mat_pred)

## Cross validation for hyper-parmeter tuning
Hyper-parameters are selected using grid search with sliding window cross validation. The original dataset is partitioned into three non-overlapping continuous sub-datasets, each containing 20 or 21 days' of the 61 days dataset. The hyper-parameter set with which the model obtains the minimum average RMSE on all sub-datasets are set to be the final hyper-parameters.


### Hyper-parameters
- `lambda_w = lamnda_x`
- `eta`

In the LSTM-GL-ReMF model, the parameters $\lambda_w,\lambda_x$ are used to weight the temporal and spatial regularizations and $\eta$ is used to balance the temporal/spatial regularization and $l_2$ regularization. It is assumed that the temporal and spatial information of data is of same importance, so $\lambda_w$ and $\lambda_x$ are set to be equal.

### Partitioning dataset using sliding window

In [9]:
N = 3
test_len = 2 * 288
dim1, dim2 = dense_mat.shape
training_sets = []
test_sets = []
training_ground_truths = []
test_ground_truths = []
for i in range(N):
    start = int(np.floor(dim2 / N) * i)
    if i < N:
        end = int(np.floor(dim2 / N) * (i + 1))
    else:
        end = dim2
    span = end - start
    training_sets.append(sparse_mat[:, start : end - test_len])
    test_sets.append(sparse_mat[:, end - test_len : end])
    training_ground_truths.append(dense_mat[:, start : end - test_len])
    test_ground_truths.append(dense_mat[:, end - test_len : end])
    print('Sub-dataset %d of size (%d, %d) created.'%(i+1, dim1, span))
print('Data partition complete!')
print()

Sub-dataset 1 of size (323, 5856) created.
Sub-dataset 2 of size (323, 5856) created.
Sub-dataset 3 of size (323, 5856) created.
Data partition complete!



## Cross validation with grid search

In [10]:
rank = 60
maxiter = 50
sampling_rate = 1.0
time_lags = np.array([1, 2, 288]) 
track = False
patience = 10

lambda_wxs = np.array([100, 200, 300, 400])
etas = np.array([0.1, 0.2, 0.3, 0.4])
pred_RMSE_mat = np.zeros((len(lambda_wxs), len(etas)))
for p in range(N):
    for l_ind in range(len(lambda_wxs)):
        for e_ind in range(len(etas)):
            dim1, dim2 = training_sets[p].shape
            init = {"W": 0.1 * np.random.rand(dim1, rank), "X": 0.1 * np.random.rand(dim2, rank)}
            mat_hat, W, X, model = LSTM_GL_ReMF(training_sets[p], ADJ, init, time_lags, 
                               lambda_wxs[l_ind], lambda_wxs[l_ind], etas[e_ind], sampling_rate, maxiter, track, patience, training_ground_truths[p])
            init = {"W": W, "X": X[- np.max(time_lags): , :], "model": model}
            test_mat_rec, test_mat_pred = online_prediction(test_sets[p], init, time_lags, lambda_wxs[l_ind], etas[e_ind])
            pos = np.where(test_ground_truths[p] != 0)
            testPred_rmse = root_mean_squared_error(test_ground_truths[p], test_mat_pred, pos)
            pred_RMSE_mat[l_ind, e_ind] += testPred_rmse
# pred_RMSE_mat = pred_RMSE_mat / N
pos = np.where(pred_RMSE_mat == np.min(pred_RMSE_mat))
opt_lambda = lambda_wxs[pos[0][0]]
opt_eta = etas[pos[1][0]]
print('The optimal lambda is %d, and the optimal eta is %.1f'%(opt_lambda, opt_eta))

The optimal lambda is 100, and the optimal eta is 0.2


# License

<div class="alert alert-block alert-danger">
<b>This work is released under the MIT license.</b>
</div>