<h2>FINAL DATA PREPROCESSING & MODEL TRAINING</h2>

Script ini digunakan untuk mempreprocess data menjadi data final yang dapat digunakan untuk training model, dan training model LSTM itu sendiri. Secara umum, script ini meliputi:
<h4>Data Preprocessing</h4>
- Padding & Truncating data, pada dasarnya memastikan seluruh trj_id memiliki jumlah data yang sama, yakni 20 data, dengan menambahkan data yang kurang dengan 0 dan memotong data yang kelebihan dan mengambil 20 data paling belakang.
- Membagi data menjadi X (koordinat, keceptan, dll yang dimasukkan ke model untuk melatih model) dan y (koordinat yang benar untuk dibandingkan dengan koordinat hasil prediksi model). y akan diambil dari koordinat terakhir tiap trajectory.
- Membagi data menjadi training data dan testing data, dengan pembagian 90%/10%.
<br/>
<h4>Model Training</h4>
- Mendefinisikan dan melatih model LSTM berdasarkan data yang telah siap.
- Arsitektur, epoch, loss, optimizer, dan dll dapat dengan bebas diubah-ubah untuk mencoba model lain.
<br/>
Jika ingin melihat performa model pertama yang telah saya latih, maka model itu sudah disave dalam folder models dengan nama model_v1.keras. Model dapat ditest menggunakan script Model Test.ipynb

In [1]:
## Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import datetime
import math

In [2]:
# Load Tensorboard
%load_ext tensorboard

In [3]:
## Read data from csv
resampled_data = pd.read_csv('../clean_data.csv', index_col='trj_id').drop('Unnamed: 1', axis=1)
print(resampled_data)

          rawlat      rawlng      speed     bearing  hour_of_day  day_of_week
trj_id                                                                       
1      -6.198042  106.769008   4.322800  179.920000           14            3
1      -6.200972  106.769202   8.014167  173.233333           14            3
1      -6.205394  106.769768  10.116136  171.477273           14            3
1      -6.210496  106.771217   9.307667  156.683333           14            3
1      -6.214969  106.773830  10.103333  139.777778           14            3
...          ...         ...        ...         ...          ...          ...
9999   -6.187751  106.845707  10.584667  329.600000            4            6
9999   -6.184123  106.843546   4.508780  324.512195            4            6
9999   -6.182706  106.842869   2.776724  287.137931            4            6
9999   -6.180504  106.842337   5.244333  326.850000            4            6
9999   -6.179029  106.841998   2.330952  231.095238            4

In [4]:
# Pad and truncate the timestamps in the dataframe
# Ini buat kita samain input modelnya, gw potong timestampnya jadi pasti ada 20 timestamp per sample. Kalo lebih dipotong, kalo kurang ditambahin 0 di depannya
# Pad value None --> pake koordinat pertama

def pad_truncate_dataframe(df, max_len, padding='pre', truncating='post', pad_value=None):
  # Split the dataframe by samples (first level of multi-index)
  samples = df.groupby(level=0)

  # Define a function to pad/truncate a single sample
  def pad_truncate_sample(sample):
    # Extract the values from a single sample
    values = sample.values
    first_element = values[0]
    first_element[2] = 0
    first_element[3] = 0

    # Pad/truncate automatically using pad_sequences
    if pad_value == None:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=first_element, dtype='float64'
      )[0]
    else:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=pad_value, dtype='float64'
      )[0]

    # Convert back to pandas dataframe
    return pd.DataFrame(padded_truncated, columns=sample.columns)

  # Apply the function to each sample and recreate the multi-index dataframe
  padded_df = samples.apply(pad_truncate_sample)
  return padded_df

df_new = pad_truncate_dataframe(resampled_data, 21)

In [5]:
## Convert the multi index dataframe to a numpy 3D array for better integration to TensorFlow (samples, timesteps, features)
numpy_data = df_new.to_xarray().to_array().to_numpy()
numpy_data = np.transpose(numpy_data, (1, 2, 0))
numpy_data.shape # Should be (55994, 21, 6) for 55994 samples, 21 timesteps, and 6 features

(55994, 21, 6)

In [6]:
## Split the data to x (feature values) and y (target values)
x_data = numpy_data[:, :-1, :]
y_data = numpy_data[:, -1, 0:4]

y_data

array([[ -6.23979377, 106.80146247,   4.28211538,  71.57692308],
       [ -6.14106118, 106.81269453,  12.37416661, 253.79166667],
       [ -6.19673309, 106.84036612,   4.83801934, 154.61538462],
       ...,
       [ -6.1200961 , 106.68786577,   4.31941182, 287.23529412],
       [ -6.13541954, 106.73027638,  16.60461538, 179.84615385],
       [ -6.15251156, 106.79305789,  21.26433333, 171.8       ]])

In [7]:
# Split the data to train, val, test splits (train to train the model, val to validate the model on every epoch,
# test to test the model on new data after trained)
val_size = 0.1
test_size = 0.1

x_train, x_test_val, y_train, y_test_val = train_test_split(x_data, y_data, test_size=test_size + val_size)
x_val, x_test, y_val, y_test = train_test_split(x_test_val, y_test_val, test_size=test_size / (test_size + val_size))

print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)
print(x_test.shape, y_test.shape)

(44795, 20, 6) (44795, 4)
(5599, 20, 6) (5599, 4)
(5600, 20, 6) (5600, 4)


In [8]:
## Define normalization layer
normalize_layer = tf.keras.layers.Normalization(axis=-1, input_shape=(20, 6))
normalize_layer.adapt(numpy_data)

  super().__init__(**kwargs)


In [9]:
## Get Mean and Variance of Normalization layer to use for normalizing y and denormalizing later
normalize_weights = normalize_layer.get_weights()

mean_variance = np.array([normalize_weights[0][0:4], normalize_weights[1][0:4]])

np.save('../mean_variance.npy', mean_variance)

In [10]:
## Normalize y
normalize_y = tf.keras.layers.Normalization(mean=mean_variance[0], variance=mean_variance[1])

y_train = normalize_y(y_train)
y_val = normalize_y(y_val)
y_test = normalize_y(y_test)
print(y_train)
print(y_val)

tf.Tensor(
[[-0.5862844   0.65615994  0.18755443  0.41767535]
 [-0.5462206  -0.06288259 -0.15033329 -0.6807784 ]
 [ 0.90683174  0.509788   -0.03517684 -1.0217757 ]
 ...
 [ 1.1145216   0.3029411   0.7912001   1.1384069 ]
 [-0.68917215  0.00963067 -0.05179683 -0.86025065]
 [-0.4310927   0.90393716 -0.07247689  1.4244599 ]], shape=(44795, 4), dtype=float32)
tf.Tensor(
[[ 0.255698   -1.0703495  -0.00495611  1.0894012 ]
 [ 0.71838766 -0.21824789 -0.6323975   0.8738113 ]
 [ 0.1812674   0.40795785 -0.25622222  0.3198691 ]
 ...
 [ 1.2228751  -0.8086927   0.06414869  1.3854537 ]
 [ 0.7649391  -1.7854972   0.7312101   0.70437944]
 [-0.27539232  0.51162916 -0.05745172  0.5577775 ]], shape=(5599, 4), dtype=float32)


In [11]:
## Save test data for model testing
np.save('../x_test.npy', x_test)
np.save('../y_test.npy', y_test)

In [12]:
# Clear any previous models
tf.keras.backend.clear_session()

# Name scope each layer
def create_named_layer(layer_type, name, **kwargs):
  with tf.name_scope(name):
    return layer_type(**kwargs)

# Define the model
num_features = 6
batch_size = 32

model = tf.keras.Sequential()
model.add(normalize_layer)
model.add(create_named_layer(tf.keras.layers.LSTM, "lstm_1", units=64, return_sequences=True, dropout=0.2))
model.add(create_named_layer(tf.keras.layers.LSTM, "lstm_2", units=64, dropout=0.2))
model.add(create_named_layer(tf.keras.layers.Dense, "dense_output", units=4))


# Define optimizer
lr = 1e-3
opt = tf.keras.optimizers.Adam(learning_rate=lr)

model.compile(loss='huber', optimizer=opt, metrics=['mae', 'mse'])
model.summary()




In [13]:
## Define LR scheduling (optional if want to use or not)
start_lr = 0.001
rampup_epochs = 20
exp_decay = 0.01

# Define the scheduling function
def schedule(epoch):
  def lr(epoch, start_lr, rampup_epochs, exp_decay):
    if epoch < rampup_epochs:
      return start_lr
    else:
      return start_lr * math.exp(-exp_decay * epoch)
  return lr(epoch, start_lr, rampup_epochs, exp_decay)

In [14]:
## Define callbacks and fit the model
log_dir = "../logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=20)

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule)

model.fit(
  x=x_train, 
  y=y_train,
  epochs=200,
  batch_size=64,
  validation_data=(x_val, y_val),
  callbacks=[tensorboard_callback, early_stopping,])

model.evaluate(x_test, y_test)

Epoch 1/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 16ms/step - loss: 0.1850 - mae: 0.4392 - mse: 0.4396 - val_loss: 0.0907 - val_mae: 0.2983 - val_mse: 0.1924
Epoch 2/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 16ms/step - loss: 0.1370 - mae: 0.3509 - mse: 0.3236 - val_loss: 0.0915 - val_mae: 0.2991 - val_mse: 0.1946
Epoch 3/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 17ms/step - loss: 0.1352 - mae: 0.3448 - mse: 0.3211 - val_loss: 0.0863 - val_mae: 0.2832 - val_mse: 0.1841
Epoch 4/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 17ms/step - loss: 0.1316 - mae: 0.3380 - mse: 0.3113 - val_loss: 0.0894 - val_mae: 0.2938 - val_mse: 0.1899
Epoch 5/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 16ms/step - loss: 0.1331 - mae: 0.3402 - mse: 0.3143 - val_loss: 0.0874 - val_mae: 0.2867 - val_mse: 0.1861
Epoch 6/200
[1m700/700[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0

KeyboardInterrupt: 

In [None]:
## Save the model
model.save('../models/model_v3.keras')