<h2>FINAL DATA PREPROCESSING & MODEL TRAINING</h2>

Script ini digunakan untuk mempreprocess data menjadi data final yang dapat digunakan untuk training model, dan training model LSTM itu sendiri. Secara umum, script ini meliputi:
<h4>Data Preprocessing</h4>
- Padding & Truncating data, pada dasarnya memastikan seluruh trj_id memiliki jumlah data yang sama, yakni 20 data, dengan menambahkan data yang kurang dengan 0 dan memotong data yang kelebihan dan mengambil 20 data paling belakang.
- Membagi data menjadi X (koordinat, keceptan, dll yang dimasukkan ke model untuk melatih model) dan y (koordinat yang benar untuk dibandingkan dengan koordinat hasil prediksi model). y akan diambil dari koordinat terakhir tiap trajectory.
- Membagi data menjadi training data dan testing data, dengan pembagian 90%/10%.
<br/>
<h4>Model Training</h4>
- Mendefinisikan dan melatih model LSTM berdasarkan data yang telah siap.
- Arsitektur, epoch, loss, optimizer, dan dll dapat dengan bebas diubah-ubah untuk mencoba model lain.
<br/>
Jika ingin melihat performa model pertama yang telah saya latih, maka model itu sudah disave dalam folder models dengan nama model_v1.keras. Model dapat ditest menggunakan script Model Test.ipynb

In [1]:
## Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import datetime

In [2]:
# Load Tensorboard
%load_ext tensorboard

In [3]:
## Read data from csv
resampled_data = pd.read_csv('../clean_data.csv', index_col='trj_id').drop('Unnamed: 1', axis=1)
print(resampled_data)

          rawlat      rawlng      speed     bearing  hour_of_day  day_of_week
trj_id                                                                       
1      -6.198042  106.769008   4.322800  179.920000           14            3
1      -6.200972  106.769202   8.014167  173.233333           14            3
1      -6.205394  106.769768  10.116136  171.477273           14            3
1      -6.210496  106.771217   9.307667  156.683333           14            3
1      -6.214969  106.773830  10.103333  139.777778           14            3
...          ...         ...        ...         ...          ...          ...
9999   -6.184174  106.843572   3.187241  229.396552            4            6
9999   -6.182703  106.842868   2.684167  277.566667            4            6
9999   -6.180504  106.842337   5.244333  326.850000            4            6
9999   -6.178920  106.841976   0.843966   83.672414            4            6
9999   -6.178846  106.841960   0.000000    0.000000            4

In [4]:
# Optimize Dataset Further
counts = resampled_data.groupby(level='trj_id').size()
filtered_df = resampled_data[resampled_data.index.get_level_values('trj_id').isin(counts[counts >= 30].index)]
print(filtered_df)

          rawlat      rawlng      speed     bearing  hour_of_day  day_of_week
trj_id                                                                       
10003  -6.174612  106.897354   4.663077  283.730769           10            4
10003  -6.175318  106.895763   4.550667  219.833333           10            4
10003  -6.175477  106.893716   7.932833  260.850000           10            4
10003  -6.172101  106.888970  11.278000  302.183333           10            4
10003  -6.168788  106.883564  12.240500  300.483333           10            4
...          ...         ...        ...         ...          ...          ...
9999   -6.184174  106.843572   3.187241  229.396552            4            6
9999   -6.182703  106.842868   2.684167  277.566667            4            6
9999   -6.180504  106.842337   5.244333  326.850000            4            6
9999   -6.178920  106.841976   0.843966   83.672414            4            6
9999   -6.178846  106.841960   0.000000    0.000000            4

In [5]:
# Pad and truncate the timestamps in the dataframe
# Ini buat kita samain input modelnya, gw potong timestampnya jadi pasti ada 60 timestamp per sample. Kalo lebih dipotong, kalo kurang ditambahin 0 di depannya
# Pad value None --> pake koordinat pertama

def pad_truncate_dataframe(df, max_len, padding='pre', truncating='post', pad_value=None):
  # Split the dataframe by samples (first level of multi-index)
  samples = df.groupby(level=0)

  # Define a function to pad/truncate a single sample
  def pad_truncate_sample(sample):
    # Extract the values from a single sample
    values = sample.values
    first_element = values[0]
    first_element[2] = 0
    first_element[3] = 0

    # Pad/truncate automatically using pad_sequences
    if pad_value == None:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=first_element, dtype='float64'
      )[0]
    else:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=pad_value, dtype='float64'
      )[0]

    # Convert back to pandas dataframe
    return pd.DataFrame(padded_truncated, columns=sample.columns)

  # Apply the function to each sample and recreate the multi-index dataframe
  padded_df = samples.apply(pad_truncate_sample)
  return padded_df

df_new = pad_truncate_dataframe(filtered_df, 61)

In [6]:
## Convert the multi index dataframe to a numpy 3D array for better integration to TensorFlow (samples, timesteps, features)
numpy_data = df_new.to_xarray().to_array().to_numpy()
numpy_data = np.transpose(numpy_data, (1, 2, 0))
numpy_data.shape # Should be (55994, 21, 6) if 55994 samples, 21 timesteps, and 6 features

(7935, 61, 6)

In [7]:
output_steps = 10

## Split the data to x (feature values) and y (target values)
simple_data = numpy_data[:, :, [0, 1, 4, 5]]
x_data = simple_data[:, :-output_steps, :]
y_data = simple_data[:, -output_steps:, 0:2]

In [8]:
# Split the data to train and test splits (train to train the model, test to test the model on new data after trained)
test_size = 0.1

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size)

In [9]:
## Define normalization layer
normalize_layer = tf.keras.layers.Normalization(axis=-1)
normalize_layer.adapt(simple_data)

In [10]:
## Get Mean and Variance of Normalization layer to use for normalizing y and denormalizing later
normalize_weights = normalize_layer.get_weights()

mean_variance = np.array([normalize_weights[0][0:2], normalize_weights[1][0:2]])

np.save('../mean_variance_E.npy', mean_variance)

In [11]:
## Normalize y
normalize_y = tf.keras.layers.Normalization(mean=mean_variance[0], variance=mean_variance[1])

y_train = normalize_y(y_train)
y_test = normalize_y(y_test)
print(y_train)

tf.Tensor(
[[[ 0.8586281   0.4043706 ]
  [ 0.89817     0.41383514]
  [ 0.9049063   0.39166653]
  ...
  [ 0.8360101   0.20294754]
  [ 0.8344211   0.18109654]
  [ 0.83275944  0.16108763]]

 [[-0.87570155  0.73086524]
  [-0.87653023  0.7469994 ]
  [-0.88981056  0.7602752 ]
  ...
  [-1.02762     0.86317813]
  [-1.0585887   0.91437554]
  [-1.0791308   0.95623547]]

 [[ 1.2726755  -0.34815508]
  [ 1.2793862  -0.38150325]
  [ 1.2827735  -0.38918924]
  ...
  [ 1.3560139  -0.42260095]
  [ 1.3576713  -0.4222198 ]
  [ 1.358658   -0.42202926]]

 ...

 [[ 0.76683635 -0.12329301]
  [ 0.7718298  -0.10900094]
  [ 0.7717401  -0.09680503]
  ...
  [ 0.7604418  -0.02585279]
  [ 0.7620607   0.0040653 ]
  [ 0.764       0.02750432]]

 [[ 1.3024185   0.28298318]
  [ 1.2806847   0.20955366]
  [ 1.2533723   0.11833843]
  ...
  [ 1.1927032  -0.16508941]
  [ 1.2173374  -0.15498967]
  [ 1.2078502  -0.1457157 ]]

 [[ 0.534369   -1.664551  ]
  [ 0.6042134  -1.6646144 ]
  [ 0.6530973  -1.6383171 ]
  ...
  [ 0.8275182

In [12]:
## Save test data for model testing
np.save('../x_test_E.npy', x_test)
np.save('../y_test_E.npy', y_test)

In [13]:
num_features = 4
num_targets = 2

# Clear any previous models
tf.keras.backend.clear_session()

# Define the model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(50, num_features)))
model.add(normalize_layer)
model.add(tf.keras.layers.LSTM(128, return_sequences=True))
model.add(tf.keras.layers.LSTM(128))
model.add(tf.keras.layers.Dense(num_targets * output_steps))
model.add(tf.keras.layers.Reshape([output_steps, num_targets]))


# Define optimizer
lr = 1e-3
opt = tf.keras.optimizers.Adam(learning_rate=lr)

model.compile(loss='huber', optimizer=opt, metrics=['mae', 'mse', 'accuracy'])
model.summary()




In [14]:
## Define LR scheduling (optional if want to use or not)
start_lr = lr
min_lr = 0.00001
max_lr = 0.001
rampup_epochs = 0
sustain_epochs = 0
exp_decay = 0.0

# Define the scheduling function
def schedule(epoch):
  def lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay):
    if epoch < rampup_epochs:
      lr = (max_lr - start_lr)/rampup_epochs * epoch + start_lr
    elif epoch < rampup_epochs + sustain_epochs:
      lr = max_lr
    else:
      lr = (max_lr - min_lr) * exp_decay**(epoch-rampup_epochs-sustain_epochs) + min_lr
    return lr
  return lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay)

In [15]:
## Define callbacks and fit the model
log_dir = "../logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=50)

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule)

model.fit(
  x=x_train, 
  y=y_train,
  epochs=200,
  validation_data=(x_test, y_test),
  callbacks=[tensorboard_callback, early_stopping,]) #Can add lr_scheduler

model.evaluate(x_test, y_test)

Epoch 1/200
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 78ms/step - accuracy: 0.7958 - loss: 0.2678 - mae: 0.5761 - mse: 0.6616 - val_accuracy: 0.9336 - val_loss: 0.0481 - val_mae: 0.2315 - val_mse: 0.0970
Epoch 2/200
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 70ms/step - accuracy: 0.9370 - loss: 0.0444 - mae: 0.2119 - mse: 0.0952 - val_accuracy: 0.9516 - val_loss: 0.0251 - val_mae: 0.1617 - val_mse: 0.0503
Epoch 3/200
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 75ms/step - accuracy: 0.9500 - loss: 0.0262 - mae: 0.1580 - mse: 0.0559 - val_accuracy: 0.9538 - val_loss: 0.0205 - val_mae: 0.1463 - val_mse: 0.0410
Epoch 4/200
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 67ms/step - accuracy: 0.9551 - loss: 0.0214 - mae: 0.1430 - mse: 0.0441 - val_accuracy: 0.9596 - val_loss: 0.0183 - val_mae: 0.1370 - val_mse: 0.0367
Epoch 5/200
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 69ms/

[0.004361042287200689,
 0.06371072679758072,
 0.008722084574401379,
 0.9838792085647583]

In [1]:
## Save the model
model.save('../models/model_v4.keras')

NameError: name 'model' is not defined