<h2>FINAL DATA PREPROCESSING & MODEL TRAINING</h2>

Script ini digunakan untuk mempreprocess data menjadi data final yang dapat digunakan untuk training model, dan training model LSTM itu sendiri. Secara umum, script ini meliputi:
<h4>Data Preprocessing</h4>
- Padding & Truncating data, pada dasarnya memastikan seluruh trj_id memiliki jumlah data yang sama, yakni 20 data, dengan menambahkan data yang kurang dengan 0 dan memotong data yang kelebihan dan mengambil 20 data paling belakang.
- Membagi data menjadi X (koordinat, keceptan, dll yang dimasukkan ke model untuk melatih model) dan y (koordinat yang benar untuk dibandingkan dengan koordinat hasil prediksi model). y akan diambil dari koordinat terakhir tiap trajectory.
- Membagi data menjadi training data dan testing data, dengan pembagian 90%/10%.
<br/>
<h4>Model Training</h4>
- Mendefinisikan dan melatih model LSTM berdasarkan data yang telah siap.
- Arsitektur, epoch, loss, optimizer, dan dll dapat dengan bebas diubah-ubah untuk mencoba model lain.
<br/>
Jika ingin melihat performa model pertama yang telah saya latih, maka model itu sudah disave dalam folder models dengan nama model_v1.keras. Model dapat ditest menggunakan script Model Test.ipynb

In [1]:
## Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import datetime
import keras

In [2]:
# Load Tensorboard
%load_ext tensorboard

In [3]:
## Read data from csv
#resampled_data = pd.read_csv('clean_data.csv', index_col='trj_id').drop('Unnamed: 1', axis=1)
resampled_data = pd.read_parquet('clean_non_resampled_data.parquet')
print(resampled_data)

          rawlat      rawlng  speed  bearing  minute_of_hour  hour_of_day  \
trj_id                                                                      
2      -6.248311  106.930447  11.35       88              51            0   
2      -6.248345  106.930673  12.43       87              51            0   
2      -6.248374  106.930931  13.83       84              51            0   
2      -6.248391  106.931061  14.29       85              51            0   
2      -6.248410  106.931184  14.58       87              51            0   
...          ...         ...    ...      ...             ...          ...   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841961   0.00        0              57            4   
55995  -6.178845  106.841963   0.00        0              57            4   
55995  -6.178845  106.841964   0.00        0              57            4   

In [4]:
# Optimize Dataset Further
counts = resampled_data.groupby(level='trj_id').size()
filtered_df = resampled_data[resampled_data.index.get_level_values('trj_id').isin(counts[counts >= 50].index)]
print(filtered_df)

          rawlat      rawlng  speed  bearing  minute_of_hour  hour_of_day  \
trj_id                                                                      
2      -6.248311  106.930447  11.35       88              51            0   
2      -6.248345  106.930673  12.43       87              51            0   
2      -6.248374  106.930931  13.83       84              51            0   
2      -6.248391  106.931061  14.29       85              51            0   
2      -6.248410  106.931184  14.58       87              51            0   
...          ...         ...    ...      ...             ...          ...   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841961   0.00        0              57            4   
55995  -6.178845  106.841963   0.00        0              57            4   
55995  -6.178845  106.841964   0.00        0              57            4   

In [5]:
# Pad and truncate the timestamps in the dataframe
# Ini buat kita samain input modelnya, gw potong timestampnya jadi pasti ada 60 timestamp per sample. Kalo lebih dipotong, kalo kurang ditambahin 0 di depannya
# Pad value None --> pake koordinat pertama

def pad_truncate_dataframe(df, max_len, padding='pre', truncating='post', pad_value=None):
  # Split the dataframe by samples (first level of multi-index)
  samples = df.groupby(level=0)

  # Define a function to pad/truncate a single sample
  def pad_truncate_sample(sample):
    # Extract the values from a single sample
    values = sample.values
    first_element = values[0]
    first_element[3] = 0
    first_element[4] = 0

    # Pad/truncate automatically using pad_sequences
    if pad_value == None:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=first_element, dtype='float64'
      )[0]
    else:
      padded_truncated = tf.keras.preprocessing.sequence.pad_sequences(
          [values], maxlen=max_len, padding=padding, truncating=truncating, value=pad_value, dtype='float64'
      )[0]

    # Convert back to pandas dataframe
    return pd.DataFrame(padded_truncated, columns=sample.columns)

  # Apply the function to each sample and recreate the multi-index dataframe
  padded_df = samples.apply(pad_truncate_sample)
  return padded_df

df_new = pad_truncate_dataframe(filtered_df, 70)

In [6]:
## Convert the multi index dataframe to a numpy 3D array for better integration to TensorFlow (samples, timesteps, features)
numpy_data = df_new.to_xarray().to_array().to_numpy()
numpy_data = np.transpose(numpy_data, (1, 2, 0))
numpy_data.shape # Should be (55994, 21, 6) if 55994 samples, 21 timesteps, and 6 features

(45065, 70, 7)

In [7]:
output_steps = 10

## Split the data to x (feature values) and y (target values)
simple_data = numpy_data[:, :, [0, 1]]
x_data = simple_data[:, :-output_steps, :]
y_data = simple_data[:, -output_steps:, 0:2]

In [8]:
# Split the data to train and test splits (train to train the model, test to test the model on new data after trained)
test_size = 0.1

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_size)

In [9]:
## Define normalization layer
normalize_layer = tf.keras.layers.Normalization(axis=-1)
denormalize_layer = tf.keras.layers.Normalization(axis=-1, invert=True)

normalize_layer.adapt(simple_data)
denormalize_layer.adapt(simple_data[:, :, 0:2])

normalize_weights = normalize_layer.get_weights()
mean_variance = np.array([normalize_weights[0][0:2], normalize_weights[1][0:2]])

In [17]:
## Save test data for model testing
np.save('x_test.npy', x_test)
np.save('y_test.npy', y_test)
np.save('x_train.npy', x_train)
np.save('y_train.npy', y_train)
np.save('simple_data.npy', simple_data)

In [11]:
## Define custom metric(s)
@keras.saving.register_keras_serializable(package="custom_metrics", name="avg_m_diff")
def avg_m_diff(y_true, y_pred):
    # Ensure the input tensors have the correct shape
    assert y_true.shape == y_pred.shape
    assert y_true.shape[2] == 2
    
    def haversine_distance(lat1, lon1, lat2, lon2):
        # Radius of the Earth in meters
        R = 6371000.0
        
        # Convert latitude and longitude from degrees to radians
        lat1 = lat1 * (tf.constant(3.141592653589793) / 180.0)
        lon1 = lon1 * (tf.constant(3.141592653589793) / 180.0)
        lat2 = lat2 * (tf.constant(3.141592653589793) / 180.0)
        lon2 = lon2 * (tf.constant(3.141592653589793) / 180.0)
        
        # Compute differences
        dlat = lat2 - lat1
        dlon = lon2 - lon1
        
        # Haversine formula
        a = tf.math.sin(dlat / 2)**2 + tf.math.cos(lat1) * tf.math.cos(lat2) * tf.math.sin(dlon / 2)**2
        c = 2 * tf.math.atan2(tf.math.sqrt(a), tf.math.sqrt(1 - a))
        
        # Distance in meters
        distance = R * c
        
        return distance
    
    # Reshape the tensors to 2D arrays for easier manipulation
    y_true_flat = tf.reshape(y_true, [-1, 2])
    y_pred_flat = tf.reshape(y_pred, [-1, 2])
    
    # Split the coordinates into separate tensors
    lat_true, lon_true = tf.split(y_true_flat, num_or_size_splits=2, axis=1)
    lat_pred, lon_pred = tf.split(y_pred_flat, num_or_size_splits=2, axis=1)
    
    # Calculate the distance for each pair of points
    distances = haversine_distance(lat_true, lon_true, lat_pred, lon_pred)
    
    # Calculate the average distance
    average_distance = tf.reduce_mean(distances)
    
    return average_distance

In [12]:
input_steps = 60
num_features = 2
num_targets = 2

# Clear any previous models
tf.keras.backend.clear_session()

# Define the model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(input_steps, num_features)))
model.add(normalize_layer)
model.add(tf.keras.layers.LSTM(64))
model.add(tf.keras.layers.Dense(num_targets * output_steps))
model.add(tf.keras.layers.Reshape([output_steps, num_targets]))
model.add(denormalize_layer)

# Define optimizer
lr = 1e-3
opt = tf.keras.optimizers.Adam(learning_rate=lr)

model.compile(loss='huber', optimizer=opt, metrics=['mae', 'mse', avg_m_diff, 'root_mean_squared_error'])
model.summary()




In [13]:
## Define LR scheduling (optional if want to use or not)
start_lr = lr
min_lr = 0.00001
max_lr = 0.001
rampup_epochs = 0
sustain_epochs = 0
exp_decay = 0.0

# Define the scheduling function
def schedule(epoch):
  def lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay):
    if epoch < rampup_epochs:
      lr = (max_lr - start_lr)/rampup_epochs * epoch + start_lr
    elif epoch < rampup_epochs + sustain_epochs:
      lr = max_lr
    else:
      lr = (max_lr - min_lr) * exp_decay**(epoch-rampup_epochs-sustain_epochs) + min_lr
    return lr
  return lr(epoch, start_lr, min_lr, max_lr, rampup_epochs, sustain_epochs, exp_decay)

In [14]:
## Define callbacks and fit the model
log_dir = "logs/fit/v6_2_" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', patience=200)

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule)

model.fit(
  x=x_train, 
  y=y_train,
  epochs=1000,
  validation_data=(x_test, y_test),
  callbacks=[tensorboard_callback, early_stopping,]) #Can add lr_scheduler

model.evaluate(x_test, y_test)

Epoch 1/1000
[1m1268/1268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 19ms/step - avg_m_diff: 3377.2061 - loss: 0.0010 - mae: 0.0188 - mse: 0.0020 - root_mean_squared_error: 0.0409 - val_avg_m_diff: 245.9202 - val_loss: 1.5253e-05 - val_mae: 0.0014 - val_mse: 3.0506e-05 - val_root_mean_squared_error: 0.0055
Epoch 2/1000
[1m1268/1268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 18ms/step - avg_m_diff: 179.5875 - loss: 4.4652e-06 - mae: 0.0010 - mse: 8.9304e-06 - root_mean_squared_error: 0.0030 - val_avg_m_diff: 117.9398 - val_loss: 2.6995e-06 - val_mae: 6.7526e-04 - val_mse: 5.3991e-06 - val_root_mean_squared_error: 0.0023
Epoch 3/1000
[1m1268/1268[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 19ms/step - avg_m_diff: 135.0188 - loss: 1.4542e-06 - mae: 7.7846e-04 - mse: 2.9084e-06 - root_mean_squared_error: 0.0017 - val_avg_m_diff: 102.0482 - val_loss: 1.0311e-06 - val_mae: 5.8250e-04 - val_mse: 2.0621e-06 - val_root_mean_squared_error: 0.0014
Epoch 4/100

[2.590947900671381e-08,
 0.00012997060548514128,
 5.181895446071394e-08,
 22.65254783630371,
 0.00022763776360079646]

In [15]:
## Save the model
model.save('model_v6_1.keras')