<h2>DATA CLEANING</h2>

Script ini digunakan untuk membersihkan data dari GrabPosisi jadi data yang lebih mudah untuk dimengerti dan dipelajari oleh model LSTMnya nanti. Proses Cleaning Data ini meliputi:
- Mengurutkan Data berdasarkan trj_id lalu pingtimestamp agar data terurut dengan baik
- Mengelompokkan data dan menjadikannya Pandas Multi Index Dataframe dengan trj_id sebagai index, untuk mempermudah pemisahan trajectories nantinya
- Me-resample data (Data awal memiliki sampling rate 1 kali per **detik**, namun beberapa kali jarak antar ping bisa lebih dari 1 detik). Setelah resampling, data memiliki sampling rate 1 kali per **menit** dan konstan (Tidak ada jarak antar ping yang berbeda)
<br/>
Hasil dari data cleaning ini sudah diexport menjadi clean_data.csv, sehingga script ini tidak perlu dijalankan kecuali ingin mengubah proses data cleaning di atas.

In [1]:
## Import libraries
import pandas as pd
from pathlib import Path

In [2]:
# Read Data
data_dir = Path('../Mobility-Prediction-Tensorflow-LSTM/GrabPosisi')
full_df = pd.DataFrame()
for parquet_file in data_dir.glob("*.parquet"):
    print(f"Adding file: {parquet_file}")
    full_df = pd.concat([full_df, pd.read_parquet(parquet_file)], ignore_index=True)

Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00000-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00001-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00002-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00003-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00004-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00005-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00006-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet
Adding file: ..\Mobility-Prediction-Tensorflow-LSTM\GrabPosisi\part-00007-8bbff892-

In [3]:
# Remove unnecessary columns (driving_mode, osname, accuracy)
train_df = full_df[['trj_id', 'pingtimestamp', 'rawlat', 'rawlng', 'speed', 'bearing']]

# Sort by trj_id and then pingtimestamp
train_df = train_df.sort_values(by=["trj_id", "pingtimestamp"])

# Convert pingtimestamp from unix timestamp to Pandas DateTime
train_df['pingtimestamp'] = pd.to_datetime(train_df['pingtimestamp'], unit='s')
print(train_df)

         trj_id  pingtimestamp    rawlat      rawlng  speed  bearing
29091989      1     1554992255 -6.197622  106.769017   5.58      180
29074360      1     1554992256 -6.197667  106.769007   5.33      177
10694992      1     1554992257 -6.197713  106.769012   5.43      177
10676471      1     1554992258 -6.197764  106.769020   5.84      178
22141074      1     1554992259 -6.197809  106.769018   5.28      179
...         ...            ...       ...         ...    ...      ...
19921434   9999     1555822630 -6.178844  106.841960   0.00        0
7604364    9999     1555822631 -6.178844  106.841960   0.00        0
25980783   9999     1555822632 -6.178844  106.841961   0.00        0
14850195   9999     1555822634 -6.178845  106.841963   0.00        0
19919728   9999     1555822635 -6.178845  106.841964   0.00        0

[55988420 rows x 6 columns]


In [6]:
def reorganize_trj_id(df):
    unique_trj_ids = df['trj_id'].unique()
    trj_id_map = {old_id: new_id for new_id, old_id in enumerate(unique_trj_ids, start=1)}
    df['trj_id'] = df['trj_id'].map(trj_id_map)
    return df, max(trj_id_map.values())

def split_large_timestamp_gaps(group, start_id):
    group = group.reset_index(drop=True)

    # Calculate the difference between consecutive timestamps
    time_diff = group['pingtimestamp'].diff()

    # Identify rows where the difference is greater than 60 seconds
    large_gaps = time_diff > pd.Timedelta(seconds=60)

    # If there are large gaps, split the group
    if large_gaps.any():
        split_groups = []
        current_group = []
        current_trj_id = group['trj_id'].iloc[0]

        # Iterate through the rows
        for i, row in group.iterrows():
            if i == 0 or not large_gaps.iloc[i]:
                current_group.append(row)
            else:
                # Log the gap and split
                gap_duration = time_diff.iloc[i].total_seconds()
                new_trj_id = start_id
                start_id += 1

                size = len(current_group)
                size_split = len(group) - sum([len(g) for g in split_groups]) - len(current_group)
                print(f"Found {gap_duration} seconds gap in trj_id {current_trj_id}, splitting to {current_trj_id} with {size} steps and {new_trj_id} with {size_split} steps")
                
                # Assign trj_id to the current group
                for j in range(len(current_group)):
                    current_group[j]['trj_id'] = current_trj_id
                split_groups.append(pd.DataFrame(current_group))
                
                # Start a new group
                current_group = [row]
                current_trj_id = new_trj_id
        
        # Append the last group
        for j in range(len(current_group)):
            current_group[j]['trj_id'] = current_trj_id
        split_groups.append(pd.DataFrame(current_group))

        # Concatenate all new groups into one DataFrame
        return pd.concat(split_groups).reset_index(drop=True), start_id
    else:
        return group, start_id


# 1. Reorganize the original trj_id
df, max_trj_id = reorganize_trj_id(train_df)
print(df)

# 2. Apply the split function with the starting id for new trj_ids
start_id = max_trj_id + 1

result = []
for trj_id, group in df.groupby('trj_id'):
    split_group, start_id = split_large_timestamp_gaps(group, start_id)
    result.append(split_group)

split_df = pd.concat(result).reset_index(drop=True).set_index('trj_id', inplace=True)

          trj_id       pingtimestamp    rawlat      rawlng  speed  bearing
29091989       1 2019-04-11 14:17:35 -6.197622  106.769017   5.58      180
29074360       1 2019-04-11 14:17:36 -6.197667  106.769007   5.33      177
10694992       1 2019-04-11 14:17:37 -6.197713  106.769012   5.43      177
10676471       1 2019-04-11 14:17:38 -6.197764  106.769020   5.84      178
22141074       1 2019-04-11 14:17:39 -6.197809  106.769018   5.28      179
...          ...                 ...       ...         ...    ...      ...
19921434   55995 2019-04-21 04:57:10 -6.178844  106.841960   0.00        0
7604364    55995 2019-04-21 04:57:11 -6.178844  106.841960   0.00        0
25980783   55995 2019-04-21 04:57:12 -6.178844  106.841961   0.00        0
14850195   55995 2019-04-21 04:57:14 -6.178845  106.841963   0.00        0
19919728   55995 2019-04-21 04:57:15 -6.178845  106.841964   0.00        0

[55988420 rows x 6 columns]
Found 108.0 seconds gap in trj_id 6, splitting to 6 with 394 steps and 

In [45]:
def remove_large_gaps(df, gap_seconds=10):
    # Function to check for large gaps within a group
    def has_large_gaps(group):
        time_diff = group['pingtimestamp'].diff()
        return (time_diff > pd.Timedelta(seconds=gap_seconds)).any()

    # Find index (trj_id) values with large gaps
    trj_ids_with_gaps = df.groupby(df.index).filter(has_large_gaps).index.unique()

    # Remove rows with those trj_ids
    cleaned_df = df[~df.index.isin(trj_ids_with_gaps)]
    
    return cleaned_df

# Example usage:
# Assuming df is the DataFrame from the previous step with 'trj_id' set as the index
pruned_df = remove_large_gaps(split_df, gap_seconds=30)

In [48]:
# Make hour_of_day and day_of_week from pingtimestamp (Used so the model can better predict user movement from time of day and day of week)
pruned_df['minute_of_hour'] = pruned_df['pingtimestamp'].dt.minute
pruned_df['hour_of_day'] = pruned_df['pingtimestamp'].dt.hour
pruned_df['day_of_week'] = pruned_df['pingtimestamp'].dt.dayofweek
pruned_df = pruned_df.drop('pingtimestamp', axis=1) # Drop pingtimestamp as it is no longer needed
print(pruned_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pruned_df['minute_of_hour'] = pruned_df['pingtimestamp'].dt.minute
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pruned_df['hour_of_day'] = pruned_df['pingtimestamp'].dt.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pruned_df['day_of_week'] = pruned_df['pingtimestamp'].dt.dayofweek


          rawlat      rawlng  speed  bearing  minute_of_hour  hour_of_day  \
trj_id                                                                      
2      -6.248311  106.930447  11.35       88              51            0   
2      -6.248345  106.930673  12.43       87              51            0   
2      -6.248374  106.930931  13.83       84              51            0   
2      -6.248391  106.931061  14.29       85              51            0   
2      -6.248410  106.931184  14.58       87              51            0   
...          ...         ...    ...      ...             ...          ...   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841960   0.00        0              57            4   
55995  -6.178844  106.841961   0.00        0              57            4   
55995  -6.178845  106.841963   0.00        0              57            4   
55995  -6.178845  106.841964   0.00        0              57            4   

In [49]:
# Export data to Parquet
pruned_df.to_parquet('clean_non_resampled_data.parquet')