# **Grab Safe Driver Telematics Risk Analysis**

This project leverages Grab driver telematics data, including anonymized acceleration, gyro, and GPS sensors, to analyze driving behavior patterns and identify potential risks of accidents or hazardous events. Using data exploration, feature engineering, driver segmentation, and risk prediction modeling, the results are expected to help Grab develop risk mitigation strategies, training programs, and driver incentives to improve road safety and overall service quality.

# **1. Import Library**
In this section, we import various libraries required for data analysis and modeling. Some of these libraries are used for data manipulation, visualization, and machine learning model implementation.

In [15]:
import pandas as pd
import numpy as np
import os
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style='whitegrid')

# **2. Load Data & Data Understanding**
In this section, load a dataset from a CSV file into a DataFrame using the pandas library.

### **2.1 Load Dataset**

In [16]:
def load_dataset(name, folder="data"):
    path = os.path.join(folder, f"{name}.csv")
    try:
        df = pd.read_csv(path)
        print(f"✅ Berhasil memuat: {name}.csv | Shape: {df.shape}")
        return df
    except Exception as e:
        print(f"❌ Gagal memuat {name}.csv: {e}")
        return None
    
platform_names = [
    "part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000",
    "part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00005-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000",
    "part-00008-e6120af0-10c2-4248-97c4-81baf4304e5c-c000", "part-00009-e6120af0-10c2-4248-97c4-81baf4304e5c-c000"
]

datasets = {}
for name in platform_names:
    df = load_dataset(name)
    if df is not None:
        datasets[name] = df

✅ Berhasil memuat: part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613554, 11)
✅ Berhasil memuat: part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613558, 11)
✅ Berhasil memuat: part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613555, 11)
✅ Berhasil memuat: part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613553, 11)
✅ Berhasil memuat: part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613559, 11)
✅ Berhasil memuat: part-00005-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613551, 11)
✅ Berhasil memuat: part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613553, 11)
✅ Berhasil memuat: part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613553, 11)
✅ Berhasil memuat: part-00008-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613560, 11)
✅ Berhasil memuat: part-00009-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv | Shape: (1613562, 11)


### **2.2 Diplaying 5 Initial Data**

In [17]:
def show_head(df, n=5):
    print(f"Number of rows and columns: {df.shape}")
    print(f"{n} First Line:")
    print(df.head(n))

for name, df in datasets.items():
    print(f"\n--- {name.upper()} ---")
    show_head(df)


--- PART-00000-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---
Number of rows and columns: (1613554, 11)
5 First Line:
       bookingID  Accuracy  Bearing  acceleration_x  acceleration_y  \
0  1202590843006     3.000    353.0        1.228867        8.900100   
1   274877907034     9.293     17.0        0.032775        8.659933   
2   884763263056     3.000    189.0        1.139675        9.545974   
3  1073741824054     3.900    126.0        3.871542       10.386364   
4  1056561954943     3.900     50.0       -0.112882       10.550960   

   acceleration_z    gyro_x    gyro_y    gyro_z  second      Speed  
0        3.986968  0.008221  0.002269 -0.009966  1362.0   0.000000  
1        4.737300  0.024629  0.004028 -0.010858   257.0   0.190000  
2        1.951334 -0.006899 -0.015080  0.001122   973.0   0.667059  
3       -0.136474  0.001344 -0.339601 -0.017956   902.0   7.913285  
4       -1.560110  0.130568 -0.061697  0.161530   820.0  20.419409  

--- PART-00001-E6120AF0-10C2-4248-97C4-8

#### **Insight**

- The dataset consists of **multiple large parts, each containing approximately 1.6 million rows and 11 columns**, indicating a very large volume of sensor data collected from multiple bookings (`bookingID`).
- Key features include **`Accuracy`, `Bearing`, accelerometer readings (`acceleration_x`, `acceleration_y`, `acceleration_z`), gyroscope readings (`gyro_x`, `gyro_y`, `gyro_z`), `second` (time), and `Speed`**.
- The sensor values show **wide variability**:
    - `Accuracy` ranges roughly from 3 to 20+, reflecting different levels of GPS precision.
    - `Bearing` covers the full 360 degrees, representing direction.
    - Accelerometer and gyroscope values fluctuate between positive and negative values, indicating motion and rotation dynamics.
    - `Speed` values range from 0 up to around 25+, showing diverse movement speeds.
- The data seems to be **time-series sensor readings per bookingID**, useful for analyzing movement patterns, driving behavior, or vehicle dynamics.
- Given the **high volume and multidimensional nature of data**, it is essential to apply data cleaning, handle missing or noisy sensor readings, and consider feature engineering like summarizing acceleration or rotation magnitudes.
- The presence of **multiple parts suggests the data was partitioned for storage or processing efficiency**, so combining and aligning these parts correctly is important for downstream analysis.

### **2.3 Dataset Information**

In [18]:
def show_data_info(df):
    print("\n🔍 Tipe Data:")
    print(df.info())

for name, df in datasets.items():
    print(f"\n--- {name.upper()} ---")
    show_data_info(df)


--- PART-00000-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

🔍 Tipe Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1613554 entries, 0 to 1613553
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   bookingID       1613554 non-null  int64  
 1   Accuracy        1613554 non-null  float64
 2   Bearing         1613554 non-null  float64
 3   acceleration_x  1613554 non-null  float64
 4   acceleration_y  1613554 non-null  float64
 5   acceleration_z  1613554 non-null  float64
 6   gyro_x          1613554 non-null  float64
 7   gyro_y          1613554 non-null  float64
 8   gyro_z          1613554 non-null  float64
 9   second          1613554 non-null  float64
 10  Speed           1613554 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 135.4 MB
None

--- PART-00001-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

🔍 Tipe Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1613558 entries,

### **Insight**

- Each dataset part contains **approximately 1.6 million rows and 11 columns**, showing this is a very **large-scale sensor dataset**.
- The data columns are consistent across parts with **no missing values** (non-null counts equal to total rows), indicating **complete data without nulls**.
- Key columns include:
    - **`bookingID`** (integer) as identifier,
    - **`Accuracy`** and **`Bearing`** (floats) representing GPS precision and orientation,
    - **`acceleration_x/y/z`** and **`gyro_x/y/z`** (floats) representing 3D movement and rotation data,
    - **`second`** (float) likely representing timestamp or time offset,
    - **`Speed`** (float) representing movement speed.
- Memory usage per part is around **135 MB**, indicating manageable size per file but collectively very large data volume.
- Uniform data types and non-null entries suggest **well-structured and clean sensor data**, ready for analysis.
- This multi-part dataset likely captures **detailed, high-frequency motion and GPS sensor data** for multiple `bookingID`s, useful for applications in movement analysis, travel pattern recognition, or behavior modeling.

### **2.4 Statical Description**

In [19]:
def show_descriptive_stats(df):
    print("\n📊 Statistik Deskriptif Numerik:")
    print(df.describe(include='number'))

    print("\n📊 Statistik Deskriptif Kategorikal:")
    categorical_cols = df.select_dtypes(include='object').columns
    if len(categorical_cols) > 0:
        print(df.describe(include='object'))
    else:
        print("Tidak ada kolom kategorikal pada dataset ini.")


for name, df in datasets.items():
    print(f"\n--- {name.upper()} ---")
    show_descriptive_stats(df)


--- PART-00000-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📊 Statistik Deskriptif Numerik:
          bookingID      Accuracy       Bearing  acceleration_x  \
count  1.613554e+06  1.613554e+06  1.613554e+06    1.613554e+06   
mean   8.185688e+11  1.157350e+01  1.689111e+02    6.807467e-02   
std    4.951339e+11  8.665737e+01  1.072641e+02    1.426141e+00   
min    0.000000e+00  7.500000e-01  0.000000e+00   -3.344084e+01   
25%    3.779571e+11  3.900000e+00  7.800000e+01   -5.087678e-01   
50%    8.074539e+11  4.259000e+00  1.682945e+02    6.130981e-02   
75%    1.254130e+12  8.000000e+00  2.624448e+02    6.344828e-01   
max    1.709397e+12  6.063000e+03  3.599985e+02    2.961647e+01   

       acceleration_y  acceleration_z        gyro_x        gyro_y  \
count    1.613554e+06    1.613554e+06  1.613554e+06  1.613554e+06   
mean     4.469578e+00    8.985294e-01 -1.754218e-03  4.358774e-04   
std      8.129320e+00    3.249632e+00  1.430876e-01  3.460380e-01   
min     -5.730359e+01   -

#### **Insight**

- **Dataset contains about 1.6 million records with only numerical features; no categorical columns present.**
- **Key variables are sensor-related measurements** such as Accuracy, Bearing, acceleration (x/y/z), gyro (x/y/z), Speed, and second.
- **High variability and presence of outliers** indicated by large standard deviations and extreme min/max values (e.g., Accuracy max ~6000, Speed max >150).
- **Negative values in acceleration and Speed** suggest sensor noise or directional data capture.
- **The `second` feature has an extremely large max value** (up to hundreds of millions), likely representing elapsed time or timestamps with very wide range or anomalies.
- Means show central tendencies (e.g., Bearing ~169°, Speed ~9 units), but **large spread indicates heterogeneous conditions or diverse sensor activity.**
- **Overall, data is complex and noisy, requiring careful preprocessing and outlier handling** for further analysis or modeling.

### **2.5 Check Missing Value**

In [20]:
def show_missing_values(df):
    print("Number of Missing Values:")
    print(df.isnull().sum())

for name, df in datasets.items():
    print(f"\n--- {name.upper()} ---")
    show_missing_values(df)


--- PART-00000-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---
Number of Missing Values:
bookingID         0
Accuracy          0
Bearing           0
acceleration_x    0
acceleration_y    0
acceleration_z    0
gyro_x            0
gyro_y            0
gyro_z            0
second            0
Speed             0
dtype: int64

--- PART-00001-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---
Number of Missing Values:
bookingID         0
Accuracy          0
Bearing           0
acceleration_x    0
acceleration_y    0
acceleration_z    0
gyro_x            0
gyro_y            0
gyro_z            0
second            0
Speed             0
dtype: int64

--- PART-00002-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---
Number of Missing Values:
bookingID         0
Accuracy          0
Bearing           0
acceleration_x    0
acceleration_y    0
acceleration_z    0
gyro_x            0
gyro_y            0
gyro_z            0
second            0
Speed             0
dtype: int64

--- PART-00003-E6120AF0-10C2-4248-97

#### **Insight**

- **No missing values detected across all features and all data parts.**
- **Every column including bookingID, Accuracy, Bearing, acceleration (x, y, z), gyro (x, y, z), second, and Speed is complete without any nulls.**
- This indicates **the dataset is clean in terms of missing data**, ensuring reliability for further analysis or modeling without the need for imputation.

### **2.6 Check Duplicates**

In [21]:
def show_duplicates(df):
    print("\n📎 Jumlah Duplikat:", df.duplicated().sum())
    
for name, df in datasets.items():
    print(f"\n--- {name.upper()} ---")
    show_duplicates(df)


--- PART-00000-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00001-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00002-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00003-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00004-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00005-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00006-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00008-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0

--- PART-00009-E6120AF0-10C2-4248-97C4-81BAF4304E5C-C000 ---

📎 Jumlah Duplikat: 0


#### **Insight**

- **No duplicate records found in any part of the dataset.**
- **All data entries are unique across all partitions**, ensuring data integrity.
- This means **the dataset does not contain redundant information**, which supports accurate analysis and modeling without bias from repeated data points.

### **2.7 Detect Outlier**

In [None]:
# Fungsi untuk mendeteksi outlier menggunakan metode IQR (output tabel rapi)
def detect_outliers_table(name, df):
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    outlier_stats = []

    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_ratio = len(outliers) / len(df)

        outlier_stats.append([col, len(outliers), f"{outlier_ratio:.2%}"])

    if outlier_stats:
        outlier_df = pd.DataFrame(outlier_stats, columns=["Kolom", "Jumlah Outlier", "Rasio Outlier"])
        print(f"\n📦 Dataset: {name}  |  Jumlah kolom numerik: {len(numeric_cols)}")
        print(outlier_df.sort_values(by="Jumlah Outlier", ascending=False).to_string(index=False))
    else:
        print(f"\n📦 Dataset: {name}  |  Tidak ada kolom numerik.")

# Jalankan untuk semua dataset
for name, df in datasets.items():
    detect_outliers_table(name, df)


📦 Dataset: part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000  |  Jumlah kolom numerik: 11
         Kolom  Jumlah Outlier Rasio Outlier
        gyro_y          261642        16.22%
        gyro_x          248843        15.42%
        gyro_z          229809        14.24%
      Accuracy          113783         7.05%
acceleration_x          103231         6.40%
acceleration_z           63517         3.94%
        second           24687         1.53%
acceleration_y            1520         0.09%
         Speed              18         0.00%
       Bearing               0         0.00%
     bookingID               0         0.00%

📦 Dataset: part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000  |  Jumlah kolom numerik: 11
         Kolom  Jumlah Outlier Rasio Outlier
        gyro_y          261818        16.23%
        gyro_x          249230        15.45%
        gyro_z          229234        14.21%
      Accuracy          113892         7.06%
acceleration_x          102546         6.36%
ac

#### **Insight**
- **High outlier ratios found mainly in gyro sensors:**
  - `gyro_y` around **16.2%**, `gyro_x` around **15.4%**, and `gyro_z` around **14.2%** outliers consistently across all dataset parts.
- **Moderate outliers in `Accuracy` (~7%) and `acceleration_x` (~6.4%)** suggest some measurement variability or noise.
- **Lower outliers in `acceleration_z` (~3.9%) and very few in `acceleration_y` (~0.1%)** indicate these axes are more stable.
- **No outliers detected in `Bearing` and `bookingID` columns (0%), and almost none in Speed (~0%)**, implying these features are clean or stable.
- **The gyro sensor** data **shows** a **significant portion** of **outliers**, indicating possible extreme values or abrupt movements in the recorded signals. 
- `Accuracy` and `acceleration_x` also have **notable outliers**, which may impact model performance if not handled properly. 
- On the other hand, features like `Bearing`, `bookingID`, and `Speed` are **stable** with **no** or **negligible outliers**, suggesting they are **reliable features** for analysis or modeling. 
- Consistent outlier patterns across all dataset parts suggest systemic sensor characteristics rather than random noise.


# **3. Data Preprocessing**

# **4. Exploratory Data Analysis**