# Cleaning Process
## Comprehensive Data Description

**Dataset Overview:**  

| Column | Data Type | Description |
|--------|-----------|-------------|
| `vehicle_id` | int | Unique identifier for each autonomous vehicle |
| `timestamp` | object/datetime | Date and time of the measurement |
| `gps_latitude` | float | Latitude position of the vehicle from GPS sensor |
| `gps_longitude` | float | Longitude position of the vehicle from GPS sensor |
| `lidar_points` | int | Number of points detected by the LiDAR sensor |
| `radar_objects` | int | Number of objects detected by the radar sensor |
| `camera_objects` | int | Number of objects detected by the camera sensor |
| `packet_drop_rate` | float | Fraction of lost communication packets |
| `packet_delivery_ratio` | float | Ratio of successfully delivered packets |
| `latency_ms` | float | Communication latency in milliseconds |
| `throughput_kbps` | float | Network throughput in kilobits per second |
| `collision_detected` | int (0 or 1) | Binary indicator if collision occurred |
| `obstacle_detection_accuracy` | float | Accuracy of obstacle detection (0–1) |
| `decision_accuracy` | float | Accuracy of autonomous decision making (0–1) |

**Summary Statistics:**  

- Sensor features (`lidar_points`, `radar_objects`, `camera_objects`) vary depending on environment.  
- Network features (`latency_ms`, `throughput_kbps`, `packet_drop_rate`) vary due to wireless communication conditions.  
- Accuracy features (`obstacle_detection_accuracy`, `decision_accuracy`) mostly range 0.7–0.9.

_________________________________________
# 1. Data Loading

i. Load dataset

ii. Identify Dataset shape

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore

# Load dataset excluding the modified column
df = pd.read_csv("Dirty_Dataset_with_Log.csv", usecols=lambda col: col not in ["modified_column_name", "modified_row_index"])

print("Dataset loaded successfully.")
display(df.head())

# Dataset shape
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}\n")


Dataset loaded successfully.


Unnamed: 0,vehicle_id,timestamp,gps_latitude,gps_longitude,lidar_points,radar_objects,camera_objects,packet_drop_rate,packet_delivery_ratio,latency_ms,throughput_kbps,collision_detected,obstacle_detection_accuracy,decision_accuracy
0,1038,01/01/2025 00:00,37.111566,-121.062897,28474.0,37.0,20.0,0.042,0.771,228.0,803.0,1.0,0.84,0.79
1,1083,01/01/2025 00:00,37.5314,-121.999292,25569.0,48.0,32.0,0.009,0.733,209.0,120.0,0.0,0.78,0.82
2,1034,01/01/2025 00:00,37.342874,-121.807894,11304.0,36.0,5.0,0.014,0.822,89.0,993.0,0.0,0.78,0.89
3,1080,01/01/2025 00:00,37.160521,-121.266002,12801.0,40.0,33.0,0.153,0.716,239.0,348.0,0.0,0.83,0.88
4,1008,01/01/2025 00:00,37.797779,-121.473512,26214.0,27.0,44.0,0.212,0.972,288.0,688.0,0.0,0.96,0.75


Number of Rows: 2306
Number of Columns: 14



_______________
# Redundant Data Summary

1. Find Redundant Data based on the same Vehicle ID AND Timestamp
2. Remove the Redundant Data

In [2]:
# Find duplicates based on Vehicle ID and Timestamp
duplicates = df[df.duplicated(subset=["vehicle_id", "timestamp"], keep=False)]

print("Redundant rows:")
print(duplicates)

# Remove redundant rows (keep only the first)
df_cleaned = df.drop_duplicates(subset=["vehicle_id", "timestamp"], keep="first")

df_cleaned.to_csv("Dirty_Dataset_cleaned.csv", index=False)
print("Duplicates removed. Cleaned dataset saved as Dirty_Dataset_cleaned.csv")

Redundant rows:
      vehicle_id         timestamp  gps_latitude  gps_longitude  lidar_points  \
0           1038  01/01/2025 00:00     37.111566    -121.062897       28474.0   
2           1034  01/01/2025 00:00     37.342874    -121.807894       11304.0   
3           1080  01/01/2025 00:00     37.160521    -121.266002       12801.0   
5           1059  01/01/2025 00:00     37.815540    -121.009048       24821.0   
6           1061  01/01/2025 00:00     37.306271    -121.215717       12705.0   
...          ...               ...           ...            ...           ...   
2280        1098  01/01/2025 00:37     37.298140    -121.592749       26174.0   
2284        1015  01/01/2025 00:37     37.214000    -121.946089       28290.0   
2285        1056  01/01/2025 00:37     37.834664    -121.033074       28802.0   
2297        1079  01/01/2025 00:38     37.959093    -121.349770       15617.0   
2299        1079  01/01/2025 00:38     37.487633    -121.001654       25770.0   

      radar

_________________________________________________
# Cleaned Dataset Overview

In [3]:
print("Dataset Info (Cleaned):")
df_cleaned.info()

print("\nSummary Statistics (Cleaned):")
display(df_cleaned.describe())

Dataset Info (Cleaned):
<class 'pandas.core.frame.DataFrame'>
Index: 1750 entries, 0 to 2305
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   vehicle_id                   1750 non-null   int64  
 1   timestamp                    1750 non-null   object 
 2   gps_latitude                 1747 non-null   float64
 3   gps_longitude                1741 non-null   float64
 4   lidar_points                 1747 non-null   float64
 5   radar_objects                1748 non-null   float64
 6   camera_objects               1745 non-null   float64
 7   packet_drop_rate             1742 non-null   float64
 8   packet_delivery_ratio        1742 non-null   float64
 9   latency_ms                   1741 non-null   float64
 10  throughput_kbps              1745 non-null   float64
 11  collision_detected           1745 non-null   float64
 12  obstacle_detection_accuracy  1744 non-null   float64
 13 

Unnamed: 0,vehicle_id,gps_latitude,gps_longitude,lidar_points,radar_objects,camera_objects,packet_drop_rate,packet_delivery_ratio,latency_ms,throughput_kbps,collision_detected,obstacle_detection_accuracy,decision_accuracy
count,1750.0,1747.0,1741.0,1747.0,1748.0,1745.0,1742.0,1742.0,1741.0,1745.0,1745.0,1744.0,1741.0
mean,1049.265714,37.498901,-121.506359,19914.896966,27.397597,26.677364,0.150261,0.85192,157.612866,550.735244,0.053868,0.846812,0.849776
std,28.895223,0.290577,0.295209,5766.606653,12.880692,12.897607,0.085413,0.087494,83.259127,261.938541,0.225822,0.086203,0.087617
min,1000.0,37.000115,-121.999854,10002.0,5.0,5.0,0.0,0.7,10.0,100.0,0.0,0.7,0.7
25%,1024.0,37.242139,-121.757025,14894.0,16.0,16.0,0.075,0.777,87.0,326.0,0.0,0.77,0.78
50%,1049.0,37.498618,-121.503844,19768.0,27.0,26.0,0.154,0.852,160.0,546.0,0.0,0.85,0.85
75%,1075.0,37.755774,-121.246314,24977.5,38.0,38.0,0.22375,0.931,228.0,782.0,0.0,0.92,0.93
max,1099.0,37.997538,-121.001123,29981.0,49.0,49.0,0.3,1.0,299.0,999.0,1.0,1.0,1.0


# Missing Values Summary

In [4]:
missing_rows = df_cleaned[df_cleaned.isna().any(axis=1)].copy()

# Identify which columns are missing in each row
missing_rows["missing_columns"] = missing_rows.apply(
    lambda row: [col for col in df_cleaned.columns if pd.isna(row[col])],
    axis=1
)

# Show only key columns
missing_info = missing_rows[["vehicle_id", "timestamp", "missing_columns"]]

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
display(missing_info)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

total_missing = df_cleaned.isna().sum().sum()
print("Total missing values in cleaned dataset:", total_missing)

Unnamed: 0,vehicle_id,timestamp,missing_columns
61,1084,01/01/2025 00:01,[latency_ms]
125,1063,01/01/2025 00:02,[gps_longitude]
127,1056,01/01/2025 00:02,[gps_longitude]
132,1049,01/01/2025 00:02,[latency_ms]
161,1047,01/01/2025 00:02,[latency_ms]
205,1088,01/01/2025 00:03,[gps_longitude]
216,1086,01/01/2025 00:03,[collision_detected]
217,1019,01/01/2025 00:03,[throughput_kbps]
221,1011,01/01/2025 00:03,[decision_accuracy]
248,1069,01/01/2025 00:04,[packet_drop_rate]


Total missing values in cleaned dataset: 72


helllo
