# Cleaning Process
## Comprehensive Data Description

**Dataset Overview:**  

| Column | Data Type | Description |
|--------|-----------|-------------|
| `vehicle_id` | int | Unique identifier for each autonomous vehicle |
| `timestamp` | object/datetime | Date and time of the measurement |
| `gps_latitude` | float | Latitude position of the vehicle from GPS sensor |
| `gps_longitude` | float | Longitude position of the vehicle from GPS sensor |
| `lidar_points` | int | Number of points detected by the LiDAR sensor |
| `radar_objects` | int | Number of objects detected by the radar sensor |
| `camera_objects` | int | Number of objects detected by the camera sensor |
| `packet_drop_rate` | float | Fraction of lost communication packets |
| `packet_delivery_ratio` | float | Ratio of successfully delivered packets |
| `latency_ms` | float | Communication latency in milliseconds |
| `throughput_kbps` | float | Network throughput in kilobits per second |
| `collision_detected` | int (0 or 1) | Binary indicator if collision occurred |
| `obstacle_detection_accuracy` | float | Accuracy of obstacle detection (0–1) |
| `decision_accuracy` | float | Accuracy of autonomous decision making (0–1) |

**Summary Statistics:**  

- Sensor features (`lidar_points`, `radar_objects`, `camera_objects`) vary depending on environment.  
- Network features (`latency_ms`, `throughput_kbps`, `packet_drop_rate`) vary due to wireless communication conditions.  
- Accuracy features (`obstacle_detection_accuracy`, `decision_accuracy`) mostly range 0.7–0.9.

_________________________________________
# 1. Data Loading

i. Load dataset

ii. Identify Dataset shape

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore

# Load dataset excluding the modified column
df = pd.read_csv("Dirty_Dataset_with_Log.csv", usecols=lambda col: col not in ["modified_column_name", "modified_row_index"])

print("Dataset loaded successfully.")
display(df.head())

# Dataset shape
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}\n")


_______________
# Redundant Data Summary

1. Find Redundant Data based on the same Vehicle ID AND Timestamp
2. Remove the Redundant Data

In [None]:
# Identify FULL duplicates (entire row is identical)
full_duplicates = df[df.duplicated(keep=False)]

print("\nFULL redundant rows:")
display(full_duplicates)

# Remove only TRUE duplicates (entire row same)
df_cleaned = df.drop_duplicates(keep="first")

print("\nAfter removing redundant rows:", df_cleaned.shape)

# Save cleaned CSV
df_cleaned.to_csv("Cleaned_Dataset.csv", index=False)
print("\nCleaned dataset saved as Cleaned_Dataset.csv")


_________________________________________________
# Cleaned Dataset Overview

In [None]:
print("Dataset Info (Cleaned):")
df_cleaned.info()

print("\nSummary Statistics (Cleaned):")
display(df_cleaned.describe())

# Missing Values Summary

In [None]:
missing_rows = df_cleaned[df_cleaned.isna().any(axis=1)].copy()

# Identify which columns are missing in each row
missing_rows["missing_columns"] = missing_rows.apply(
    lambda row: [col for col in df_cleaned.columns if pd.isna(row[col])],
    axis=1
)

# Show only key columns
missing_info = missing_rows[["vehicle_id", "timestamp", "missing_columns"]]

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
display(missing_info)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

total_missing = df_cleaned.isna().sum().sum()
print("Total missing values in cleaned dataset:", total_missing)

# Fill in Missing Values using KNN imputations

In [None]:
from sklearn.impute import KNNImputer
# Get all numeric columns
numeric_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Remove vehicle_id from numeric columns
if "vehicle_id" in numeric_cols:
    numeric_cols.remove("vehicle_id")

print("Numeric columns to fill:", numeric_cols)

# Create a copy of the cleaned dataframe for KNN imputation
df_knn = df_cleaned.copy()

# Create KNN imputer with k=5 (recommended)
imputer = KNNImputer(n_neighbors=5)

# Apply only to numeric columns
df_knn[numeric_cols] = imputer.fit_transform(df_knn[numeric_cols])

df_knn.to_csv("Cleaned_Dataset_KNNFilled.csv", index=False)
print("KNN-filled dataset saved as Cleaned_Dataset_KNNFilled.csv")

In [None]:
# 1. Identify missing locations BEFORE imputation
missing_locs = []
for row in df_cleaned.index:
    for col in numeric_cols:
        if pd.isna(df_cleaned.loc[row, col]):
            missing_locs.append((row, col))

# 2. Create a results table
results = []

for row, col in missing_locs:
    original = df_cleaned.loc[row, col]
    filled = df_knn.loc[row, col]   # value after KNN imputation
    
    results.append({
        "row_index": row,
        "vehicle_id": df_cleaned.loc[row, "vehicle_id"],
        "timestamp": df_cleaned.loc[row, "timestamp"],
        "column_imputed": col,
        "original_value": original,
        "filled_value": filled
    })

# 3. Convert to DataFrame for display
imputation_report = pd.DataFrame(results)

print("KNN Imputation Report (Before vs After):")
display(imputation_report)

# Save report
imputation_report.to_csv("KNN_Imputed_Values_Report.csv", index=False)
print("Saved: KNN_Imputed_Values_Report.csv")
