# Data Cleaning Notebook

This notebook cleans the new `car_sales_data.csv` dataset using **Polars**. It includes explicit checks for duplicates and invalid data.

In [1]:
import polars as pl
import numpy as np
from pathlib import Path

from src.utils.data_manager import load_from, save_to

## 1. Load Data
Loading raw data from `data/raw/car_sales_data.csv`.

In [2]:
df_raw = pl.read_csv(load_from("raw", "car_sales_data.csv"))
print(f"Raw shape: {df_raw.shape}")
df_raw.head()

Raw shape: (50000, 7)


Manufacturer,Model,Engine size,Fuel type,Year of manufacture,Mileage,Price
str,str,f64,str,i64,i64,i64
"""Ford""","""Fiesta""",1.0,"""Petrol""",2002,127300,3074
"""Porsche""","""718 Cayman""",4.0,"""Petrol""",2016,57850,49704
"""Ford""","""Mondeo""",1.6,"""Diesel""",2014,39190,24072
"""Toyota""","""RAV4""",1.8,"""Hybrid""",1988,210814,1705
"""VW""","""Polo""",1.0,"""Petrol""",2006,127869,4101


## 2. Inspect Anomalies
Identifying duplicates and mismatched types before cleaning.

In [3]:
# 1. Check for Validation: Duplicates
duplicates = df_raw.filter(df_raw.is_duplicated())
if duplicates.height > 0:
    print(f"Found {duplicates.height} duplicate rows:")
    print(duplicates)
else:
    print("No duplicates found.")

# 2. Apply transformations WITHOUT dropping nulls initially to find bad data
df_casted = (
    df_raw
    .unique() # handling duplicates first
    .with_columns([
        # String standardization
        pl.col("Manufacturer").str.strip_chars().str.to_titlecase(),
        pl.col("Model").str.strip_chars(),
        pl.col("Fuel type").str.strip_chars().str.to_titlecase(),
        
        # Numeric casting (strict=False turns errors into nulls)
        pl.col("Engine size").cast(pl.Float64, strict=False),
        pl.col("Year of manufacture").cast(pl.Int64, strict=False),
        pl.col("Mileage").cast(pl.Int64, strict=False),
        pl.col("Price").cast(pl.Float64, strict=False)
    ])
)

# 3. Check for rows that became Null (indicating bad data)
nan_rows = df_casted.filter(pl.any_horizontal(pl.all().is_null()))

if nan_rows.height > 0:
    print(f"\nFound {nan_rows.height} rows with nulls (potential casting errors):")
    print(nan_rows)
else:
    print("\nNo null rows found after casting.")

Found 23 duplicate rows:
shape: (23, 7)
┌──────────────┬────────┬─────────────┬───────────┬─────────────────────┬─────────┬───────┐
│ Manufacturer ┆ Model  ┆ Engine size ┆ Fuel type ┆ Year of manufacture ┆ Mileage ┆ Price │
│ ---          ┆ ---    ┆ ---         ┆ ---       ┆ ---                 ┆ ---     ┆ ---   │
│ str          ┆ str    ┆ f64         ┆ str       ┆ i64                 ┆ i64     ┆ i64   │
╞══════════════╪════════╪═════════════╪═══════════╪═════════════════════╪═════════╪═══════╡
│ Ford         ┆ Mondeo ┆ 1.4         ┆ Diesel    ┆ 1987                ┆ 224569  ┆ 883   │
│ VW           ┆ Polo   ┆ 1.2         ┆ Petrol    ┆ 2003                ┆ 10000   ┆ 8024  │
│ VW           ┆ Polo   ┆ 1.2         ┆ Petrol    ┆ 2003                ┆ 10000   ┆ 8024  │
│ Toyota       ┆ Yaris  ┆ 1.0         ┆ Petrol    ┆ 1996                ┆ 13500   ┆ 5087  │
│ VW           ┆ Passat ┆ 1.8         ┆ Diesel    ┆ 1996                ┆ 13500   ┆ 9394  │
│ …            ┆ …      ┆ …           ┆ 

## 3. Finalize Cleaning
Dropping identified invalid rows and saving.

In [4]:
df_cleaned = df_casted.drop_nulls()

print(f"Original shape: {df_raw.shape}")
print(f"Cleaned shape:  {df_cleaned.shape}")
df_cleaned.head()

Original shape: (50000, 7)
Cleaned shape:  (49988, 7)


Manufacturer,Model,Engine size,Fuel type,Year of manufacture,Mileage,Price
str,str,f64,str,i64,i64,f64
"""Toyota""","""Yaris""",1.0,"""Petrol""",1989,221849,671.0
"""Porsche""","""Cayenne""",2.6,"""Petrol""",2008,67439,26149.0
"""Ford""","""Mondeo""",2.0,"""Hybrid""",2014,46317,29459.0
"""Toyota""","""RAV4""",2.2,"""Hybrid""",2006,109397,14497.0
"""Vw""","""Polo""",1.4,"""Petrol""",2011,68754,10406.0


## 4. Export Cleaned Data
Saving to `data/cleaned/car_sales_data_cleaned.csv`.

In [5]:
df_cleaned.write_csv(save_to("cleaned", "car_sales_data_cleaned.csv"))
print("Saved successfully.")

Saved successfully.
