![](assets/image.png)


# What is TDD?

![alt text](assets/tdd-schema.png){width=500}


What is TDD ?

We will see it countless times during this talk.

Before we can start with TDD, I'd like to guide you through a brief journey. During this time, we will take some pandas code, AND WITHOUT LOOKING AT THE DATA, translate it using refactoring tests to polars, to get used to its syntax and specificities.

Then we will see how we can use pelage to write a meaningful data test, see it fail, then write some simple code, and make it pass.


# Step 1: From Pandas To Polars

![kent-hat](assets/kent-beck-hat.png){width=150}


In [2]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import polars as pl

from load_data import data_loader

emissions_pandas = (
    data_loader().filter(pl.col.electric_range_km.is_not_null()).collect().to_pandas()
)

In [3]:
%load_ext notifier

In [4]:

df = emissions_pandas
df = df.dropna(subset=["fuel_consumption"])
df.drop(["obfcm_data_source", "registered_category"], axis=1, inplace=True)

df[df["fuel_type"] == "PETROL"]["fuel_type"] = "petrol"
df["fuel_consumption_per_100km"] = df["fuel_consumption"] * 100

grouped = []
for manufacturer in df["manufacturer_name"].unique():
    manuf_df = df[df["manufacturer_name"] == manufacturer]
    for year in df["year"].unique():
        subset_df = manuf_df[(manuf_df["year"] == year)]
        result = {"manufacturer_name": manufacturer,
            "year": year,
            "mean_fuel_consumption": subset_df["fuel_consumption_per_100km"].mean(), "mean_electric_range": subset_df["electric_range_km"].mean(), "vehicle_count": subset_df["vehicle_id"].nunique(),
        }
        grouped.append(result)

grouped = pd.DataFrame(grouped)
grouped = grouped.dropna()
grouped = grouped.sort_values(["mean_fuel_consumption", "year"], ascending=False)
grouped = grouped[grouped["vehicle_count"] >= 100]
grouped = grouped.reset_index(drop=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(["obfcm_data_source", "registered_category"], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["fuel_type"] == "PETROL"]["fuel_type"] = "petrol"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["fuel_consumption_per_100km"] = df["fuel_consumption"] * 100


## Refactoring



In [6]:
main_yearly_emissions_per_manufacturer = (
    emissions_pandas.dropna(subset=["fuel_consumption"])
    .drop(["obfcm_data_source", "registered_category"], axis=1)
    .assign(
        fuel_type=lambda df: df["fuel_type"].replace({"PETROL": "petrol"}),
        fuel_consumption_per_100km=lambda df: df["fuel_consumption"] * 100,
    )
    .groupby(["manufacturer_name", "year"])
    .agg(
        mean_fuel_consumption=("fuel_consumption_per_100km", "mean"),
        mean_electric_range=("electric_range_km", "mean"),
        vehicle_count=("vehicle_id", "nunique"),
    )
    .reset_index()
    .sort_values(["mean_fuel_consumption", "year"], ascending=False)
    .loc[lambda df: df["vehicle_count"] >= 100]
    .reset_index(drop=True)
)

emissions_polars = pl.DataFrame(emissions_pandas)

main_yearly_emissions_per_manufacturer = (
    emissions_polars.drop_nulls(subset=["fuel_consumption"])
    .drop(["obfcm_data_source", "registered_category"])
    .with_columns(
        fuel_type=pl.col.fuel_type.replace({"PETROL": "petrol"}),
        fuel_consumption_per_100km=pl.col.fuel_consumption * 100,
    )
    .group_by("manufacturer_name", "year")
    .agg(
        mean_fuel_consumption=pl.col.fuel_consumption_per_100km.mean(),
        mean_electric_range=pl.col.electric_range_km.mean(),
        vehicle_count=pl.col.vehicle_id.n_unique().cast(pl.Int64),
    )
    .sort(["mean_fuel_consumption", "year"], descending=True)
    .filter(pl.col.vehicle_count >= 100)
)

# But What About TDD?

## This is Kent Beck:

![alt text](assets/kent_beck.png){width=300}

- ##### Kent coined the term TDD
- ##### Kent created one of the first testing frameworks
- ##### Kent found that a testing framework should be written in the same language as the code
- ##### Kent say that TDD reduces developer anxiety (Better than Xanax!)
- ##### Kent has usually good ideas about software development


## Enters `Pelage`

##### A testing framework to be used with polars to express data science tests clearly and easily.

##### Pass a dataframe to a testing function:

- ##### If it fails you get a nice descriptive error message!
- ##### If it passes, you get your dataframe back!


## Step 2: Now Let's Do Some Real Tdd In Polars!


In [7]:
emissions_for_tdd = (
    data_loader().filter(pl.col.electric_range_km.is_not_null()).collect()
)

## TDD With Polars and Pelage


In [8]:
import pelage as plg

primary_key_columns = [
    "vehicle_id",
    "reporting_period",
    "obfcm_data_source",
    "used_in_calculation",
]
(
    emissions_for_tdd.drop_nulls(subset=["model_variant", "license_plate"])
    .with_columns(pl.col.fuel_type.str.to_lowercase())
    .filter(pl.col.fuel_type.is_in(["diesel", "petrol"]).not_())
    .cast({"used_in_calculation": pl.Boolean})
    .filter(pl.len().over(primary_key_columns) == 1)
    .pipe(plg.accepted_range, {"mass_kg": (1000, 3000), "engine_power_kw": (25, 600)})
    .pipe(
        plg.has_no_nulls,
        [
            "country",
            "manufacturer_name",
            "model_type",
            "model_variant",
            "license_plate",
            "brand_name",
            *primary_key_columns,
        ],
    )
    .pipe(plg.custom_check, pl.col.fuel_type.str.contains(r"[A-Z]").not_())
    .pipe(
        plg.unique_combination_of_columns,
        primary_key_columns,
    )
    .pipe(plg.not_accepted_values, {"fuel_type": ["diesel", "petrol"]})
    .pipe(plg.has_dtypes, {"used_in_calculation": pl.Boolean})
)

vehicle_id,obfcm_data_source,reporting_period,used_in_calculation,country,manufacturer_name,model_type,model_variant,license_plate,brand_name,commercial_name,registered_category,ewltp_g_per_km,fuel_type,fuel_mode,year,fuel_consumption,electric_consumption_wh_per_km,electric_range_km,engine_capacity_cm3,engine_power_kw,mass_kg,total_fuel_consumed_l,total_distance_travelled_km
i64,str,i64,bool,str,str,str,str,str,str,str,str,f64,str,str,i64,f64,f64,f64,f64,f64,f64,f64,f64
12718530,"""OEM""",2023,false,"""IT""","""STELLANTIS EUROPE""","""MP""","""JHPFP""","""MN1BBH1""","""JEEP""","""COMPASS""","""M1G""",45.0,"""petrol/electric""","""P""",2022,1.9,163.0,48.0,1332.0,132.0,1935.0,0.2,0.0
7190988,"""MS""",2023,true,"""SE""","""STELLANTIS EUROPE""","""MP""","""JHPFP""","""MN1BBH1""","""JEEP""","""COMPASS""","""M1G""",45.0,"""petrol/electric""","""P""",2022,1.9,162.0,48.0,1332.0,132.0,1935.0,1043.36,12483.0
6051896,"""OEM""",2023,true,"""ES""","""STELLANTIS EUROPE""","""MP""","""JHPFP""","""ML1BBH1""","""JEEP""","""COMPASS""","""M1G""",45.0,"""petrol/electric""","""P""",2022,1.9,166.0,47.0,1332.0,96.0,1935.0,115.0,3789.0
12713671,"""OEM""",2023,true,"""IT""","""STELLANTIS EUROPE""","""MP""","""JHPFP""","""ML1BBH1""","""JEEP""","""COMPASS""","""M1G""",45.0,"""petrol/electric""","""P""",2022,1.9,166.0,47.0,1332.0,96.0,1935.0,3410.8,49094.0
12718724,"""OEM""",2023,true,"""IT""","""STELLANTIS EUROPE""","""MP""","""JHPFP""","""ML1BBH1""","""JEEP""","""COMPASS""","""M1G""",45.0,"""petrol/electric""","""P""",2022,1.9,166.0,47.0,1332.0,96.0,1935.0,325.7,5313.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
699643,"""OEM""",2023,true,"""PL""","""MERCEDES-BENZ AG""","""204 X""","""R81IT1""","""CZAA051A""","""MERCEDES-BENZ""","""GLC 300 DE 4MATIC""","""M1""",54.0,"""diesel/electric""","""P""",2021,2.0,281.0,42.0,1950.0,143.0,2135.0,1276.04,27013.7
4712170,"""OEM""",2023,true,"""BE""","""MERCEDES-BENZ AG""","""204 X""","""R81IT1""","""CZAA05BB""","""MERCEDES-BENZ""","""GLC 300 DE 4MATIC""","""M1""",52.0,"""diesel/electric""","""P""",2021,2.0,257.0,42.0,1950.0,143.0,2135.0,4949.28,85724.2
4759155,"""OEM""",2023,true,"""DE""","""MERCEDES-BENZ AG""","""204 X""","""R81IT1""","""CZAA05BB""","""MERCEDES-BENZ""","""GLC 300 DE 4MATIC""","""M1""",55.0,"""diesel/electric""","""P""",2021,2.1,264.0,41.0,1950.0,143.0,2135.0,5575.46,67522.9
185753,"""OEM""",2023,true,"""IT""","""VOLVO""","""X""","""XZBW""","""XZBWVF0""","""VOLVO""","""XC40""","""M1""",47.0,"""petrol/electric""","""P""",2021,,152.0,45.0,1477.0,95.0,1812.0,433.0,17254.0


## TDD With Polars And Pelage (part 2)


In [13]:
loaded_stocks = pl.read_parquet("data/stocks.parquet")
full_stocks_by_10min = (
    loaded_stocks
    .filter(
        (
            pl.col.zone.is_in(["zone_28", "zone_29", "zone_17"])
            & (pl.col.subzone == "subzone_a")
        ).not_()
    )
    .with_columns(
        pl.col.timestamp.str.to_datetime(time_unit="ms").dt.truncate("10m"),
        pl.col.stock_value.clip(0, None),
    )
    .group_by("zone", "subzone", "timestamp")
    .agg(
        pl.col.stock_value.min().name.suffix("_min"),
        pl.col.stock_value.max().name.suffix("_max"),
    )
    .with_columns(
        date=pl.col.timestamp.dt.date(),
    )
    .filter(pl.len().over("zone", "subzone", "date") > 1)
    .sort("zone", "subzone", "timestamp")
    .pipe(plg.has_dtypes, {"timestamp": pl.Datetime(time_unit="ms")})
    .pipe(plg.has_no_nulls)
    .pipe(plg.is_monotonic, "timestamp", strict=True, group_by=["zone", "subzone"])
    .pipe(plg.unique_combination_of_columns, ["zone", "subzone", "timestamp"])
    .pipe(
        plg.accepted_range, {"stock_value_min": (0, None), "stock_value_max": (0, None)}
    )
    .pipe(
        plg.custom_check,
        (
            pl.col.zone.is_in(["zone_28", "zone_29", "zone_17"])
            & (pl.col.subzone == "subzone_a")
        ).not_(),
    )
    .pipe(plg.not_constant, columns="timestamp", group_by=["zone", "subzone", "date"])
)
full_stocks_by_10min


zone,subzone,timestamp,stock_value_min,stock_value_max,date
str,str,datetime[ms],i64,i64,date
"""zone_01""","""subzone_a""",2025-04-04 19:00:00,17,17,2025-04-04
"""zone_01""","""subzone_a""",2025-04-04 19:10:00,16,16,2025-04-04
"""zone_01""","""subzone_a""",2025-04-04 20:30:00,11,15,2025-04-04
"""zone_01""","""subzone_a""",2025-04-04 20:40:00,10,10,2025-04-04
"""zone_01""","""subzone_a""",2025-04-04 21:30:00,8,9,2025-04-04
…,…,…,…,…,…
"""zone_63""","""subzone_c""",2025-05-29 22:40:00,7,8,2025-05-29
"""zone_63""","""subzone_c""",2025-05-29 22:50:00,6,8,2025-05-29
"""zone_63""","""subzone_c""",2025-05-29 23:20:00,7,7,2025-05-29
"""zone_63""","""subzone_c""",2025-05-29 23:30:00,7,8,2025-05-29


# Conclusion

#### Method Chaining: From horrible slow-performing pandas to good, easy-to-read pandas code.

#### Translation to polars becomes easy: A good equivalence test + `.pipe(pl.DataFrame)` + you shift this up!

#### With TDD in data science, we rapidly get larger insights from just a few data-tests as we build up our analysis:

- **Being able to express tests in the same manner as your code is a critical part of the TDD workflow.**
- **This is already done for your in pelage, the core logic leverages polars (except for processing user inputs).**
- **The result: a simple, easy to use package, that allows to write from the start code of great quality that is required for production contexts.**

#### All you have to do is:

- **❌ Write a failing test!**
- **✅ Write some code to make it pass!**
- **🔄 Refactor to make it better!**


# And Now?

- ### Type `uv add pelage` ( ~~pip install pelage~~, it works but `uv` is better )
- ### Got to the website: https://alixtc.github.io/pelage/
- ### QR Code for the Presentation
  ![](assets/qr_presentation_link.png){width=200}
- ### QR Code for my LinkedIn
  ![](assets/qr_linkedin_profile.png){width=200}
