In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import polars as pl

from load_data import data_loader

emissions = data_loader()

My name is Alix Tiran-Cappello, I work as a data scientist / MLOPS pipeline maintainer for Renault Group for my day job.

I am only saying this to explain that I get called A LOT to debug some pandas either failing or with absolute garbage performance.

And this experience has definitely shaped some ideas that will perspire during this presentation. I will refrain from swearing but you might experience decent levels of salt at times.

But today I am here on my own, and what I will say today is not associated in any way with my company.

The reason I am here is to present to you a small open source package that I created: `pelage` .

My goal is to bring basic key practices of software engineering that have demonstrated usefulness to maintain velocity and imbed quality from the start
to help you write better data science code with polars and TDD.


# What is TDD?

![alt text](assets/tdd-schema.png){width=500}


What is TDD ?

We will see it countless times during this talk.

Before we can start with TDD, I'd like to guide you through a brief journey. During this time, we will take some pandas code, AND WITHOUT LOOKING AT THE DATA, translate it using refactoring tests to polars, to get used to its syntax and specificities.

Then we will see how we can use pelage to write a meaningful data test, see it fail, then write some simple code, and make it pass.


In [70]:
emissions_pandas = (
    emissions.filter(pl.col.electric_range_km.is_not_null()).collect().to_pandas()
)

In [71]:
%load_ext notifier

The notifier extension is already loaded. To reload it, use:
  %reload_ext notifier


# Step 1: From Pandas To Polars

![kent-hat](assets/kent-beck-hat.png){width=150}


In [None]:

df = emissions_pandas
df = df.dropna(subset=["fuel_consumption"])
df.drop(["obfcm_data_source", "registered_category"], axis=1, inplace=True)

df[df["fuel_type"] == "PETROL"]["fuel_type"] = "petrol"
df["fuel_consumption_per_100km"] = df["fuel_consumption"] * 100

grouped = []
for manufacturer in df["manufacturer_name"].unique():
    manuf_df = df[df["manufacturer_name"] == manufacturer]
    for year in df["year"].unique():
        subset_df = manuf_df[(manuf_df["year"] == year)]
        result = {"manufacturer_name": manufacturer,
            "year": year,
            "mean_fuel_consumption": subset_df["fuel_consumption_per_100km"].mean(), "mean_electric_range": subset_df["electric_range_km"].mean(), "vehicle_count": subset_df["vehicle_id"].nunique(),
        }
        grouped.append(result)

grouped = pd.DataFrame(grouped)
grouped = grouped.dropna()
grouped = grouped.sort_values(["mean_fuel_consumption", "year"], ascending=False)
grouped = grouped[grouped["vehicle_count"] >= 100]
grouped = grouped.reset_index(drop=True)


```python
main_yearly_emissions_per_manufacturer = (
    emissions_pandas.dropna(subset=["fuel_consumption"])
    .drop(["vehicle_family_number"], axis=1)
    .assign(
        fuel_type=lambda df: df["fuel_type"].replace({"PETROL": "petrol"}),
        fuel_consumption_per_100km=lambda df: df["fuel_consumption"] * 100,
    )
    .groupby(["manufacturer_name", "year"])
    .agg(
        mean_fuel_consumption=("fuel_consumption_per_100km", "mean"),
        mean_electric_range=("electric_range_km", "mean"),
        vehicle_count=("vehicle_id", "nunique"),
    )
    .reset_index()
    .sort_values(["mean_fuel_consumption", "year"], ascending=False)
    .loc[lambda df: df["vehicle_count"] >= 100]
    .reset_index(drop=True)
)

emissions_polars = pl.DataFrame(emissions_pandas)

main_yearly_emissions_per_manufacturer = (
    emissions_polars.drop_nulls(subset=["fuel_consumption"])
    .drop(["vehicle_family_number"])
    .with_columns(
        fuel_type=pl.col.fuel_type.replace({"PETROL": "petrol"}),
        fuel_consumption_per_100km=pl.col.fuel_consumption * 100,
    )
    .group_by("manufacturer_name", "year")
    .agg(
        mean_fuel_consumption=pl.col.fuel_consumption_per_100km.mean(),
        mean_electric_range=pl.col.electric_range_km.mean(),
        vehicle_count=pl.col.vehicle_id.n_unique().cast(pl.Int64),
    )
    .sort(["mean_fuel_consumption", "year"], descending=True)
    .filter(pl.col.vehicle_count >= 100)
)
```


# But What About TDD?

## This is Kent Beck:

![alt text](assets/kent_beck.png){width=350}

- ##### Kent coined the term TDD
- ##### Kent created one of the first testing frameworks
- ##### Kent found that a testing framework should be written in the same language as the code
- ##### Kent say that TDD reduces developer anxiety (Better than Xanax!)
- ##### Kent has usually good ideas about software development


## Enters Pelage

##### A testing framework to be used with polars to express data science concepts clearly and easily.

##### Pass a dataframe to a testing function:

- ##### If it fails you get a nice descriptive error message!
- ##### If it passes, you get your dataframe back!


In [91]:
emissions_for_tdd = emissions.filter(pl.col.electric_range_km.is_not_null()).collect()

## Step 2: Now Let's Do Some Real Tdd In Polars!


vehicle_id,reporting_period,obfcm_data_source,used_in_calculation,country,manufacturer_name,model_type,model_variant,license_plate,brand_name,commercial_name,registered_category,ewltp_g_per_km,fuel_type,fuel_mode,year,fuel_consumption,electric_consumption_wh_per_km,electric_range_km,engine_capacity_cm3,engine_power_kw,mass_kg,total_fuel_consumed_l,total_distance_travelled_km
i64,i64,str,bool,str,str,str,str,str,str,str,str,f64,str,str,i64,f64,f64,f64,f64,f64,f64,f64,f64
9515604,2023,"""OEM""",false,"""BE""","""MERCEDES-BENZ AG""","""R1ES""","""U21IT0""","""CZAA050C""","""MERCEDES-BENZ""","""E 300 DE""","""M1""",36.0,"""diesel/electric""","""P""",2023,1.4,212.0,52.0,1950.0,143.0,2145.0,1.68,8.2
13523581,2023,"""OEM""",true,"""FR""","""CHRYSLER""","""JK""","""JTAFG""","""N5H62A""","""JEEP""","""WRANGLER UNLIMITED""","""M1G""",79.0,"""petrol/electric""","""P""",2022,3.5,221.0,44.0,1995.0,200.0,2348.0,1140.2,11678.3
6849622,2023,"""OEM""",true,"""NL""","""BMW AG""","""G5X""","""TA61""","""IAA50900""","""BMW""","""X5 xDrive45e""","""M1G""",29.0,"""petrol/electric""","""P""",2021,1.3,251.0,85.0,2998.0,210.0,2510.0,2096.85,31197.1
11510555,2023,"""OEM""",true,"""DE""","""BMW AG""","""G3X""","""TS11""","""IAW50000""","""BMW""","""X3 xDrive30e""","""M1G""",46.0,"""petrol/electric""","""P""",2021,2.0,188.0,50.0,1998.0,135.0,2065.0,941.06,17041.8
739896,2023,"""OEM""",true,"""LU""","""MERCEDES-BENZ AG""","""212""","""U01IT1""","""CZAA050A""","""MERCEDES-BENZ""","""E 300 DE 4MATIC""","""M1""",37.0,"""diesel/electric""","""P""",2021,1.4,236.0,51.0,1950.0,143.0,2130.0,1436.4,30657.2
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10739202,2023,"""OEM""",false,"""FR""","""AUTOMOBILES PEUGEOT""","""M""","""4""","""DGZU-C12000""","""PEUGEOT""","""3008""","""M1""",28.0,"""petrol/electric""","""P""",2023,1.2,163.0,67.0,1598.0,133.0,1838.0,3.56,11.2
10715121,2023,"""OEM""",true,"""DE""","""AUTOMOBILES PEUGEOT""","""M""","""4""","""5GBU-C1J000""","""PEUGEOT""","""3008""","""M1""",30.0,"""petrol/electric""","""P""",2021,1.3,153.0,63.0,1598.0,147.0,1915.0,1505.49,19862.6
8267319,2023,"""OEM""",false,"""SE""","""TOYOTA""","""XA5P(EU,M)""","""AXAP54(N)""","""AXAP54L-ANXGBW(1A)""","""TOYOTA""","""TOYOTA RAV4""","""M1""",22.0,"""petrol/electric""","""P""",2021,1.0,166.0,75.0,2487.0,136.0,2005.0,0.0,0.0
5015536,2023,"""OEM""",true,"""BE""","""MERCEDES-BENZ AG""","""H1GLE""","""B21JT1""","""CZAAB56A""","""MERCEDES-BENZ""","""GLE 350 DE 4MATIC""","""M1""",20.0,"""diesel/electric""","""P""",2022,0.7,271.0,95.0,1950.0,143.0,2655.0,4832.79,70584.6


```python
primary_key_columns = [
    "vehicle_id",
    "reporting_period",
    "obfcm_data_source",
    "used_in_calculation",
]

(
    emissions_for_tdd.drop_nulls(subset=["model_variant", "license_plate"])
    .with_columns(pl.col.fuel_type.str.to_lowercase())
    .filter(pl.col.fuel_type.is_in(["diesel", "petrol"]).not_())
    .cast({"used_in_calculation": pl.Boolean})
    .filter(pl.len().over(primary_key_columns) == 1)
    .pipe(plg.accepted_range, {"mass_kg": (1000, 3000), "engine_power_kw": (25, 600)})
    .pipe(
        plg.has_no_nulls,
        [
            "country",
            "manufacturer_name",
            "model_type",
            "model_variant",
            "license_plate",
            "brand_name",
            *primary_key_columns,
        ],
    )
    .pipe(plg.custom_check, pl.col.fuel_type.str.contains(r"[A-Z]").not_())
    .pipe(
        plg.unique_combination_of_columns,
        primary_key_columns,
    )
    .pipe(plg.not_accepted_values, {"fuel_type": ["diesel", "petrol"]})
    .pipe(plg.has_dtypes, {"used_in_calculation": pl.Boolean})
)
```


In [None]:
loaded_stocks = pl.read_parquet("data/stocks.parquet")

```python
full_stocks_by_10min = (
    loaded_stocks
    .filter(
        (
            pl.col.zone.is_in(["zone_28", "zone_29", "zone_17"])
            & (pl.col.subzone == "subzone_a")
        ).not_()
    )
    .with_columns(
        pl.col.timestamp.cast(pl.Datetime(time_unit="ms")).dt.truncate("10m"),
        pl.col.stock_value.clip(0, None),
    )
    .group_by("zone", "subzone", "timestamp")
    .agg(
        pl.col.stock_value.min().name.suffix("_min"),
        pl.col.stock_value.max().name.suffix("_max"),
    )
    .with_columns(
        date=pl.col.timestamp.dt.date(),
    )
    .filter(pl.len().over("zone", "subzone", "date") > 1)
    .sort("zone", "subzone", "timestamp")
    .pipe(plg.has_dtypes, {"timestamp": pl.Datetime(time_unit="ms")})
    .pipe(plg.has_no_nulls)
    .pipe(plg.is_monotonic, "timestamp", strict=True, group_by=["zone", "subzone"])
    .pipe(plg.unique_combination_of_columns, ["zone", "subzone", "timestamp"])
    .pipe(
        plg.accepted_range, {"stock_value_min": (0, None), "stock_value_max": (0, None)}
    )
    .pipe(
        plg.custom_check,
        (
            pl.col.zone.is_in(["zone_28", "zone_29", "zone_17"])
            & (pl.col.subzone == "subzone_a")
        ).not_(),
    )
    .pipe(plg.not_constant, columns="timestamp", group_by=["zone", "subzone", "date"])
)
full_stocks_by_10min
```


I'll leave it here for now because we have seen enough examples so that you can get a reasonable idea of pelage capabilities.
We saw some checks that apply for columns, rows and even groups and there are more features for time series, infinite values and more.

# Conclusion

To summarize this presentation, we have seen how we can refactor some horrible slow-performing pandas code into good pandas code using method chaining.

We have seen that it is then fairly easy to translate into polars, with a good equivalence test, you just write a `.pipe(pl.DataFrame)` and then you shift it up!

And then we discovered how much valuable insights and intel we can derive from just a few simple data tests to build up our analysis.

As you can see, being able to express tests in the same manner as the code you are writing is a critical part of the TDD workflow.

But thankfully this part has already be done for your in pelage, because under the hood, beside a few things to process user inputs, all the core logic leverages polars.

The result is a relative simple but easy to use package, that allows to write from the very start code of great quality that is required for production.

All you have to do is: write a failing test, write some code to make it pass, once the test passes, refactor to make it better.

# And Now?

- ### Got to the website: https://alixtc.github.io/pelage/
- ### Type `uv add pelage`
