# Robust Data Transformation with Pandas: Typing, Validation, Testing

*EuroPython 2023 Tutorial*

*by [Jakub Urban]() and [Jan Pipek]()*

## Abstract

We will explore possibilities for making our data analyses and transformations in Pandas robust and production ready. We will see how advanced group-by, resample or rolling aggregations work on large time series weather data. (As a bonus, you will learn about Prague climate.) We will use type annotations and schema validations with the Pandera library to make our code more readable and robust. We will also show the potential of property-based testing using the Hypothesis package, with strategies generated from Pandera schemas. We will show how to avoid issues with time zones when working with time series data. By the end of the tutorial, you will have a deeper understanding of advanced Pandas aggregations and be able to write robust, production ready Pandas code.

## Motivation

* Pandas is a great tool for data analysis and transformation.
* It may not be obvious how to write robust and production ready code.
  * In particular because Pandas is very permissive and flexible.
* There are less known features and libraries that help making Pandas code more robust and better maintainable.
* Time series data offer a spectrum of challenges that we will explore.
  * Time zones can be tricky.
  * Time series aggregations are not always straightforward.

## What will you learn

* Writing Pandas code into reusable, testable and production ready functions.
* Using [Pandera](https://pandera.readthedocs.io/en/stable/) for data validation.
* Using [Hypothesis](https://hypothesis.readthedocs.io/en/latest/) for property-based testing.
* Checking type annotations with [mypy](https://mypy.readthedocs.io/en/stable/).
* Safe time zone handling in Pandas.
* Advanced time series resampling and aggregations.