# Code examples for Pandas vs Polars Talk
## Arturo Regalado
### 10 May 2023

Dataset for benchmarks and main examples: [Reddit Sarcasm Posts](https://www.kaggle.com/datasets/danofer/sarcasm)
Size of data: 1 million rows by 10 columns

In [23]:
# Imports
import pandas as pd
import polars as pl
data_url = r"C:\Users\artre\Downloads\archive (1)\train-balanced-sarcasm.csv"

## Loading comparisons

In [24]:
%%timeit -r 10
# Pandas eager
df_pandas = pd.read_csv(data_url, sep=',', on_bad_lines='skip', low_memory=False)

4.35 s ± 132 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [25]:
%%timeit -r 10
# Polars eager execution
df_polars = pl.read_csv(data_url)

254 ms ± 10.1 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [26]:
%%timeit -r 10
# Polars lazy execution

df_polars = (
    pl.scan_csv(data_url)
    .collect()
)

250 ms ± 7.06 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [27]:
%%timeit -r 10
# Polars with laxy execution and streaming. Useful when data is too big to hold in memory
df_polars = (
    pl.scan_csv(data_url)
    .collect(streaming=True)
)

145 ms ± 2.65 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)


In [28]:
%%timeit -r 10
# Polars lazy execution with fetch. Useful when debugging. 

df_polars = (
    pl.scan_csv(data_url)
    .fetch(n_rows=int(100000))
)

41.5 ms ± 1.4 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)


## Processing data examples

In [30]:
%%timeit -r 10
# Pandas with method chaining syntax

df_pandas = (
    pd.read_csv(data_url, on_bad_lines='skip', low_memory=False)
    .query('subreddit == "politics"')
    .query('score > 0')
)

df_pandas.head()

4.47 s ± 82.8 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [31]:
%%timeit -r 10
# Polars with eager execution
df_polars = (
    pl.read_csv(data_url)
    .filter(pl.col('subreddit') == 'politics')
    .filter(pl.col('score') > 0)
)

df_polars.head()

320 ms ± 22.2 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


In [32]:
%%timeit -r 10
# Polars with lazy execution

q1 = (
    pl.scan_csv(data_url)
    .filter(pl.col('subreddit') == 'politics')
    .filter(pl.col('score') > 0)
    .collect()
)

183 ms ± 4.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)


## Summary of benchmarks

### Loading times
Test | Average | SD
---------|----------|---------
 Pandas | 4.35 s | 132 ms
 Polars Eager | 254 ms | 10.1 ms
 Polars Lazy | 250 ms | 7.06 ms

### Processing
Test | Average | SD
---------|----------|---------
 Pandas | 4.47 s | 82.8 ms
 Polars Eager | 320 ms | 22.1 ms
 Polars Lazy | 183 ms | 4.6 ms