# Intro to Polars

## Outline

1. Introduction, basic Polars DataFrame API, differences with pandas, simple I/O
2. The expression system, filtering rows, transforming columns
3. Working with semi-structured data, integration with Pydantic
4. Cloud storage, out of core processing with the new streaming engine

## 1. Introduction

### What is Polars?

https://pola.rs/

![Polars](static/polars_github_banner.svg)

> Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

In summary:

- Expressive API (more familiar to R and Spark users, still approachable to pandas users)
- Fast (thanks to its Rust core)
- Support for zero-copy Apache Arrow data
- Out of core (with its new streaming engine)

### Dataset

We will play with [data from Stack Overflow](https://www.kaggle.com/datasets/stackoverflow/stacksample):

![Stack Overflow data sample](static/stacksample.png)

In [36]:
# Uncomment to generate a sample of the dataset
#
# import polars as pl
#
# pl.scan_csv("data/Questions.csv", encoding="utf8-lossy").collect(engine="streaming").sample(fraction=0.01).write_parquet("data/questions-sample.parquet")
# pl.scan_csv("data/Tags.csv").collect(engine="streaming").write_parquet("data/tags.parquet")

In [37]:
import polars as pl

In [38]:
df = pl.read_parquet("data/questions-sample.parquet")
df.head()

Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
i64,str,str,str,i64,str,str
8869230,"""1109161""","""2012-01-15T11:11:05Z""","""NA""",1,"""How to display two or more mar…","""<p>I have a question about the…"
31915780,"""4021972""","""2015-08-10T09:15:43Z""","""NA""",-2,"""AngularJS: $http.get 405 (Meth…","""<p>when I get a request from a…"
30964930,"""2786156""","""2015-06-21T13:16:50Z""","""NA""",0,"""Invariant parameters in Java""","""<p>I'm reading Bloch's Effecti…"
38084100,"""2938167""","""2016-06-28T18:51:24Z""","""NA""",0,"""How to make Excel macro splitt…","""<p>For splitting a file into s…"
9601400,"""346977""","""2012-03-07T12:19:37Z""","""NA""",2,"""Rails Custom Validators: Testi…","""<p>I'm trying to write up a ra…"


In [39]:
df_tags = pl.read_parquet("data/tags.parquet")
df_tags.head()

Id,Tag
i64,str
80,"""flex"""
80,"""actionscript-3"""
80,"""air"""
90,"""svn"""
90,"""tortoisesvn"""


In [40]:
len(df), len(df_tags)

(12642, 3750994)

In [41]:
print(f"Estimated size in memory (questions sample): {df.estimated_size() >> 20} MiB")

Estimated size in memory (questions sample): 18 MiB


In [42]:
!du -h data/{questions*,tags}.parquet

6.1M	data/questions-sample.parquet
 14M	data/tags.parquet


In [43]:
print(df.head(3))

shape: (3, 7)
┌──────────┬─────────────┬─────────────────┬────────────┬───────┬─────────────────┬────────────────┐
│ Id       ┆ OwnerUserId ┆ CreationDate    ┆ ClosedDate ┆ Score ┆ Title           ┆ Body           │
│ ---      ┆ ---         ┆ ---             ┆ ---        ┆ ---   ┆ ---             ┆ ---            │
│ i64      ┆ str         ┆ str             ┆ str        ┆ i64   ┆ str             ┆ str            │
╞══════════╪═════════════╪═════════════════╪════════════╪═══════╪═════════════════╪════════════════╡
│ 8869230  ┆ 1109161     ┆ 2012-01-15T11:1 ┆ NA         ┆ 1     ┆ How to display  ┆ <p>I have a    │
│          ┆             ┆ 1:05Z           ┆            ┆       ┆ two or more     ┆ question about │
│          ┆             ┆                 ┆            ┆       ┆ mar…            ┆ the…           │
│ 31915780 ┆ 4021972     ┆ 2015-08-10T09:1 ┆ NA         ┆ -2    ┆ AngularJS:      ┆ <p>when I get  │
│          ┆             ┆ 5:43Z           ┆            ┆       ┆ $http.get 4

In [44]:
df.describe()

statistic,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
str,f64,str,str,str,f64,str,str
"""count""",12642.0,"""12642""","""12642""","""12642""",12642.0,"""12642""","""12642"""
"""null_count""",0.0,"""0""","""0""","""0""",0.0,"""0""","""0"""
"""mean""",21342000.0,,,,1.730264,,
"""std""",11516000.0,,,,10.770727,,
"""min""",2750.0,"""1000030""","""2008-08-05T19:51:29Z""","""2010-08-14T20:49:33Z""",-9.0,""" Cached hidden input value is …","""<blockquote>  <p><strong>Poss…"
"""25%""",11379210.0,,,,0.0,,
"""50%""",21879410.0,,,,0.0,,
"""75%""",31554180.0,,,,1.0,,
"""max""",40140000.0,"""NA""","""2016-10-19T19:29:06Z""","""NA""",582.0,"""zoom property, for regular vid…","""<ul> <li>I have a task which c…"


In [45]:
df_tags["Tag"].value_counts().sort("count", descending=True).head(10)

Tag,count
str,u32
"""javascript""",124155
"""java""",115212
"""c#""",101186
"""php""",98808
"""android""",90659
"""jquery""",78542
"""python""",64601
"""html""",58976
"""c++""",47591
"""ios""",47009


In [46]:
_s = df["Title"]
print(type(_s))

_s.head(3)

<class 'polars.series.series.Series'>


Title
str
"""How to display two or more mar…"
"""AngularJS: $http.get 405 (Meth…"
"""Invariant parameters in Java"""


In [47]:
df.dtypes

[Int64, String, String, String, Int64, String, String]

### Differences with pandas

One notable difference with pandas is that Polars DataFrames don't have an index. [This is what the documentation used to say](https://web.archive.org/web/20220206194551/https://pola-rs.github.io/polars-book/user-guide/coming_from_pandas.html#no-index):

> ### No index
>
> They are not needed. Not having them makes things easier. Convince me otherwise

[Since pandas 2.0 it is possible to use PyArrow as backend](https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#argument-dtype-backend-to-return-pyarrow-backed-or-numpy-backed-nullable-dtypes), so the performance gap has become a bit smaller. Still, the considerations about the API and the lazy capabilities stand.

Some advice on how to migrate from pandas can be found at https://docs.pola.rs/user-guide/migration/pandas/

Here comes a trivial I/O microbenchmark (uses the full dataset):

In [48]:
import pandas as pd

In [50]:
%%timeit -n1 -r1
pd.read_csv("data/Questions.csv", encoding_errors="replace").to_csv("/tmp/questions-throwaway.csv")

38.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [51]:
%%timeit -n1 -r1
pl.read_csv("data/Questions.csv", encoding="utf8-lossy").write_csv("/tmp/questions-throwaway.csv")

3.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## 2. The expression system, filtering rows, transforming columns

### Exercise

What are the most upvoted Python questions?

## 3. Working with semi-structured data, integration with Pydantic

### Exercise

## 4. Cloud storage, out of core processing with the new streaming engine