![Image Description](img/polars-101/Slide1.png)


# Getting started with Polars
 
To help you get started this notebook introduces some of the key concepts that make Polars a powerful data analysis tool.

The key concepts we meet are:
- fast flexible analysis with the `Expression API` in Polars
- easy `parallel` computations
- `automatic query optimisation` in `lazy mode`
- `streaming` to work with `larger-than-memory` datasets in Polars

## Importing Polars
We begin by importing polars as `pl`. Following this convention will allow you to work with examples from the official documentation

In [2]:
import polars as pl

We want to restrict how many rows of a `DataFrame` are printed out to the screen. Polars allows us to control configuration using options in `pl.Config`.

## Setting configuration options
We can adjust default configuration options with `pl.Config`.

In this notebook we want Polars to print `5 rows` of `DataFrame` so we use `pl.Config.set_tbl_rows`

In [3]:
pl.Config.set_tbl_rows(6)

polars.config.Config

You can see the full range of configuration options here: https://pola-rs.github.io/polars/py-polars/html/reference/config.html

In the workshop we see how to apply the right configuration options in a range of contexts.

## Input data
Polars can read from a wide range of data formats including CSV, Parquet, Arrow, JSON, Excel and database connections.

To get started we will use a sample `CSV Titanic passenger` dataset. This dataset gives details of `all the passengers on the Titanic and whether they survived`.

We begin by setting the path to this CSV

In [3]:
csvFile = "../data/titanic.csv"

We read the CSV into a Polars `DataFrame` with the `read_csv` function. 

We then call `head` to print out the first few rows of the `DataFrame`

In [4]:
df = pl.read_csv(csvFile)
df.head(5)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. Wil…","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


Each row of the `DataFrame` has details about a passenger on the Titanic including the class they travelled in (`Pclass`), their name (`Name`) and `Age`.

Alternatively we can use `glimpse` to see the first data points arranged vertically. This is handy for wide dataframes.
    
    - The formatting is done one line per column, so wide dataframes show nicely.
    - Each line will show the column name, the data type and the first few values.
    - Return a dense preview of the dataframe.

In [5]:
df.glimpse()

Rows: 891
Columns: 12
$ PassengerId <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Survived    <i64> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1
$ Pclass      <i64> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2
$ Name        <str> Braund, Mr. Owen Harris, Cumings, Mrs. John Bradley (Florence Briggs Thayer), Heikkinen, Miss. Laina, Futrelle, Mrs. Jacques Heath (Lily May Peel), Allen, Mr. William Henry, Moran, Mr. James, McCarthy, Mr. Timothy J, Palsson, Master. Gosta Leonard, Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg), Nasser, Mrs. Nicholas (Adele Achem)
$ Sex         <str> male, female, female, female, male, male, male, male, female, female
$ Age         <f64> 22.0, 38.0, 26.0, 35.0, 35.0, None, 54.0, 2.0, 27.0, 14.0
$ SibSp       <i64> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1
$ Parch       <i64> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0
$ Ticket      <str> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330877, 17463, 349909, 347742, 237736
$ Fare        <f64> 7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708


## Expressions
You can use square brackets to select rows and columns in Polars...

In [6]:
df[:3,["Pclass","Name","Age"]]

Pclass,Name,Age
i64,str,f64
3,"""Braund, Mr. Ow…",22.0
1,"""Cumings, Mrs. …",38.0
3,"""Heikkinen, Mis…",26.0


...but using this square bracket approach means that you don't get all the benefits of parallelisation and query optimisation.

To really take advantage of Polars we use the Expression API.

### Selecting and transforming columns with the Expression API

We see a simple example of the Expression API here where we select the `Pclass`, `Name` and `Age` columns inside a `select` statement

In [7]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name"),
            pl.col("Age"),
        ]
    )
)

Pclass,Name,Age
i64,str,f64
3,"""Braund, Mr. Ow…",22.0
1,"""Cumings, Mrs. …",38.0
3,"""Heikkinen, Mis…",26.0
…,…,…
3,"""Johnston, Miss…",
1,"""Behr, Mr. Karl…",26.0
3,"""Dooley, Mr. Pa…",32.0


In the Expression API we use `pl.col` to refer to a column.

**However, the Expression API allows us not only to refer to a column but also to transform it.**

In this example we select the same three columns, but this time we will get:

    - the Pclass
    - the name with all lower case 
    - round off the age to the nearest whole number

In [8]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name").str.to_lowercase(),
            pl.col("Age").round(2)
        ]
    )
)

Pclass,Name,Age
i64,str,f64
3,"""braund, mr. ow…",22.0
1,"""cumings, mrs. …",38.0
3,"""heikkinen, mis…",26.0
…,…,…
3,"""johnston, miss…",
1,"""behr, mr. karl…",26.0
3,"""dooley, mr. pa…",32.0


**When we have multiple expressions like this Polars runs them in parallel.**

We can chain expressions together to do more complicated transformations on a column. 

In this example:

    - we first split the `Name` column into words seperated by a space to get a list of words. 
    - In the same step we then count the length of this list. 
    - And then we add this as a new column to the `DataFrame` by giving it an `alias` at the end of the expression

In [9]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name"),
            pl.col("Name").str.split(" ").alias("Name_split"),
            pl.col("Name").str.split(" ").arr.lengths().alias("Name_word_count"),
        ]
    )
)

Pclass,Name,Name_split,Name_word_count
i64,str,list[str],u32
3,"""Braund, Mr. Ow…","[""Braund,"", ""Mr."", … ""Harris""]",4
1,"""Cumings, Mrs. …","[""Cumings,"", ""Mrs."", … ""Thayer)""]",7
3,"""Heikkinen, Mis…","[""Heikkinen,"", ""Miss."", ""Laina""]",3
…,…,…,…
3,"""Johnston, Miss…","[""Johnston,"", ""Miss."", … """"Carrie""""]",5
1,"""Behr, Mr. Karl…","[""Behr,"", ""Mr."", … ""Howell""]",4
3,"""Dooley, Mr. Pa…","[""Dooley,"", ""Mr."", ""Patrick""]",3


We look at expressions in detail throughout the workshop to find the right expression for many different scenarios.

### Filtering a `DataFrame` with the Expression API

We filter a `DataFrame` by applying a condition to an expression.

In this example we find all the passengers over 70 years of age

In [10]:
(
    df
    .filter(
        pl.col("Age") > 70
    )
)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
97,0,1,"""Goldschmidt, M…","""male""",71.0,0,0,"""PC 17754""",34.6542,"""A5""","""C"""
117,0,3,"""Connors, Mr. P…","""male""",70.5,0,0,"""370369""",7.75,,"""Q"""
494,0,1,"""Artagaveytia, …","""male""",71.0,0,0,"""PC 17609""",49.5042,,"""C"""
631,1,1,"""Barkworth, Mr.…","""male""",80.0,0,0,"""27042""",30.0,"""A23""","""S"""
852,0,3,"""Svensson, Mr. …","""male""",74.0,0,0,"""347060""",7.775,,"""S"""


We are not limited to using the Expression API for these operations. The Expression API is at the heart of all data transformations in Polars as we see below.

## Analytics
Polars has a wide range of functionality for analysing data. In the workshop we look at a wider range of analytic methods and how we can use expressions to write more complicated analysis in a concise way.

We begin by getting an overview of the `DataFrame` with `describe`

In [11]:
df.describe()

describe,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,f64,f64,f64,str,str,f64,f64,f64,str,f64,str,str
"""count""",891.0,891.0,891.0,"""891""","""891""",891.0,891.0,891.0,"""891""",891.0,"""891""","""891"""
"""null_count""",0.0,0.0,0.0,"""0""","""0""",177.0,0.0,0.0,"""0""",0.0,"""687""","""2"""
"""mean""",446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
…,…,…,…,…,…,…,…,…,…,…,…,…
"""median""",446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
"""25%""",223.0,0.0,2.0,,,20.0,0.0,0.0,,7.8958,,
"""75%""",669.0,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


The output of `describe` shows us how many records there are, how many `null` values and some key statistics.

### Value counts on a column
We use `value_counts` to count occurences of values in a column.

In this example we count how many passengers there are in each class with `value_counts`

In [12]:
df["Pclass"].value_counts()

Pclass,counts
i64,u32
2,184
3,491
1,216


### Groupby and aggregations
Polars has a fast parallel algorithm for `groupby` operations. 

Here we first group by the `Survived` and the `Pclass` columns. We then aggregate in `agg` by counting the number of passengers in each group

In [13]:
df

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
889,0,3,"""Johnston, Miss…","""female""",,1,2,"""W./C. 6607""",23.45,,"""S"""
890,1,1,"""Behr, Mr. Karl…","""male""",26.0,0,0,"""111369""",30.0,"""C148""","""C"""
891,0,3,"""Dooley, Mr. Pa…","""male""",32.0,0,0,"""370376""",7.75,,"""Q"""


In [14]:
(
    df
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
)

Survived,Pclass,counts
i64,i64,u32
1,2,87
0,2,97
1,3,119
0,1,80
0,3,372
1,1,136


We use the Expression API to for each aggregation in `agg`.

Groupby operations in Polars are fast because Polars has a parallel algorithm for getting the groupby keys. Aggregations are also fast because Polars runs multiple expressions in `agg` in parallel.

### Window operations
Window operations occur when we want to add a column that reflects not just data from that row but from a related group of rows. Windows occur in many contexts including rolling or temporal statistics and Polars covers these use cases.

Another example of a window operation is when we want the percentage breakdown within a group. We use the `over` expression for this.

For example, here we use `over` to calculate what percentage of passengers in each class survived 

In [15]:
survivedPercentageDf = (
    df
    # Groupby Survived and Pclass
    .groupby(["Survived","Pclass"])
    # Count the number of passengers in each group
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    # Divide the number of passengers in each group by the total passengers in each class
    .with_columns(
        (100*(
            pl.col("counts")/pl.col("counts").sum().over("Pclass")
        )
        )
        .alias("% Survived")
    )
    # Sort the output
    .sort(["Pclass","Survived"],descending=True)
)

survivedPercentageDf

Survived,Pclass,counts,% Survived
i64,i64,u32,f64
1,3,119,24.236253
0,3,372,75.763747
1,2,87,47.282609
0,2,97,52.717391
1,1,136,62.962963
0,1,80,37.037037


## Lazy mode and query optimisation
In the examples above we work in eager mode. In eager mode Polars runs each part of a query step-by-step.

Polars has a powerful feature called lazy mode. In this mode Polars looks at a query as a whole to make a query graph. Before running the query Polars passes the query graph through its query optimiser to see if there ways to make the query faster.

When working with a CSV we can switch from eager mode to eager mode by replacing `read_csv` with `scan_csv`

In [17]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
)

The output of a lazy query is `LazyFrame` and we see the unoptimized query plan when we output a `LazyFrame`.

### Query optimiser
We can see the optimised query plan that Polars will actually run by add `explain` at the end of the query

In [18]:
print(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .explain()
)

AGGREGATE
	[col("PassengerId").count().alias("counts")] BY [col("Survived"), col("Pclass")] FROM
	
  CSV SCAN ../data/titanic.csv
  PROJECT 3/12 COLUMNS


In this example Polars has identified an optimisation:
```python
PROJECT 3/12 COLUMNS
```
There are 12 columns in the CSV, but the query optimiser sees that only 3 of these columns are required for the query. When the query is evaluated Polars will `PROJECT` 3 out of 12 columns: Polars will only read the 3 required columns from the CSV. This projection saves memory and computation time.

A different optimisation happens when we apply a `filter` to a query. In this case we want the same analysis of survival by class but only for passengers over 50

In [19]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .explain()
)

AGGREGATE
	[col("PassengerId").count().alias("counts")] BY [col("Survived"), col("Pclass")] FROM
	
  CSV SCAN ../data/titanic.csv
  PROJECT 4/12 COLUMNS
  SELECTION: [(col("Age")) > (50.0)]


In this example the query optimiser has seen that:
- 4 out of 12 columns are now required `PROJECT 4/12 COLUMNS` and
- only passengers over 50 should be selected `SELECTION: [(col("Age")) > (50.0)]`

These optimisations are applied as Polars reads the CSV file so the whole dataset must not be read into memory.

### Query evaluation

To evaluate the full query and output a `DataFrame` we call `collect` 

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .collect()
)

During development with a large dataset it may be better to limit evaluation to a smaller number of output rows. We can do this by replacing `collect` with `fetch`

In [20]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .fetch(3)
)

shape: (3, 3)
┌──────────┬────────┬────────┐
│ Survived ┆ Pclass ┆ counts │
│ ---      ┆ ---    ┆ ---    │
│ i64      ┆ i64    ┆ u32    │
╞══════════╪════════╪════════╡
│ 1        ┆ 2      ┆ 1      │
│ 0        ┆ 1      ┆ 1      │
│ 1        ┆ 1      ┆ 1      │
└──────────┴────────┴────────┘


## Streaming larger-than-memory datasets
By default Polars reads your full dataset into memory when evaluating a lazy query. However, if your dataset is too large to fit into memory Polars can run many operations in *streaming* mode. With streaming Polars processes your query in batches rather than all at once.

To enable streaming we pass the `streaming = True` argument to `collect`

In [21]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .collect(streaming = True)
)

Survived,Pclass,counts
i64,i64,u32
1,2,3
0,2,12
1,3,1
0,1,21
1,1,18
0,3,9


In the workshop we look at how you can write your own streaming algorithms and we see how streaming affects different queries.

## Summary
This notebook has been a quick overview of the key ideas that make Polars a powerful data analysis tool:
- expressions allow us to write complex transformations concisely and run them in parallel
- lazy mode allows Polars apply query optimisations that reduce memory usage and computation time
- streaming lets us process larger-than-memory datasets with Polars