# Polars: the key concepts
 
This notebook introduces some of the key concepts that make Polars a powerful data analysis tool. 

The key concepts we meet are:
- fast flexible analysis with the Expression API in Polars
- easy parallel computations
- automatic query optimisation in lazy mode
- streaming to work with larger-than-memory datasets in Polars

## Importing Polars
We begin by importing polars as `pl`. Following this convention will allow you to work with examples from the official documentation

In [None]:
import polars as pl

We want to restrict how many rows of a `DataFrame` are printed out to the screen. Polars allows us to control configuration using options in `pl.Config`.

## Setting configuration options
We can adjust default configuration options with `pl.Config`.

In this notebook we want Polars to print 6 rows of `DataFrame` so we use `pl.Config.set_tbl_rows`

In [None]:
pl.Config.set_tbl_rows(6)

You can see the full range of configuration options here: https://pola-rs.github.io/polars/py-polars/html/reference/config.html

In the course we see how to apply the right configuration options in a range of contexts.

## Input data
Polars can read from a wide range of data formats including CSV, Parquet, Arrow, JSON, Excel and database connections. We cover all of these in the course.

For this introduction we use a CSV with the Titanic passenger dataset. This dataset gives details of all the passengers on the Titanic and whether they survived.

We begin by setting the path to this CSV

In [None]:
csvFile = "../data/titanic.csv"

We read the CSV into a Polars `DataFrame` with the `read_csv` function. 

We then call `head` to print out the first few rows of the `DataFrame`

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

Each row of the `DataFrame` has details about a passenger on the Titanic including the class they travelled in (`Pclass`), their name (`Name`) and `Age`.

## Expressions
You can use square brackets to select rows and columns in Polars...

In [None]:
df[:3,["Pclass","Name","Age"]]

...but using this square bracket approach means that you don't get all the benefits of parallelisation and query optimisation.

To really make take advantage of Polars we use the Expression API.

### Selecting and transforming columns with the Expression API

We see a simple example of the Expression API here where we select the `Pclass`, `Name` and `Age` columns inside a `select` statement

In [None]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name"),
            pl.col("Age"),
        ]
    )
)

In the Expression API we use `pl.col` to refer to a column.

However, the Expression API allows us not only to refer to a column but also to transform it.

In this example we select the same three columns, but this time we get the number of words in each name and we round off the age to the nearest whole number

In [None]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name").str.to_lowercase(),
            pl.col("Age").round(2)
        ]
    )
)

When we have multiple expressions like this Polars runs them in parallel.

We can chain expressions together to do more complicated transformations on a column. 

In this example we first split the `Name` column into words seperated by a space to get a list of words. In the same step we then count the length of this list. We add this as a new column to the `DataFrame` by giving it an `alias` at the end of the expression

In [None]:
(
    df
    .select(
        [
            pl.col("Pclass"),
            pl.col("Name"),
            pl.col("Name").str.split(" ").arr.lengths().alias("Name_word_count"),
        ]
    )
)

We look at expressions in detail throughout the course to find the right expression for many different scenarios.

### Filtering a `DataFrame` with the Expression API

We filter a `DataFrame` by applying a condition to an expression.

In this example we find all the passengers over 70 years of age

In [None]:
(
    df
    .filter(
        pl.col("Age") > 70
    )
)

We are not limited to using the Expression API for these operations. The Expression API is at the heart of all data transformations in Polars as we see below.

## Analytics
Polars has a wide range of functionality for analysing data. In the course we look at a wider range of analytic methods and how we can use expressions to write more complicated analysis in a concise way.

We begin by getting an overview of the `DataFrame` with `describe`

In [None]:
df.describe()

The output of `describe` shows us how many records there are, how many `null` values and some key statistics.

### Value counts on a column
We use `value_counts` to count occurences of values in a column.

In this example we count how many passengers there are in each class with `value_counts`

In [None]:
df["Pclass"].value_counts()

### Groupby and aggregations
Polars has a fast parallel algorithm for `groupby` operations. 

Here we first group by the `Survived` and the `Pclass` columns. We then aggregate in `agg` by counting the number of passengers in each group

In [None]:
(
    df
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
)

We use the Expression API to for each aggregation in `agg`.

Groupby operations in Polars are fast because Polars has a parallel algorithm for getting the groupby keys. Aggregations are also fast because Polars runs multiple expressions in `agg` in parallel.

### Window operations
Window operations occur when we want to add a column that reflects not just data from that row but from a related group of rows. Windows occur in many contexts including rolling or temporal statistics and Polars covers these use cases.

Another example of a window operation is when we want the percentage breakdown within a group. We use the `over` expression for this.

For example, here we use `over` to calculate what percentage of passengers in each class survived 

In [None]:
survivedPercentageDf = (
    df
    # Groupby Survived and Pclass
    .groupby(["Survived","Pclass"])
    # Count the number of passengers in each group
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    # Divide the number of passengers in each group by the total passengers in each class
    .with_column(
        100*(
            pl.col("counts")/pl.col("counts").sum().over("Pclass")
        )
        .alias("% Survived")
    )
    # Sort the output
    .sort(["Pclass","Survived"],reverse=True)
)
survivedPercentageDf

### Visualisation

We can use popular plotting libraries like Plotly or Matplotlib with Polars.

In this example we create a grouped bar chart of survival by class in Plotly

In [None]:
import plotly.express as px
fig = px.histogram(
          x = survivedPercentageDf["Pclass"], 
          y = survivedPercentageDf["% Survived"],
          color = survivedPercentageDf['Survived'], 
          barmode = 'group',
          title="% survival by class",
          labels = {
              "x":"Passenger class",
              "y":"% Survived",
              "color":"Survived"
          },
          height=400
)
fig.show()

## Lazy mode and query optimisation
In the examples above we work in eager mode. In eager mode Polars runs each part of a query step-by-step.

Polars has a powerful feature called lazy mode. In this mode Polars looks at a query as a whole to make a query graph. Before running the query Polars passes the query graph through its query optimiser to see if there ways to make the query faster.

When working with a CSV we can switch from eager mode to eager mode by replacing `read_csv` with `scan_csv`

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
)

The output of a lazy query is `LazyFrame` and we see the unoptimized query plan when we output a `LazyFrame`.

### Query optimiser
We can see the optimised query plan that Polars will actually run by add `describe_optimized_plan` at the end of the query

In [None]:
print(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .describe_optimized_plan()
)

In this example Polars has identified an optimisation:
```python
PROJECT 3/12 COLUMNS
```
There are 12 columns in the CSV, but the query optimiser sees that only 3 of these columns are required for the query. When the query is evaluated Polars will `PROJECT` 3 out of 12 columns: Polars will only read the 3 required columns from the CSV. This projection saves memory and computation time.

A different optimisation happens when we apply a `filter` to a query. In this case we want the same analysis of survival by class but only for passengers over 50

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .describe_optimized_plan()
)

In this example the query optimiser has seen that:
- 4 out of 12 columns are now required `PROJECT 4/12 COLUMNS` and
- only passengers over 50 should be selected `SELECTION: Some([(col("Age")) > (50f64)])`

These optimisations are applied as Polars reads the CSV file so the whole dataset must not be read into memory.

### Query evaluation

To evaluate the full query and output a `DataFrame` we call `collect` 

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .collect()
)

During development with a large dataset it may be better to limit evaluation to a smaller number of output rows. We can do this by replacing `collect` with `fetch`

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .fetch(3)
)

## Streaming larger-than-memory datasets
By default Polars reads your full dataset into memory when evaluating a lazy query. However, if your dataset is too large to fit into memory Polars can run many operations in *streaming* mode. With streaming Polars processes your query in batches rather than all at once.

To enable streaming we pass the `allow_streaming = True` argument to `collect`

In [None]:
(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age") > 50)
    .groupby(["Survived","Pclass"])
    .agg(
        pl.col("PassengerId").count().alias("counts")
    )
    .collect(allow_streaming = True)
)

In the course we look at how you can write your own streaming algorithms and we see how streaming affects different queries.

## Summary
This notebook has been a quick overview of the key ideas that make Polars a powerful data analysis tool:
- expressions allow us to write complex transformations concisely and run them in parallel
- lazy mode allows Polars apply query optimisations that reduce memory usage and computation time
- streaming lets us process larger-than-memory datasets with Polars