<a target="_blank" href="https://colab.research.google.com/github/bettercodepaul/data2day_2023_polars/blob/main/data2day_2023_Polars_Teil_1.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Polars: The Turbo Boost for Dataframes

In this notebook, we'll get to know Polars. Polars is an extremely fast dataframe library or in-memory query engine. It features extremely parallel execution, cache-efficient algorithms, and an expressive API. This makes it perfect for efficient querying and transformation of data.

Polars is written in Rust, uses Apache Arrow's column-oriented format, and has a Python API.

More information is available here:

- Homepage of Polars: https://www.pola.rs/
- User Guide: https://pola-rs.github.io/polars/user-guide/
- API Reference: https://pola-rs.github.io/polars/py-polars/html/reference/

## Installation + Set-Up

In [None]:
import urllib.request
import os.path

In [None]:
# load requirements.txt with required libraries
REQUIREMENTS_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/requirements.txt"
urllib.request.urlretrieve(REQUIREMENTS_URL, os.path.basename(REQUIREMENTS_URL))

In [None]:
# don't forget that you might need to restart the kernel
!pip install -qr requirements.txt

In [None]:
# import polars
import polars as pl

In [None]:
# output up to 60 characters per column and do not abbreviate floating point numbers
pl.Config(fmt_str_lengths=60, fmt_float="full")

In [None]:
# download CSV data
DATA_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/spotify-charts-2017-2021-global-top200.csv.gz"
LOCAL_DATA_FILE_NAME = os.path.basename(DATA_URL)
urllib.request.urlretrieve(DATA_URL, LOCAL_DATA_FILE_NAME)

In [None]:
# download excercises and utility functions
EXERCISES_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/data2day_exercises_en.py"
urllib.request.urlretrieve(EXERCISES_URL, os.path.basename(EXERCISES_URL))

In [None]:
# import exercises and utility functions
from data2day_exercises_en import *

## Data Loading

Polars supports different formats when loading data into a dataframe:

- CSV (`read_csv`, `read_csv_batched`)
- Apache Parquet (`read_parquet`)
- Databricks Delta (`read_delta`)
- SQL-Datenbanken (`read_database`, `read_database_uri`)
- JSON (`read_json`, `read_ndjson`)
- Microsoft Excel (`read_excel`)
- Apache OpenOffice (`read_ods`)
- Apache Avro (`read_avro`)
- Apache IPC (`read_ipc`, `read_ipc_stream`)
- Apache Iceberg

We first load a CSV file

In [None]:
# load a CSV file
df = pl.read_csv("spotify-charts-2017-2021-global-top200.csv.gz")
df.head(2) # output the first 2 rows

The file contains the daily Spotify charts. The following information is included:

- `title`: title of the song
- `rank`: ranking in the charts
- `date`: day on which the chart was compiled
- `artist`: band or artists performing the song
- url`: URL where the song can be heard on Spotify
- region`: region or country for which the charts were collected
- chart`: name or type of the charts
- trend`: development of the song's ranking compared to the previous day
- streams`: number of streams of the song on that day

In each column, under the column name, you can see the data type of the column. The date column was read in as a string (`str`), this can be corrected using the `try_parse_dates` option.

In [None]:
# Load data from CSV file and parse date columns
df = pl.read_csv("spotify-charts-2017-2021-global-top200.csv.gz", try_parse_dates=True)
df.head(2) # output the first 2 rows

## Projection (select columns)

If not all columns of a dataframe are needed, certain columns can be selected with the `select` method.

In [None]:
df.select("title", "artist", "url", "streams").head(2)

## Generalized projection (change or add columns)

With the help of expressions we can change columns or add new columns.

A column can be referenced with the `pl.col` method specifying the column name.

In [None]:
df.select("title", "artist", pl.col("url"), pl.col("streams")).head(2)

In order to get smaller numbers, we can specify, for example, the number of streams in thousands.

In [None]:
# Division with the "/" operator converts to floating point
df.select(pl.col("title"), pl.col("artist"), pl.col("url"), pl.col("streams")/1000).head(2)

In [None]:
# Alternative with "floordiv", mixed notation with pure column name and pl.col is also possible
df.select("title", "artist", "url", pl.col("streams").floordiv(1000)).head(2)

In addition to standard operators such as `+`, `-`, `*` and `/`, a variety of expressions for calculations with numbers is available:

- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/computation.html
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/operators.html

There are also many functions for the manipulation of strings. These are addressed via their own namespace `str`.

- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/string.html

A selection of commonly used functions for strings:

- `str.starts_with`, `str.ends_with`, `str.contains`
- `str.slice`
- `str.replace`
- `str.to_date`, `str.to_datetime`
- `str.split`
- `str.strip_chars`
- `str.n_chars`

In [None]:
df.select(pl.col("title").str.to_uppercase(), "artist", "url", "streams").head(2)

In order not to always have to list all columns that are not transformed at all, the function `with_columns` can be used.

In [None]:
# with_columns corresponds to select supplemented by all missing columns
df.with_columns(pl.col("title").str.to_uppercase()).head(2)

So far we have not added any columns. A new column will be created if we specify a name that does not exist yet. We can use the following methods for this:

- `alias` for a completely new name
- `prefix`/`suffix` to add a prefix/suffix to the existing name

In [None]:
# Extract trackId from the URL
df.select("title", "url").with_columns(pl.col("url").str.slice(len("https://open.spotify.com/track/")).alias("trackId"), pl.col("title").str.to_uppercase().suffix("_uppercase")).head(2)

If the entire query becomes too long, it should be bracketed and structured with breaks. In this way, a typical "query pipeline" is created that can be read from top to bottom.

In [None]:
(df
  .select("title", "url")
  .with_columns(
    pl.col("url").str.slice(len("https://open.spotify.com/track/")).alias("trackId"),
    pl.col("title").str.to_uppercase().suffix("_uppercase")
  )
  .head(2)
)

We can also use aggregating function like `min`, `max`, `sum`, `mean`, `median`, etc. in `select`, which will give us an aggregation. If we use a column more than once, we have to be careful to give it an appropriate name. Either with `alias` or with `suffix`.

In [None]:
# Determine period for which data is available
df.select(pl.col("date").min().suffix("_min"), pl.col("date").max().suffix("_max"))

## Selection/Filter

With the help of the selection, the data set can be filtered to specific data sets.

For a quick overview the methods `head`, `tail` and `sample` can be used.

In [None]:
# the first two rows
df.head(2)

In [None]:
# the last two rows
df.tail(2)

In [None]:
# two random rows (absolute with parameter "n" or relative with parameter "fraction")
df.sample(n=2)
df.sample(fraction=2/len(df)) # 2/362182 ≈ 0.000006is equivalent here to n=2

The rows with the largest or smallest value in a column can be selected with the functions 'top_k' and 'bottom_k'.

In [None]:
# the most streamed song on Spotify in one day worldwide: Easy On Me by Adele
df.top_k(1, by="streams")

We can listen to that one, too.

In [None]:
# plays a preview of the song with Spotify. If there are multiple songs in the dataframe, a row number can be specified.
play_song(df.top_k(1, by="streams"))

Rows can be selected precisely using the `filter` method and a Boolean expression. For example, we can select all records of a particular artist.

In [None]:
# two rows for the singer "Adele".
# eq stands for equals
df.filter(pl.col("artist").eq("Adele")).head(2)

An overview of important operators:
- Equal (`==`): `eq`
- Not Equal (`!=`): `ne`.
- Greater Than (`>`, `>=`): `gt`, `ge`
- Less Than (`<`, `<=`): `lt`, `le`
- Between: `is_between`
- Equal to one of a set: `is_in`.

Logical expressions can be linked with:
- conjunction/AND: `&`
- disjunction/OR: `|`
- Contravalence/XOR: `^`
- Negation/NOT: `~`

In [None]:
# two entries for the song "Easy On Me" by Adele with more than 3 million streams in one day
df.filter(pl.col("artist").eq("Adele") & pl.col("title").eq("Easy On Me") & pl.col("streams").gt(3_000_000)).head(2)

Instead of the operators `eq` and `gt` it would also be possible to use the standard Python operators `==` and `>`. But then all logical subexpressions have to be compounded. What you prefer is in the end a matter of taste 😁

In [None]:
df.filter(pl.col("artist").eq("Adele") & pl.col("title").eq("Easy On Me") & pl.col("streams").gt(3_000_000)).head(2)
df.filter((pl.col("artist") == "Adele") & (pl.col("title") == "Easy On Me") & (pl.col("streams") > 3_000_000)).head(2)

For a comparison with a specific date, the date can be generated with the function `pl.date`.

In [None]:
# two entries for May 1, 2017
df.filter(pl.col("date").eq(pl.date(2017, 5, 1))).head(2)

In [None]:
# the ranks 5 to 10 for July 19, 2018
df.filter(pl.col("date").eq(pl.date(2018, 7, 19)) & pl.col("rank").is_between(5, 10))

We can also plot daily streams or ranks with a helper function.

In [None]:
some_song_df = df.filter(pl.col("artist").eq("Juice WRLD") & pl.col("title").eq("Lucid Dreams"))

In [None]:
plot_streams(some_song_df)

In [None]:
plot_rank(some_song_df)

## Exercises on projection and selection

You can do the exercise right here in the notebook. For each exercise there is an object (`q1`, `q2`, `q3`, ...) that contains the question, a hint, an answer check and the solution.

In [None]:
# The method "question" prints the question.
q0.question()

In [None]:
# Then there is always a cell with a hint in which variables the solution should be written.
# Feel free to create more cells to inspect your solution more closely.
awesome_company = ...

In [None]:
# the method "check" checks a solution
q0.check(awesome_company)

In [None]:
# the "hint" method displays a hint
q0.hint()

In [None]:
# the "solution" method prints the solution
q0.solution()

Now it's your turn with the real exercises!

### Question 1

In [None]:
q1.question()

In [None]:
q1_df = ...

In [None]:
q1.check(q1_df)
#q1.hint()
#q1.solution()

### Question 2

In [None]:
q2.question()

In [None]:
q2_df = ...

In [None]:
q2.check(q2_df)

### Question 3

In [None]:
q3.question()

In [None]:
q3_df = ...

In [None]:
q3.check(q3_df)

### Question 4

In [None]:
q4.question()

In [None]:
rank_1 = ...
rank_200 = ...

In [None]:
q4.check(rank_1, rank_200)

### Question 5

In [None]:
q5.question()

In [None]:
q5_df = ...

In [None]:
q5.check(q5_df)

### Question 6

In [None]:
q6.question()

In [None]:
q6_df = ...

In [None]:
q6.check(q6_df)

## Series

Normally we always work on a dataframe. For the sake of completeness, however, it should be mentioned that there is a `series` data type for the individual columns. With the method `get_column` or the subset operator `[]` a column can be retrieved from a dataframe.

In [None]:
df.head(2).get_column("title")

In [None]:
df.head(2)["title"]

In [None]:
type(df.get_column("title"))

## Data types

Polars can store many different types of data in one column.

### Numbers and Boolean values

- `Int8`, `Int16`, `Int32`, `Int64`: integer number
- `Float32`, `Float64`: floating point number
- `UInt8`, `UInt16`, `UInt32`, `UInt64`: unsigned integer number
- `Decimal`: 128-bit floating point number with high precision, experimental
- `Boolean`: logical/boolean value

Numbers are created in Polars as 64-bit data types unless otherwise stated.

A column can be converted to another data type with the function 'cast', e.g. to save memory space.

In [None]:
# Default data type is Int64 or Float64 for numbers
df.select(pl.col("streams")).head(2)

In [None]:
# throws an error because some values are too large for Int16
try:
    df.select(pl.col("streams").cast(pl.Int16)).head(2)
except pl.ComputeError as e:
    print(e.args)


In [None]:
# does not throw an error because Int32 is sufficiently large
df.select(pl.col("streams").cast(pl.Int32)).head(2)

Attention before you convert everything to the smallest possible datatype: with 32-bit datatypes overflows can occur during calculations for which no warning is issued!

In [None]:
print(f'Number of total streams with Int64 is {df.select(pl.col("streams").sum()).item()}')
print(f'Number of total streams with Int32 is {df.select(pl.col("streams").cast(pl.Int32).sum()).item()}')

With the method `shrink_dtype` the memory consumption can be reduced to some extent also in an automated fashion. However, this never changes from "signed" to "unsigned" data types, even if no negative data is present.

In [None]:
df.select(pl.col("rank").shrink_dtype()).head(2)

### Date and time

- `Date`: Date
- `Time`: Time
- `Datetime`: Time
- `Duration`: Time duration

You can extract components from data and times using the namespace `dt`.

In [None]:
(df
    .select("date")
    .with_columns(
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.quarter().alias("quarter"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.week().alias("week"),
        pl.col("date").dt.weekday().alias("weekday"), # Monday == 1, Sunday == 7
        pl.col("date").dt.day().alias("day"),
    )
    .sample(5)
)

We can also subtract dates from each other or add or subtract a period of time (`offset_by`).

In [None]:
(df
    .select("date")
    .with_columns(
        (pl.col("date").dt.month_end() - pl.col("date")).alias("days_till_month_end"),
        pl.col("date").dt.offset_by("1w").alias("same_day_next_week")
    )
    .sample(5)
)

In [None]:
(df
    .filter(pl.col("date").eq(pl.col("date").dt.month_end()))
    .select("date", pl.col("artist"))
    .sample(5)
)

### Character strings

- `Utf8`: any character string
- `Categorical`: character string encoded as category

### Structures

- `List`: List with variable length per row
- `Array`: list with fixed length in all rows, e.g. coordinates
- `Struct`: named fields

### Other

- `Binary`: binary data
- `Object`: any Python object

## Sort

The `sort` method makes it easy to sort dataframes.

In [None]:
df.sort("rank").head(3)

In [None]:
df.sort("streams", descending=True).head(3)

In [None]:
df.sort(["rank", "streams"], descending=[False, True]).head(3)

## Write data

A dataframe can be written to a file in various formats using the `write_*` methods.

In [None]:
df_2020 = df.filter(pl.col("date").dt.year().eq(2020))

In [None]:
# as CSV (approx. 9 MB)
df_2020.write_csv("2020_write_test.csv", )

In [None]:
# as compressed CSV (approx. 2 MB)
import gzip

with gzip.open("2020_write_test.csv.gz", "wb") as f:
    df_2020.write_csv(f)

In [None]:
# as Apache Parquet (approx. 1 MB)
df_2020.write_parquet("2020_write_test.parquet")

In [None]:
!ls -l 2020_write_test*

## Optional exercises

### Question 7

In [None]:
q7.question()

In [None]:
q7_df = ...

In [None]:
q7.check(q7_df)

### Question 8

In [None]:
q8.question()

In [None]:
q8_monday = ...
q8_friday = ...

In [None]:
q8.check(q8_monday, q8_friday)

### Question 9

In [None]:
q9.question()

In [None]:
q9_df = ...

In [None]:
q9.check(q9_df)

### Question 10

In [None]:
q10.question()

In [None]:
q10_df = ...

In [None]:
q10.check(q10_df)

### Question 11

In [None]:
q11.question()

In [None]:
q11_ohne_zedd = ...
q11_mit_zedd = ...

In [None]:
q11.check(q11_ohne_zedd, q11_mit_zedd)

### Question 12

In [None]:
q12.question()

In [None]:
q12_df = ...

In [None]:
q12.check(df, q12_df)