<a target="_blank" href="https://colab.research.google.com/github/bettercodepaul/data2day_2023_polars/blob/main/data2day_2023_Polars_Part_2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Polars: The Turbo Boost for Dataframes - Part 2

Important links as a reminder:

- Homepage of Polars: https://www.pola.rs/
- User Guide: https://pola-rs.github.io/polars/user-guide/
- API Reference: https://pola-rs.github.io/polars/py-polars/html/reference/

## Installation + Set-Up

In [None]:
import urllib.request
import os.path

In [None]:
REQUIREMENTS_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/requirements.txt"
urllib.request.urlretrieve(REQUIREMENTS_URL, os.path.basename(REQUIREMENTS_URL))

In [None]:
# don't forget that you might need to restart the kernel
!pip install -qr requirements.txt

In [None]:
import polars as pl

In [None]:
# output up to 60 characters per column and do not abbreviate floating point numbers
pl.Config(fmt_str_lengths=60, fmt_float="full")

In [None]:
# download data
DATA_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/spotify-charts-2017-2021-global-top200.csv.gz"
LOCAL_DATA_FILE_NAME = os.path.basename(DATA_URL)
urllib.request.urlretrieve(DATA_URL, LOCAL_DATA_FILE_NAME)
GENRES_DATA_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/track-genres.parquet"
LOCAL_GENRES_DATA_FILE_NAME = os.path.basename(GENRES_DATA_URL)
urllib.request.urlretrieve(GENRES_DATA_URL, LOCAL_GENRES_DATA_FILE_NAME)

In [None]:
# download excercises and utility functions
EXERCISES_URL = "https://github.com/bettercodepaul/data2day_2023_polars/raw/main/data2day_exercises_en.py"
urllib.request.urlretrieve(EXERCISES_URL, os.path.basename(EXERCISES_URL))

In [None]:
from data2day_exercises_en import *

In [None]:
# Load data from CSV file and parse date columns
df = pl.read_csv("spotify-charts-2017-2021-global-top200.csv.gz", try_parse_dates=True)
df.head(2) # output the first 2 rows

## Aggregations on groups

In the first part you have already learned about aggregate functions like `max`, `min`, `mean` and `sum`. These functions become really powerful when you apply them to groups that you can form from almost any expression.

The group is formed with the `group_by` method.

The subsequent aggregation with the method `agg`. This method works similar to a `select`, but for aggregations.

In [None]:
# the five most streamed artists
(df
    .group_by("artist")
    .agg(pl.col("streams").sum())
    .top_k(5, by="streams")
)

An aggregation can contain multiple expressions...

In [None]:
# the five most streamed artists and their average ranking in the charts
(df
    .group_by("artist")
    .agg(pl.col("streams").sum(), pl.col("rank").mean())
    .top_k(5, by="streams")
)

The grouping can also be done with multiple expressions...

In [None]:
# the 5 most streamed artists in a year
(df
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(pl.col("streams").sum())
    .top_k(5, by="streams")
    .sort("year")
)

But now we are missing the year 2020! Fortunately, the function `head` also works on a grouped dataframe and then returns the first *n* rows per group. Disadvantage compared to `top_k`: we have to sort the dataset completely for this.

In [None]:
# Artists with the most streams per year
(df
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(pl.col("streams").sum())
    .sort("streams", descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

You can also check out which artists had the most numer of different songs in the top 200.

In [None]:
# Artists with the most number of different songs in the Top 200 per year
(df
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(pl.col("title").n_unique().alias("distinctSongsInTop200"))
    .sort("distinctSongsInTop200", descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

What would the rankings look like if you used the days at #1 as your benchmark?

In [None]:
# Artist with the most days at number 1 per year
(df
    .filter(pl.col("rank").eq(1))
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(pl.len().alias("daysOnNumberOne"))
    .sort("daysOnNumberOne", descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

In [None]:
# Instead of filtering the entire data set, we can even filter data in the aggregation
(df
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(
        pl.col("streams").sum(),
        pl.col("date").filter(pl.col("rank").eq(1)).len().alias("daysOnNumberOne")
    )
    .sort(["daysOnNumberOne", "streams"], descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

We have more than one artist per line because we consider each collaboration as a separate artist. There are many well-known songs from such collaborations...

In [None]:
top_5_colabs = (df
    .filter(pl.col("artist").str.contains(", "))
    .group_by("artist", "title", "url")
    .agg(pl.col("streams").sum())
    .top_k(5, by="streams")
)
top_5_colabs

In [None]:
play_song(top_5_colabs, 0)

## A special data type: lists

Polars can also handle lists as a special data type very well. Such a list is created, for example, when we split a string with the method `str.split`.

In [None]:
# "artist" as string
df.filter(pl.col("artist").eq("Shawn Mendes, Camila Cabello")).head(1)

In [None]:
# "artist" as a list of strings
(df
    .filter(pl.col("artist").eq("Shawn Mendes, Camila Cabello"))
    .head(1)
    .with_columns(pl.col("artist").str.split(", "))
)

It is sometimes very handy to roll out such lists with the `explode` method. This way the record is then duplicated accordingly often and can be treated like any other column.

In [None]:
(df
    .filter(pl.col("artist").eq("Shawn Mendes, Camila Cabello"))
    .head(1)
    .with_columns(pl.col("artist").str.split(", "))
    .explode("artist")
)

We can now compute the artists with the most days at #1 without interpreting each collaboration as its own artist.

In [None]:
# artists with most days on number 1 per year
(df
    .with_columns(pl.col("artist").str.split(", "))
    .explode("artist")
    .group_by("artist", pl.col("date").dt.year().alias("year"))
    .agg(
        pl.col("streams").sum(),
        pl.col("date").filter(pl.col("rank").eq(1)).len().alias("daysOnNumberOne")
    )
    .sort(["daysOnNumberOne", "streams"], descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

Instead of rolling out the lists, we can also work directly on list columns. Suitable method are in context `list`, e.g. `list.lengths()` for the length of a list.

In [None]:
# How many artists are there per top 200 entry?
(df
    .select(pl.col("artist"))
    .with_columns(pl.col("artist").str.split(", "))
    .with_columns(pl.col("artist").list.lengths().alias("artistCount"))
    .group_by("artistCount")
    .len()
    .sort("artistCount")
    .with_columns((pl.col("count")/pl.col("count").sum()).round(2).alias("percentage"))
)

In [None]:
# The chart entry with 10 artists is "Pa' La Cultura" at #151 on 8/7/2020
play_song(df
    .with_columns(pl.col("artist").str.split(", ").list.lengths().alias("artistCount"))
    .filter(pl.col("artistCount").eq(10))
)

## Exercises on groupings and aggregations

### Question 13

In [None]:
q13.question()

In [None]:
q13_df = ...

In [None]:
q13.check(q13_df)
#q13.hint()
#q13.solution()

### Question 14

In [None]:
q14.question()

In [None]:
q14_df = ...

In [None]:
q14.check(q14_df)

### Question 15

In [None]:
q15.question()

In [None]:
q15_df = ...

In [None]:
q15_df

In [None]:
q15.check(q15_df)

## Joins & Co. - Connecting Dataframes

### Concatenate with `pl.concat`
A flexible and simple way to concatenate two data frames is the `pl.concat` method.

In [None]:
# how="vertical" stacks two dataframes on top of each other, names and types of columns must match
pl.concat([
    df.sample(1),
    df.sample(1)
], how="vertical")

In [None]:
# how="vertical_relaxed" tries to adjust the data types if necessary
pl.concat([
    df.sample(1),
    df.sample(1).with_columns(pl.col("artist").cast(pl.Categorical))
], how="vertical_relaxed")

In [None]:
# how="diagonal" can also handle other column names
pl.concat([
    df.sample(1).select("title", "artist", pl.col("rank").alias("position")),
    df.sample(1).select("title", pl.col("artist").alias("performer"), "rank")
], how="diagonal")

In [None]:
# how="horizontal" puts dataframes side by side, the number of records must match
some_df = df.sample(4)
pl.concat([
    some_df.select("title", "artist"),
    some_df.select("streams", "rank")
], how="horizontal")

In [None]:
# how="align" puts dataframes side by side and tries to align them to the common key columns
pl.concat([
    some_df.sample(fraction=1.0, shuffle=True).select("url", "date", "title"),
    some_df.sample(fraction=1.0, shuffle=True).select("url", "date", "artist"),
    some_df.sample(fraction=0.5, shuffle=True).select("url", "date", "streams")
], how="align")

With `how=align` actually already a join is performed, but it is not really clear on which columns.

In most cases it will therefore be better to perform an explicit join.

### Connect mit `join`

Joins allow us to connect two dataframes. Polars supports the following join types:

`left.join(right, on=..., how=...)`.

- `full`: all rows from `left` and `right`, even if they have no join partner in the other dataframe
- `left`: all rows from `left`, even if they have no join partner in `right`.
- `inner`: rows from `left` and `right` with matching join partner in the other dataframe
- `semi`: rows from `left` with matching join partner in `right` (like `inner`, but no new columns from `right`)
- `anti`: rows from `left` without matching join partner in `right` (opposite of `semi`)

In [None]:
left = pl.DataFrame({
    "key": [0, 1, 2],
    "value": ["a", "b", "c"]
})
right = pl.DataFrame({
    "key": [1, 2, 3],
    "value": ["x", "y", "z"]
})

In [None]:
# full outer join
left.join(right, on="key", how="full").sort("key")

In [None]:
# left join
left.join(right, on="key", how="left")

In [None]:
# inner join
left.join(right, on="key", how="inner")

In [None]:
# semi join
left.join(right, on="key", how="semi")

In [None]:
# anti join
left.join(right, on="key", how="anti")

## Exercises on joins

### Question 16

In [None]:
q16.question()

In [None]:
q16_df = ...

In [None]:
q16.check(q16_df)

### Question 17

In [None]:
q17.question()

In [None]:
q17_df = ...

In [None]:
q17.check(q17_df)

### Question 18

In [None]:
q18.question()

In [None]:
q18_df = ...

In [None]:
q18.check(q18_df)

## Grouping and Joining with Expressions: `over` Expressions

For many calculations, it can be helpful to evaluate an expression over a group.

For example, we could try to determine the newcomer of the year. For this we need an information when an artist appeared in the charts for the first time.

In [None]:
first_appearance = df.group_by("artist").agg(pl.col("date").min().alias("firstChartAppearance"))
first_appearance.filter(pl.col("artist").is_in(["Billie Eilish", "Lewis Capaldi"]))

We can now join this new information to the overall data set to determine the Newcomer of the Year.

In [None]:
(df
    .join(first_appearance, on="artist")
    .filter(pl.col("date").dt.year().eq(pl.col("firstChartAppearance").dt.year()))
    .group_by(pl.col("date").dt.year().alias("year"), "artist")
    .agg(pl.col("streams").sum())
    .sort("streams", descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

Ed Sheeran was no longer a newcomer in 2017, but we lack the informtions from previous years to do better....

We can achieve the same without the intermediate data set by using an `over` expression.

In [None]:
(df
    # expression with over instead of temporary dataframe with group_by, agg and join
    .with_columns(pl.col("date").min().over("artist").alias("firstChartAppearance"))
    .filter(pl.col("date").dt.year().eq(pl.col("firstChartAppearance").dt.year()))
    .group_by(pl.col("date").dt.year().alias("year"), "artist")
    .agg(pl.col("streams").sum())
    .sort("streams", descending=True)
    .group_by("year")
    .head(1)
    .sort("year")
)

## Reshaping

For some calculations and especially plots it is helpful to switch between different variants of a dataframe.

The Wide format has more columns (Wide) and less rows.
The Long format has more rows (Long) and fewer columns.

In [None]:
some_df = pl.DataFrame({
    "month": ["2023-01", "2023-01", "2023-01", "2023-02"],
    "genre": ["pop", "rock", "hip-hop", "pop"],
    "streams": [100, 200, 300, 150] 
})
some_df

With the `pivot` method we can make a data set *wider*, i.e. transport information from rows to new columns. The following parameters are important:

- `index`: columns that will be kept
- `on`: column with values, from which new column names are formed
- `values`: column with values, which are written into the new columns

In [None]:
some_df.pivot(index="month", on="genre", values="streams", )

We could replace the resulting `null` values with `fill_null`.

In [None]:
some_df.pivot(index="month", on="genre", values="streams").fill_null(0)

With the counterpart `unpivot` we can make a record longer again, i.e. transport information from columns to rows. The following parameters are important:

- `index`: columns that are to be preserved
- `on`: columns containing the values for the `value_name` column
- `variable_name`: name of the column that should get the column names from `value_vars`.
- `value_name`: name of the column that should get the values from the existing rows

In [None]:
(some_df
    .pivot(index="month", columns="genre", values="streams")
    .unpivot(id_vars="month", on=["pop", "rock", "hip-hop"], variable_name="genre", value_name="streams")
    .sort("month")
)

We could remove the `null` values with `drop_nulls`.

In [None]:
(some_df
    .pivot(index="month", columns="genre", values="streams")
    .unpivot(id_vars="month", on=["pop", "rock", "hip-hop"], variable_name="genre", value_name="streams")
    .sort("month")
    .drop_nulls()
)

## Selectors + horizontal expressions

Especially for data in "wide" format it is often helpful to perform operations on several columns without having to specify the column names specifically. In fact, sometimes the column names are not even known when a query is created, because they only emerge from the concrete data.

So far we have always passed a single column name to `pl.col`, but there are more possibilities:

In [None]:
# select multiple columns by name
df.select(pl.col("rank", "streams").log()).head(2)

In [None]:
# select multiple columns by data type
df.select(pl.col(pl.Utf8).str.to_lowercase()).head(2)

In [None]:
# select multiple columns with a regular expression
df.select(pl.col("^.*rt.*$")).head(2)

In addition, there is also the possibility to select all columns.

In [None]:
df.select(pl.all()).head(2)

Or even to exclude certain columns.

In [None]:
# all columns, but not "url
df.select(pl.exclude("url")).head(2)

In [None]:
# all string columns, but not "url".
df.select(pl.col(pl.Utf8).exclude("url")).head(2)

On such a column selection, which contains more than one column, we can also perform "horizontal" calculations. For this purpose there are the methods `pl.horizontal_sum`, `pl.horizontal_min` and `pl.horizontal_max`.

In [None]:
df.select(pl.sum_horizontal(pl.exclude(pl.Utf8))).head(2)

## Exercises (optional)

### Question 19

In [None]:
q19.question()

In [None]:
q19_df = ...

In [None]:
q19.check(q19_df)

### Question 20

In [None]:
q20.question()

In [None]:
q20_df = ...

In [None]:
q20.check(df, q20_df)