In [None]:
%%capture
!pip install polars
!pip install altair
!pip install plotly
!pip install matplotlib

In [None]:
import polars as pl
import pandas as pd
import altair as alt
import plotly.express as px

# `polars` and `altair`

Two tools we're using the Analytics South team.  
Through these two tools we can achieve lightning fast operations and build complex graphs in a breeze.

The tools also allow us to decouple code-segments much easier than using some other options, like `pandas` or `matplotlib`.

We'll do a comparison throughout this workshop, with

`polars` <-> `pandas`  
`altair` <-> `plotly`

# `polars`

Starting with `polars`,  
`polars` is a DataFrame library written in Rust, which is incredibly fast and has an ergonomic API compared to `pandas` (IMO).

`Polars` goes to great lengths to:

- Reduce redundant copies
- Traverse memory cache efficiently
- Minimize contention in parallelism

![polars](https://www.ritchievink.com/img/post-35-polars-0.15/db-benchmark.png)

It's lazy and semi-lazy!

## DataFrame instantiation

Let's see how we can instantiate DataFrames!

In [None]:
df_pl = pl.DataFrame({"a": [1,2], "b": [2,3]}); df_pl

In [None]:
df_pd = pd.DataFrame({"a": [1,2], "b": [2,3]}); df_pd

That was easy, and very similar to `pandas`, but how can we interop between pandas?

In [None]:
pl.from_pandas(df_pd)

In [None]:
df_pl.to_pandas()

That was even easier!

We're now ready for some action, how about exploring `pl.Expr`?

### `pl.Expr` a super-charged expression

In [None]:
pl_expr = pl.col("a") == 1; pl_expr

Hmmm... What do we make of this?

1. It's a non-executed expression
    - Could be viewed as a SQL-expression
2. It's not tied to any data-source

And this gives us?

1. Simpler to test
2. Easier to refactor and separate code

Let's try it!

In [None]:
df_pl.filter(pl_expr)

In [None]:
df_pl.filter(~pl_expr)

Did you ever think about how `pandas` work more in-depth? Like filtering?

`df[df["id"].is_in(list_of_ids)]`

What happens?

1. We create a list of true-false, e.g. an indexing array
2. We filter our `DataFrame` using said indexing array

This means that we allocate one object completely unneccessary as we're not re-using it.

Another thing about `pandas` people don't think about is that `pandas` is single-threaded while a computer usually has _multiple cores_. 

What we're doing in `pandas` is often ridicolously parallellizible. `filter`, `map`, `groupBy`, and even `set_column`. 

This means that we're **not making use of the majority of our computer**, but simply a small piece.

For now we need to be aware of the following:

1. `polars` use `pl.Expr``
2. `polars` has no index (which makes it actually a lot easier..)
3. `polars` uses `null` over `NaN` in multiple places
4. `polars` can be `lazy`
5. `polars` has a simpler API (IMO)
    - Bonus: Less warnings like "could be bad to set attribute to copy"
6. `polars` has a stricter type system
7. `polars` is built on `Rust` if you care

We also need to be aware of the following draw-backs:

1. `polars` stricter Type System makes it more prone to crash on `concat` and similar actions
2. `polars` readers are not as versatile and well-implemented as `pandas`
    - Bonus: `polars` is much, much faster
    - Example: `polars` does not support `,`-decimals anyway simple or merging two columns into a date
3. `pandas` is _Lingua Franca_ and widely used within API:s, we'll see in `altair` and `plotly`.

### A simple cheat-sheet

In [None]:
df = df_pl

df.groupby(pl.col("a")).agg(pl.col("b")) # lazy by default!
df.groupby(pl.col("a")).agg([pl.mean("b")])

In [None]:
b_over_a = [pl.col("a"), pl.mean("b").over("a")]
df.select(b_over_a) # can use expressions to make it really cool

In [None]:
pl.concat([df, df]).select(b_over_a)

In [None]:
df.select([pl.mean("a"), pl.mean("b")])

In [None]:
df.with_columns([
    pl.mean("a").alias("mean_a"),
    pl.min("b").alias("min_b")
])

In [None]:
df.filter(pl.col("a") <= 1)

In [None]:
df.partition_by("a")

In [None]:
df.select((pl.col("a") == 1).sum())

In [None]:
df.select(pl.when(pl.col("a") == 1).then(100).otherwise(pl.col("a") / 10).alias("conditional_a"))

In [None]:
df.with_column(pl.col("b").list().alias("b_list"))

### What about being `lazy`?

I thought we could go through the `lazy` mode and discuss further

In [None]:
df.lazy().select(pl.col("a")).filter(pl.col("a") <= 1)

In [None]:
print(df.lazy().select(pl.col("a")).filter(pl.col("a") <= 1).describe_optimized_plan())

In [None]:
print(df.lazy().filter(pl.col("a") <= 1).select(pl.col("a")).describe_optimized_plan())

In [None]:
df.lazy().filter(pl.col("a") <= 1).select(pl.col("a")).collect()

## Playing around!

- https://github.com/unitedstates/congress-legislators
- https://www.kaggle.com/datasets/uciml/electric-power-consumption-data-set
- https://www.kaggle.com/datasets/rounakbanik/pokemon

# `altair` the simple way to plot

I think `altair` has helped me out a lot in my project(s).

I do miss something which is closer to `ggplot`, but the only library in Python isn't well supported in other tooling.

`plotly` is very nice when it works, but gets very awkward quickly!

`matplotlib` is obviously OK, but I love interactivity that's better.

In [None]:
import altair as alt
import plotly.express as px
import matplotlib.pyplot as plt

In [None]:
df = pl.read_csv("https://raw.githubusercontent.com/lgreski/pokemonData/master/Pokemon.csv"); df.head()

In [None]:
import time
def benchmark(func, iter: int = 1000):
    t0 = time.time()
    for i in range(iter):
        func()
    print(f"Average time {(time.time() - t0) / iter:.2} to run func")

In [None]:
benchmark(lambda: df.groupby(["Type1", "Type2"]).agg([pl.max("Attack"), pl.mean("Generation")]))
df_pd = df.to_pandas()
benchmark(lambda: df_pd.groupby(["Type1", "Type2"]).agg({"Attack": "max", "Generation": "mean"}))

In [None]:
df_large = pl.concat([df for i in range(100)])
df_pd_large = df_large.to_pandas()

In [None]:
benchmark(lambda: df_large.groupby(["Type1", "Type2"]).agg([pl.max("Attack"), pl.mean("Generation")]))
benchmark(lambda: df_pd_large.groupby(["Type1", "Type2"]).agg({"Attack": "max", "Generation": "mean"}))

In [None]:
df_large.shape

In [None]:
df_large = pl.concat([df for i in range(1000)])
df_pd_large = df_large.to_pandas()

In [None]:
benchmark(lambda: df_large.groupby(["Type1", "Type2"]).agg([pl.max("Attack"), pl.mean("Generation")]))
benchmark(lambda: df_pd_large.groupby(["Type1", "Type2"]).agg({"Attack": "max", "Generation": "mean"}))

In [None]:
df_large.shape

The performance is clear as night and day, now onto `altair`

## Plotting with `altair`

In [None]:
df.head()

In [None]:
px.histogram(df.to_pandas(), y="Type1")

In [None]:
alt.Chart(df.to_pandas()).mark_bar().encode(x="Type1", y="count(Type1)")

In [None]:
alt.Chart(df.to_pandas()).mark_bar(tooltip=True).encode(x="Type1", y="count(Type1)")

In [None]:
df.head()

In [None]:
chart = alt.Chart(df.to_pandas())
point_chart = chart.mark_circle(tooltip=True).encode(x="Type1", y="Attack", color="Type1", size="Attack")
histogram_health = chart.mark_tick(tooltip=True).encode(x="Type1", y="HP", color="Type1")

In [None]:
(point_chart+histogram_health).resolve_scale(y="independent")

In [None]:
point_chart | histogram_health

In [None]:
point_chart & histogram_health

In [None]:
brush = alt.selection_interval()
point_chart = point_chart.add_selection(
    brush
)
histogram_health = histogram_health.transform_filter(
    brush
)
point_chart | histogram_health

In [None]:
df.head()

In [None]:
point_chart = chart.mark_circle(tooltip=True).encode(x="HP", y="Attack", color="Type1", size="Defense")
point_chart.interactive()

# Let's play!

Let us play around with `polars` and `altair`!


1. Add Selection Tools (selectboxes, select by legend etc)
2. Add timeline-slider
3. Try to combine multiple `.over` statements
4. Try to add a `altair transform_filter` which can transform the data, e.g. showing `mean` of a col etc.

More details can be found at: 
1. [_polars user guide_](https://pola-rs.github.io/polars-book/user-guide/introduction.html)
2. [altair docs](https://altair-viz.github.io/gallery/index.html)

In [None]:
# https://www.kaggle.com/datasets/uciml/electric-power-consumption-data-set
# https://www.kaggle.com/datasets/rounakbanik/pokemon