In [1]:
import polars as pl
import pandas as pd

from content.utils import polars_read_tsv_file, pandas_read_tsv_file
from content.constants import IMDB_DATASET_PATH

%load_ext memory_profiler

# Is this a break-up?

Frankly, I personally never fell in love with Pandas because I always felt the performance was not entirely up to speed with what I was used to coming from other environments.
However, from the perspective of data analysis I can now understand it is easy to fall in love with a library that makes correlating data so much easier while keeping the syntax relatively easy to read.

Since that time though Pandas has come a long way and has made incredible performance improvements.
However, it still has important shortcomings with respect to Polars.
We will explore these below with a few exercises, which will also make you familiar with how Pandas code can be rewritten to Polars!

## Performance
The first one we will explore is, surprise surprise, the difference in performance.
Pandas is written on top of Python (on top of C), while Polars is written in Rust.
While the reduced overhead is already an important difference, this also means that Polars can fully utilize all cores and it has been written with this idea in mind.
Although the CPython interface of version 3.12 has made important changes to make this possible from Python as well, it will take time before Pandas is rewritten to optimally do so.
Interestingly, for smaller data frames we see different results than for larger ones, not always in favour of Polars.

### Exercise 3.1
In this exercise we will load in a small CSV dataset.
Load in the StudentsPerformance dataset in with Pandas and measure the execution time, then load the same file in with Polars and measure the execution time.

In [None]:
%%memit

In [None]:
%%memit

### Exercise 3.2
In this exercise, let's check how the performance compares when reading in a large dataset.
For this purpose, we will use the IMDB dataset.

In [None]:
%%memit

In [None]:
%%memit

Likely, you found a slightly higher execution time for Polars in the case of the small dataset, and a significantly lower execution time for the large dataset.
This appears to be the general case, due to which Polars is especially recommendable for large datasets in practice.

## Real null values
One of the main annoyances about Polars is that null values are not actual null values.
This complicates further processing of individual column data and will sometimes lead to users having to make assumptions about their data.
In Pandas, the "null" value always depends on the data type of the column, which complicates post-processing:

In [None]:
import numpy as np


test_df = pd.DataFrame(
    {
        "test": pd.Series([1, 2, 3, None], dtype=pd.Int8Dtype),
        "numbers": [1, 2, 3, None],
        "int_values": [1, 2, 3, 4],
        "str_values": ["a", "b", "c", None],
        "bool_values": [True, False, None, None],
    }
)
display(test_df)
display(test_df.memory_usage(deep=True))
display(type(test_df["test"][0]))

Additionally, the dtype of a column can change if one of the values is set to None.
For example, if the value of an in64 column is changed to None, the dtype of the column will change along with it:

In [None]:
test_df.loc[3, "int_values"] = None
display(test_df)
display(type(test_df["int_values"][0]))

### Exercise 3.3
In the last data frame, we observe three different dtypes and corresponding "null" values.
What are the Python types of these different null values and how can they be filtered out when post-processing columns? 

In Polars, null values are consistently displayed as "null" when showing data and, more importantly, consistently evaluate to Python None values.
This facilitates more intuitive None value checks and corresponding data processing.

### Exercise 3.4
Recreate the previous data frame in Polars and display and evaluate the null values in a similar fashion.

## Syntax
Polars and Pandas share a lot of similarity in many methods and functionality, but they also differ fundamentally in others.
The most fundamental difference in core principles is that Polars has no support for indices, while in Pandas you can always manipulate rows individually. 
While it is possible to do the same in Polars, you will either need to append an index column to your data frame or use cumbersome syntax with possibly expensive operations to achieve index-based manipulation.

The main idea behind this is to stimulate transparency in data manipulation, with the thought that the state of the index and intransparent `reset_index` calls do not favour transparency.
In Polars, you should instead consider the conditions the record(s) you need to manipulate meet, and address them based on those, rather than their position within the frame.

Below, we will explore a few other syntax differences between the two by loading in data and rewriting the Pandas statements to Polars.
Within all of these differences, you may notice that transparency is a recurring principle.

In [33]:
games_pl = pl.read_csv("content/data/game_recommendations_on_steam/games.csv")
games_pd = pd.read_csv("content/data/game_recommendations_on_steam/games.csv")

### Exercise 3.5
Rewrite to Polars: select the titles of the games and obtain the result as a list

In [None]:
games_pd.loc[:, "title"].to_list()

### Exercise 3.6
Filter the Leisure Suit Larry games scoring higher than 80%.

In [None]:
games_pd[(games_pd["title"].str.contains("Leisure Suit Larry")) & (games_pd["positive_ratio"] > 80)]

### Exercise 3.7
Give The Witcher the score it deserves.

In [None]:
games_pd.loc[games_pd["title"] == "The Witcher 3: Wild Hunt - Blood and Wine", "positive_ratio"] = 100
games_pd[games_pd["title"] == "The Witcher 3: Wild Hunt - Blood and Wine"]

## Being lazy pays off
One of the most important distinctions yet between Pandas and Polars is lazy loading and lazy manipulation of data.
This allows users to pass around, amplify and modify query contexts on the data before the data is ever read, finally allowing Polars to add up and optimize all the operations when the data is finally collected.
For this we have no exercises here, as we dedicated the next part of this hackathon to it.
Continue with part 4!