# Exploring Polars

In this exercise, we are going to explore Polars using a small dataset.
The dataset was taken from https://www.kaggle.com/datasets/spscientist/students-performance-in-exams and describes marks obtainead by students for different areas of a test.
Ocasionally, we'll explore differences to Pandas.
We will also observe that on small datasets, there is a negligible difference between Pandas and Polars in terms of computational speed and memory usage.


In [7]:
import polars as pl

%load_ext memory_profiler

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


## Load the data

Load the CSV data using Polars and visualise the loaded DataFrame.

In [8]:
df = pl.read_csv("content/data/StudentsPerformance.csv")
df

gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
str,str,str,str,str,i64,i64,i64
"""female""","""group B""","""bachelor's degree""","""standard""","""none""",72,72,74
"""female""","""group C""","""some college""","""standard""","""completed""",69,90,88
"""female""","""group B""","""master's degree""","""standard""","""none""",90,95,93
"""male""","""group A""","""associate's degree""","""free/reduced""","""none""",47,57,44
"""male""","""group C""","""some college""","""standard""","""none""",76,78,75
…,…,…,…,…,…,…,…
"""female""","""group E""","""master's degree""","""standard""","""completed""",88,99,95
"""male""","""group C""","""high school""","""free/reduced""","""none""",62,55,55
"""female""","""group C""","""high school""","""free/reduced""","""completed""",59,71,65
"""female""","""group D""","""some college""","""standard""","""completed""",68,78,77


## Manipulate the data
In the remainder of this notebook, we will manipulate the data to answer the provided questions.
Consult the documentation at https://docs.pola.rs/api/python/stable/reference/index.html.

### Exercise 2.2
Calculate and show the mean score for each test area separately.

In [9]:
df.mean().select("math score", "reading score", "writing score")

math score,reading score,writing score
f64,f64,f64
66.089,69.169,68.054


### Exercise 2.2
Calculate the 90th percentile score for each test area.

In [10]:
df.quantile(0.9).select("math score", "reading score", "writing score")

math score,reading score,writing score
f64,f64,f64
86.0,87.0,87.0


### Exercise 2.3
How many students scored higher than the 90th percentile for each test area? 
* Use only Polars methods and operators within a `select()` predicate

In [11]:
df.select(
    pl.col("math score") > pl.quantile("math score", 0.9),
    pl.col("reading score") > pl.quantile("reading score", 0.9),
    pl.col("writing score") > pl.quantile("writing score", 0.9),
).sum()

math score,reading score,writing score
u32,u32,u32
95,100,98


## Group by clause
In the exercise below, the `group_by()` method may come in handy.
This method returns a `GroupBy` instance, which allows you to define an operation for all columns through global methods such as `sum()`.
More interestingly, it allows you to define per-column operations in combination with the `agg()` method.
In this method, using the `pl.col()` method we can specificy what happens to the grouped lists in each column, while columns not included in the list are discarded from the result.
If we just specify the name without transformations, the grouped lists will be returned as is:

In [12]:
import polars as pl

df_example = (
    pl.DataFrame(
        {
            "a": [1, 2, 3, 4, 5],
            "b": [5, 4, 3, 2, 1],
            "c": [1, 1, 2, 2, 3],
        }
    )
    .group_by("c")
    .agg(pl.col("a").sum().alias("sum_a"), "b")
)
df_example

c,sum_a,b
i64,i64,list[i64]
3,5,[1]
1,3,"[5, 4]"
2,7,"[3, 2]"




### Exercise 2.4
Find out the proportions of students who failed on all parts of the test per gender.
* Polars expressions support calculations using native operators in combination with the `pl.col` predicate
* The `group_by` method may come in handy
* In the Netherlands, only scores equal to and higher than 55% are considered a pass, but you feel free to define passing to your own liking!

In [13]:
df.group_by(
    "gender",
    (pl.col("math score") < 55)
    .and_(pl.col("reading score") < 55)
    .and_(pl.col("writing score") < 55)
    .alias("failed"),
).len().sort("failed").group_by("gender").agg(
    pl.col("len").filter(pl.col("failed")).first() / pl.col("len").sum()
)

gender,len
str,f64
"""male""",0.143154
"""female""",0.084942


### Exercise 2.5
Find out whether there is a correlation between the lunch type and the test scores.

In [14]:
df.group_by("lunch").agg(pl.mean_horizontal("math score", "reading score", "writing score").mean())

lunch,math score
str,f64
"""free/reduced""",62.199061
"""standard""",70.837209


### Exercise 2.6
What race/etnicity group scored lowest overall?

In [15]:
df.group_by("race/ethnicity").agg(
    pl.mean_horizontal("math score", "reading score", "writing score")
    .mean()
    .alias("mean_score")
).sort("mean_score").head(1)

race/ethnicity,mean_score
str,f64
"""group A""",62.992509


### Exercise 2.7
What race/etnicity group had a significantly higher score in math than other test areas?

_NOTE: I think we can conclude that the score is not necessarily significantly higher than the mean score on the other test areas, but the difference is significantly higher compared to the other groups who all scored lower on maths than other areas_

In [23]:
df.group_by("race/ethnicity").agg(
    pl.col("math score").mean(),
    pl.mean_horizontal("reading score", "writing score").mean(),
).select(
    pl.all(), (pl.col("math score") / pl.col("reading score")).alias("math_ratio")
).sort("math_ratio", descending=True).head(1)

race/ethnicity,math score,reading score,math_ratio
str,f64,f64,f64
"""group E""",73.821429,72.217857,1.022205
"""group A""",61.629213,63.674157,0.967884
"""group D""",67.362595,70.087786,0.961117
"""group B""",63.452632,66.476316,0.954515
"""group C""",64.46395,68.465517,0.941554
