# Exploring Polars

In this exercise, we are going to explore Polars using a small dataset.
The dataset was taken from https://www.kaggle.com/datasets/spscientist/students-performance-in-exams and describes marks obtainead by students for different areas of a test.
Ocasionally, we'll explore differences to Pandas.
We will also observe that on small datasets, there is a negligible difference between Pandas and Polars in terms of computational speed and memory usage.


In [2]:
import polars as pl
import pandas as pd
import time

%load_ext memory_profiler

## Load the data

Load the CSV data using Polars and visualise the loaded DataFrame.

In [None]:
df = pl.read_csv("content/data/StudentsPerformance.csv")
df

## Manipulate the data
In the remainder of this notebook, we will manipulate the data to answer the provided questions.
Consult the documentation at https://docs.pola.rs/api/python/stable/reference/index.html.

### Exercise 2.2
Calculate and show the mean score for each test area separately.

### Exercise 2.2
Calculate the 90th percentile score for each test area.

### Exercise 2.3
How many students scored higher than the 90th percentile for each test area? 
* Use only Polars methods and operators within a `select()` predicate

## Group by clause
In the exercise below, the `group_by()` method may come in handy.
This method returns a `GroupBy` instance, which allows you to define an operation for all columns through global methods such as `sum()`.
More interestingly, it allows you to define per-column operations in combination with the `agg()` method.
In this method, using the `pl.col()` method we can specificy what happens to the grouped lists in each column, while columns not included in the list are discarded from the result.
If we just specify the name without transformations, the grouped lists will be returned as is:

In [None]:
import polars as pl
df = pl.DataFrame(
    {
        "a": [1, 2, 3, 4, 5],
        "b": [5, 4, 3, 2, 1],
        "c": [1, 1, 2, 2, 3],
    }
).group_by("c").agg(pl.col("a").sum().alias("sum_a"), "b")
df



### Exercise 2.4
Find out the proportions of students who failed on all parts of the test per gender.
* Polars expressions support calculations using native operators in combination with the `pl.col` predicate
* The `group_by` method may come in handy
* In the Netherlands, only scores equal to and higher than 55% are considered a pass, but you feel free to define passing to your own liking!

### Exercise 2.5
Find out whether there is a correlation between the lunch type and the test scores.

### Exercise 2.6
What race/etnicity group scored lowest overall?

### Exercise 2.7
What race/etnicity group had a significantly higher score in math than other test areas?