# Quantiles
By the end of this lecture you will be able to:
- calculate a quantile on a `DataFrame`
- calculate a quantile on an expression
- calculate multiple quantiles

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Quantiles

### Quantiles on a `DataFrame`
We calculate quantiles on a `DataFrame` using the `quantile` function. To get the 90th percentile we do:

In [None]:
df.quantile(0.9)

### Quantiles in an expression

We can also calculate quantiles as an expression

In [None]:
(
    df
    .select(
        pl.col('Age').quantile(0.9)
    )
)

### Multiple quantiles
We can calculate multiple quantiles in an expression using a list comprehension. As expressions are run in parallel this allows us to calculate multiple quantiles in parallel

In [None]:
quantileList = [0.1,0.5,0.9]
(
    df
    .select(
        [
            pl.col('Age').quantile(q).alias(f"Age_quantile_{q}") for q in quantileList
        ]
    )
)

To calculate multiple quantiles on multiple columns we can use `suffix` to avoid column name collisions.

In this example we calculate multiple quantiles on all of the floating point columns

In [None]:
quantileList = [0.1,0.5,0.9]
(
    df
    .select(
        [
            pl.col(pl.Float64).quantile(q).suffix(f"_quantile_{q}") for q in quantileList
        ]
    )
)

### Interpolation strategy for quantiles
We can use different interpolation strategies for calculating quantiles:
- nearest 
- higher 
- lower 
- midpoint
- linear

In [None]:
df.select(
    [
        pl.col('Age').quantile(0.25,interpolation='nearest').alias('Age_nearest'),
        pl.col('Age').quantile(0.25,interpolation='linear').alias('Age_linear'),
    ]
)

# Exercises

In the exercises you will develop your understanding of:
- calculating quantiles
- using different interpolation methods

## Exercise 1 - calculating quantiles
Calculate the 25th,50th and 75th percentiles for the `Age` column. Output the results as 3 columns (with appropriate names) in a one row `DataFrame`

In [None]:
qs = [0.25,0.5,0.75]
df = pl.read_csv(csvFile)
df<blank>

Calculate the same percentiles for all of the numeric columns.

Hint: you can pass a list of dtypes to `pl.col`

# Solutions

## Solution to Exercise 1
Calculate the 25th,50th and 75th percentiles for the `Age` column. Output the results as 3 columns (with appropriate names) in a one row `DataFrame`

In [None]:
qs = [0.25,0.5,0.75]
(
    df
    .select(
        [pl.col('Age').quantile(q).alias(f"Age_{q}_quantile") for q in qs]
    )
)

Calculate the same percentiles for all of the numeric columns.

Hint: you can pass a list of dtypes to `pl.col`

In [None]:
qs = [0.25,0.5,0.75]
(
    df
    .select(
        [
            pl.col([pl.Float64,pl.Int64]).quantile(q).suffix(f"_{q}_quantile") for q in qs
        ]
    )
)