## Expression And Contexts
### Expression : 
an expression is a lazy representation of a data transformation. 

In [12]:
import polars as pl
bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)

[(col("weight")) / (col("height").pow([dyn int: 2]))]


### Contexts : 
Polars expressions need a context in which they are executed to produce a result. 
Depending on the context it is used in, the same Polars expression can produce different results

In [13]:
# 1. select 2.with_columns 3.filter 4.group_by
from datetime import date

df = pl.DataFrame(
    {
        "name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
        "birthdate": [
            date(1997, 1, 10),
            date(1985, 2, 15),
            date(1983, 3, 22),
            date(1981, 4, 30),
        ],
        "weight": [57.9, 72.5, 53.6, 83.1],  # (kg)
        "height": [1.56, 1.77, 1.65, 1.75],  # (m)
    }
)

print(df)

shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name           ┆ birthdate  ┆ weight ┆ height │
│ ---            ┆ ---        ┆ ---    ┆ ---    │
│ str            ┆ date       ┆ f64    ┆ f64    │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   │
└────────────────┴────────────┴────────┴────────┘


In [21]:
# The selection context select applies expressions over columns. 
# The context select may produce new columns that are aggregations, combinations of other columns, or literals:
result = df.select(
    bmi = bmi_expr,
    avg_bmi = bmi_expr.mean(),
    ideal_max_bmi = 25,
)
print(result)
      
# Note that broadcasting can also occur within expressions. For instance, consider the expression below:
result = df.select(deviation = (bmi_expr - bmi_expr.mean())/bmi_expr.std())
print(result)

shape: (4, 3)
┌───────────┬───────────┬───────────────┐
│ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
│ ---       ┆ ---       ┆ ---           │
│ f64       ┆ f64       ┆ i32           │
╞═══════════╪═══════════╪═══════════════╡
│ 23.791913 ┆ 23.438973 ┆ 25            │
│ 23.141498 ┆ 23.438973 ┆ 25            │
│ 19.687787 ┆ 23.438973 ┆ 25            │
│ 27.134694 ┆ 23.438973 ┆ 25            │
└───────────┴───────────┴───────────────┘
shape: (4, 1)
┌───────────┐
│ deviation │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.115645  │
│ -0.097471 │
│ -1.22912  │
│ 1.210946  │
└───────────┘


In [18]:
# The main difference between the two is that the context with_columns creates a new dataframe that contains
# the columns from the original dataframe and the new columns according to its input expressions,
# whereas the context select only includes the columns selected by its input expressions:
result = df.with_columns(
    bmi = bmi_expr,
    avg_bmi = bmi_expr.mean(),
    ideal_max_bmi = 25,    
)
print(result)

shape: (4, 7)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
│ name           ┆ birthdate  ┆ weight ┆ height ┆ bmi       ┆ avg_bmi   ┆ ideal_max_bmi │
│ ---            ┆ ---        ┆ ---    ┆ ---    ┆ ---       ┆ ---       ┆ ---           │
│ str            ┆ date       ┆ f64    ┆ f64    ┆ f64       ┆ f64       ┆ i32           │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
│ Alice Archer   ┆ 1997-01-10 ┆ 57.9   ┆ 1.56   ┆ 23.791913 ┆ 23.438973 ┆ 25            │
│ Ben Brown      ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   ┆ 23.141498 ┆ 23.438973 ┆ 25            │
│ Chloe Cooper   ┆ 1983-03-22 ┆ 53.6   ┆ 1.65   ┆ 19.687787 ┆ 23.438973 ┆ 25            │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1   ┆ 1.75   ┆ 27.134694 ┆ 23.438973 ┆ 25            │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘


In [20]:
# The context filter filters the rows of a dataframe based on one or more expressions that evaluate to the Boolean data type.

result = df.filter(
    pl.col("birthdate").is_between(date(1982,12,31), date(1996,1,1)),
    pl.col("height") > 1.7,
)
print(result)

shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name      ┆ birthdate  ┆ weight ┆ height │
│ ---       ┆ ---        ┆ ---    ┆ ---    │
│ str       ┆ date       ┆ f64    ┆ f64    │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5   ┆ 1.77   │
└───────────┴────────────┴────────┴────────┘


In [23]:
# group_by and aggregations 
# In the context group_by, rows are grouped according to the unique values of the grouping expressions. 
# You can then apply expressions to the resulting groups, which may be of variable lengths.
result = df.group_by(
    (pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
).agg(pl.col("name"))
print(result)

shape: (2, 2)
┌────────┬─────────────────────────────────┐
│ decade ┆ name                            │
│ ---    ┆ ---                             │
│ i32    ┆ list[str]                       │
╞════════╪═════════════════════════════════╡
│ 1980   ┆ ["Ben Brown", "Chloe Cooper", … │
│ 1990   ┆ ["Alice Archer"]                │
└────────┴─────────────────────────────────┘


In [32]:
# multiple groupings
mean_weight = df["weight"].mean()

# Perform the grouping
result = df.group_by(
    (pl.col("birthdate").dt.year()// 10 * 10).alias("decade"),
    (pl.col("height") < 1.7).alias("isShort"),
    (pl.col("weight") > mean_weight).alias("isObese")
).agg(pl.col("name"))

print(result)

shape: (3, 4)
┌────────┬─────────┬─────────┬─────────────────────────────────┐
│ decade ┆ isShort ┆ isObese ┆ name                            │
│ ---    ┆ ---     ┆ ---     ┆ ---                             │
│ i32    ┆ bool    ┆ bool    ┆ list[str]                       │
╞════════╪═════════╪═════════╪═════════════════════════════════╡
│ 1980   ┆ true    ┆ false   ┆ ["Chloe Cooper"]                │
│ 1990   ┆ true    ┆ false   ┆ ["Alice Archer"]                │
│ 1980   ┆ false   ┆ true    ┆ ["Ben Brown", "Daniel Donovan"… │
└────────┴─────────┴─────────┴─────────────────────────────────┘


In [41]:
# multiple aggreagation functions 
result = df.group_by(
    (pl.col("birthdate").dt.year()//10*10).alias("decade"),
    (pl.col("height") < 1.7).alias("isShort"),
).agg(
    pl.len(),
    pl.col("height").max().alias("tallest"),
    pl.col("weight","height").mean().name.prefix("avg_"),
    # above line is equivalent to : 
    # pl.col("weight").mean().alias("avg_weight"),
    # pl.col("height").mean().alias("avg_height"),
]
    
)
print(result)

# There is also group_by_dynamic and rolling for goruping contexts


shape: (3, 6)
┌────────┬─────────┬─────┬─────────┬────────────┬────────────┐
│ decade ┆ isShort ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │
│ ---    ┆ ---     ┆ --- ┆ ---     ┆ ---        ┆ ---        │
│ i32    ┆ bool    ┆ u32 ┆ f64     ┆ f64        ┆ f64        │
╞════════╪═════════╪═════╪═════════╪════════════╪════════════╡
│ 1980   ┆ true    ┆ 1   ┆ 1.65    ┆ 53.6       ┆ 1.65       │
│ 1980   ┆ false   ┆ 2   ┆ 1.77    ┆ 77.8       ┆ 1.76       │
│ 1990   ┆ true    ┆ 1   ┆ 1.56    ┆ 57.9       ┆ 1.56       │
└────────┴─────────┴─────┴─────────┴────────────┴────────────┘


### Expression Expansion
Expression expansion is like a shorthand notation for when you want to apply the same transformation to multiple columns

In [47]:
 # pl.col("weight","height").mean().name.prefix("avg_"),
    # above line is equivalent to : 
    # pl.col("weight").mean().alias("avg_weight"),
    # pl.col("height").mean().alias("avg_height"),


# This will expand 2 cols
expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1..1")
result = df.select(expr)
print(result)

# thsi will expand 0 cols
df2 = pl.DataFrame(
    {
        "ints": [1, 2, 3, 4],
        "letters": ["A", "B", "C", "D"],
    }
)
result = df2.select(expr)
print(result)

shape: (4, 2)
┌─────────────┬─────────────┐
│ weight*1..1 ┆ height*1..1 │
│ ---         ┆ ---         │
│ f64         ┆ f64         │
╞═════════════╪═════════════╡
│ 63.69       ┆ 1.716       │
│ 79.75       ┆ 1.947       │
│ 58.96       ┆ 1.815       │
│ 91.41       ┆ 1.925       │
└─────────────┴─────────────┘
shape: (0, 0)
┌┐
╞╡
└┘


### Conclusion : 
Because expressions are lazy, when you use an expression inside a context Polars can try to simplify your expression before running the data transformation it expresses. 