# Window functions in SQL and Polars

In this supplementary note, let's discuss window functions in SQL and polars. Last week, we talked about window functions or `transform()` method in pandas, but we have not discussed how to do it in SQL or polars. 

In `pandas`, we could do all aggregation, transformation, and apply was done after `groupby()` method. A syntax for groupwise transformation is a bit different in SQL. Let's bring up the z-score example in pandas with the full dataset.

We first import relevant packages:

In [12]:
import pandas as pd
import numpy as np
import polars as pl
import duckdb

And we load the stations information:

In [13]:
stations = pd.read_csv("station-metadata.csv")

Let's prepare the dataframe for the full data:

In [14]:
%%time
intervals = [f"{10 * i + 1}-{10 * (i+1)}" for i in range(190, 202)]# quiz! 1901-1910 to 2011-2020.
dfs = []
for interval in intervals:
    filepath = f"datafiles/{interval}.csv"
    df = pd.read_csv(filepath)
    dfs.append(df)
df = (pd.concat(dfs, axis=0, ignore_index=True)
      .melt(
        id_vars = ["ID", "Year"],
        value_vars = [f"VALUE{i}" for i in range(1, 13)],
        var_name = "Month",
        value_name = "Temp")
      .query("~Temp.isnull()") 
      .assign(Month = lambda x : x.Month.str[5:].astype(int))
      .assign(Temp = lambda x : x.Temp / 100)
      .merge(stations, on="ID", how="inner")
     )

CPU times: user 4.64 s, sys: 974 ms, total: 5.61 s
Wall time: 5.85 s


Then, this would compute the z-scores.

In [25]:
%%time
def z_score(x):
    m = np.mean(x)
    s = np.std(x)
    return (x - m)/s
grouped = df.groupby(['NAME', 'Month'])["Temp"]
df["z"] = grouped.transform(z_score)

CPU times: user 41.7 s, sys: 1.11 s, total: 42.8 s
Wall time: 43.1 s


Actually, the next cell should be faster than the above, as it uses the internal well-optimized version of the `mean` and `std` function instead of using slow Python function:

In [62]:
%%time
grouped = df.groupby(['NAME', 'Month'])["Temp"]
means = grouped.transform("mean")
stds = grouped.transform("std")
df["z"] = (df["Temp"] - means) / stds
df["z"]

CPU times: user 1.26 s, sys: 308 ms, total: 1.57 s
Wall time: 1.72 s


0          -0.069232
1          -0.490032
2           0.856529
3          -0.978160
4          -1.146480
              ...   
13983499    0.707107
13983500    0.401712
13983501    0.584073
13983502    1.364803
13983503    0.760456
Name: z, Length: 13983504, dtype: float64

In [63]:
df

Unnamed: 0,ID,Year,Month,Temp,LATITUDE,LONGITUDE,STNELEV,NAME,z
0,AG000060390,1901,1,10.34,36.7167,3.2500,24.0,ALGER_DAR_EL_BEIDA,-0.069232
1,AG000060390,1902,1,9.84,36.7167,3.2500,24.0,ALGER_DAR_EL_BEIDA,-0.490032
2,AG000060390,1903,1,11.44,36.7167,3.2500,24.0,ALGER_DAR_EL_BEIDA,0.856529
3,AG000060390,1904,1,9.26,36.7167,3.2500,24.0,ALGER_DAR_EL_BEIDA,-0.978160
4,AG000060390,1905,1,9.06,36.7167,3.2500,24.0,ALGER_DAR_EL_BEIDA,-1.146480
...,...,...,...,...,...,...,...,...,...
13983499,USC00393079,1902,5,17.03,43.0500,-98.5333,379.5,FORT_RANDALL,0.707107
13983500,USC00281280,1952,3,5.68,39.9167,-75.1167,4.3,CAMDEN,0.401712
13983501,USC00281280,1952,5,17.83,39.9167,-75.1167,4.3,CAMDEN,0.584073
13983502,USC00281280,1952,6,25.05,39.9167,-75.1167,4.3,CAMDEN,1.364803


Then, how do we do it with SQL?

In [70]:
%%time
duckdb.sql("""
SELECT 
  *, (Temp - AVG(Temp) OVER (PARTITION BY NAME, Month)) / (STDDEV(Temp) OVER (PARTITION BY NAME, Month)) AS z
FROM df
""").df()

CPU times: user 15.9 s, sys: 6.91 s, total: 22.8 s
Wall time: 11.2 s


Unnamed: 0,ID,Year,Month,Temp,LATITUDE,LONGITUDE,STNELEV,NAME,z,z_1
0,FRM00007005,1994,12,6.62,50.143,1.832,67.1,ABBEVILLE,-0.183382,-0.183382
1,FRM00007005,1970,12,3.08,50.143,1.832,67.1,ABBEVILLE,-1.075202,-1.075202
2,FRM00007005,1995,12,2.24,50.143,1.832,67.1,ABBEVILLE,-1.286820,-1.286820
3,FRM00007005,1981,12,2.40,50.143,1.832,67.1,ABBEVILLE,-1.246512,-1.246512
4,FRM00007005,1997,12,5.70,50.143,1.832,67.1,ABBEVILLE,-0.415154,-0.415154
...,...,...,...,...,...,...,...,...,...,...
13983499,CHXLT114097,1989,6,22.00,37.480,105.670,1185.0,ZHONGNING,-0.146898,-0.146898
13983500,CHXLT114097,1973,6,22.05,37.480,105.670,1185.0,ZHONGNING,-0.100777,-0.100777
13983501,CHXLT114097,1975,6,21.85,37.480,105.670,1185.0,ZHONGNING,-0.285261,-0.285261
13983502,CHXLT114097,1959,6,21.25,37.480,105.670,1185.0,ZHONGNING,-0.838711,-0.838711


Let's break this down a little bit. 

- `AVG(Temp) OVER (PARTITION BY NAME, Month)` calculates the mean temperature for each `NAME`-`Month` group.
- `STDDEV(Temp) OVER (PARTITION BY NAME, Month)` to calculate the standard deviation of temperature for for each `NAME`-`Month` group.
- `(PARTITION BY NAME, Month)` is the window we define, it creates grouping we need. `AVG` and `STDDEV` are computed `OVER` this window. 


If you also want `means` and `stds` columns, you can nest the `SELECT` command: 

In [77]:
%%time
duckdb.sql("""
SELECT 
    *, (Temp - means) / stds AS z
FROM (
    SELECT *, AVG(Temp) OVER w AS means, STDDEV(Temp) OVER w AS stds
    FROM df
    WINDOW w AS (PARTITION BY NAME, Month)
)
""").df()

CPU times: user 15.9 s, sys: 7.38 s, total: 23.3 s
Wall time: 12.1 s


Unnamed: 0,ID,Year,Month,Temp,LATITUDE,LONGITUDE,STNELEV,NAME,z,means,stds,z_1
0,MXXLT440932,1969,5,27.14,20.65,-89.51,10.0,ABALA_ABALA,-0.575050,27.639091,0.867909,-0.575050
1,MXXLT440932,1968,5,27.69,20.65,-89.51,10.0,ABALA_ABALA,0.058657,27.639091,0.867909,0.058657
2,MXXLT440932,2000,5,28.40,20.65,-89.51,10.0,ABALA_ABALA,0.876715,27.639091,0.867909,0.876715
3,MXXLT440932,1999,5,28.90,20.65,-89.51,10.0,ABALA_ABALA,1.452813,27.639091,0.867909,1.452813
4,MXXLT440932,1998,5,27.83,20.65,-89.51,10.0,ABALA_ABALA,0.219964,27.639091,0.867909,0.219964
...,...,...,...,...,...,...,...,...,...,...,...,...
13983499,CHXLT114097,2005,6,24.79,37.48,105.67,1185.0,ZHONGNING,2.426644,22.159254,1.084109,2.426644
13983500,CHXLT114097,2007,6,22.76,37.48,105.67,1185.0,ZHONGNING,0.554138,22.159254,1.084109,0.554138
13983501,CHXLT114097,1984,6,21.34,37.48,105.67,1185.0,ZHONGNING,-0.755693,22.159254,1.084109,-0.755693
13983502,CHXLT114097,2000,6,22.02,37.48,105.67,1185.0,ZHONGNING,-0.128450,22.159254,1.084109,-0.128450


The same thing can be done by polars, like the cell below. It might seem similar to SQL syntax than `groupby` method for pandas dataframes. Instead of `groupby`, the window functions are written in the form of `Expression`, a way for letting the computer optimize the operation with less involvement of Python interpreter. 

In [59]:
df_pl = pl.from_pandas(df)

In [65]:
%%time
df_pl = (df_pl.with_columns(means = pl.col("Temp").mean().over(["NAME", "Month"]), 
                   stds = pl.col("Temp").std().over(["NAME", "Month"])).
                  with_columns(z = (pl.col("Temp") - pl.col("means")) / pl.col("stds")))

CPU times: user 1.02 s, sys: 819 ms, total: 1.84 s
Wall time: 461 ms


In [66]:
df_pl

ID,Year,Month,Temp,LATITUDE,LONGITUDE,STNELEV,NAME,z,means,stds
str,i64,i64,f64,f64,f64,f64,str,f64,f64,f64
"""AG000060390""",1901,1,10.34,36.7167,3.25,24.0,"""ALGER_DAR_EL_BEIDA""",-0.069232,10.422262,1.188213
"""AG000060390""",1902,1,9.84,36.7167,3.25,24.0,"""ALGER_DAR_EL_BEIDA""",-0.490032,10.422262,1.188213
"""AG000060390""",1903,1,11.44,36.7167,3.25,24.0,"""ALGER_DAR_EL_BEIDA""",0.856529,10.422262,1.188213
"""AG000060390""",1904,1,9.26,36.7167,3.25,24.0,"""ALGER_DAR_EL_BEIDA""",-0.97816,10.422262,1.188213
"""AG000060390""",1905,1,9.06,36.7167,3.25,24.0,"""ALGER_DAR_EL_BEIDA""",-1.14648,10.422262,1.188213
…,…,…,…,…,…,…,…,…,…,…
"""USC00393079""",1902,5,17.03,43.05,-98.5333,379.5,"""FORT_RANDALL""",0.707107,17.01,0.028284
"""USC00281280""",1952,3,5.68,39.9167,-75.1167,4.3,"""CAMDEN""",0.401712,3.256667,6.032512
"""USC00281280""",1952,5,17.83,39.9167,-75.1167,4.3,"""CAMDEN""",0.584073,15.558955,3.888292
"""USC00281280""",1952,6,25.05,39.9167,-75.1167,4.3,"""CAMDEN""",1.364803,19.873478,3.792871
