## Rolling time series analysis

In [1]:
from datetime import datetime,timedelta

import polars as pl

pl.Config.set_tbl_rows(8)

polars.config.Config

In [2]:
start_datetime = datetime(2020, 1, 1)
end_datetime = datetime(2020, 1, 2)

df = (
    pl.DataFrame(
        {
            "date": pl.datetime_range(
                start_datetime, end_datetime, interval="1h", eager=True
            )
        }
    ).with_row_index("values")
    .select("date", "values")
)
df

date,values
datetime[μs],u32
2020-01-01 00:00:00,0
2020-01-01 01:00:00,1
2020-01-01 02:00:00,2
2020-01-01 03:00:00,3
…,…
2020-01-01 21:00:00,21
2020-01-01 22:00:00,22
2020-01-01 23:00:00,23
2020-01-02 00:00:00,24


### Built-in rolling aggregations

Polars has built-in expressions to do a rolling-groupby and aggregations for common aggregations such as `mean`, `min`, `max`, `sum`. 

In [3]:
df.with_columns(
    roll_mean = pl.col("values").rolling_mean_by(
        "date",
        window_size="3h",
        closed="right"
    ),
    roll_sum = pl.col("values").rolling_sum_by(
        "date",
        window_size=timedelta(hours=3),
        closed="right"
    )
)

date,values,roll_mean,roll_sum
datetime[μs],u32,f64,u32
2020-01-01 00:00:00,0,0.0,0
2020-01-01 01:00:00,1,0.5,1
2020-01-01 02:00:00,2,1.0,3
2020-01-01 03:00:00,3,2.0,6
…,…,…,…
2020-01-01 21:00:00,21,20.0,60
2020-01-01 22:00:00,22,21.0,63
2020-01-01 23:00:00,23,22.0,66
2020-01-02 00:00:00,24,23.0,69


## Rolling expression

The `rolling_*` expressions above are only available for the most common aggregations. 

We can specify any expression to be evaluated in rolling windows using the `rolling` expression.

In [4]:
df.with_columns(
    # Get the first row index in each window
    window_row_first = pl.col("values").first().rolling(
        index_column="date",
        period="3h"
    ),
    # Get the last row index in each window
    window_row_last = pl.col("values").last().rolling(
        index_column="date",
        period="3h"
    )
)

date,values,window_row_first,window_row_last
datetime[μs],u32,u32,u32
2020-01-01 00:00:00,0,0,0
2020-01-01 01:00:00,1,0,1
2020-01-01 02:00:00,2,0,2
2020-01-01 03:00:00,3,1,3
…,…,…,…
2020-01-01 21:00:00,21,19,21
2020-01-01 22:00:00,22,20,22
2020-01-01 23:00:00,23,21,23
2020-01-02 00:00:00,24,22,24


Add `offset`, which default is `-period`

In [5]:
df.with_columns(
    # Get the first row index in each window
    window_row_first = pl.col("values").first().rolling(
        index_column="date",
        period="3h",
        offset="-1h"
    ),
    # Get the last row index in each window
    window_row_last = pl.col("values").last().rolling(
        index_column="date",
        period="3h",
        offset="-1h"
    ),
    window_row_indexes = pl.col("date").agg_groups().rolling(
        index_column="date",
        period="3h",
        offset="-1h"
    )
)

  window_row_indexes = pl.col("date").agg_groups().rolling(


date,values,window_row_first,window_row_last,window_row_indexes
datetime[μs],u32,u32,u32,list[u32]
2020-01-01 00:00:00,0,0,2,"[0, 1, 2]"
2020-01-01 01:00:00,1,1,3,"[1, 2, 3]"
2020-01-01 02:00:00,2,2,4,"[2, 3, 4]"
2020-01-01 03:00:00,3,3,5,"[3, 4, 5]"
…,…,…,…,…
2020-01-01 21:00:00,21,21,23,"[21, 22, 23]"
2020-01-01 22:00:00,22,22,24,"[22, 23, 24]"
2020-01-01 23:00:00,23,23,24,"[23, 24]"
2020-01-02 00:00:00,24,24,24,[24]


Aggregate columns first and `rolling`

In [6]:
df.with_columns(
    roll_mean = pl.col("values").mean().rolling(
        "date",
        period="3h"
    ),
    roll_sum = pl.col("values").sum().rolling(
        "date",
        period="3h"
    )
)

date,values,roll_mean,roll_sum
datetime[μs],u32,f64,u32
2020-01-01 00:00:00,0,0.0,0
2020-01-01 01:00:00,1,0.5,1
2020-01-01 02:00:00,2,1.0,3
2020-01-01 03:00:00,3,2.0,6
…,…,…,…
2020-01-01 21:00:00,21,20.0,60
2020-01-01 22:00:00,22,21.0,63
2020-01-01 23:00:00,23,22.0,66
2020-01-02 00:00:00,24,23.0,69


## Rolling on a `DataFrame`

No matter `rolling_*` or `rolling` expressions are useful if we want to add a new column or columns to a `DataFrame`.

We can also call `rolling` on a `DataFrame` to create a new `DataFrame` with all rolling column.

In [7]:
df.rolling(
    index_column="date",
    period="3h"
).agg(
    window_row_first = pl.col("values").first(),
    window_row_last = pl.col("values").last()
).head(4)

date,window_row_first,window_row_last
datetime[μs],u32,u32
2020-01-01 00:00:00,0,0
2020-01-01 01:00:00,0,1
2020-01-01 02:00:00,0,2
2020-01-01 03:00:00,1,3


### Rolling windows by group

In [8]:
df2 = pl.concat(
    [
        df.with_columns(
            id = pl.lit("A")
        ),
        df.with_columns(
            id = pl.lit("B")
        )
    ]
)

df2

date,values,id
datetime[μs],u32,str
2020-01-01 00:00:00,0,"""A"""
2020-01-01 01:00:00,1,"""A"""
2020-01-01 02:00:00,2,"""A"""
2020-01-01 03:00:00,3,"""A"""
…,…,…
2020-01-01 21:00:00,21,"""B"""
2020-01-01 22:00:00,22,"""B"""
2020-01-01 23:00:00,23,"""B"""
2020-01-02 00:00:00,24,"""B"""


In [11]:
df2.rolling(
    index_column="date",
    period="3h",
    group_by="id"
).agg(
    window_row_first = pl.col("values").first(),
    window_row_last = pl.col("values").last() 
)

id,date,window_row_first,window_row_last
str,datetime[μs],u32,u32
"""A""",2020-01-01 00:00:00,0,0
"""A""",2020-01-01 01:00:00,0,1
"""A""",2020-01-01 02:00:00,0,2
"""A""",2020-01-01 03:00:00,1,3
…,…,…,…
"""B""",2020-01-01 21:00:00,19,21
"""B""",2020-01-01 22:00:00,20,22
"""B""",2020-01-01 23:00:00,21,23
"""B""",2020-01-02 00:00:00,22,24


## Lazy mode and streaming?

Unfortunately rolling operations are not currently supported by the streaming engine. 

## Rolling or `group_by_dynamic`?
The `rolling` and `group_by_dynamic` methods both do windowed time series aggregations but there is a difference between them:
- `group_by_dynamic` works with constant window intervals
- `rolling` works with windows that depend on the data

Essentially `group_by_dynamic` looks at first and last time points and divides up the time interval into boxes. 

`rolling` looks at each time point and creates a window around each one

## Exercises

### Exercise 1

We create a `DataFrame` from the Spotify data

In [12]:
spotify_csv = "data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,date,str,str,str,str,str,i64
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track…","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_UP""",2545384


For the track `Starboy` add a one-week `rolling_mean` and `rolling_max` of the streams column

In [25]:
spotify_df.filter(
    pl.col("title") == "Starboy"
).select(
    "title", "artist", "date", "trend", "streams"
).sort(
    "date"
).with_columns(
    roll_mean = pl.col("streams").rolling_mean_by(
        "date",
        window_size="1w",
        closed="right"
    ),
    roll_max = pl.col("streams").rolling_max_by(
        "date",
        window_size="1w",
        closed="right"
    ),
)

title,artist,date,trend,streams,roll_mean,roll_max
str,str,date,str,i64,f64,i64
"""Starboy""","""The Weeknd, Daft Punk""",2017-01-01,"""SAME_POSITION""",3135625,3.135625e6,3135625
"""Starboy""","""The Weeknd, Daft Punk""",2017-01-02,"""SAME_POSITION""",3342769,3.239197e6,3342769
"""Starboy""","""The Weeknd, Daft Punk""",2017-01-03,"""SAME_POSITION""",3563076,3.3472e6,3563076
"""Starboy""","""The Weeknd, Daft Punk""",2017-01-04,"""SAME_POSITION""",3619247,3.4152e6,3619247
…,…,…,…,…,…,…
"""Starboy""","""The Weeknd, Daft Punk""",2021-03-30,"""MOVE_DOWN""",741385,760032.571429,812216
"""Starboy""","""The Weeknd, Daft Punk""",2021-03-31,"""MOVE_DOWN""",744903,758873.142857,812216
"""Starboy""","""The Weeknd, Daft Punk""",2021-04-02,"""NEW_ENTRY""",735424,745988.666667,812216
"""Starboy""","""The Weeknd, Daft Punk""",2021-10-31,"""NEW_ENTRY""",760744,760744.0,760744


Visualize the rolling mean number of streams for the most popular tracks

- Add a column called `title_artist` that is a string concatenation of the `title` and `artist` columns separated by `:`
- Sort the `DataFrame` by the `title_artist` and `date` columns

In [26]:
roll_spotify_df = spotify_df.with_columns(
    title_artist = pl.concat_str(["title", "artist"], separator=":")
).sort("title_artist", "date")

roll_spotify_df.head()

title,rank,date,artist,url,region,chart,trend,streams,title_artist
str,i64,date,str,str,str,str,str,i64,str
"""!""",56,2019-08-09,"""Trippie Redd""","""https://open.spotify.com/track…","""Global""","""top200""","""NEW_ENTRY""",1266004,"""!:Trippie Redd"""
"""!""",152,2019-08-10,"""Trippie Redd""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN""",703785,"""!:Trippie Redd"""
"""#PROUDCATOWNERREMIX""",189,2019-08-23,"""XXXTENTACION, Rico Nasty""","""https://open.spotify.com/track…","""Global""","""top200""","""NEW_ENTRY""",679370,"""#PROUDCATOWNERREMIX:XXXTENTACI…"
"""$$$ - with Matt Ox""",67,2018-03-16,"""XXXTENTACION""","""https://open.spotify.com/track…","""Global""","""top200""","""NEW_ENTRY""",1199692,"""$$$ - with Matt Ox:XXXTENTACIO…"
"""$$$ - with Matt Ox""",83,2018-03-17,"""XXXTENTACION""","""https://open.spotify.com/track…","""Global""","""top200""","""MOVE_DOWN""",950497,"""$$$ - with Matt Ox:XXXTENTACIO…"


Continue by:
- doing a weekly `rolling` groupby on the `date` column by `title_artist`
- creating an aggregated `roll_streams` column with the weekly mean of the streams for each track
- sorting the output by `roll_streams` with the largest values at the top

In [27]:
roll_spotify_df.rolling(
    index_column="date",
    period="1w",
    group_by="title_artist"
).agg(
    roll_streams = pl.col("streams").mean()
).sort("roll_streams", descending=True)

title_artist,date,roll_streams
str,date,f64
"""drivers license:Olivia Rodrigo""",2021-01-17,1.2716e7
"""drivers license:Olivia Rodrigo""",2021-01-18,1.2713e7
"""drivers license:Olivia Rodrigo""",2021-01-19,1.2408e7
"""Girls Want Girls (with Lil Bab…",2021-09-03,1.238475e7
…,…,…
"""Superstition - Single Version:…",2017-01-01,331376.0
"""Secrets:The Weeknd""",2017-01-01,331233.0
"""Take Me To Church:Hozier""",2017-01-02,330936.0
"""Ni**as In Paris:JAY-Z, Kanye W…",2017-01-01,325951.0


We want to continue by visualizing the results only for the most popular tracks. 

However, we want to keep all dates for these tracks, not just the most streamed dates so:

- filter the `DataFrame` to keep **all** rows for any track that appears in the top 300 rows of `roll_streams`

The output should have 14,358 rows

In [28]:
roll_spotify_df = roll_spotify_df.rolling(
    index_column="date",
    period="1w",
    group_by="title_artist"
).agg(
    roll_streams = pl.col("streams").mean()
).sort("roll_streams", descending=True).pipe(
    lambda df: df.join(df.head(300), on="title_artist", how="semi")
).sort("date")

roll_spotify_df

title_artist,date,roll_streams
str,date,f64
"""Shape of You:Ed Sheeran""",2017-01-06,6.151345e6
"""Shape of You:Ed Sheeran""",2017-01-07,6376919.5
"""Shape of You:Ed Sheeran""",2017-01-08,6.4371e6
"""Shape of You:Ed Sheeran""",2017-01-09,6.6431e6
…,…,…
"""Dynamite:BTS""",2021-12-20,1.0486e6
"""Way 2 Sexy (with Future & Youn…",2021-12-20,972878.571429
"""Fair Trade (with Travis Scott)…",2021-12-20,890003.8
"""Shape of You:Ed Sheeran""",2021-12-20,849296.714286


Visualize the results as a time series line chart with Plotly
- time on the x-axis
- `roll_streams` on the y-axis
- `title_artist` in color

In [29]:
import plotly.express as px

px.line(
    roll_spotify_df,
    x="date",
    y="roll_streams",
    color="title_artist",
    width=1000
)