# Windowing Operations

Pandas can perform windowing operations, which are operations that works similar to `group_by()`.

1. Sliding partion of values (which can vary depending the type of window).
2. Performs an aggregation over the sliding partion of values.

The windowing operation can be applied to `Series` or `DataFrames`, and their types are the following:

1. Rolling window
2. Weighted window
3. Expanding window
4. Exponentially weighted window

There are also some generalities that applies to all these types of windows.

**NOTE:** What I think is essentially important to understand is rolling window, expanding window, and
what support (or not) each type of window.

In [54]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [55]:
## Handy functions
from IPython.display import display_html, display, HTML

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def display_several(*args):
    for df in args:
        display(df)

def display_windowed(windowed):
    table_title_html = '<div style="display:inline-block; vertical-align:top; width:15%; margin:1px;"><h5>window {0} (type: {1})</h5>{2}</div>'

    html_str=''
    for i, window in enumerate(windowed):
        if isinstance(window, pd.Series):
            window = window.to_frame()
            html_str+=table_title_html.format(i, "s",window.to_html())
        else:
            html_str+=table_title_html.format(i, "df",window.to_html())
        
    display_html(html_str,raw=True)


# General Properties

1. It is possible to iterate over windows
2. All windowing operations support a `min_periods` arguments. `min_periods` indicates the minimum number of non-nan values a window must contain in order to return a result, otherwise, return `nan`.
    - default = 1 for time-based windows (or offset window size)
    - default = window size for fixed windows


In [56]:
df = pd.DataFrame(
    { "A": range(6),
      "B" : [np.nan, 1, 2, np.nan, np.nan, 3]
     }, 
     index=pd.date_range('2020-01-01', periods=6, freq='1D')
     )
df

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [57]:
# 1. Iterate over windows
# NOTE: using rolling, each window is built using the current value of row and completed
# with previous ones (if there are some ones) until get the fixed size
# For that reason, the first windows have size 1 and 2
for window in df.rolling(window = 3):
    display(window)

Unnamed: 0,A,B
2020-01-01,0,


Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0


Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0


Unnamed: 0,A,B
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,


Unnamed: 0,A,B
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,


Unnamed: 0,A,B
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [58]:
# Or using our handy function display_windowed, we can display side by side
# the windows
display_windowed(df.rolling(window = 3))

Unnamed: 0,A,B
2020-01-01,0,

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0

Unnamed: 0,A,B
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,

Unnamed: 0,A,B
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,

Unnamed: 0,A,B
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [59]:
# 2. using min_periods (FOCUS on column B)
# NOTE: all the previous windows have at least 1 non-nan value except 
# by the first one. Then, all return a result except the first one in the column B
df.rolling(window = 3 , min_periods= 1).sum()


Unnamed: 0,A,B
2020-01-01,0.0,
2020-01-02,1.0,1.0
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,2.0
2020-01-06,12.0,3.0


In [60]:
# NOTE: From previous windows, the windows 0, 1, 4, 5 have less than 2 non-nan values.
# Then, the result will be a NaN value for those windows.
df.rolling(window = 3 , min_periods= 2).sum()


Unnamed: 0,A,B
2020-01-01,,
2020-01-02,1.0,
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,
2020-01-06,12.0,


In [61]:
#NOTE: all the windows have less than 3 non-nan values in column B. Then, all
# will return NaN in the result
df.rolling(window = 3 , min_periods= 3).sum()

Unnamed: 0,A,B
2020-01-01,,
2020-01-02,,
2020-01-03,3.0,
2020-01-04,6.0,
2020-01-05,9.0,
2020-01-06,12.0,


In [62]:
#NOTE: For fixed-window size the default min_periods is the windows size 
# (in this case 3), similar to the above example.
df.rolling(window = 3).sum()

Unnamed: 0,A,B
2020-01-01,,
2020-01-02,,
2020-01-03,3.0,
2020-01-04,6.0,
2020-01-05,9.0,
2020-01-06,12.0,


In [63]:
#NOTE: For time-based windows, the default min_periods is 1, similar to our 
#first example.
df.rolling(window='3D').sum()


Unnamed: 0,A,B
2020-01-01,0.0,
2020-01-02,1.0,1.0
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,2.0
2020-01-06,12.0,3.0


## Rolling Window

A rolling window, also known as a moving window, can support three types of windows using the function `.rolling(window = ?)`

1. fixed windows `window = <integer>`
2. time-based windows based on an offset `window = <time-based offset>`, which create variable size windows. It requires a monotonic time based index.
3. custom windows `window = <custom_indexer>`. 

Although, there are those three well-defined windows, it is possible to alter the size of windows using some parameters in `.rolling()`. The parameters that I consider important are:

1. `center`
2. `closed`

For the aggregation step, we can use built-in functions (as `.mean()`) or user defined functions. Here, I consider important to understand:

1. `.apply()` for user defined function UDF
2. `.cov()` or `.corr()` for binary calculations.

**NOTE:** we used a handy function for displaying the windows

In [64]:
times = ['2020-01-01', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-29']

df = pd.DataFrame(
    { "A": range(5),
      "B" : np.random.randint(10, size = 5)
     }, 
     index=pd.DatetimeIndex(times)
     )
df

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7


In [65]:
# 1. fixed window using integer number
# NOTE: the first windows have less size than 3, it is because how the windows  
# are created. It will be later explained when I talk about center parameter in
# rolling()
windowed = df.rolling(window=3)
display_windowed(windowed)

Unnamed: 0,A,B
2020-01-01,0,5

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3

Unnamed: 0,A,B
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7


In [67]:
# 2. time-based window using an offset
# NOTE: it requires a time-based index to split the data in intervals of 
# 3 days (3D), generating variable size windows
windowed = df.rolling(window="3D")
display_windowed(windowed)

Unnamed: 0,A,B
2020-01-01,0,5

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3

Unnamed: 0,A,B
2020-01-29,4,7


### Custom Indexer
It uses a `BaseIndexer` subclass that allow to defined custom method `get_window_bound` for calculating a custom bounds

**NOTE:** each window is created in an increasing way, where each window *i* will take the row *i* and the previous values until fit the fixed size (for fixed windows) or interval (for time-based windows).

In [14]:
# 2. f
windowed = df.rolling(window=3, center=True)
display_windowed(windowed)

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3

Unnamed: 0,A,B
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7

Unnamed: 0,A,B
2020-01-05,3,3
2020-01-29,4,7


### Window endpoints and `closed` parameter

The parameter `closed` allow us to include or exclude the endpoints for our windows.

- `closed = right` includes right, but excludes left endpoint. (default)
- `closed = left` includes left, but excludes right endpoint.
- `closed = both` includes both left and right.
- `closed = neither` excludes both left and right.

The following picture indicates us the endpoints of a window (with fixed size 3 `window = 3`) and the effect of the `closed` parameter. It is important to remember that by default the `right` endpoint is included. In other words, `closed = right` is the default behavior.

<img src="./assets/imgs/window_endpoint.jpg" width="500"/>

**NOTE:** although the window size will be fixed, `both` and `neither` can change the size of the windows, without taking into account the fixed size 3. For example, in the image above `both` return a window of size 4 and `neither` a window of size 2.

**NOTE:** using time-based window the behavior is the same, but it is important to remember that the window size is variable and could return larger or smaller windows depending of the amount of data in each interval of time.



In [15]:
df = pd.DataFrame({'A': range(6)})
df

Unnamed: 0,A
0,0
1,1
2,2
3,3
4,4
5,5


In [16]:
# 1. default closed='right'
# NOTE: the example described in the picture are focus on the window 3
rolling_window = df.rolling(window=3, closed='right')
display_windowed(rolling_window)

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
1,1
2,2
3,3

Unnamed: 0,A
2,2
3,3
4,4

Unnamed: 0,A
3,3
4,4
5,5


In [17]:
# 2. default closed='left'
rolling_window = df.rolling(window=3, closed='left')
display_windowed(rolling_window)

Unnamed: 0,A

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
1,1
2,2
3,3

Unnamed: 0,A
2,2
3,3
4,4


In [18]:
# 3. default closed='both'
rolling_window = df.rolling(window=3, closed='both')
display_windowed(rolling_window)

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
0,0
1,1
2,2
3,3

Unnamed: 0,A
1,1
2,2
3,3
4,4

Unnamed: 0,A
2,2
3,3
4,4
5,5


In [19]:
# 4. default closed='neither'
rolling_window = df.rolling(window=3, closed='neither')
display_windowed(rolling_window)

Unnamed: 0,A

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
1,1
2,2

Unnamed: 0,A
2,2
3,3

Unnamed: 0,A
3,3
4,4


In [20]:
from pandas.api.indexers import VariableOffsetWindowIndexer

df = pd.DataFrame(range(10), index=pd.date_range("2020", periods=10))

offset = pd.offsets.BDay(1)

indexer = VariableOffsetWindowIndexer(index=df.index, offset=offset)

df

Unnamed: 0,0
2020-01-01,0
2020-01-02,1
2020-01-03,2
2020-01-04,3
2020-01-05,4
2020-01-06,5
2020-01-07,6
2020-01-08,7
2020-01-09,8
2020-01-10,9


In [21]:
from pandas.api.indexers import FixedForwardWindowIndexer

indexer = FixedForwardWindowIndexer(window_size=2)

df.rolling(indexer, min_periods=1).sum()

Unnamed: 0,0
2020-01-01,1.0
2020-01-02,3.0
2020-01-03,5.0
2020-01-04,7.0
2020-01-05,9.0
2020-01-06,11.0
2020-01-07,13.0
2020-01-08,15.0
2020-01-09,17.0
2020-01-10,9.0


In [22]:
df = pd.DataFrame(
    data=[
        [pd.Timestamp("2018-01-01 00:00:00"), 100],
        [pd.Timestamp("2018-01-01 00:00:01"), 101],
        [pd.Timestamp("2018-01-01 00:00:03"), 103],
        [pd.Timestamp("2018-01-01 00:00:04"), 111],
    ],
    columns=["time", "value"],
).set_index("time")


df

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2018-01-01 00:00:00,100
2018-01-01 00:00:01,101
2018-01-01 00:00:03,103
2018-01-01 00:00:04,111


In [23]:
df["value2"] = df["value"] * 2

In [24]:
def mad(x):
    print(x)
    return np.fabs(x - x.mean()).mean()

df.rolling(window=4).apply(mad, raw=False)

time
2018-01-01 00:00:00    100.0
2018-01-01 00:00:01    101.0
2018-01-01 00:00:03    103.0
2018-01-01 00:00:04    111.0
dtype: float64
time
2018-01-01 00:00:00    200.0
2018-01-01 00:00:01    202.0
2018-01-01 00:00:03    206.0
2018-01-01 00:00:04    222.0
dtype: float64


Unnamed: 0_level_0,value,value2
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01 00:00:00,,
2018-01-01 00:00:01,,
2018-01-01 00:00:03,,
2018-01-01 00:00:04,3.625,7.25


In [25]:
df[::-1]

Unnamed: 0_level_0,value,value2
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01 00:00:04,111,222
2018-01-01 00:00:03,103,206
2018-01-01 00:00:01,101,202
2018-01-01 00:00:00,100,200


In [51]:
df = pd.DataFrame(
    np.arange(12).reshape((3,4)).T,
    columns=["A", "B", "C"],
)
s = pd.Series([0, 1, 2, 3])

df

Unnamed: 0,A,B,C
0,0,4,8
1,1,5,9
2,2,6,10
3,3,7,11


In [44]:
display_windowed(df.rolling(window=2))

Unnamed: 0,A,B,C
0,0,4,8

Unnamed: 0,A,B,C
0,0,4,8
1,1,5,9

Unnamed: 0,A,B,C
1,1,5,9
2,2,6,10

Unnamed: 0,A,B,C
2,2,6,10
3,3,7,11


In [47]:
np.corrcoef([4,5],[0,1])

array([[1., 1.],
       [1., 1.]])

In [52]:
# NOTE: corr (and cov) will calculate the correlation between each column of the window
# of df and the series (s). But remember that the pair-wise bettween the column of the window
# and the series s is possible by the matching indexes.


df.rolling(window=2).corr(s)

Unnamed: 0,A,B,C
0,,,
1,1.0,1.0,1.0
2,1.0,1.0,1.0
3,1.0,1.0,1.0


In [31]:
df.rolling(window=2).corr(s)

Unnamed: 0,A,B,C,D
0,,,,
1,1.0,1.0,1.0,-1.0
2,1.0,-1.0,,-1.0
3,1.0,1.0,,-1.0


Unnamed: 0,a,b,c
0,,,
1,1.0,1.0,1.0
2,1.0,1.0,1.0
3,1.0,1.0,1.0


In [20]:

offset = pd.offsets.BDay(1)
offset


<BusinessDay>

In [98]:
df = pd.DataFrame(
    {"x": 1},
    index=[
        pd.Timestamp("20130101 09:00:01"),
        pd.Timestamp("20130101 09:00:02"),
        pd.Timestamp("20130101 09:00:03"),
        pd.Timestamp("20130101 09:00:04"),
        pd.Timestamp("20130101 09:00:06"),
    ],
)
df

Unnamed: 0,x
2013-01-01 09:00:01,1
2013-01-01 09:00:02,1
2013-01-01 09:00:03,1
2013-01-01 09:00:04,1
2013-01-01 09:00:06,1


In [107]:
windowed = df.rolling("2s", closed='left')
display_windowed(windowed)

Unnamed: 0,x

Unnamed: 0,x
2013-01-01 09:00:01,1

Unnamed: 0,x
2013-01-01 09:00:01,1
2013-01-01 09:00:02,1

Unnamed: 0,x
2013-01-01 09:00:02,1
2013-01-01 09:00:03,1

Unnamed: 0,x
2013-01-01 09:00:04,1


In [10]:
df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})

result = df.groupby('A').expanding().sum()

display_side_by_side(df, result)

Unnamed: 0,A,B
0,a,0
1,b,1
2,a,2
3,b,3
4,a,4

Unnamed: 0_level_0,Unnamed: 1_level_0,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0,0.0
a,2,2.0
a,4,6.0
b,1,1.0
b,3,4.0


In [11]:
display_windowed(df.groupby('A').expanding())

Unnamed: 0,B
0,0


Unnamed: 0,B
0,0
2,2


Unnamed: 0,B
0,0
2,2
4,4


Unnamed: 0,B
1,1


Unnamed: 0,B
1,1
3,3


will always return float64 values.