# Windowing Operations

Pandas can perform windowing operations, which are operations that works similar to `group_by()`.

1. Sliding partion of values (which can vary depending the type of window).
2. Performs an aggregation over the sliding partion of values.

The windowing operation can be applied to `Series` or `DataFrames`, and their types are the following:

1. Rolling window
2. Weighted window
3. Expanding window
4. Exponentially weighted window

There are also some generalities that applies to all these types of windows.

In [115]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [116]:
## Handy functions
from IPython.display import display_html, display, HTML

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def display_several(*args):
    for df in args:
        display(df)

def display_windowed(windowed):
    table_title_html = '<div style="display:inline-block; vertical-align:top; width:15%; margin:1px;"><h4>window {0} (type: {1})</h4>{2}</div>'

    html_str=''
    for i, window in enumerate(windowed):
        if isinstance(window, pd.Series):
            window = window.to_frame()
            html_str+=table_title_html.format(i, "s",window.to_html())
        else:
            html_str+=table_title_html.format(i, "df",window.to_html())
        
    display_html(html_str,raw=True)


# General Properties

1. It is possible to iterate over windows
2. All windowing operations support a `min_periods` arguments. `min_periods` indicates the minimum number of non-nan values a window must contain in order to return a result, otherwise, return `nan`.
    - default = 1 for time-based windows (or offset window size)
    - default = window size for fixed windows


In [117]:
df = pd.DataFrame(
    { "A": range(6),
      "B" : [np.nan, 1, 2, np.nan, np.nan, 3]
     }, 
     index=pd.date_range('2020-01-01', periods=6, freq='1D')
     )
df

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [118]:
# 1. Iterate over windows
# NOTE: using rolling, each window is built using the current value of row and completed
# with previous ones (if there are some ones) until get the fixed size
# For that reason, the first windows have size 1 and 2
for window in df.rolling(window = 3):
    display(window)

Unnamed: 0,A,B
2020-01-01,0,


Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0


Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0


Unnamed: 0,A,B
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,


Unnamed: 0,A,B
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,


Unnamed: 0,A,B
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [119]:
# Or using our handy function display_windowed, we can display side by side
# the windows
display_windowed(df.rolling(window = 3))

Unnamed: 0,A,B
2020-01-01,0,

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0

Unnamed: 0,A,B
2020-01-01,0,
2020-01-02,1,1.0
2020-01-03,2,2.0

Unnamed: 0,A,B
2020-01-02,1,1.0
2020-01-03,2,2.0
2020-01-04,3,

Unnamed: 0,A,B
2020-01-03,2,2.0
2020-01-04,3,
2020-01-05,4,

Unnamed: 0,A,B
2020-01-04,3,
2020-01-05,4,
2020-01-06,5,3.0


In [76]:
# 2. using min_periods (FOCUS on column B)
# NOTE: all the previous windows have at least 1 non-nan value except 
# by the first one. Then, all return a result except the first one in the column B
df.rolling(window = 3 , min_periods= 1).sum()


Unnamed: 0,A,B
2020-01-01,0.0,
2020-01-02,1.0,1.0
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,2.0
2020-01-06,12.0,3.0


In [77]:
# NOTE: From previous windows, the windows 0, 1, 4, 5 have less than 2 non-nan values.
# Then, the result will be a NaN value for those windows.
df.rolling(window = 3 , min_periods= 2).sum()


Unnamed: 0,A,B
2020-01-01,,
2020-01-02,1.0,
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,
2020-01-06,12.0,


In [78]:
#NOTE: all the windows have less than 3 non-nan values in column B. Then, all
# will return NaN in the result
df.rolling(window = 3 , min_periods= 3).sum()

Unnamed: 0,A,B
2020-01-01,,
2020-01-02,,
2020-01-03,3.0,
2020-01-04,6.0,
2020-01-05,9.0,
2020-01-06,12.0,


In [79]:
#NOTE: For fixed-window size the default min_periods is the windows size 
# (in this case 3), similar to the above example.
df.rolling(window = 3).sum()

Unnamed: 0,A,B
2020-01-01,,
2020-01-02,,
2020-01-03,3.0,
2020-01-04,6.0,
2020-01-05,9.0,
2020-01-06,12.0,


In [80]:
#NOTE: For time-based windows, the default min_periods is 1, similar to our 
#first example.
df.rolling(window='3D').sum()


Unnamed: 0,A,B
2020-01-01,0.0,
2020-01-02,1.0,1.0
2020-01-03,3.0,3.0
2020-01-04,6.0,3.0
2020-01-05,9.0,2.0
2020-01-06,12.0,3.0


## Rolling Window

In [81]:
times = ['2020-01-01', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-29']

df = pd.DataFrame(
    { "A": range(5),
      "B" : np.random.randint(10, size = 5)
     }, 
     index=pd.DatetimeIndex(times)
     )
df

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7


In [89]:
windowed = df.rolling(window=3)
display_windowed(windowed)

Unnamed: 0,A,B
2020-01-01,0,5

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3

Unnamed: 0,A,B
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7


In [87]:
windowed = df.rolling(window=3, center=True)
display_windowed(windowed)

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0

Unnamed: 0,A,B
2020-01-01,0,5
2020-01-03,1,0
2020-01-04,2,3

Unnamed: 0,A,B
2020-01-03,1,0
2020-01-04,2,3
2020-01-05,3,3

Unnamed: 0,A,B
2020-01-04,2,3
2020-01-05,3,3
2020-01-29,4,7

Unnamed: 0,A,B
2020-01-05,3,3
2020-01-29,4,7


### Window endpoints and `closed` parameter

The parameter `closed` allow us to include or exclude the endpoints for our windows.

- `closed = right` includes right, but excludes left endpoint. (default)
- `closed = left` includes left, but excludes right endpoint.
- `closed = both` includes both left and right.
- `closed = neither` excludes both left and right.

The following picture indicates us the endpoints of a window (with fixed size 3 `window = 3`) and the effect of the `closed` parameter. It is important to remember that by default the `right` endpoint is included. In other words, `closed = right` is the default behavior.

<img src="./assets/imgs/window_endpoint.jpg" width="500"/>

**NOTE:** although the window size will be fixed, `both` and `neither` can change the size of the windows, without taking into account the fixed size 3. For example, in the image above `both` return a window of size 4 and `neither` a window of size 2.

**NOTE:** using time-based window the behavior is the same, but it is important to remember that the window size is variable and could return larger or smaller windows depending of the amount of data in each interval of time.



In [140]:
df = pd.DataFrame({'A': range(6)})
df

Unnamed: 0,A
0,0
1,1
2,2
3,3
4,4
5,5


In [141]:
# 1. default closed='right'
# NOTE: the example described in the picture are focus on the window 3
rolling_window = df.rolling(window=3, closed='right')
display_windowed(rolling_window)

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
1,1
2,2
3,3

Unnamed: 0,A
2,2
3,3
4,4

Unnamed: 0,A
3,3
4,4
5,5


In [142]:
# 2. default closed='left'
rolling_window = df.rolling(window=3, closed='left')
display_windowed(rolling_window)

Unnamed: 0,A

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
1,1
2,2
3,3

Unnamed: 0,A
2,2
3,3
4,4


In [143]:
# 3. default closed='both'
rolling_window = df.rolling(window=3, closed='both')
display_windowed(rolling_window)

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
0,0
1,1
2,2

Unnamed: 0,A
0,0
1,1
2,2
3,3

Unnamed: 0,A
1,1
2,2
3,3
4,4

Unnamed: 0,A
2,2
3,3
4,4
5,5


In [144]:
# 4. default closed='neither'
rolling_window = df.rolling(window=3, closed='neither')
display_windowed(rolling_window)

Unnamed: 0,A

Unnamed: 0,A
0,0

Unnamed: 0,A
0,0
1,1

Unnamed: 0,A
1,1
2,2

Unnamed: 0,A
2,2
3,3

Unnamed: 0,A
3,3
4,4


In [98]:
df = pd.DataFrame(
    {"x": 1},
    index=[
        pd.Timestamp("20130101 09:00:01"),
        pd.Timestamp("20130101 09:00:02"),
        pd.Timestamp("20130101 09:00:03"),
        pd.Timestamp("20130101 09:00:04"),
        pd.Timestamp("20130101 09:00:06"),
    ],
)
df

Unnamed: 0,x
2013-01-01 09:00:01,1
2013-01-01 09:00:02,1
2013-01-01 09:00:03,1
2013-01-01 09:00:04,1
2013-01-01 09:00:06,1


In [107]:
windowed = df.rolling("2s", closed='left')
display_windowed(windowed)

Unnamed: 0,x

Unnamed: 0,x
2013-01-01 09:00:01,1

Unnamed: 0,x
2013-01-01 09:00:01,1
2013-01-01 09:00:02,1

Unnamed: 0,x
2013-01-01 09:00:02,1
2013-01-01 09:00:03,1

Unnamed: 0,x
2013-01-01 09:00:04,1


In [10]:
df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})

result = df.groupby('A').expanding().sum()

display_side_by_side(df, result)

Unnamed: 0,A,B
0,a,0
1,b,1
2,a,2
3,b,3
4,a,4

Unnamed: 0_level_0,Unnamed: 1_level_0,B
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0,0.0
a,2,2.0
a,4,6.0
b,1,1.0
b,3,4.0


In [11]:
display_windowed(df.groupby('A').expanding())

Unnamed: 0,B
0,0


Unnamed: 0,B
0,0
2,2


Unnamed: 0,B
0,0
2,2
4,4


Unnamed: 0,B
1,1


Unnamed: 0,B
1,1
3,3


will always return float64 values.