## Moving Window Functions
* sometimes we want to do manipulation on a chunk of a dataframe by time period, e.g. 5 mins, instead of using groups
* we can do this by just indexing into the dataframe with a date range comparison
* let's look at examples on the stock market with amazon


In [None]:
import pandas as pd
df = pd.read_csv("datasets/AMZN.csv")
df.head()

In [None]:
# Let's look at the highs and lows
amzn=df[["High","Low"]].set_index(pd.to_datetime(df["Date"]))
# just by this year
amzn=amzn[amzn.index> '2020-01-01']
import matplotlib.pyplot as plt
amzn.plot()
# notice we have two lines!

* The `rolling()` operator lets us specify a window size and apply a function that works on that much data to every point in our time series
* Close your eyes and imagine now one of those stock prices over time. How would we calculate the 7 day moving average?
* First, what does the 7 day moving average even mean?


* Now, what's the algorithm you would use to calculate it?

In [None]:
import numpy as np
for i in range(7,len(amzn)):
    print(np.mean(amzn.iloc[i-7:i]["High"]))

#a loop. ickers.

* In pandas, the `rolling` function allows us to do this with arbitrary window sizes and a function

In [None]:
# Here, let's do a 7 day rolling on the dataframe
# Note the output we get back
amzn.rolling(7).apply(np.mean)

In [None]:
# there are the usual benefits (faster, less error prone, not icky) versus iteration
# we can also control those nan's by specifying how much data we want to be present
# before applying the function
amzn.rolling(7,min_periods=1).apply(np.mean).plot()

In [None]:
# here's a 30 day moving average
amzn.rolling(30,min_periods=1).apply(np.mean).plot()

# Time Series and Gaps
What do we do when we have missing data in a time series?

In [None]:
import pandas as pd
# let's look at some data reported cases of measles in England and Wales
df=pd.read_csv("datasets/ewcitmeas.txt", delim_whitespace=True, dtype=float, na_values="*")
df.head()

In [None]:
# what a painful date time format! Welcome to my world!
df.rename(columns={'DD': 'day', 'MM': 'month', 'YY': 'year'}, inplace=True)
df['year'] = df['year'] + 1900
df=df.set_index(pd.to_datetime(df[['year', 'month', 'day']])).drop(["day","month","year"], axis='columns')
df.head()

In [None]:
%matplotlib inline
# we can setup some features for matplotlib here
#import matplotlib as mpl
#mpl.rcParams['figure.figsize'] = 12, 8

In [None]:
# instead of the line plots we have been using, let's look at a scatter plot
df["London"].plot(style=".")

* We notice several things here, first, the seasonality of the measels is shown. Second that the resultion of weekly makes sense sometimes, but at other times it seems like we could be helped with more frequent data collection

In [None]:
# I want to show you how to deal with holes in your data
# I'm just going to pull out 500 obsertvations as an example into a new dataframe
df2=df["London"].sample(500)
# Now, look at the randomness of dates/values in the dataframe itself
display(df2.head())

In [None]:
# One way of filling holes is to resample and forward fill values
df_ffill=df2.resample("D").fillna(method="ffill")
df_ffill.head()

In [None]:
#let's compare these two
df2.plot(style=".")
plt.figure()
df_ffill.plot(style=".")

In [None]:
# What if we applied a rolling window?
import numpy as np
df2.resample("D").asfreq().rolling(10,min_periods=1).apply(np.nanmean).dropna().plot(style=".")

* We have another great option in pandas, which is the `interpolate()` function.
* interpolate fills na values in different ways depending upon the parameters you pass
* it actually has some specific time series benefits but only for higher frequency data (day or better)

In [None]:
df2.resample("D").interpolate(method="linear").plot(style=".")

In [None]:
df2.resample("D").interpolate(method="polynomial", order=3).plot(style=".")

In [None]:
# what does interpolate work on?
import pandas as pd
import numpy as np
df=pd.read_csv("datasets/run.csv")
df.head()

In [None]:
df=df[0:10].append(df[20:100])
df=df.set_index(pd.to_datetime(df["timestamp"]))
df.head(20)

In [None]:
df["heart_rate"].plot(style=".")

In [None]:
df.resample("1s").interpolate(method="time").head(20)

In [None]:
df.resample("1s").interpolate(method="time")["heart_rate"].plot(style=".")

In [None]:
import matplotlib.pyplot as plt
df.resample("1s").interpolate(method="time")["heart_rate"].plot(style="b.")
plt.figure()
df.resample("1s").fillna(method="ffill")["heart_rate"].plot(style="r.")
plt.figure()
df.resample("1s").fillna(method="nearest")["heart_rate"].plot(style="g.")