# Time Series Data
* We've already seen pandas can handle date/time formats
* Time series data adds new manipulation options to our data, and pandas was actually developed with time series data in mind.

## Resampling 
* the process of converting a time series from one frequency to another.
  * downsampling: going from a high frequency (e.g. daily) to a lower frequency (e.g. weekly)
  * upsampling: going from a lower frequency to higher frequency
  * remapping: aligning data to a set frequency (e.g. mapping weekly data to sundays)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# We can create date ranges with
pd.date_range?

In [2]:
# Some sample data
dates = pd.date_range(start='1/1/2018', end='05/31/2018')
ts = pd.Series(np.random.randn(len(dates)), index=dates)
ts.head()

2018-01-01   -0.702027
2018-01-02   -0.562127
2018-01-03   -0.389513
2018-01-04    0.202224
2018-01-05    2.364082
Freq: D, dtype: float64

In [3]:
# You need a date or time index on your dataframe to do some resampling
# When we resample we need to determine the new frequency we want and how we want to resample
# Let's change our daily data down to weekly data
resampler=ts.resample('W')
resampler

<pandas.core.resample.DatetimeIndexResampler object at 0x7fb4f65353c8>

In [4]:
# Just like groupby, this is an object which will do the resampling for us
# Since we are downsampling (D->W) we need to decide how to aggregate our datapoints
# We are now very used to this!
resampler.apply(np.mean).head()

2018-01-07    0.153167
2018-01-14   -0.506203
2018-01-21    0.276566
2018-01-28    0.146692
2018-02-04    0.096839
Freq: W-SUN, dtype: float64

* Notice the frequency is now W-SUN (weekly beginning on sunday)
* When we downsample we are "binning" our values and need to determine which end of the bin is open/closed
* By default the right side is **closed** for weekly binning, which we did here
  * Closed vs. open can be confusing! For example, is an observation at midnight on October 13, 2020 a Tuesday observation, or a Monday observation?
  * If you have defined thing as left closed, then it's Monday. If you defined them as right closed, then it's Tuesday.

# Here's an example
* if you have a bunch of time sampled data in seconds and you are downsampling to minutes then:
  * if you are **left closed** you are saying "downsample to minutes where all of the values are **<** the next minute whole number"
  * if you are **right closed** you are saying "downsample to minutes where all of the values are **<=** the next minute whole number"
* clear as a No.3 Dark Maple Syrup, eh?

In [5]:
# Let's look at 9 seconds which cross the minute boundry
index = pd.date_range('10/13/2020 12:59:55', periods=9, freq='S')
series = pd.Series(range(9), index=index)
series

2020-10-13 12:59:55    0
2020-10-13 12:59:56    1
2020-10-13 12:59:57    2
2020-10-13 12:59:58    3
2020-10-13 12:59:59    4
2020-10-13 13:00:00    5
2020-10-13 13:00:01    6
2020-10-13 13:00:02    7
2020-10-13 13:00:03    8
Freq: S, dtype: int64

In [6]:
# if we resample this to 1 minute intervals closed on the left 
# then the first five seconds will be binned to the left value (<)
series.resample('1T',closed="left").apply(np.max)

2020-10-13 12:59:00    4
2020-10-13 13:00:00    8
Freq: T, dtype: int64

In [7]:
# if we resample this to 1 minute intervals closed on the right 
# then the first six seconds will be binned to the left value (<=)
series.resample('1T',closed="right").apply(np.max)

2020-10-13 12:59:00    5
2020-10-13 13:00:00    8
Freq: T, dtype: int64

<a href="https://stackoverflow.com/questions/48340463/how-to-understand-closed-and-label-arguments-in-pandas-resample-method">https://stackoverflow.com/questions/48340463/how-to-understand-closed-and-label-arguments-in-pandas-resample-method</a>
<img src="https://i.stack.imgur.com/nX6yv.png"></img>

In [8]:
# another example, with 12 periods in minute chunks
ts = pd.Series(np.arange(12), index=pd.date_range(start='1/1/2018', periods=12,freq='T'))
ts

2018-01-01 00:00:00     0
2018-01-01 00:01:00     1
2018-01-01 00:02:00     2
2018-01-01 00:03:00     3
2018-01-01 00:04:00     4
2018-01-01 00:05:00     5
2018-01-01 00:06:00     6
2018-01-01 00:07:00     7
2018-01-01 00:08:00     8
2018-01-01 00:09:00     9
2018-01-01 00:10:00    10
2018-01-01 00:11:00    11
Freq: T, dtype: int64

In [9]:
# what do you think will happen if we resample to the nearest whole 5 minute mark but close left?
# look at the data, write down in your mind
ts.resample("5min", closed='left').apply(np.sum)

2018-01-01 00:00:00    10
2018-01-01 00:05:00    35
2018-01-01 00:10:00    21
Freq: 5T, dtype: int64

In [11]:
# what do you think will happen if we resample to the nearest whole 5 minute mark but close right?
# look at the data, write down in your mind
ts.resample("5min", closed='right').apply(np.sum)

2017-12-31 23:55:00     0
2018-01-01 00:00:00    15
2018-01-01 00:05:00    40
2018-01-01 00:10:00    11
Freq: 5T, dtype: int64

In [12]:
# Also, downresampling really is an aggregation exercise, so you can do all sorts of things
# What do you think this does in real language?
ts.resample('2min').apply(lambda x: pd.Series({"mean":np.mean(x),"max":np.max(x)})).unstack()

Unnamed: 0,mean,max
2018-01-01 00:00:00,0.5,1.0
2018-01-01 00:02:00,2.5,3.0
2018-01-01 00:04:00,4.5,5.0
2018-01-01 00:06:00,6.5,7.0
2018-01-01 00:08:00,8.5,9.0
2018-01-01 00:10:00,10.5,11.0


In [13]:
# Inline activity!
df=pd.read_csv('datasets/si330_dstat.csv',skiprows=5)
df.head()
# How do we generate a dataframe which shows the 30 second averages and the 
# standard deviations of the idl (CPU idle) time?


Unnamed: 0,epoch,usr,sys,idl,wai,stl
0,1602536000.0,0.497,0.106,99.347,0.049,0.001
1,1602536000.0,0.438,0.312,99.25,0.0,0.0
2,1602536000.0,0.125,0.125,99.75,0.0,0.0
3,1602536000.0,0.125,0.063,99.812,0.0,0.0
4,1602536000.0,0.125,0.063,99.812,0.0,0.0


In [None]:
# With upsampling there is no need to aggregate. 

# let's create a dataframe, with two weekly indices, and four columns. First the 
# indicies
dates = pd.date_range(start='1/1/2018', periods=2, freq='W')
# now let's fill in the DataFrame
df = pd.DataFrame(np.random.randn(2,4), index=dates, 
                  columns=['col1','col2','col3','col4'])
df.head()

In [None]:
# Now we upsample from weekly frequency to daily frequency,
df_daily = df.resample('D').asfreq()
df_daily.head()

In [None]:
# As you notice, there will be NaN values, let's engage in interpolation
# Foprward fill or backward fill
df.resample('D').ffill()

In [None]:
# We can also choose to only fill a certain number of periods, by using the limit 
# parameter in the ffill() function. For instance, here, we are limiting to 
# interpolating three observations
df.resample('D').ffill(limit=3)

# Working with time series data
* we've now seen downsampling and upsampling, and have a better sense of how date ranges are handled in pandas
* lets go back to a favorite dataset of ours which has lots of interesting time series data in it and try and explore a bit

In [None]:
df=pd.read_excel("datasets/AnnArbor-TicketViolation2016.xls",skiprows=1)
print(df.columns)
df.head()

In [None]:
# First up, let's create a date/time index. We have an issue date column and 
# an issuetime column
def clean_time(x):
    issue_time=str(x["IssueTime"])
    if len(issue_time) < 4:
        issue_time="0"+issue_time
    date_time="{}{}:{}".format(
        str(x["Issue Date "])[0:11], 
        issue_time[0:2], 
        issue_time[-2:], axis=1)
    return pd.to_datetime(date_time, format='%Y-%m-%d %H:%M')
df=df.set_index(df[["Issue Date ","IssueTime"]].apply(clean_time, axis=1))
df.head()

In [None]:
# Now let's plot the fines over the year!
import matplotlib.pyplot as plt
df[" Fine "].plot()

In [None]:
# gah! That's meaningless. How would we find signal in that noise?
# let's zoom in on a single month, pandas does the "right thing" with date/time slicing!
df.loc["2016-01-01":"2016-02-01", " Fine "].plot()

In [None]:
# This, is, btw, much cooler than it seems at first blush, check this out
df.index < "2016-01-03"
# WOW!

In [None]:
# so this means we can use date/times as masks!
df[df.index<"2016-02"].head()

In [None]:
# Now let's resample this and look at daily totals
df.loc["2016-01-01":"2016-02-01", " Fine "].resample("1D").apply(np.sum).plot()

In [None]:
# January 10th 2016 was a sunday! Looks pretty clear that sundays very few tickets 
# are given out!
# Also, David Bowie died on this day. :(
# How do things change if we look at mean values?
df.loc["2016-01-01":"2016-02-01", " Fine "].resample("1D").apply(np.mean).plot()

In [None]:
#We could also look at tickets per hour in a single week
df.loc["2016-01-11":"2016-01-18", " Fine "].resample("1H").apply(len).plot()

In [None]:
# That 13th-14th has some big values, let's zoom in a bit
df.loc["2016-01-13":"2016-01-14", " Fine "].resample("15T").apply(len).plot()

In [None]:
# We can also explore multiple series of data plotted on the same chart by executing plot() on a
# dataframe multiple times in a single cell
df.loc["2016-01-13":"2016-01-14", " Fine "].resample("15T").apply(len).plot()
df.loc["2016-01-13":"2016-01-14", " Fine "].resample("60T").apply(len).plot()
df.loc["2016-01-13":"2016-01-14", " Fine "].resample("180T").apply(len).plot()