In [21]:
import pandas as pd
import numpy as np

## 1. Resampling

    Starting out with time series analysis and labeling tasks. 
    A common scenario is when you have data for every minute, but you only need 
    hourly data. Or vice versa. Or you have data at different time intervals and
    you would like to make a regular time series.

In [3]:
rng_index = pd.date_range(start = '1/1/2011', periods = 72, freq = 'H')
df = pd.Series(list(range(len(rng_index))), index = rng_index)
df.head()

2011-01-01 00:00:00    0
2011-01-01 01:00:00    1
2011-01-01 02:00:00    2
2011-01-01 03:00:00    3
2011-01-01 04:00:00    4
Freq: H, dtype: int64

    I would like to think of the above data in terms of every 45 mins and not hours. In the output below 
    it drops certain data points from the previous df, and includes some data points as well. We choose 
    to carry forward the values at previous timestamps to newly created timestamps. This is necessary, since
    there is no value for these timestamps in the original df. Therefore, eventhough old datapoints could be lost, they
    still have influence on newly generated data points.

In [4]:
df_modified = df.asfreq('45min', method='ffill')
df_modified.head()

2011-01-01 00:00:00    0
2011-01-01 00:45:00    0
2011-01-01 01:30:00    1
2011-01-01 02:15:00    2
2011-01-01 03:00:00    3
Freq: 45T, dtype: int64

    Since ffill looked at what came before it to fill in values for new data points, 
    bfill looks at the data points that come after it to fill in values for new data points.
    Domain knowledge could come into play when considering the fill method for new data.

In [6]:
#when you'd like to retain info from old timestamps, and let
#new timestamps to be as they are, then method=None does the job.
df_modified2 = df.asfreq('45min', method=None)
print(df_modified2.head())

2011-01-01 00:00:00    0.0
2011-01-01 00:45:00    NaN
2011-01-01 01:30:00    NaN
2011-01-01 02:15:00    NaN
2011-01-01 03:00:00    3.0
Freq: 45T, dtype: float64


In [8]:
#look into the docs(normalize, fill_value, etc...)
df.asfreq??

In [13]:
low_freq = df.asfreq('3H')
low_freq.head()

2011-01-01 00:00:00     0
2011-01-01 03:00:00     3
2011-01-01 06:00:00     6
2011-01-01 09:00:00     9
2011-01-01 12:00:00    12
Freq: 3H, dtype: int64

In [20]:
#we might need summary statistics on the data points that are 
#dropped from the original dataset. So we use resample, which
#is a more flexible version of asfreq. It lets you do fancier 
#stuff in addition to ffill or bfill
type(df.resample('2H', label="right"))
df.resample('2H', label="right").mean().head()

2011-01-01 02:00:00    0.5
2011-01-01 04:00:00    2.5
2011-01-01 06:00:00    4.5
2011-01-01 08:00:00    6.5
2011-01-01 10:00:00    8.5
Freq: 2H, dtype: float64

## 2. Irregular Time Series and Resampling

    The code below makes use of np.random.choice to generate a random list of 10 numbers
    without replacement. We use this list of random numbers to choose the respective rows
    from the original df. This would create an irregular timeseries, where we can use
    resample to show it's versatility.

In [37]:
irreg_ts = df[list(np.random.choice(a=range(len(rng_index)), size=10, replace=False))]

In [41]:
irreg_ts

2011-01-02 04:00:00    28
2011-01-02 19:00:00    43
2011-01-03 20:00:00    68
2011-01-01 20:00:00    20
2011-01-01 15:00:00    15
2011-01-03 14:00:00    62
2011-01-02 01:00:00    25
2011-01-01 16:00:00    16
2011-01-01 03:00:00     3
2011-01-03 15:00:00    63
dtype: int64

In [42]:
#resampling from the above dataset to create a timeseries of daily frequency
#would not work, since pandas expects order in data in timeseries.
irreg_ts.asfreq('D') #the output proves our point

2011-01-02 04:00:00    28.0
2011-01-03 04:00:00     NaN
Freq: D, dtype: float64

In [44]:
#there we bring order to the data by sorting the index col
irreg_ts = irreg_ts.sort_index()

In [45]:
irreg_ts.head()

2011-01-01 03:00:00     3
2011-01-01 15:00:00    15
2011-01-01 16:00:00    16
2011-01-01 20:00:00    20
2011-01-02 01:00:00    25
dtype: int64

In [53]:
irreg_ts.asfreq('D', method='ffill')

2011-01-01 03:00:00     3
2011-01-02 03:00:00    25
2011-01-03 03:00:00    43
Freq: D, dtype: int64

In [55]:
irreg_ts.resample('1D').mean()

2011-01-01    13.500000
2011-01-02    32.000000
2011-01-03    64.333333
Freq: D, dtype: float64

## 3. Exercises

In [None]:
"""
Difference between asfreq() and resample()?

asfreq() helps you to upsample or downsample your time series, filling in new data
according to your choice. resample is just like asfreq(), in the sense that you 
upsample and downsample. It is like a groupby object, and can be used to apply
aggregate statistics upon downsampling you timeseries data, unlike asfreq().
"""

In [57]:
"""
How to partially forward fill?

Consider the following scenario : I have a regular time series whose freq=6D.
I upsample this time series to freq=1D, and I know that the value for new timestamps
can forward filled upto 2 days only. So the rest would be filled with NaN. So, how
can we do this?
"""

#asfreq with method=None (which is the default value, shown here explicitly), and then use fillna!
df.asfreq('10min', method=None).fillna(method='ffill', limit=3).head(10)

2011-01-01 00:00:00    0.0
2011-01-01 00:10:00    0.0
2011-01-01 00:20:00    0.0
2011-01-01 00:30:00    0.0
2011-01-01 00:40:00    NaN
2011-01-01 00:50:00    NaN
2011-01-01 01:00:00    1.0
2011-01-01 01:10:00    1.0
2011-01-01 01:20:00    1.0
2011-01-01 01:30:00    1.0
Freq: 10T, dtype: float64