# Resampling and Reindexing

When working with time series data, you will often want to change the times at which your observations are made.

If you data is taken every 5 minutes, you may instead want to take it every hour.

This operation is often necessary to perform calculations that combine two data sets that were collected on different time stamps.

You have two options

- resampling
- reindexing

When you reindex, it means you provide a new index (maybe hourly or daily) to sample your data on.  
Data collected at the same time as your index will be used, and then if no data exists for a certain time, you have options for what to do.

Documentation links

- [Resample Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
- [Reindex Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)
- [DateRange Pandas Docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html)
- [Pandas Interpolation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.tseries.resample.Resampler.interpolate.html)

Tutorial Links

- [Interpolation (slightly outdated)](http://machinelearningmastery.com/resample-interpolate-time-series-data-python/)



In [1]:
from io import StringIO
import pandas as pd
print('this uses pandas version', pd.__version__)

this uses pandas version 0.19.2


In [2]:
# This data has an extra and unwanted sample at an odd time

csv_data = '''timestamp,unit,energy
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 20:45:59,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,KWH,27.004
2016-01-01 21:45:00,KWH,27.008
2016-01-01 22:00:00,KWH,27.012
'''

from io import StringIO
import pandas as pd
data = pd.read_csv(StringIO(csv_data), parse_dates=True, index_col=0)
data

Unnamed: 0_level_0,unit,energy
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 20:45:59,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,KWH,27.004
2016-01-01 21:45:00,KWH,27.008
2016-01-01 22:00:00,KWH,27.012


In [3]:
date_index = pd.date_range('2016-01-01 20:00:00', periods=9, freq='15T')
data.reindex(date_index)

Unnamed: 0,unit,energy
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,KWH,27.004
2016-01-01 21:45:00,KWH,27.008
2016-01-01 22:00:00,KWH,27.012


In [4]:
# this data has the odd sample and some missing data

csv_data = '''timestamp,unit,energy
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 20:45:59,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 22:00:00,KWH,27.012
'''

from io import StringIO
import pandas as pd
data = pd.read_csv(StringIO(csv_data), parse_dates=True, index_col=0)
data

Unnamed: 0_level_0,unit,energy
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 20:45:59,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 22:00:00,KWH,27.012


In [5]:
date_index = pd.date_range('2016-01-01 20:00:00', periods=9, freq='15T')
data.reindex(date_index)

Unnamed: 0,unit,energy
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,,
2016-01-01 21:45:00,,
2016-01-01 22:00:00,KWH,27.012


Note that the new data does not have data for the intervals that were missing.
You have the option to fill in that data with other samples.
The simplest of the many options is to replace it with the last valid sample.

You will have to decide on your own data and analysis, which of these makes more sense for you to perform.

In [6]:
date_index = pd.date_range('2016-01-01 20:00:00', periods=9, freq='15T')
data.reindex(date_index, method='pad')

Unnamed: 0,unit,energy
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,KWH,27.002
2016-01-01 21:45:00,KWH,27.002
2016-01-01 22:00:00,KWH,27.012


In [7]:
# you can also interpolate the data

data.resample('15T').interpolate(method='linear')

Unnamed: 0_level_0,unit,energy
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01 20:00:00,KWH,26.989
2016-01-01 20:15:00,KWH,26.992
2016-01-01 20:30:00,KWH,26.994
2016-01-01 20:45:00,KWH,26.997
2016-01-01 21:00:00,KWH,26.999
2016-01-01 21:15:00,KWH,27.002
2016-01-01 21:30:00,,27.005333
2016-01-01 21:45:00,,27.008667
2016-01-01 22:00:00,KWH,27.012
