 ww# Examples of the use of _pandas.resample()_ to calculate sum and average in different periods

## The input file has measurements of air temperature, relative humidity and radiation, as well as a timestamp field for each measurement. It includes one complete day of measurements, taken every 5 minutes, approximately. The measurements are _not_ exactly equidistant in time (isochronal).

## The function _pandas.resample()_ will be used to help aggregate the measurements of temperature and radiation in two different ways:
    * The temperature will be averaged into hourly temperature
    * The radiation will be integrated (added up)

# <center>*</center>

## Import needed libraries

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
plt.ioff()

## Read data

In [None]:
df = pd.read_csv( '../../../data/sensors_to_resample.csv', sep=';' )

### First lines of the table, just to check

In [None]:
df.head()

### And the last lines

In [None]:
df.tail()

## We define the index of the table (previously a consecutive number) to be the data column _Timestamp_

In [None]:
df.index = pd.DatetimeIndex( df['Timestamp'] )
df.index

## Notice that the data type of the column was from "whatever" (_object_, in this case) into _DatetimeIndex_ (datetime64 bits)
## This is the crucial step!

In [None]:
print( df.index.dtype) 
print( df['Timestamp'].dtype )

### First lines of the table (you never check too much)

In [None]:
df.head()

### And the last lines again

In [None]:
df.tail()

## Notice that the timestamps are __not__ exactly isochronal: that is our main problem here

In [None]:
df['Timestamp'].head()

In [None]:
df['Timestamp'].tail()

## Just to show it more clearly, we can check the difference (discret derivative, for the sake of precision) between timestamps

In [None]:
duration_between_timestamps = df.index.to_series().diff()

## Convert to seconds to compare more easily

In [None]:
duration_between_timestamps = duration_between_timestamps.dt.seconds

In [None]:
print( duration_between_timestamps.min() )
print( duration_between_timestamps.max() )

## Plot the number of seconds between measurements

In [None]:
fig, ax = plt.subplots( nrows=1, ncols=1, figsize=(20,5) )
ax.plot( df.index, duration_between_timestamps, linestyle='', marker='s' ) 
plt.show()

### Doesn't seem like a big deal, a shift of one or two seconds, but towards midnight it adds up to a couple of minutes, we don't have the measurements at the "minute 5" anymore

In [None]:
df['Timestamp'].tail()

## This is the problem to solve, and we want to do it with _pandas.resample_

### For _resample_ to work properly, the series must have a time index, as stated in the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html):
    Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

#### (that was the reason to use DatetimeIndex previously)

## If we have that index, we can do things like:

In [None]:
print( df.index.min(), df.index.max() )

In [None]:
print( df.resample( '1H' ).mean().index.min(), df.resample( '1H' ).mean().index.max() )

In [None]:
print( df.resample( '5Min' ).mean().index.min(), df.resample( '5Min' ).mean().index.max() )

#### For the record, _5 minutes_ can be '5Min' or '5T' ('5M' are 5 months)

In [None]:
print( df.resample( '5M' ).mean().index.min(), df.resample( '5M' ).mean().index.max() )
print( df.resample( '5Min' ).mean().index.min(), df.resample( '5Min' ).mean().index.max() )
print( df.resample( '5T' ).mean().index.min(), df.resample( '5T' ).mean().index.max() )

## What happened?

### What happened was:
.resample( 'new intevals' ).mean()

### The index is recalculated to 1 hour or 5 minutes, and the _mean_(!) is taken as the new value for the intervals. This is perhaps easier to see in the hourly example:

In [None]:
df['Temperature'].head(15)

In [None]:
df['Temperature'].resample( '1H' ).mean().head()

### Other methods for resample are:
    bfill()    # Backward fill
    count()    # Number of values in the interval
    ffill()    # Forward fill
    first()    # Use the first (valid) data
    last()     # Use tha last (valid) data
    max()      # Maximum value in the interval
    mean()     # Mean of the interval
    median()   # Median of values in the interval
    min()      # Minimum value in the interval
    nunique()  # Number of unique values
    std()      # Standard deviation
    sum()      # Sum of the values in the interval
    var()      # Variance in the interval

### Here a complete list (still need to check it thoroughly, though)

In [None]:
tmp = df['Temperature'].resample( '1H' )
methods = [ method_name for method_name in dir(tmp) if callable(getattr(tmp, method_name)) ]
methods = [ method_name for method_name in methods if not '_' in method_name ]
print( methods )

## A couple of examples

In [None]:
df['Temperature'].head(15)

In [None]:
df['Temperature'].resample( '1H' ).min().head(2)

In [None]:
df['Temperature'].resample( '1H' ).max().head(2)

In [None]:
df['Temperature'].resample( '1H' ).sum().head(2)

In [None]:
df['Temperature'].resample( '1H' ).first().head(2)

## Some of the methods make more sense for higher frequencies, when the resampling frequency is higher than the original:

In [None]:
df['Temperature'].resample( '1H' ).ffill().head(2)

In [None]:
df['Temperature'].resample( '1H' ).bfill().head(2)

### Now the same with 1 minute as new period:

In [None]:
df['Temperature'].resample( '1T' ).ffill().head(10)

In [None]:
df['Temperature'].resample( '1T' ).bfill().head(10)

In [None]:
df['Temperature'].resample( '1T' ).fillna('nearest').head(10)

In [None]:
df['Temperature'].resample( '1T' ).fillna('nearest', limit=1).head(10)

## Lastly, if used on a complete DataFrame, it applies to all columns, so please check to see if that is what you actually want

In [None]:
df.resample('1H').mean().head()

In [None]:
df.resample('1H').sum().head()

# <center>*</center>

## References and further reading

### Documentation and examples:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

### To check the aggregation functions, and general examples:
http://benalexkeen.com/resampling-time-series-data-with-pandas/

### To check time units available:
https://stackoverflow.com/questions/17001389/pandas-resample-documentation
### And also:
https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

# <center>*</center>

___

# To check next:

### About the numerical integration of time series, using scipy:
https://nbviewer.jupyter.org/gist/metakermit/5720498