# Comparison of pivot building methods

Methods include the following:

 * `make_pivots` used in the original diffusion frequencies module; this function uses `np.linspace`  which requires the number of data points to create ahead of time instead of spacing between data points. Notably, `linspace` includes the last value in the range by default. Does not allow specification of start and end dates to override the given date observations.
 * `get_pivots` used in the new modular augur interface (but not provided in the frequency estimators module); this function uses `np.arange` which requires the step size between points defined as a floating point value. `arange` does not include the final value in the range unless there is a floating point overflow. When the step size is not an integer, [this function is known to produce inconsistent results due to floating point overflows](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html#numpy.arange). For example, the final value in the output range can be greater than the given stop value. `linspace` is recommended instead. Allows either a start or an end date to override the given date observations.
 * `calculate_pivots` used in the KDE frequencies class (but only available as a class method); this function uses `pd.date_range` which requires either the frequency of points in the requested range or the number of points, combining the functionality of `arange` and `linspace` into one date-aware function. `date_range` also includes the final value by default and supports closed intervals on the left or right. The `calculate_pivots` method requires both a start and an end date to override the given date observations


In [1]:
import numpy as np
import pandas as pd

np.random.seed(1)

def float_to_datestring(time):
    """Convert a floating point date to a date string

    >>> float_to_datestring(2010.75)
    '2010-10-01'
    >>> float_to_datestring(2011.25)
    '2011-04-01'
    >>> float
    """
    year = int(time)
    month = int(np.rint(((time - year) * 12) + 1))
    day = 1
    return "%s-%02d-%02d" % (year, month, day)

def timestamp_to_float(time):
    """Convert a pandas timestamp to a floating point date.

    >>> import datetime
    >>> time = datetime.date(2010, 10, 1)
    >>> timestamp_to_float(time)
    2010.75
    >>> time = datetime.date(2011, 4, 1)
    >>> timestamp_to_float(time)
    2011.25
    """
    return time.year + ((time.month - 1) / 12.0)

Generate some example date observations in floating point space like augur would provide.

In [29]:
n_samples = 15
observations = np.array(sorted(np.random.choice([2010.0, 2011.0], size=n_samples) +
                               np.random.random(n_samples) * 0.75))
observations[:10]

array([ 2010.21428914,  2010.32525726,  2010.3449102 ,  2010.43651063,
        2010.67904424,  2011.01164996,  2011.36769014,  2011.40976011,
        2011.44511106,  2011.44933273])

Compare calculation of monthly pivots with different methods described above.
Start with `linspace` approach from `make_pivots`.

In [30]:
# Determine the range of the given observations.
dt = np.max(observations) - np.min(observations)

# Define number of pivots to match the arange and date_range examples below.
n_pivots = 21

# Calculate pivots with linear spacing between start and end with 1% of the total range
# added to each side of the range.
np.linspace(np.min(observations) - 0.01 * dt, np.max(observations) + 0.01 * dt, n_pivots)

array([ 2010.19938122,  2010.27541159,  2010.35144195,  2010.42747232,
        2010.50350268,  2010.57953305,  2010.65556341,  2010.73159378,
        2010.80762415,  2010.88365451,  2010.95968488,  2011.03571524,
        2011.11174561,  2011.18777597,  2011.26380634,  2011.3398367 ,
        2011.41586707,  2011.49189743,  2011.5679278 ,  2011.64395816,
        2011.71998853])

Next, use `arange` approach from `get_pivots` method.

In [31]:
min_date = None
max_date = None
pivots_per_year = 12
dt = 1.0 / pivots_per_year

first_pivot = min_date if min_date else np.floor(np.min(observations) / dt) * dt
last_pivot = max_date if max_date else np.ceil(np.max(observations) / dt) * dt
np.arange(first_pivot, last_pivot, dt)

array([ 2010.16666667,  2010.25      ,  2010.33333333,  2010.41666667,
        2010.5       ,  2010.58333333,  2010.66666667,  2010.75      ,
        2010.83333333,  2010.91666667,  2011.        ,  2011.08333333,
        2011.16666667,  2011.25      ,  2011.33333333,  2011.41666667,
        2011.5       ,  2011.58333333,  2011.66666667,  2011.75      ])

In [32]:
# Floating point step size leads to overflow in arange call above.
dt

0.08333333333333333

In [33]:
first_pivot

2010.1666666666665

In [34]:
last_pivot

2011.75

Depending on which observations are included above, the last pivot may or may not be included in the output from `arange` because the step size of `1 / 12.0` leads to a buffer flow.

In [35]:
min_date = None
max_date = None
pivots_per_year = 4
dt = 1.0 / pivots_per_year

first_pivot = min_date if min_date else np.floor(np.min(observations) / dt) * dt
last_pivot = max_date if max_date else np.ceil(np.max(observations) / dt) * dt
np.arange(first_pivot, last_pivot, dt)

array([ 2010.  ,  2010.25,  2010.5 ,  2010.75,  2011.  ,  2011.25,  2011.5 ])

In [36]:
last_pivot

2011.75

Now the last pivot value is not included in the output from `arange` because the step size of `1 / 4.0` does not lead to a buffer overflow.
Next, try to replicate the above examples with the pandas `date_range` function.

In [37]:
# Use same step size as first arange example above of 1 month between pivots.
pivot_frequency = 1
dt = 1.0 / 12

first_pivot = float_to_datestring(min_date if min_date else np.floor(np.min(observations) / dt) * dt)
last_pivot = float_to_datestring(max_date if max_date else np.ceil(np.max(observations) / dt) * dt)
pd.date_range(first_pivot, last_pivot, freq="%sMS" % pivot_frequency)

DatetimeIndex(['2010-03-01', '2010-04-01', '2010-05-01', '2010-06-01',
               '2010-07-01', '2010-08-01', '2010-09-01', '2010-10-01',
               '2010-11-01', '2010-12-01', '2011-01-01', '2011-02-01',
               '2011-03-01', '2011-04-01', '2011-05-01', '2011-06-01',
               '2011-07-01', '2011-08-01', '2011-09-01', '2011-10-01'],
              dtype='datetime64[ns]', freq='MS')

In [38]:
first_pivot

'2010-03-01'

In [39]:
last_pivot

'2011-10-01'

In [40]:
# Convert datetime values to floats
np.array([
    timestamp_to_float(timestamp)
    for timestamp in pd.date_range(first_pivot, last_pivot, freq="%sMS" % pivot_frequency)
])

array([ 2010.16666667,  2010.25      ,  2010.33333333,  2010.41666667,
        2010.5       ,  2010.58333333,  2010.66666667,  2010.75      ,
        2010.83333333,  2010.91666667,  2011.        ,  2011.08333333,
        2011.16666667,  2011.25      ,  2011.33333333,  2011.41666667,
        2011.5       ,  2011.58333333,  2011.66666667,  2011.75      ])

In [41]:
# Use same step size as second arange example above of 3 months between pivots (4 pivots per year).
pivot_frequency = 3
dt = 1.0 / 4

first_pivot = float_to_datestring(min_date if min_date else np.floor(np.min(observations) / dt) * dt)
last_pivot = float_to_datestring(max_date if max_date else np.ceil(np.max(observations) / dt) * dt)
pd.date_range(first_pivot, last_pivot, freq="%sMS" % pivot_frequency)

DatetimeIndex(['2010-01-01', '2010-04-01', '2010-07-01', '2010-10-01',
               '2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01'],
              dtype='datetime64[ns]', freq='3MS')

In [42]:
# Convert datetime values to floats
np.array([
    timestamp_to_float(timestamp)
    for timestamp in pd.date_range(first_pivot, last_pivot, freq="%sMS" % pivot_frequency)
])

array([ 2010.  ,  2010.25,  2010.5 ,  2010.75,  2011.  ,  2011.25,
        2011.5 ,  2011.75])

Note that the pivots from `date_range` always include the last pivot value regardless of the step size between pivots. Values in the given range are [calculated by applying the "frequency" as an integer offset from the start value up to the end date](https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/tseries/offsets.py#L2380-L2452).