In [2]:
import numpy as np
import pandas as pd

import core.artificial_signal_generators as sig_gen
import core.finance as fin
import core.statistics as stats
import helpers.unit_test as hut

In [7]:
def _generate_series(seed) -> pd.Series:
    arma_process = sig_gen.ArmaProcess([], [])
    realization = arma_process.generate_sample(
        {"start": "2000-01-01", "periods": 75, "freq": "D"},
        scale=1,
        seed=seed,
    )
    return realization

Generating series with `D` daily frequency and 75 observations.

In [8]:
srs = _generate_series(seed=1)
srs

2000-01-01    1.624345
2000-01-02   -0.611756
2000-01-03   -0.528172
2000-01-04   -1.072969
2000-01-05    0.865408
                ...   
2000-03-11   -1.444114
2000-03-12   -0.504466
2000-03-13    0.160037
2000-03-14    0.876169
2000-03-15    0.315635
Freq: D, Name: arma(0,0), Length: 75, dtype: float64

## Testing current implementation

Current verion of `fin.compute_average_holding_period()` is ugly but working as it is supposed to:

In [11]:
result_in_days = fin.compute_average_holding_period(srs)
result_in_days

0.727585777509342

Default result if appled to srs is 0.727.

Checking 'by-hand'

In [61]:
srs.abs().mean() / srs.diff().abs().mean()

0.727585777509342

Seems correct.

Therefore, if we want to output the calculated average holding period in months, 
the output should be equal to `0.72768/30`.

PLEASE, note, if the statement above is incorrect and results for months should be different in this case - tell me and do not read everything below, brcause this assumption is crucail in my thinking here.

In [62]:
result_in_month = fin.compute_average_holding_period(srs, unit='M')
result_in_month

0.024252859250311398

In [63]:
result_in_days/30

0.024252859250311398

So we see that the function is working correctly.

## Approach with `resample()` is not really applicable

First, let's take a look at th suggested function in https://github.com/alphamatic/amp/pull/378#issuecomment-647843776

In [39]:
def get_mean_holding(srs: pd.Series, freq: str) -> float:
    num = srs.abs().resample(freq).sum().mean()
    denom = srs.diff().abs().resample(freq).sum().mean()
    return num / denom

The problem with it is that `resample()` is applied to both `num` and `denum` which excludes any point in the resampling.
Because in both `srs.abs()` and `srs.diff().abs()` we calculate the mean by summarizing all of these series values and 
dividing them by their size. in other words, `srs.abs().mean() = srs.abs().sum()/ srs.abs().size` and 
`srs.diff().abs().mean() = srs.diff().abs().sum()/ srs.diff().abs().size`.

When we apply `.resample(freq).sum()`, we do not change the sum of the values, because `srs.abs().sum() = srs.abs().resample(freq).sum().sum()` and  `srs.diff().abs().sum() = srs.diff().abs().resample(freq).sum().sum()`.

Therefore, resampling impacts only the size of the series, that has to become smaller. However, if we apply resampling to the both `num` and `denum`, then both their sizes reduce by the same rate and are still equal. So when we do `num/denum` this sizes eradicate themselves and in the end we have the same output for all the frequencies which in my opinion doesn't make any sense:

In [92]:
get_mean_holding(srs, freq='D')

0.7374180177459547

In [93]:
get_mean_holding(srs, freq='M')

0.7374180177459548

In [94]:
get_mean_holding(srs, freq='Y')

0.7374180177459544

You can see that the results are almost equal, which is not what we want.

I've tried to resolve this problem by resampling only `denum`, naively thinking that it will resolve everything.

In [95]:
def get_mean_holding_1(srs: pd.Series, freq: str) -> float:
    num = srs.abs().mean()
    denom = srs.diff().abs().resample(freq).sum().mean()
    return num / denom

In [96]:
get_mean_holding_1(srs, freq='D')

0.7374180177459547

In [97]:
get_mean_holding_1(srs, freq='M')

0.029496720709838196

Looks fine at the first sight. However, if we check with the correct result, there is a difference:

In [98]:
get_mean_holding_1(srs, freq='D')/30 

0.024580600591531825

In [99]:
(get_mean_holding_1(srs, freq='M')) / (get_mean_holding_1(srs, freq='D')/30) 

1.2000000000000002

20%% higher than it should be! It is more sensible with years:

In [100]:
get_mean_holding_1(srs, freq='Y')

0.009832240236612727

In [101]:
get_mean_holding_1(srs, freq='D')/365

0.002020323336290287

In [102]:
(get_mean_holding_1(srs, freq='Y')) / (get_mean_holding_1(srs, freq='D')/365)

4.866666666666665

Almost 5 times higher. It happens because after resampling it calculates size wrongly.
For example, `srs.diff().abs().size` = 75 so the size of resampled series should be 75/30 = 2.5 in order to compute the mean
correclty. However:

In [103]:
srs.diff().abs().size

75

In [106]:
srs.diff().abs().resample('M').sum().size

3

In [107]:
srs.diff().abs().resample('M').sum()

2000-01-31    40.664780
2000-02-29    23.207425
2000-03-31    15.320120
Freq: M, Name: arma(0,0), dtype: float64

As we see it counts March as a full month and this biases our results.
The same happens for all the other cases and, therefore, the approach is not really applicable.

This means that we should compute the average holding period with the input frequency and then multiply it by the frequency coefficient which is more or less fixed for all the frequency pairs (`D`/`M` = 1/30, `W`/`Y`=52 etc.)

The same logic is implemented in the current version.

Julia has offered to use the following format to do this:

In [52]:
pd.date_range("2020-01-01", freq='D', periods=1)[-1]

Timestamp('2020-01-01 00:00:00', freq='D')

However, I believe that this approach is sort of biased too.

It does not output the date that is after the number of 1 `freq` datetime, but has it's own indexing and just chooses the date at the end of specified `freq`.
For example:

In [108]:
pd.date_range("2020-01-01", freq="W", periods=1)[-1]

Timestamp('2020-01-05 00:00:00', freq='W-SUN')

In [109]:
pd.date_range("2020-01-04", freq="W", periods=1)[-1]

Timestamp('2020-01-05 00:00:00', freq='W-SUN')

By setting `freq='W'` we did not obtain a date in 7 days from specified. Moreover, for different dates in hte input the output is the same. This can bias our calculations and it is much more visible on the other examples:

Same output for `freq='M` and 2 dates in the specified month: 

In [110]:
pd.date_range("2020-01-07", freq="M", periods=1)[-1]

Timestamp('2020-01-31 00:00:00', freq='M')

In [111]:
pd.date_range("2020-01-18", freq="M", periods=1)[-1]

Timestamp('2020-01-31 00:00:00', freq='M')

Same output for `freq='Y` and 2 dates in the specified year: 

In [112]:
pd.date_range("2020-03-07", freq="Y", periods=1)[-1]

Timestamp('2020-12-31 00:00:00', freq='A-DEC')

In [113]:
pd.date_range("2020-09-19", freq="Y", periods=1)[-1]

Timestamp('2020-12-31 00:00:00', freq='A-DEC')

Therefore, both approaches do not satisfy our need.

Here is an alternative: to make a function that will return the ratio of one frequency lenght in time to another.
Simplified example:

In [129]:
def get_freq_coef(freq_1: str, freq_2: str) -> float:
    time_dict = {
        'D': {'D': 1, 'M': 30, 'Y': 365},
        'M': {'D': 1/30, 'M': 1, 'Y':12},
        'Y': {'D': 1/365, 'M':1/12, 'Y':1}
                }
    return time_dict[freq_1][freq_2]

In [130]:
get_freq_coef(freq_1='Y', freq_2='D')

0.0027397260273972603

In [131]:
get_freq_coef(freq_1='D', freq_2='Y')

365

In [132]:
get_freq_coef(freq_1='M', freq_2='Y')

12

In [133]:
get_freq_coef(freq_1='Y', freq_2='M')

0.08333333333333333

Something like this can be created for all the relevant time frequencies and then be used in other cases when we need to convert an amount of days to an amount of month etc.