
# Max overlapping timeseries algorithm

This algorithm has been designed to find the number of timeseries which have a minimum number of samples at the begining of each timeseries. 

For example, let's assume that we have timeseries produced by some type of meter with some periodic values. The start and end date of the series can be different for each serie. To generate a set to apply some forecasting model using different input variables, we need to obtain e set of overlapping timeseries with a minimal duration to make sure the ML algorithm will capture the periodicity of the samples. 

The proposed algorithm allows to know, for each date, how many overlapping days of timeseries we have available. 

Let's assume that we have N meters which generates N files, each one including a timeseries with starts with a `start date` and finishes with an `end date`.

We must generate a dataframe containing:

|  `Start date` | `Timeseries ID` | `Duration of timeseries in days` |

For example:

```
#Start date, meter ID, Duration in days
2019-06-02T09:37:00.000Z, vgbiwenoi2323, 367
2019-06-05T09:37:00.000Z, dscafweee3498, 450
2019-06-06T09:37:00.000Z, cncinnenr7325, 348
2019-06-09T09:37:00.000Z, onjdqweni8623, 317
2019-06-10T09:37:00.000Z, eiwhdoqwu3764, 347
2019-06-11T09:37:00.000Z, lidscbnqo1387, 227
2019-06-13T09:37:00.000Z, vgbiweeef2424, 367
2019-06-13T09:37:00.000Z, ebnqiunin1298, 387
2019-06-18T09:37:00.000Z, ommfiunun3546, 357
2019-06-21T09:37:00.000Z, tyrfeunht6543, 398
```



In [1]:
import pandas as pd

In [2]:
def run_algorithm(input_csv_filename, min_period):
    timeseries_df= pd.read_csv(input_csv_filename)
    timeseries_df['start'] = pd.to_datetime(timeseries_df['start'])

    # The dataframe must be ordered by increasing date
    timeseries_df=timeseries_df.sort_values(by="start")
    timeseries_df=timeseries_df.reset_index(drop=True)

    # We calculate the delta between rows and leave it in a column as integers
    timeseries_df['delta']= timeseries_df['start'].diff()
    timeseries_df.iloc[0, timeseries_df.columns.get_loc('delta')] = pd.Timedelta('0 days')
    timeseries_df['delta'] = timeseries_df['delta'].dt.days.astype('int64')

    # New column where the number of 
    timeseries_df['complete_periods']=0

    # Skipping first iteration
    for i in range(1,len(timeseries_df)) :

        timeseries_df.loc[0:i-1,'duration']=timeseries_df.loc[0:i-1,'duration']-timeseries_df.loc[i,'delta']
        timeseries_df.loc[i,'complete_periods'] = (timeseries_df.loc[0:i,'duration']>=min_period).sum()
        #print('Number of completes periods')
        #print((timeseries_df.loc[0:i,'Duration']>=min_period).sum())
        # print(timeseries_df.loc[0:i,:])
        #print(timeseries_df.iloc[i, 0], timeseries_df.iloc[i, 2])

    return timeseries_df


In [3]:
csv_input_filename = 'time_ranges.csv'

### Find the best time window

In [14]:
for min_period in range(360, 721, 30):
    returned_df = run_algorithm(csv_input_filename, min_period)
    number_meters = returned_df[returned_df['complete_periods'] > 0].shape[0]
    print(f"Period of {min_period} days: {number_meters} meters")

Period of 360 days: 15946 meters
Period of 390 days: 15814 meters
Period of 420 days: 15801 meters
Period of 450 days: 15772 meters
Period of 480 days: 15750 meters
Period of 510 days: 15746 meters
Period of 540 days: 15689 meters
Period of 570 days: 15621 meters
Period of 600 days: 15215 meters
Period of 630 days: 14932 meters
Period of 660 days: 0 meters
Period of 690 days: 0 meters


KeyboardInterrupt: 

In [15]:
csv_input_filename = 'time_ranges.csv'
for min_period in range(630, 661, 5):
    returned_df = run_algorithm(csv_input_filename, min_period)
    number_meters = returned_df[returned_df['complete_periods'] > 0].shape[0]
    print(f"Period of {min_period} days: {number_meters} meters")

Period of 630 days: 14932 meters
Period of 635 days: 14829 meters
Period of 640 days: 14805 meters
Period of 645 days: 14554 meters
Period of 650 days: 14139 meters
Period of 655 days: 62 meters
Period of 660 days: 0 meters


### Run the algorithm with the best time window

In [48]:
best_time_period = 365
results = run_algorithm(csv_input_filename, best_time_period)

In [49]:
max_periods = max(results['complete_periods'])
results[results['complete_periods'] == max_periods]

Unnamed: 0,start,meterId,duration,delta,complete_periods
15796,2020-02-28 23:00:00+00:00,CIR0141691342,26,1,15699
15798,2020-03-04 23:00:00+00:00,CIR0141691341,27,3,15699


In [52]:
results_2 = results[:15796]
results_2[results_2['duration']>0]

Unnamed: 0,start,meterId,duration,delta,complete_periods
0,2019-05-25 22:00:00+00:00,CIR0141449180,83,0,0
1,2019-05-25 22:00:00+00:00,CIR0141600959,83,0,2
2,2019-05-25 22:00:00+00:00,CIR0141601720,83,0,3
3,2019-05-25 22:00:00+00:00,CIR0141682188,81,0,4
4,2019-05-25 22:00:00+00:00,CIR0141441118,83,0,5
...,...,...,...,...,...
15787,2020-02-23 01:00:00+00:00,ZIV0045681792,15,0,15694
15788,2020-02-23 23:00:00+00:00,SAG0196250062,28,0,15695
15793,2020-02-26 01:00:00+00:00,CIR0141691347,27,1,15696
15794,2020-02-27 10:00:00+00:00,ZIV0046094242,27,1,15697


In [18]:
min_date = max(results['start'])
print("Min date: " + str(min_date))

Min date: 2019-06-06 00:00:00+00:00


### Try the results

In [17]:
from datetime import datetime as dt, timedelta as td

In [57]:
df = pd.read_csv('reactive_values/meter_data_ZIV0046096055_S02.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [58]:
expected_delta = td(days=best_time_period)
print(max(df['timestamp']))
print(min(df['timestamp']))

2021-03-12 00:00:00+00:00
2020-02-27 13:00:00+00:00


In [28]:
dt(2021, 3, 11) - dt(2019, 9, 6)

datetime.timedelta(days=552)

In [59]:
df['timestamp'].sort_values()

0      2020-02-27 13:00:00+00:00
1      2020-02-27 14:00:00+00:00
2      2020-02-27 15:00:00+00:00
3      2020-02-27 16:00:00+00:00
4      2020-02-27 17:00:00+00:00
                  ...           
7121   2021-03-11 20:00:00+00:00
7122   2021-03-11 21:00:00+00:00
7123   2021-03-11 22:00:00+00:00
7124   2021-03-11 23:00:00+00:00
7125   2021-03-12 00:00:00+00:00
Name: timestamp, Length: 7126, dtype: datetime64[ns, UTC]

In [39]:
df = pd.read_csv('time_ranges.csv')

In [42]:
df.sort_values(by='start')

Unnamed: 0,start,meterId,duration
2950,2016-12-21T18:07:14.000Z,CIR0141691637,1541
15511,2019-05-25T22:00:00.000Z,CIR0141456543,656
4157,2019-05-25T22:00:00.000Z,CIR0141600597,654
14409,2019-05-25T22:00:00.000Z,CIR0141601720,656
14797,2019-05-25T22:00:00.000Z,CIR0141600959,656
...,...,...,...
3580,2021-03-04T23:00:00.000Z,SAG0205909972,7
751,2021-03-05T23:00:00.000Z,SAG0205909973,6
16095,2021-03-06T23:00:00.000Z,SAG0205910047,5
11087,2021-03-08T01:00:00.000Z,SAG0205910036,1
