$\newcommand{\bbT}{\mathbb{T}}$
$\newcommand{\bbR}{\mathbb{R}}$
$\newcommand{\bbL}{\mathbb{L}}$
$\newcommand{\oc}{\overline{c}}$


## Timeseries vs Event-Sets

In general, when you consider temporal data in real-world datasets/databases you are often presented with two kinds of temporal data over a time domain $\bbT$:

- Timeseries-based data, which may be seen as a **mapping**
$ts: \bbT \rightarrow \bbR^m$, e.g., in our specific *fitbit* dataset, the calories:

In [1]:
import nbimporter
from ETLBasics_t1 import calories_to_df
PATH = '../../../datasets/pmdata/'
calories_to_df(PATH, [1,2,3])

Unnamed: 0_level_0,Unnamed: 1_level_0,calories
partecipant,TS,Unnamed: 2_level_1
1,2019-11-01 00:00:00,1.39
1,2019-11-01 00:01:00,1.39
1,2019-11-01 00:02:00,1.39
1,2019-11-01 00:03:00,1.39
1,2019-11-01 00:04:00,1.39
...,...,...
3,2020-03-31 23:55:00,1.34
3,2020-03-31 23:56:00,1.34
3,2020-03-31 23:57:00,1.34
3,2020-03-31 23:58:00,1.34


let us notice that we have a timeline for each paretcipant and the granularity oif each one of them, in the **raw** data, is *minutes* granularity, this does mean that the timeline may be *conveniently transformed* into a timeline with a *rougher* granularity,
such as 15 minutes, half-hour, hour, by aggregating and summing up the calories in the same group. 

- event-based data, this data may be seen as a subset of $E\subseteq \bbT\times \bbR^m \times \bbL^n$ where $E$ by its own definition does not necessarily *covers* all the possible tinmestemps in $\bbT$ as a timeseries function does by definition. In our specific *fitbit* dataset this may be represented by the exercises:

In [2]:
import json as json
import pandas as pd
with open(PATH + 'p01/fitbit/' + 'exercise.json') as file:
    dict_ex = json.load(file)
df_ex = pd.DataFrame.from_dict(dict_ex)
df_ex['PatID'] = 1
df_ex[['startTime', 'duration','steps','activityName',  'PatID']]

Unnamed: 0,startTime,duration,steps,activityName,PatID
0,2019-11-01 14:56:32,1331000,1878,Walk,1
1,2019-11-01 19:03:11,2202000,2786,Walk,1
2,2019-11-02 13:26:38,2458000,3035,Walk,1
3,2019-11-04 21:22:08,1024000,1284,Walk,1
4,2019-11-05 19:27:25,973000,1065,Walk,1
...,...,...,...,...,...
185,2020-03-27 13:07:53,1076000,1203,Walk,1
186,2020-03-27 16:22:27,2918000,3909,Walk,1
187,2020-03-28 09:58:08,5581000,7599,Walk,1
188,2020-03-29 07:42:53,1076000,1279,Walk,1


The above set of events is relative to a set of event $E$ for the exercises for the partecipant $1$, it is a simplified version of the data suggested for extraction but it serves as an example of an event set. Firts, we noticed that the granularity of $\bbT$ (reprwsented by the $startTime$ attribute) is $seconds$, but this is of little importance  as we are not considering the full $\bbT$, then we notice that we have two numerical (i.e., their domain belongs to $\bbR$) 
attributes which are $duration$ and $steps$, and one which is a categorical attribute, i.e, a label, which is the $activityName$ attribute. $PatId$ is more likely a 
part of the index that identifies  events belonging  to the same context, in this case, 
the same partecipant.


### From Timeseries to Event-Sets

For the purpose of our analysis Event-Sets are more suitable to be represented as temporal transactions. However, we may want to take into account information coming from timeseries into our temporal transactions, one intuitive  way to do that consists of turning the a time series $ts: \bbT \rightarrow \bbR$ into a set of 
events $E_{ts} \subseteq \bbT \times \bbR^m \times \bbL^n$ where 
$(t, r_1, \ldots, r_m, l_1, \ldots, l_n)$ represents a specific behaviour of $ts$
*around* time $t$ (e.g., a time frame of length $5$ minutes centered in $t$).

## Timeseries to events: an example

There are various ways fo extracting events from time-series (see referencees below) 
and in the current task you are encouraged to try some of them or try one of your own according to the structure of the data 
that you have profiled during task $t2$. Moreover, it may be the case that you may want to try distinct techniques or distinct parameters according to the time series taken into account (e.g., two distinct techniques/set-of-parameters for extracting events from the calories w.r.t. the heart rate timeseries).


Let us consider the $calories$ example, let us suppose that we want to extract hourly based events 
in which we generate the following event $(t, \oc, c_{\sigma}, weekend, week\_day, hour\_day, eight\_part\_day, quarter\_day, half\_day)$ where:
- $t$ is the timestamp representing the beginning of the hour where we take the measure;
- $\oc$ is the average calories per minute in the given hour;
- $c_{\sigma}$ is the standard deviation in the given hour;
- $weekend$ is a boolean denoting if the timestamp belongs to a day of the week or a week end day (i.e., Saturday or  Sunday);
- $week\_day$ denoting the day of the week, e.g., its value may be Monday for instance;
- $hour\_day$ the hour in which the measure took place, e.g., the value is 12 for the interval [12, 13];
- $eight\_part\_day$ the part of the day in which the measure took place, for instance, 0 for hours in the interval [0, 2], 1 for hours in the interval [3, 5] and so on, in general, $i$ for interval $[3i, 3(i+1)-1]$
with $0\leq i\leq 7$;
- $quarter\_day$ same as above but with the day splitted into four parts $i$ for interval $[6i, 6(i+1)-1]$
with $0\leq i\leq 3$;
- $half\_day$ same as above but with just tow values $0$ for hours form $0$ to $11$
and $1$ for hours from $12$ to $23$.

The following function performs the above operation:

In [3]:
df = calories_to_df(PATH, [1,2,3])
df['pid'] = [x[0] for x in list(df.index)]
df['ts'] = [x[1] for x in list(df.index)]

In [4]:
df['tsh'] =  df['ts'].apply(lambda x: x.strftime('%Y-%m-%d_%H'))

In [5]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,calories,pid,ts,tsh
partecipant,TS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2019-11-01 00:00:00,1.39,1,2019-11-01 00:00:00,2019-11-01_00
1,2019-11-01 00:01:00,1.39,1,2019-11-01 00:01:00,2019-11-01_00
1,2019-11-01 00:02:00,1.39,1,2019-11-01 00:02:00,2019-11-01_00
1,2019-11-01 00:03:00,1.39,1,2019-11-01 00:03:00,2019-11-01_00
1,2019-11-01 00:04:00,1.39,1,2019-11-01 00:04:00,2019-11-01_00
...,...,...,...,...,...
3,2020-03-31 23:55:00,1.34,3,2020-03-31 23:55:00,2020-03-31_23
3,2020-03-31 23:56:00,1.34,3,2020-03-31 23:56:00,2020-03-31_23
3,2020-03-31 23:57:00,1.34,3,2020-03-31 23:57:00,2020-03-31_23
3,2020-03-31 23:58:00,1.34,3,2020-03-31 23:58:00,2020-03-31_23


In [6]:
df_agg = df.groupby(['pid', 'tsh']).agg({'calories':['mean', 'std']})
df_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,calories,calories
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std
pid,tsh,Unnamed: 2_level_2,Unnamed: 3_level_2
1,2019-11-01_00,1.390000,0.000000
1,2019-11-01_01,1.397000,0.054222
1,2019-11-01_02,1.392333,0.018074
1,2019-11-01_03,1.399333,0.035217
1,2019-11-01_04,1.399333,0.035217
...,...,...,...
3,2020-03-31_19,2.070667,1.175683
3,2020-03-31_20,1.552000,0.541876
3,2020-03-31_21,1.340000,0.000000
3,2020-03-31_22,1.340000,0.000000


In [7]:
df_agg['c_mean'] = df_agg['calories']['mean']
df_agg['c_std'] = df_agg['calories']['std']
df_agg = df_agg.drop(['calories'], axis=1)

In [8]:
df_agg['ts'] = [x[1] for x in list(df_agg.index)]

In [9]:
import dateutil as du
df_agg['date'] = df_agg['ts'].apply(lambda x: du.parser.parse(x[0:-3]))

In [10]:
df_agg['hour'] =  df_agg['ts'].apply(lambda x: int(x[-2:])) 

In [11]:
df_agg['eight_part_day'] = df_agg['hour'].apply(lambda x: int(x/3)) 

In [12]:
df_agg['eight_part_day'].iloc[0:24]

pid  tsh          
1    2019-11-01_00    0
     2019-11-01_01    0
     2019-11-01_02    0
     2019-11-01_03    1
     2019-11-01_04    1
     2019-11-01_05    1
     2019-11-01_06    2
     2019-11-01_07    2
     2019-11-01_08    2
     2019-11-01_09    3
     2019-11-01_10    3
     2019-11-01_11    3
     2019-11-01_12    4
     2019-11-01_13    4
     2019-11-01_14    4
     2019-11-01_15    5
     2019-11-01_16    5
     2019-11-01_17    5
     2019-11-01_18    6
     2019-11-01_19    6
     2019-11-01_20    6
     2019-11-01_21    7
     2019-11-01_22    7
     2019-11-01_23    7
Name: eight_part_day, dtype: int64

In [13]:
df_agg['quarter_day'] = df_agg['hour'].apply(lambda x: int(x/6)) 
df_agg['quarter_day'].iloc[0:24]

pid  tsh          
1    2019-11-01_00    0
     2019-11-01_01    0
     2019-11-01_02    0
     2019-11-01_03    0
     2019-11-01_04    0
     2019-11-01_05    0
     2019-11-01_06    1
     2019-11-01_07    1
     2019-11-01_08    1
     2019-11-01_09    1
     2019-11-01_10    1
     2019-11-01_11    1
     2019-11-01_12    2
     2019-11-01_13    2
     2019-11-01_14    2
     2019-11-01_15    2
     2019-11-01_16    2
     2019-11-01_17    2
     2019-11-01_18    3
     2019-11-01_19    3
     2019-11-01_20    3
     2019-11-01_21    3
     2019-11-01_22    3
     2019-11-01_23    3
Name: quarter_day, dtype: int64

In [14]:
df_agg['half_day'] = df_agg['hour'].apply(lambda x: int(x/12)) 
df_agg['half_day'].iloc[0:24]

pid  tsh          
1    2019-11-01_00    0
     2019-11-01_01    0
     2019-11-01_02    0
     2019-11-01_03    0
     2019-11-01_04    0
     2019-11-01_05    0
     2019-11-01_06    0
     2019-11-01_07    0
     2019-11-01_08    0
     2019-11-01_09    0
     2019-11-01_10    0
     2019-11-01_11    0
     2019-11-01_12    1
     2019-11-01_13    1
     2019-11-01_14    1
     2019-11-01_15    1
     2019-11-01_16    1
     2019-11-01_17    1
     2019-11-01_18    1
     2019-11-01_19    1
     2019-11-01_20    1
     2019-11-01_21    1
     2019-11-01_22    1
     2019-11-01_23    1
Name: half_day, dtype: int64

In [15]:
df_agg['week_day'] = df_agg['date'].apply(lambda x: x.strftime('%A'))
df_agg['week_day'].iloc[0:24] 

pid  tsh          
1    2019-11-01_00    Friday
     2019-11-01_01    Friday
     2019-11-01_02    Friday
     2019-11-01_03    Friday
     2019-11-01_04    Friday
     2019-11-01_05    Friday
     2019-11-01_06    Friday
     2019-11-01_07    Friday
     2019-11-01_08    Friday
     2019-11-01_09    Friday
     2019-11-01_10    Friday
     2019-11-01_11    Friday
     2019-11-01_12    Friday
     2019-11-01_13    Friday
     2019-11-01_14    Friday
     2019-11-01_15    Friday
     2019-11-01_16    Friday
     2019-11-01_17    Friday
     2019-11-01_18    Friday
     2019-11-01_19    Friday
     2019-11-01_20    Friday
     2019-11-01_21    Friday
     2019-11-01_22    Friday
     2019-11-01_23    Friday
Name: week_day, dtype: object

In [16]:
df_agg['week_day'].iloc[24:48] 

pid  tsh          
1    2019-11-02_00    Saturday
     2019-11-02_01    Saturday
     2019-11-02_02    Saturday
     2019-11-02_03    Saturday
     2019-11-02_04    Saturday
     2019-11-02_05    Saturday
     2019-11-02_06    Saturday
     2019-11-02_07    Saturday
     2019-11-02_08    Saturday
     2019-11-02_09    Saturday
     2019-11-02_10    Saturday
     2019-11-02_11    Saturday
     2019-11-02_12    Saturday
     2019-11-02_13    Saturday
     2019-11-02_14    Saturday
     2019-11-02_15    Saturday
     2019-11-02_16    Saturday
     2019-11-02_17    Saturday
     2019-11-02_18    Saturday
     2019-11-02_19    Saturday
     2019-11-02_20    Saturday
     2019-11-02_21    Saturday
     2019-11-02_22    Saturday
     2019-11-02_23    Saturday
Name: week_day, dtype: object

In [17]:
df_agg['weekend']= df_agg['week_day'].apply(lambda x: True if x in ['Saturday', 'Sunday'] else False) 

In [18]:
df_agg['weekend'].iloc[36:96]

pid  tsh          
1    2019-11-02_12     True
     2019-11-02_13     True
     2019-11-02_14     True
     2019-11-02_15     True
     2019-11-02_16     True
     2019-11-02_17     True
     2019-11-02_18     True
     2019-11-02_19     True
     2019-11-02_20     True
     2019-11-02_21     True
     2019-11-02_22     True
     2019-11-02_23     True
     2019-11-03_00     True
     2019-11-03_01     True
     2019-11-03_02     True
     2019-11-03_03     True
     2019-11-03_04     True
     2019-11-03_05     True
     2019-11-03_06     True
     2019-11-03_07     True
     2019-11-03_08     True
     2019-11-03_09     True
     2019-11-03_10     True
     2019-11-03_11     True
     2019-11-03_12     True
     2019-11-03_13     True
     2019-11-03_14     True
     2019-11-03_15     True
     2019-11-03_16     True
     2019-11-03_17     True
     2019-11-03_18     True
     2019-11-03_19     True
     2019-11-03_20     True
     2019-11-03_21     True
     2019-11-03_22     True
 

In [19]:
df_agg = df_agg.drop(['date', 'ts'], axis=1)

In [20]:
df_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,c_mean,c_std,hour,eight_part_day,quarter_day,half_day,week_day,weekend
pid,tsh,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2019-11-01_00,1.390000,0.000000,0,0,0,0,Friday,False
1,2019-11-01_01,1.397000,0.054222,1,0,0,0,Friday,False
1,2019-11-01_02,1.392333,0.018074,2,0,0,0,Friday,False
1,2019-11-01_03,1.399333,0.035217,3,1,0,0,Friday,False
1,2019-11-01_04,1.399333,0.035217,4,1,0,0,Friday,False
...,...,...,...,...,...,...,...,...,...
3,2020-03-31_19,2.070667,1.175683,19,6,3,1,Tuesday,False
3,2020-03-31_20,1.552000,0.541876,20,6,3,1,Tuesday,False
3,2020-03-31_21,1.340000,0.000000,21,7,3,1,Tuesday,False
3,2020-03-31_22,1.340000,0.000000,22,7,3,1,Tuesday,False


### References

[Brockwell, Peter J., et al. Introduction to time series and forecasting. springer, 2016.](https://link.springer.com/book/10.1007/978-3-319-29854-2)

THE ANALYSIS OF TIME SERIES. AN INTRODUCTION, 6th edition.
C. Chatfield.
Boca Raton, Florida: Chapman and Hall/CRC Press, 2003.

NONLINEAR TIME SERIES: NONPARAMETRIC AND PARAMETRIC METHODS.
J. Fan and Q. Yao.
New York: SpringerVerlag, 2003.

EXPLORATORY DATA MINING AND DATA CLEANING.
T. Dasu and T. Johnson.
Hoboken, New Jersey: Wiley, 2003.
