In [None]:
import pandas as pd
import json as json
import datetime as dt
import dateutil as du

You may download the fitbit dataset  [https://datasets.simula.no/pmdata/](https://datasets.simula.no/pmdata/) the site also feature useful references on the structure of the dataset. Please change the below varibale to the path you extracted  teh dataset:

In [None]:
PATH = '../../../datasets/pmdata/'

The structure of the dataset is the following:

In [None]:
!ls $PATH

Each partecipant has an associated folder that  contains varous informations on a six months
period where they accepted to self-monitoring themselves, we are interested in the fitbit files contained in the folder ''pNUMBER/fitbit'':

In [None]:
p_id = 1
p_folder = PATH + "p{:02d}".format(p_id) + '/fitbit/'
!ls {p_folder}

Let us begin with the **calories.json** file which, according to the description, shows how many calories the person have burned the last minute.

In [None]:
calories_file = 'calories.json'

with open(p_folder + calories_file) as file:
    dict_cal = json.load(file) 
    
nrows  = 10  

print(f'First {nrows} rows of json list of length {len(dict_cal)}:\n' +  '\n'.join([str(d) for d in dict_cal[0:nrows]]))

As we will see the json structure of these files does not feature arbitrary nesting depth and thus it is easy to turn them into a dataframe. First we convert the data into their appropriate types and we assign suitable names:

In [None]:
for d in dict_cal:
    d['TS'] = du.parser.parse(d['dateTime'])
    d['calories'] = float(d['value'])
    d.pop('dateTime')
    d.pop('value')
dict_cal[0:nrows]

Then we turn the dictionary into a dataframe:

In [None]:
df_cal = pd.DataFrame.from_dict(dict_cal)
df_cal

We want the index to be the timestamp (as it should be):

In [None]:
df_cal = df_cal.set_index('TS')
df_cal

The only thing missing is to add the partecipant id for 
dealing with multiple partecipants and allow comparisons:

In [None]:
df_cal['partecipant'] = p_id
df_cal

let us now wrap what we have done so far into a python function 
```python 
calories_to_df(root_path, partecipants)
```
that perform all the above operations
for each partecipant in the ***participants***
list starting from the designated pmdata root folder ***root_path*** and returns the concatenation of the above data frames for each selected partecipant: 

In [None]:
def calories_to_df(root_path, partecipants):
    dfs = []
    for p_id in partecipants: 
        p_folder = root_path + "p{:02d}".format(p_id) + '/fitbit/'
        calories_file = 'calories.json'
        with open(p_folder + calories_file) as file:
            dict_cal = json.load(file)
        for d in dict_cal:
            d['TS'] = du.parser.parse(d['dateTime'])
            d['calories'] = float(d['value'])
            d.pop('dateTime')
            d.pop('value')
        df_cal = pd.DataFrame.from_dict(dict_cal)
        df_cal['partecipant'] = p_id
        df_cal = df_cal.set_index(['partecipant','TS'])
        dfs.append(df_cal)
    r = pd.concat(dfs)
    r = r.sort_index()
    return r

The only difference is that we have put the partecipant in the index for avoiding duplications in the index, exactly like we will do for a key in a table of a relational database. 
Let us notice that the index is lexicographically sorted for speeding up the 
slicing which is useful especially when we deal with time series.
Let us test the end result:

In [None]:
calories = calories_to_df(PATH, [1,10])


In [None]:
calories

In [None]:
calories.loc[1].loc[dt.datetime(2020,1,1):dt.datetime(2020,1,10)]

# Data Mining Project (part 1) 

Write a  function similar to
```python
calories_to_df
```

For the json files:

- sedentary_minutes.json
- distance.json
- sleep.json
- exercise.json
- heart_rate.json
- steps.json
- lightly_active_minutes.json
- time_in_heart_rate_zones.json
- moderately_active_minutes.json
- very_active_minutes.json
- resting_heart_rate.json


### Note: 
it is NOT mandatory to import all the fields, in particular, for the file **exercise.json** there are many fields which are not so informative from an anlysis point of view, the structure is the following:

In [None]:
with open(p_folder + 'exercise.json') as file:
    dict_ex = json.load(file)
dict_ex[0]

Unfortunately pandas just read the first level of a dictionary:

In [None]:
df_ex = pd.DataFrame.from_dict(dict_ex)
df_ex.iloc[0:1]

A good dataframe, in my opinion,  for storing such  information will represent the following columns:

INDEX:
- startTime 
- patID

COLUMNS:
- activityName
- activityLevelSedentary
- activityLevelLightly
- activityLevelFairly
- activityLevelVery
- averageHeartRate
- calories
- activeDuration
- steps
- heartRateZonesOutofRange
- heartRateZonesFatBurn
- heartRateZonesCardio
- heartRateZonesPeak
- elevationGain


### Note:

I am just suggesting how to tranform the information if made a different choice I encourage you to do so but you have to motivate it.

Do the same for the following csv files in the */pmsys* folder:

In [None]:
!ls $PATH/p01/pmsys

If you want to know more on SRPE here are some references just for satisfying your curiosity.

#### More on SRPE: Session Rating of Perceived Extension

<a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5673663/' /> A new approach to monitoring exercise training </a>

<a href='https://pubmed.ncbi.nlm.nih.gov/11708692/'>
Session-RPE Method for Training Load Monitoring: Validity, Ecological Usefulness, and Influencing Factors
</a>    



