# Introduction

This is a tutorial for using Forest to analyze Beiwe data. We will first download the data using mano. We will also be creating some time series plots using the generated statistic summaries. There are four parts to this tutorial.

1. Check Python version and download Forest.
2. Download data for your study from the server.
3. Explore the file structure of your data
4. Process data using forest.
5. Creating time series plots. 

## Check Python Version and Download Forest

Before we begin, we need to check the current distribution of Python. Note that forest is built using Python 3.11. 

In [1]:
from platform import python_version
import sys

- Print the python version and the path to the Python interpreter. 

In [2]:
print(python_version()) ## Prints your version of python
print(sys.executable) ## Prints your current python installation

3.12.5
/n/onnela_dp_l3/Lab/envs/.forest_venv/bin/python


*The output should display two lines.* 

1. The Python version installed- make sure you are not using a version of Python that is earlier than 3.11
2. The path to where Python is currently installed

- You may need to install git, pip, mano and forest. To do so, either run the chunk below (the one with lines starting with "!") or enter the lines below (not starting with "!" in a command-line shell. If you already have mano and forest installed, you can skip to the next step. 

In [3]:
#run this chunk to install mano and forest
%pip install mano 
%pip install --upgrade https://github.com/onnela-lab/forest/tarball/develop
%pip install orjson

Note: you may need to restart the kernel to use updated packages.
Collecting https://github.com/onnela-lab/forest/tarball/develop
  Using cached https://github.com/onnela-lab/forest/tarball/develop
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


`# Or, copy and paste the below lines into a command-line shell` 

`pip install mano`

`pip install https://github.com/onnela-lab/forest/tarball/develop`

Note: In this notebook, you will install the develop branch of forest. This branch has all of the most recent features (including location type information), but function names are slightly different than in the main branch, so they may not match what is on the website. To find documentation specific to the develop branch, look at the current version's docstring by typing a function name and holding shift+tab.

## Download Beiwe Data


In this notebook, we will download data from a beiwe study. Edit the cell below to match parameters in your study.

- For **study_id**, enter the "study ID, found in the top right corner of the study page". 
- For **direc**, the current working directory will be used. If you want data to be stored in another directory, change this variable to another string with the desired filepath. 
- For **dest_folder_name**, enter the "name of the folder you want raw data stored in". 
- For **server**, enter the server where data is located. If your Beiwe website URL starts with studies.beiwe.org, enter "studies"
- For **time_start**, enter the earliest date you want to download data for, in YYYY-MM-DD format.
- For **time_end**, enter the latest date you want to download data for, in YYYY-MM-DD format. If this is None, mano will download all data available (up until today at midnight). 
- For **data_streams**, enter a list of data streams you want to download. Forest currently analyzes `gps`, `survey_timings`, `calls`, and `texts` data streams. A full list of data types can be found under the "Download Data" tab of the Beiwe website. If this is None, all possible data streams will be downloaded. 
- For **beiwe_ids**, enter a list of Beiwe IDs you want to download data for. If you leave this as an empty list, mano will attempt to download data for all user IDs

In [4]:
import os
import sys
import pandas as pd 

# Test for one HOPE study, one user
# in the order of: DFCI_Wright_HOPE Trial Phase 2, DFCI_Wright_HOPE Trial Phase 2_Passive Data Only, DFCI_Wright_HOPE Trial, DFCI_Wright_HOPE Troubleshooting, DFCI_Wright_HOPE Test, DFCI_Wright_HOPE_Test 2
HOPE_STUDY_IDS = ['598365d5388cd66a62ac1f9e', '5a2ae1dc03d3c425ef0ea752', '588224eff4d48a76f488cdfd',
                  '5a79f17d03d3c45080924ed4', '59c2b5b4388cd6715a958247', '5h7D9XT2vrN3BWkdcbYVNtpI']
study_id = HOPE_STUDY_IDS[0]  # test on one part of the study
direc = os.path.abspath("../../data/")  # set to the desired directory
# to be used to pull participant ids

# TODO change destination folder name to reflect the data being downloaded
# TODO change to reflect the data being downloaded

dest_folder_name = "bulk_download_test"
server = "studies"
# time_start = "2017-01-01"
# time_end = "2017-01-02"


# For Jupyter Notebook ONLY
sys.argv = ['script_name', '1']  
task_id = int(sys.argv[1])
print("task_id: " + str(task_id))

METADATA_TB_PATH = os.path.join(direc, "trial_phase2_download_participants.csv")
print(METADATA_TB_PATH)

#==================             ========================

user_params = pd.read_csv(METADATA_TB_PATH) # this is the file that contains the user parameters
row = user_params.iloc[task_id]
print(row)

# study_id = row['study id'].values[0]
# direc = os.getcwd() #current working directory, 
# dest_folder_name = "raw_data"

server = "studies"

time_start = row['First Registration Date'].split(" ")[0]
print(time_start)
time_end = None

# data_streams = ["gps", "survey_timings", "survey_answers", "audio_recordings", "calls", "texts", "accelerometer"]
# test gps for now -- change to include all passive features
data_streams = ["gps", "accelerometer", "survey_timings", "survey_answers"] #, "survey_timings", "survey_answers", "audio_recordings", "calls", "texts", "accelerometer"]

beiwe_ids = [row['Patient ID']]
print("beiwe_ids: " + str(beiwe_ids))

# if dest_dir doesn't exist, create it
dest_dir = os.path.join(direc, dest_folder_name)
print("destination directory: " + dest_dir)

if not os.path.exists(dest_dir):
    os.makedirs(dest_dir)


task_id: 1
/n/onnela_dp_l3/Lab/HOPE/beiwe/data/trial_phase2_download_participants.csv
Created On                                    2018-05-22
Patient ID                                      2n18iikg
Status                                          Inactive
OS Type                                              IOS
First Registration Date        2018-06-06 19:00:00 (UTC)
Last Registration                                    NaN
Last Upload                                          NaN
Last Survey Download                                 NaN
Last Set Password                                    NaN
Last Push Token Update                               NaN
Last Device Settings Update                          NaN
Last OS Version                                      NaN
App Version Code                                     NaN
App Version Name                                     NaN
Last Heartbeat                                       NaN
Name: 1, dtype: object
2018-06-06
beiwe_ids: ['2n18iikg']
d

In this next cell, we will import our keyring_studies.py file which includes download credentials. If you haven't already done this, open the keyring_studies.py file and paste your credentials inside. 

If your keyring_studies.py file is in a different directory than the one which includes this notebook, replace `sys.path.insert(0, '')` with `sys.path.insert(0, 'path/to/dir/containing/file/')`.

In [5]:
# import .py file located in another directory if needed
import mano
import sys
sys.path.insert(0, direc)

import keyring_studies
kr = mano.keyring(None)

In [18]:
type(beiwe_ids)

list

This next cell will download your data. Downloading your data will probably be the most time-consuming part of the whole process, so if you've already downloaded the data, you will save time by not running this cell.

In [19]:
import os

from helper_functions import download_data
download_data(kr, study_id, dest_dir, beiwe_ids, time_start, time_end, data_streams)

Downloading data for 2n18iikg


In [6]:
# unit testing bulk data processing code 

# NOTE: in publicly available version of this file, some variable values 
# have been replaced by a placeholder / "foo" value. 

from platform import python_version
from datetime import date, datetime
import sys
import os
import pandas as pd
import mano
import mano.sync as msync
import numpy as np

print(sys.argv)
job_id = int(sys.argv[1]) # job_id = 0
print('job_id = ' + str(job_id))

# # -------------------------------------------------------------------------------
# # manually edited params
# # -------------------------------------------------------------------------------

# beiwe_id_list_path    = "foo/beiwe_2_0_start_and_end_dates_clean.csv"
# beiwe_data_dir        = 'foo'
# beiwe_data_error_dir  = 'foo'
# keyring_dir           = 'foo'

# # study id
# study_id = 'foo'

# # set up the keyring
# sys.path.insert(0, keyring_dir)
# import keyring_studies_MK
# Keyring = mano.keyring(None)


# # -------------------------------------------------------------------------------
# # run
# # -------------------------------------------------------------------------------

# # read data frame with study users
# beiwe_id_df = pd.read_csv(beiwe_id_list_path, sep =',')
# beiwe_id_df = beiwe_id_df[beiwe_id_df['beiwe_id'].notna()]
# # beiwe_id_df.columns.values
# beiwe_id_df_rows = list(range(beiwe_id_df.shape[0]))
# execute_indices = [i for i in beiwe_id_df_rows if i % 15 == job_id]

# # iterate over subset of indexes of beiwe_id_df
# for row_i in execute_indices: # row_i = 0

#     # -------------------------------------------------------------------------------
#     # pull data specific to particular user
#     # -------------------------------------------------------------------------------
#     # beiwe_id_df.index[beiwe_id_df['beiwe_id'] == '114akcoc'].tolist()
#     user_id = beiwe_id_df['beiwe_id'].tolist()[row_i]
#     # bug fixed: replaced [job_id] with [row_i]
#     day_date_min = beiwe_id_df['beiwe_data_query_start'].tolist()[row_i]
#     day_date_max = beiwe_id_df['beiwe_data_query_end'].tolist()[row_i]
#     print('Starting user_id = ' + user_id)

#     # create sequence of days over which we will be quering data day by day
#     download_rn_start = datetime.strptime(day_date_min, '%Y-%m-%d').date()
#     # download_rn_end   = date.today()
#     download_rn_end   = datetime.strptime(day_date_max, '%Y-%m-%d').date()
#     download_days_seq = pd.date_range(download_rn_start, download_rn_end, freq = 'd').tolist()
#     download_days_seq = [str(i.date()) + "T00:00:00" for i in download_days_seq]

#     # -------------------------------------------------------------------------------
#     # run download day by day: gps
#     # -------------------------------------------------------------------------------
#     j_range_max = len(download_days_seq) - 1
#     for j in range(j_range_max): # j = 2
#         # define currently considered
#         time_start = download_days_seq[j]
#         time_end = download_days_seq[j + 1]
#         # wheter keep repeating
#         repeat_cnt = 0
#         while True:
#             # download GPS
#             try:
#                 zf = msync.download(Keyring, study_id, user_id, data_streams = ['gps'], time_start = time_start, time_end = time_end)
#                 zf.extractall(beiwe_data_dir)
#                 # if successful, break
#                 break
#             except BaseException as e:
#                 print(str(e))
#                 repeat_cnt = repeat_cnt + 1
#                 print('repeat_cnt = repeat_cnt + 1 [gps] -- ' + user_id + " " + str(j))
#             # if exceeded the number of breaks
#             if repeat_cnt > 10:
#                  # append error log
#                 if not os.path.exists(beiwe_data_error_dir):
#                     os.makedirs(beiwe_data_error_dir)
#                 file_tmp = os.path.join(beiwe_data_error_dir, user_id + "_gps_" + str(j))
#                 open(file_tmp, mode = 'a').close()
#                 break

#     print('FINISHED user_id = ' + user_id)

print('--- FINISHED PYTHON SCRIPT RUN --- job_id = ' + str(job_id))

['/n/home01/egraff/.conda/envs/ood_env/lib/python3.12/site-packages/ipykernel_launcher.py', '-f', '/n/home01/egraff/.local/share/jupyter/runtime/kernel-3b51dcad-d842-4f70-9c8f-db222c47b6eb.json']


ValueError: invalid literal for int() with base 10: '-f'

In [2]:
# TODO: Erase once the script download_data.py is tested and working

# script to download certain passive data covariates from the HOPE dataset
# references
# forest_and_manu_usage.ipynb
# https://github.com/onnela-lab/als-beiwe-passive-data/blob/main/py/download_data_gps_AWS.py 
# TODO refactor to download all passive data covariates for all users in the HOPE study

# pasted from forest_and_mano_usage.ipynb; generalized to download for all participants in a given study
# from platform import python_version
# import sys

# print(python_version()) ## Prints your version of python
# print(sys.executable) ## Prints your current python installation

# #run this chunk to install mano and forest
# %pip install mano 
# %pip install --upgrade https://github.com/onnela-lab/forest/tarball/develop
# %pip install orjson

import os
# Test for one HOPE study, one user
# in the order of: DFCI_Wright_HOPE Trial Phase 2, DFCI_Wright_HOPE Trial Phase 2_Passive Data Only, DFCI_Wright_HOPE Trial, DFCI_Wright_HOPE Troubleshooting, DFCI_Wright_HOPE Test, DFCI_Wright_HOPE_Test 2
HOPE_STUDY_IDS = ['598365d5388cd66a62ac1f9e', '5a2ae1dc03d3c425ef0ea752', '588224eff4d48a76f488cdfd', '5a79f17d03d3c45080924ed4', '59c2b5b4388cd6715a958247', '5h7D9XT2vrN3BWkdcbYVNtpI'] 
study_id = HOPE_STUDY_IDS[0] # test on one part of the study 
direc = os.path.abspath("../../data/") # set to the desired directory
# to be used to pull participant ids
PART_TB_PATH = direc + "/download_participants/trial_phase2_download_participants.csv"
dest_folder_name = "bulk_download_test"
server = "studies"
time_start = "2017-01-01"
time_end = "2017-01-02" # TODO for now test one day's to make sure data download script runs end to end
# for now focus on passive data - specifically gps and accelerometer
data_streams = ["gps", "accelerometer", "survey_timings", "survey_answers"] #, "survey_timings", "survey_answers", "audio_recordings", "calls", "texts", "accelerometer"]
# for now test on one user
# then pull all user ids from participant table 
import pandas as pd

# TODO: rename participant table with study id, rather than hardcoding 
beiwe_id_df = pd.read_csv(PART_TB_PATH, sep = ",")

# convert patient id to string and then to list
beiwe_id_df['Patient ID'] = beiwe_id_df['Patient ID'].astype(str)
beiwe_ids = beiwe_id_df['Patient ID'].tolist()
print(beiwe_ids) # sanity check 

# beiwe_ids = ['u2l3u6og'] 

dest_dir = os.path.join(direc, dest_folder_name)

# import .py file located in another directory if needed
import mano
import sys
sys.path.insert(0, direc)

import keyring_studies
kr = mano.keyring(None)


# from helper_functions import download_data
# download_data(kr, study_id, dest_dir, beiwe_ids, time_start, time_end, data_streams) 

# up until cell 5 in the forest_mano notebook ========================================

['1s5wlcm6', '2n18iikg', '2umdx87r', '41n4f9xk', '4aczi869', '4kp3bpt3', '4uu7odcm', '4yqhd5tz', '56l6mrgc', '6pqstf1g', '7ijvywnb', '81rasxas', '8a4lsxex', '8lqhw9ep', '93ubkqfi', '9r91qo8x', 'aej6wxh8', 'aj6hu6x', 'akka6vtq', 'azybbrqe', 'bhjfpzgc', 'bntzix8l', 'boai4zds', 'btjka62m', 'c3rv1bs9', 'ckud9s4q', 'cpqdi58s', 'erv2mapd', 'fc2k5drf', 'g5pvhkig', 'gb4hxmdq', 'h1qpaxs7', 'h4yepx8s', 'hfnit4ev', 'hrqqx8bu', 'i5b2fdez', 'i9fqxgwn', 'igfq43wg', 'j47zkf1o', 'j8soi1xe', 'jg55qc21', 'jxfij7hr', 'k2rf8qm4', 'kkxkaagr', 'kzt9osem', 'mihep4', 'ox6k2lbi', 'peeff4t8', 'q2zyj5m4', 'r7dszv23', 'rccun', 'rckq58a9', 's18ydzme', 'srpx1ilr', 'tg9vyfot', 'u2l3u6og', 'ua1djdlg', 'uiwlw4n5', 'ujg255xj', 'ujjuisc1', 'utff7e4t', 'uwk9bwdt', 'uz1v2g7u', 'vyck9k79', 'x8wr4182', 'xcdydxji', 'xfd4twop', 'ychlwvnz', 'zj4gbhlp', 'zo8vuchq']


In [3]:
# convert patient id to string and then to list
beiwe_id_df['Patient ID'] = beiwe_id_df['Patient ID'].astype(str)
beiwe_ids = beiwe_id_df['Patient ID'].tolist()
print(beiwe_ids)    

['1s5wlcm6', '2n18iikg', '2umdx87r', '41n4f9xk', '4aczi869', '4kp3bpt3', '4uu7odcm', '4yqhd5tz', '56l6mrgc', '6pqstf1g', '7ijvywnb', '81rasxas', '8a4lsxex', '8lqhw9ep', '93ubkqfi', '9r91qo8x', 'aej6wxh8', 'aj6hu6x', 'akka6vtq', 'azybbrqe', 'bhjfpzgc', 'bntzix8l', 'boai4zds', 'btjka62m', 'c3rv1bs9', 'ckud9s4q', 'cpqdi58s', 'erv2mapd', 'fc2k5drf', 'g5pvhkig', 'gb4hxmdq', 'h1qpaxs7', 'h4yepx8s', 'hfnit4ev', 'hrqqx8bu', 'i5b2fdez', 'i9fqxgwn', 'igfq43wg', 'j47zkf1o', 'j8soi1xe', 'jg55qc21', 'jxfij7hr', 'k2rf8qm4', 'kkxkaagr', 'kzt9osem', 'mihep4', 'ox6k2lbi', 'peeff4t8', 'q2zyj5m4', 'r7dszv23', 'rccun', 'rckq58a9', 's18ydzme', 'srpx1ilr', 'tg9vyfot', 'u2l3u6og', 'ua1djdlg', 'uiwlw4n5', 'ujg255xj', 'ujjuisc1', 'utff7e4t', 'uwk9bwdt', 'uz1v2g7u', 'vyck9k79', 'x8wr4182', 'xcdydxji', 'xfd4twop', 'ychlwvnz', 'zj4gbhlp', 'zo8vuchq']


In [4]:
len(beiwe_ids)
beiwe_ids_list = beiwe_ids

Next, we can directly explore the structure of the sample Beiwe data that we've just downloaded. 

At the top level of the directory `/data`, subject-level data is separately contained with subdirectories. Each subdirectory are named according to the subject's assigned Beiwe ID. In this sample, we observe the six subdirectories each from a separate study participant. 

In [5]:
from helper_functions import tree
from pathlib import Path
import pandas as pd

tree(dest_dir, level=1, limit_to_directories=True)

bulk_download_test
├── 2n18iikg
├── 4aczi869
├── 4kp3bpt3
├── 2umdx87r
├── 41n4f9xk
├── 56l6mrgc
├── 4uu7odcm
├── 7ijvywnb
├── 93ubkqfi
├── azybbrqe
├── aej6wxh8
├── akka6vtq
├── 8lqhw9ep
├── bntzix8l
├── aj6hu6x
├── bhjfpzgc
├── boai4zds
├── 9r91qo8x
├── erv2mapd
├── gb4hxmdq
├── fc2k5drf
├── btjka62m
└── h1qpaxs7

23 directories


## Process Data using Forest 
- Using the Forest library developed by the Onnela lab, we compute daily GPS and communication summary statistics

First, we generate the GPS-related summary statistics by using the **gps_stats_main** function under the **traj2stat.py** in the Jasmine tree of Forest. This code will take between 15 minutes to 12 hours to run, depending on your machine and the quantity of data downloaded. To make sure that everything is working right, change the `beiwe_ids` argument from `None` to a list with just a couple of the Beiwe IDs in your study.

- For **data_dir**, enter the "path to the data file directory". This will be the same directory you downloaded data into.
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** We can use "pytz.all_timezones" to check all options.
- For **frequency**, there are 'daily' or 'hourly' or 'both' for the temporal resolution for summary statistics. Currently, one must pass this as one of the Frequency class imported from Jasmine. So, you may use Frequency.HOURLY or Frequency.DAILY
- For **save_traj**, it's "True" if you want to save the trajectories as a csv file, "False" if you don't (default: False). Here, we chose **"True."**
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, jasmine will run on all users in the data_dir directory.
- For **places_of_interest**, enter a list of places of interest. This list must contain keywords from [openstreetmaps](https://wiki.openstreetmap.org/wiki/OpenStreetBrowser/Category_list)

There are also more optional arguments that can be passed to the function, which are located in the Hyperparameters class in the traj2stat.py file. These include:
- For **log_threshold**, enter the number of minutes required to be spent at a place to count as a place
- For **save_osm_log**, enter whether you want to save the log associated with places of interest.

and others as can been seen in the class definition.

In [None]:
from forest.jasmine.traj2stats import gps_stats_main, Hyperparameters
from forest.constants import Frequency

data_dir = dest_dir
gps_output_dir = "gps_output"
tz_str = "America/New_York"
freq = Frequency.DAILY
save_traj = True 
beiwe_ids = ["2n18iikg"] # Test for one person
places_of_interest = None

# if you are not interested in more specific hyperparameters, you can use the default ones
# by setting parameters = None or not passing in the parameters argument
parameters = Hyperparameters()
parameters.save_osm_log = False
parameters.log_threshold = 60 # threshold, in minutes, for logging locations if OSM analysis is enabled
parameters.pcr_bool = True # enables physical circadian rhythm (PCR) statistics
parameters.pcr_window = 14 # number of days to look back and forward for calculating PCR
parameters.pcr_sample_rate = 30 # sample rate in seconds

gps_stats_main(
    data_dir, gps_output_dir, tz_str, freq, save_traj, places_of_interest = places_of_interest, 
    participant_ids = beiwe_ids, parameters = parameters
)






*The output should describe how the data is being processed. If this is working correctly, you will see something like:*
    
><i>User: tcqrulfj  
Read in the csv files ...  
Collapse data within 10 second intervals ...  
Extract flights and pauses ...  
Infer unclassified windows ...  
Merge consecutive pauses and bridge gaps ...  
Selecting basis vectors ...  
Imputing missing trajectories ...  
Tidying up the trajectories...  
Calculating the daily summary stats...<i>

Liz testing

In [8]:
# Next, we can directly explore the structure of the sample Beiwe data that we've just downloaded. 

# At the top level of the directory `/data`, subject-level data is separately contained with subdirectories. Each subdirectory are named according to the subject's assigned Beiwe ID. In this sample, we observe the six subdirectories each from a separate study participant. 
import os
from helper_functions import tree
from pathlib import Path
import pandas as pd
from forest.jasmine.traj2stats import gps_stats_main, Hyperparameters
from forest.constants import Frequency
import sys
import glob
import yaml

# Get the absolute path to the directory this script lives in
SCRIPT_DIR = os.getcwd()

# Construct the path to the config file relative to the script
CONFIG_PATH = os.path.join(SCRIPT_DIR, "../../config/HOPE_config.yaml")
# # quick and dirty hardcoding for testing
# CONFIG_PATH = "/n/onnela_dp_l3/Lab/HOPE/beiwe/config/HOPE_config.yaml"
CONFIG_DIR = os.path.dirname(CONFIG_PATH)

with open(CONFIG_PATH, "r") as f:
    config = yaml.safe_load(f)

# data_dir = os.path.abspath(os.path.join(CONFIG_DIR, config["data_dir"]))
raw_data_dir = os.path.abspath(os.path.join(CONFIG_DIR, config["raw_data_dir"]))
raw_data_p
metadata_path = os.path.abspath(os.path.join(CONFIG_DIR, config["metadata_path"]))

data_dir = os.path.abspath(os.path.join(raw_data_dir, ))

beiwe_id_df = pd.read_csv(metadata_path, sep = ",")

# convert patient id to string and then to list
beiwe_id_df["beiwe_id"] = beiwe_id_df["beiwe_id"].astype(str)
beiwe_ids = beiwe_id_df["beiwe_id"].tolist()
print(beiwe_ids)  

tree(dest_dir, level=1, limit_to_directories=True)

# from forest.jasmine.traj2stats import gps_stats_main, Hyperparameters
# from forest.constants import Frequency

# user_params = pd.read_csv(metadata_path) # this is the file that contains the user parameters
# row = user_params.iloc[task_id]
# print(row)
# beiwe_ids = [row["beiwe_id"]]
# print("beiwe ids: " + str(beiwe_ids), flush=True)

gps_output_dir = "gps_output"
tz_str = "America/New_York"
freq = Frequency.HOURLY
save_traj = True 
# TEST ON SINGLE SUBJECT
beiwe_ids = ["d3o6kuvf"] # Test one id for now "d3o6kuvf" smallish dataset

places_of_interest = None
# if you are not interested in more specific hyperparameters, you can use the default ones
# by setting parameters = None or not passing in the parameters argument
parameters = Hyperparameters()
parameters.save_osm_log = False
parameters.log_threshold = 60 # threshold, in minutes, for logging locations if OSM analysis is enabled
parameters.pcr_bool = True # enables physical circadian rhythm (PCR) statistics
parameters.pcr_window = 14 # number of days to look back and forward for calculating PCR
parameters.pcr_sample_rate = 30 # sample rate in seconds

gps_stats_main(
    data_dir, gps_output_dir, tz_str, freq, save_traj, places_of_interest = places_of_interest, 
    participant_ids = beiwe_ids, parameters = parameters
)

from helper_functions import concatenate_summaries


concatenate_summaries(dir_path = os.path.join(direc, gps_output_dir), 
                      output_filename = os.path.join(direc,"gps_summaries.csv"))

print(f"GPS data processing complete. Output saved to {os.path.join(direc, 'gps_summaries.csv')}")

['jy8yzsap', 'md7fnll7', 'wrb5oh7u', 'wgs5rptp', 'q21jny47', '52td8unr', '67zqexic', 'piiua4us', '4458ann3', '7h42nxij', 's18ydzme', 'em2heqoc', 'boai4zds', 'qrb9cncj', '1s5wlcm6', 'aj6hu6x', 'nzvk7jwo', 'behdfa31', 'bhjfpzgc', 'igfq43wg', 'j3ba35d2', 'h599npeu', 'xcdydxji', '2n18iikg', 'cpjgpdpc', 'sgjviesj', '4kp3bpt3', '4zqq46uj', 'srpx1ilr', 'd3o6kuvf', 'nnu1qurv', 'btjka62m', '7ijvywnb', 'ua1djdlg', 'dglk2mak', 'ud1ic49g', 'unlkfxnc', 'sg5x9we2', 'i5b2fdez', '3j2fdbb', 'jg55qc21', '4s7o5pps', '9r91qo8x', 'kzt9osem', 'asj59ze4', 'akka6vtq', 'o912nw3y', 'aej6wxh8', 'vby6fos4', 'fxh8kqh4', 'j8soi1xe', 'o3ze4d97', '8lpaia8q', 'uiwlw4n5', 'ujg255xj', 'lglv5j7s', 'hfnit4ev', 'see4r8y6', '8lqhw9ep', 't9xq7k8b', '5iv5o89o', '83i73qyl', '4fyd1ssv', 'k93dkv8w', 'h4yepx8s', 'fc2k5drf', 'rmtdpger', 'mk9ee8p4', '56l6mrgc', '1rkt87d2', '8r4hkqi6', '4aczi869', 'ychlwvnz', 'ukexs41t', '76rm9dnl', 'ige4zl6o', 'x8wr4182', 'qkcrsaa9', 'i9fqxgwn', 'kkxkaagr', 'pwby6ex2', '6ogn9wsa', '3tg2dbdl', 'n7g4

NameError: name 'dest_dir' is not defined

In [10]:
os.path.join(raw_data_dir, "manual_HOPE_paper2_download")

'/n/onnela_dp_l3/Lab/HOPE/beiwe/data/raw/manual_HOPE_paper2_download'

We will now contatenate GPS summaries into one file. 

>*Note- this function appends the frequency value to the end of the filename e.g. "gps_summaries_daily.csv"*

In [None]:
from helper_functions import concatenate_summaries


concatenate_summaries(dir_path = os.path.join(direc, gps_output_dir), 
                      output_filename = os.path.join(direc,"gps_summaries.csv"))



Second, we compute the call and text-based summary statistics by using the **log_stats_main** function under the **log_stats.py** in the Willow tree of Forest. This should run a lot faster than `forest.jasmine.traj2stats.gps_stats_main`. 


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **option**, choose a Frequency value corresponding to the temporal resolution you would like data to be aggregated to. 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, willow will run on all users in the data_dir directory.

In [None]:
import forest.willow.log_stats
data_dir = dest_dir
comm_output_dir = "comm_output"
tz_str = "America/New_York"
option = Frequency.DAILY
beiwe_ids = None



forest.willow.log_stats.log_stats_main(
    data_dir, comm_output_dir, tz_str, option, beiwe_ids = beiwe_ids
)

*The output should describe how the data is being processed (e.g., read, collapse, extracted...imputing, tidying, and calculating daily summary stats).*

>*Note- calls and texts data are only collected on Android phones. If you only enrolled users with iPhones in your study, you will not have any output here.*

- The following code is  used to concatenate these files into a single csv for the **communication summaries**.

In [None]:
from helper_functions import concatenate_summaries

concatenate_summaries(dir_path = os.path.join(direc,comm_output_dir), 
                      output_filename = os.path.join(direc,"comm_summaries.csv"))


*The output should show the data for the first five observations in the concatenated dataset.*

Next, we summarize survey information using the **survey_stats_main** function under the **base.py** in the Sycamore tree of Forest. This will take between 5 minutes and 2 hours to run, depending on how many surveys were administered durinng your study.


- For **data_dir**, enter the "path to the data file directory". 
- For **output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, sycamore will run on all users in the data_dir directory.
- For **config_path**, enter the filepath to your downloaded survey config file. This can be downloaded by clicking "edit study" on your study page, and clicking "Export study settings JSON file under "Export/Import study settings". If this is None, Sycamore will still run, but fewer outputs will be produced. 
- For **interventions_filepath**, enter the filepath to your downloaded interventions timing file. This can be downloaded by clicking "edit study" on your study page, and clicking "Download Interventions" next to "Intervention Data". If this is None, Sycamore will still run, but fewer outputs will be produced. (note, this doesn't apply if you are using the main version of sycamore)

In [None]:
from forest.sycamore.base import compute_survey_stats

data_dir = dest_dir
survey_output_dir = "survey_output"
tz_str = "America/New_York"
beiwe_ids = None
config_path = None
interventions_filepath = None

compute_survey_stats(
    study_folder = data_dir, output_folder = survey_output_dir,
    config_path = config_path, tz_str = tz_str, users = beiwe_ids,
    start_date = time_start, end_date = time_end, 
    interventions_filepath = interventions_filepath)

Now, we summarize accelerometer using the **run** function under the **base.py** in the Oak tree of Forest. This tree is in beta testing, so don't be surprised if you encounter errors running this function.

- For **data_dir**, enter the "path to the data file directory". 
- For **accelerometer_output_dir**, enter the "path to the file directory where output is to be stored". 
- For **tz_str**, enter the time zone where the study was conducted. Here, it's **"America/New_York."** 
- For **frequency**, choose a value of frequency similar as what was used in jasmine. 
- For **beiwe_ids**, enter the list of Beiwe IDs to run Forest on. If this is `None`, willow will run on all users in the data_dir directory.

In [None]:
from forest.oak.base import run

data_dir = dest_dir
accelerometer_output_dir = "accel_output"
tz_str = "America/New_York"
frequency = Frequency.DAILY
beiwe_ids = None

run(data_dir, accelerometer_output_dir, 
    tz_str, frequency, users = beiwe_ids)

In [None]:
from helper_functions import concatenate_summaries


concatenate_summaries(dir_path = os.path.join(direc, accelerometer_output_dir), 
                      output_filename = os.path.join(direc,"accel_summaries.csv"))

## Plot Data

Now, we will also be generate some time series plots using the generated statistic summaries.
- To read the file, we need to define **response_filename** with the concatenated dataset. Here, we are using 'gps_summaries_daily.csv'.

In [None]:
import matplotlib.pyplot as plt
import os
import pandas as pd

direc = os.getcwd()
response_filename = 'gps_summaries_daily.csv'
path_resp = os.path.join(direc, response_filename)    

# read data
response_data = pd.read_csv(path_resp)

# GPS data (jasmine)
response_data['Date'] = pd.to_datetime(response_data[['year', 'month', 'day']])

# Accelerometer data (oak)
# response_data.rename(columns={'date': 'Date'}, inplace=True)

The data needs to be sorted according to date. The following code will sort and create 4 even time intervals in the plot. 

In [None]:
## Make sure the data is sorted according to date
response_data.sort_values('Date', inplace = True)
response_data.reset_index(drop = True, inplace = True)

def time_series_plot(var_to_plot, ylab = '', xlab = 'Date', num_x_ticks = 4):
    for key, grp in response_data.groupby(['Beiwe_ID']):
        plt.plot(response_data.Date, response_data[var_to_plot], label=key)
    
    #if len(response_data['Beiwe_ID'].unique()) > 1: ## more than one user to plot
    #    plt.plot(response_data.Date, response_data[var_to_plot], c=response_data['Beiwe_ID'].astype('category'))
    #else:
    #    plt.plot(response_data.Date, response_data[var_to_plot]) #just one user
    title = f"Time Series Plot of {var_to_plot}"
    plt.title(title)
    plt.xlabel(xlab)
    plt.ylabel(ylab)
    
    ## get evenly indices
    tick_indices = [(i * (len(response_data.Date.unique()) - 1)) // (num_x_ticks - 1) for i in range(num_x_ticks) ]
    
    plt.xticks(response_data.Date.unique()[tick_indices])
    plt.show()

- You can now create time series plots using **time_series_plot('variable')**.

In [None]:
time_series_plot('dist_traveled', ylab = "km")

*The output displays a time series plot for the variable, "dist_traveled."*

In [None]:
time_series_plot('sd_flight_length', ylab = "km")

*The output displays a time series plot for the variable, "sd_flight_length."*