## Dataset preparation
This notebook imports participant data from TIME study, keeps only those who completed the study, comutes all the features, then saves two files:
1. Feature set for all the users
2. A sample of users to try different ML algorithms

## Import libraries
Import essential libraries here.

In [16]:
import sys
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob

## Import participant status
We will import participant status data. So that we can filter out those who completed the study

In [6]:
## Import the status file
status_file = '/Users/adityaponnada/Downloads/time_study_data/participant_status_tracking_v2.csv'
status_df = pd.read_csv(status_file)

## Show the first few rows
print(status_df.head())
# Also print the columns names
print(status_df.columns)

   Record ID            Visualizer ID Participant Status  Consent Date  \
0       9001       sharpnessnextpouch           Completed    3/17/2020   
1       9002     uniformlyharmfulbush          Unenrolled    3/18/2020   
2       9003     hacksawscoldingdares            Withdrew    3/27/2020   
3       9004    dimnesscranialunheard           Completed    3/28/2020   
4       9005  coynessculminatebarista           Completed     4/8/2020   

  Date participant completed Date participant withdrew  \
0                  3/17/2021                       NaN   
1                        NaN                       NaN   
2                        NaN                 12/4/2020   
3                  3/28/2021                       NaN   
4                   4/8/2021                       NaN   

  Date participant unenrolled Date Devices Mailed ID of device loaned  \
0                         NaN           3/25/2020        C2F9214C2188   
1                  10/20/2020           3/25/2020        C2F

Now only keep the completed participants

In [7]:
## Filter completed participants. We will only keep the visualizerID and status columns
status_df = status_df[status_df['Participant Status '] == 'Completed'][['Visualizer ID', 'Participant Status ']]
# Rename the visualizerID column to participant_id.
status_df.rename(columns={'Visualizer ID': 'participant_id'}, inplace=True)
# Also rename participant status to status
status_df.rename(columns={'Participant Status ': 'status'}, inplace=True)
# Reset the index
status_df.reset_index(drop=True, inplace=True)
# Add @timestudy_com to the participant_id column
status_df['participant_id'] = status_df['participant_id'] + '@timestudy_com'
## Show the first few rows
print(status_df.head())
# Also print the shape of the dataframe
print(status_df.shape)


                           participant_id     status
0        sharpnessnextpouch@timestudy_com  Completed
1     dimnesscranialunheard@timestudy_com  Completed
2   coynessculminatebarista@timestudy_com  Completed
3  spinstersubatomiccoyness@timestudy_com  Completed
4     sadlyskilledlustfully@timestudy_com  Completed
(136, 2)


Save the completed participants IDs as a list

In [9]:
completed_participants = status_df['participant_id'].tolist()
# Display the completed participants
print(completed_participants)

['sharpnessnextpouch@timestudy_com', 'dimnesscranialunheard@timestudy_com', 'coynessculminatebarista@timestudy_com', 'spinstersubatomiccoyness@timestudy_com', 'sadlyskilledlustfully@timestudy_com', 'unfittedfactoiddivisive@timestudy_com', 'groinunratedbattery@timestudy_com', 'exploreparadoxmangle@timestudy_com', 'penpalsandbanklifting@timestudy_com', 'showplacefacingsanta@timestudy_com', 'lyricallymalformedrigor@timestudy_com', 'neutergoldfishsworn@timestudy_com', 'debatableuneasyeveryone@timestudy_com', 'peddlingventricleexert@timestudy_com', 'collisionmolarbreeze@timestudy_com', 'faucetsquealingcatapult@timestudy_com', 'bannisterhardwiredladle@timestudy_com', 'resupplyclappingyahoo@timestudy_com', 'punctuatelandingdeferred@timestudy_com', 'tattlingsupperlegroom@timestudy_com', 'vagabondnumerousflatterer@timestudy_com', 'anagramprobingscrooge@timestudy_com', 'equallustinessuntil@timestudy_com', 'crestedserpentspongy@timestudy_com', 'fracturerepurposealgebra@timestudy_com', 'cohesivepr

## Import compliance matrix
We will import hourly compliance matrix for all the completed participants

In [20]:
folder_path = '/Users/adityaponnada/Downloads/time_study_data/compliance_matrix/'
# Import all the csv files within this folder. But only for the completed participants. Then concatenate them into a single dataframe.
# Note: The folder is structured as follows:
# folder_path/participant_id/uema_feature_mx_*.csv. Here * is a wildcard that matches any characters.
# The code should first use the completed participant list, then loop through the folder path and find p[articipant_id folder. 
# Then once the matching folder found, just concatinate all the csv files that match the pattern uema_feature_mx_*.csv
all_files = []
for participant in completed_participants:
    participant_folder = f"{folder_path}{participant}/"
    # Find all the csv files that match the pattern uema_feature_mx_*.csv
    files = glob.glob(participant_folder + 'uema_feature_mx_*.csv')
    for file in files:
        all_files.append(pd.read_csv(file))
# Concatenate all the dataframes in the list into a single dataframe
compliance_matrix = pd.concat(all_files, ignore_index=True)
# Show the first few rows of the compliance matrix
print(compliance_matrix.head())
# Also print the shape of the compliance matrix
print(compliance_matrix.shape)

                     Participant_ID Initial_Prompt_Date Prompt_Type  \
0  sharpnessnextpouch@timestudy_com          2020-06-24   EMA_Micro   
1  sharpnessnextpouch@timestudy_com          2020-06-24   EMA_Micro   
2  sharpnessnextpouch@timestudy_com          2020-06-24   EMA_Micro   
3  sharpnessnextpouch@timestudy_com          2020-06-24   EMA_Micro   
4  sharpnessnextpouch@timestudy_com          2020-06-24   EMA_Micro   

  Study_Mode     Initial_Prompt_Local_Time Answer_Status  \
0       TIME  Wed Jun 24 05:34:02 PDT 2020     Completed   
1       TIME  Wed Jun 24 05:43:02 PDT 2020     Completed   
2       TIME  Wed Jun 24 05:51:02 PDT 2020     Completed   
3       TIME  Wed Jun 24 06:14:03 PDT 2020  NeverStarted   
4       TIME  Wed Jun 24 06:33:05 PDT 2020     Completed   

       Actual_Prompt_Local_Time  First_Question_Completion_Unixtime  \
0  Wed Jun 24 05:34:02 PDT 2020                       1593002047735   
1  Wed Jun 24 05:43:02 PDT 2020                       1593002586653   

In [22]:
## Get the number of rows in compliance_matrix
num_rows = compliance_matrix.shape[0]
print(f"Number of rows in compliance_matrix: {num_rows}")
# Get the number of columns in compliance_matrix
num_cols = compliance_matrix.shape[1]
print(f"Number of columns in compliance_matrix: {num_cols}")
# Get the number of unique participants in compliance_matrix
num_participants = compliance_matrix['Participant_ID'].nunique()
print(f"Number of unique participants in compliance_matrix: {num_participants}")

Number of rows in compliance_matrix: 1495495
Number of columns in compliance_matrix: 62
Number of unique participants in compliance_matrix: 137


Save the file for later access

In [23]:
## Save compliance_matrix to a csv file. The filename should have _date_time appended to it.
current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
compliance_matrix.to_csv(f'/Users/adityaponnada/Downloads/time_study_data/compliance_matrix_{current_time}.csv', index=False)
print(f"Compliance matrix saved to /Users/adityaponnada/Downloads/time_study_data/compliance_matrix_{current_time}.csv")

Compliance matrix saved to /Users/adityaponnada/Downloads/time_study_data/compliance_matrix_20250701_115558.csv
