# This is a starting point for metadata about the glasses.

- (1) we need estimates of GoPro Timestamp at start
- (2) we need to reduce the filesize of the video CSV from openface to something workable
- (3) we need to keep track of where the files are, etc.
- (4) we will save all of this to a data structure that we can load in other places for convenience

----

First step is to run:

(1)  `ls *.MP4 | sed "s/^/file '/g; s/$/'/g" > filelist.txt` followed by `ffmpeg -f concat -safe 0 -i filelist.txt -c copy video.mp4` to concatentate the gopro videos into one.

(2)  then we run the openface feature extractor with `FeatureExtraction.exe -f "E:\data_cap_val_2\patrick_1\GoPro\video.mp4" -2Dfp -3Dfp -pdmparams -pose -aus -gaze`.  This will give all kinds of features from openface, timestamped in secs relative to start of video, one row per frame.

(3) Finally we use `python3 run_blink_extractor_threaded.py --video video.mp4` to get an '\_eyeratio\_final.csv' file. This file has relative timestamps from the start of the video in seconds, as well as an eye ratio that we can use to calculate blinks based on a threshold, one row per frame.  The script and how to process it are located in the Video Blink Extraction directory.

----

We will copy over the files to this folder 'cleaned_data'.  The folder contains:
- `captivateFiltered.json` (the full MongoDB dump, as a JSON file).
- `metadata.p` (a file we create that tells us basic data that we input; i.e. what day each test was, who it was, which folder it corresponds to, an estimated timestamp for the gopro video starting, how large of an error we expect in this estimate, an estimate of gopro duration, an estimate of glasses data session start times and durations).  This file will be generated with this script.  Helpers to estimate gopro start time are included here.
- folders for each user test by day, i.e. `david_1, patrick_1, david_2, juliana_1, etc`
- - each folder will have a `video_eyeratio_final.csv` file from the blink script run on the video
- - each folder will have a `video.csv` file from openface run on the video (optionally)
- - each folder will have a `video_min.csv` file with just the most important features from `video.csv`, generated with this script.  GoPro video was recorded at 60 Hz (FPS).
- - each folder will have a `glasses_sessions.p` file, which includes the relevant sorted data as continuous sessions from the glasses (at 10Hz, blink data is 1kHz).  This file will generate that data.

----
To use this, download the mongoDB json file and put it in this folder.  Generate a folder for the session with `video.csv` and `video_eyeratio_final.csv` manually.  Add in metadata below (a gopro estimated timestamp, date of test, session name, etc) and run through this script.

In [17]:
import json
from datetime import datetime , timezone
import time
import numpy as np
import codecs
import struct
import matplotlib.pyplot as plt
import pandas as pd
import pathlib
import pickle

%matplotlib inline

# Estimating a GO-PRO timestamp

GoPro videos are all 11m47s long at 60 fps.  There is no easy way to get a timestamp for the video-- the create date and the last modified date that should come as embedded metadata don't seem to be timestamped accurately.  

Based on what we've seen, it looks like the best estimate of the start time of the video is the 'date modified' of the first gopro video file minus the 11m47s duration of the video.  Here's an example from Juliana; based on when it looks like the glasses turned on in the video and our first glasses packet, we expect recording started at around 10:21:40am on 04/09:

In [None]:
def timestamp_ms_to_string(timestamp):
    local_tz = datetime.now().astimezone().tzinfo
    return datetime.fromtimestamp(timestamp/1000, tz=local_tz).strftime('%m/%d/%y %I:%M:%S%p %Z')

def string_to_timestamp_ms(datestring):
    return datetime.strptime(datestring, '%m/%d/%y %I:%M:%S%p %Z').timestamp()*1000

In [8]:
fname = pathlib.Path('/Volumes/ExtDrive_(ResEnv)/data_cap_val_2/juliana_1/GoPro/GH010082.MP4')

modified_time = fname.stat().st_mtime
create_time = fname.stat().st_ctime

print('modified time:', timestamp_ms_to_string(modified_time*1000))
print('create time:', timestamp_ms_to_string(create_time*1000))
print('modified time - 11m47s:', timestamp_ms_to_string(modified_time*1000 - (11*60+47)*1000))

modified time: 04/09/21 10:33:20AM EDT
create time: 04/14/21 05:49:40PM EDT
modified time - 11m47s: 04/09/21 10:21:33AM EDT


#### use the below to find a gopro start timestamp for each participant based on the first video file


In [12]:
#apply the above to new participants:
fname = pathlib.Path('/Volumes/ExtDrive_(ResEnv)/data_cap_val_2/david_3/GH010078.MP4')
print(timestamp_ms_to_string(fname.stat().st_mtime*1000 - (11*60+47)*1000))

04/06/21 04:48:22PM EDT


# NOTES

### DAVID_2 (3/25, 2p - 8:30p)

*No GoPro video to timestamp.* Glasses use appears to start at 44min12sec into 6hr28min45sec video, so about 5hr45min of useful data.  We see 5hr51 of glasses data, so this matches.  The timestamp of first glasses packet is 03/25/21 02:25:13PM EDT, so we'll assume the video starts at **03/25/21 01:41:01PM EDT**.

### PATRICK_1 (3/29, 3:30p - 8p)

*No GoPro video to timestamp.*

### PATRICK_2 (4/1, 2:45p - 8p)

*No GoPro video to timestamp.*

### BEATA_1 (4/2, 12:15p - 9:15p) 

*No GoPro video to timestamp.*  GoPro starts within 12:16pm min based on apple watch in frame, time guess 41:09, focus guess 41:26.  we'll assume the video starts at **04/02/21 12:16:30PM EDT**.

### DAVID_3 (4/6, 5:30p - 11:30p)

GoPro reset!! Seems we have 40 min of video GH010078.mp4 and GH020078.mp4 before a reset, which then changes the names starting at GH010079.mp4, GH020079.mp4....  The starting time of the first ~40 min video is **04/06/21 04:48:22PM EDT**, while the main section is **04/06/21 05:28:57PM EDT**.  We can split this into 2.

### IRMANDY_1 (4/8, 5:30p - 12p)
 
only 11 minutes of video BOOOOOOO.  Starts at **04/08/21 05:30:56PM EDT**.

### JULIANA_1 (4/9, 10a - 3p)

GoPro starts before glasses, seems to be about ~20-30s before.  GoPro Video 1 gives a date created of 04/10/21 00:58, and a date modified of 04/09/21 10:33.  Juliana glasses data starts at 04/09/21 10:22:05AM EDT, so we expect a start time of 10:21:40 or so.  Based on the create time of the video, using the example above, we get a start time of **04/09/21 10:21:33AM EDT**.

----
now that we've gotten our gopro timestamps, and we've create folders in this folder with all of the raw openface and eyeratio data and the latest mongo instance, let's add our gopro timestamps below and process the data into useable chunks.  We start with this structure, which we hand populate:

In [14]:
video_metadata = {
'david_2'  : {'date': '03/25', 'vid_start': '03/25/21 01:41:01PM EDT', 'start_error_sec':60},      
'patrick_1': {'date': '03/29', 'vid_start': '', 'start_error_sec':60},
'patrick_2': {'date': '04/01', 'vid_start': '', 'start_error_sec':60},
'beata_1'  : {'date': '04/02', 'vid_start': '04/02/21 12:16:30PM EDT', 'start_error_sec':60},
'david_3a' : {'date': '04/06', 'vid_start': '04/06/21 04:48:22PM EDT', 'start_error_sec':15},
'david_3b' : {'date': '04/06', 'vid_start': '04/06/21 05:28:57PM EDT', 'start_error_sec':15},
'irmandy_1': {'date': '04/08', 'vid_start': '04/08/21 05:30:56PM EDT', 'start_error_sec':15},
'juliana_1': {'date': '04/09', 'vid_start': '04/09/21 10:21:33AM EDT', 'start_error_sec':15}
}

## Create and Save Glasses Data as Sessions with Each User

let's first load in our glasses data and process them into packets.  We can store relevant
sessions in the folder that corresponds the the day above, and store metadata about session
start and end time.  This will improve loading time, as the full dataset is GB large (small enough to fit in RAM, but large enough to be a huge pain to load frequently).

If this gets too out of hand to reprocess for each participant, we should start a new 'batch' in a new folder.  This is the only hefty operation for each participant that is required.

In [60]:
flatten = lambda t: [item for sublist in t for item in sublist]

def tick_to_timestamp_ms_converter(tick_ref, timestamp_ms_ref):
    return lambda tick_val: int(timestamp_ms_ref + tick_val - tick_ref)
    
def get_packets_by_range(start="01/01/01 01:01:01AM EST", end=np.inf):
    #accepts timestamp string or timestamp value in ms; returns ordered packets by servertimestamp
    if type(start) == str: start = string_to_timestamp_ms(start)
    if type(end) == str: end = string_to_timestamp_ms(end)
    return sorted([x for x in data if x['serverTimestamp']>=start and x['serverTimestamp']<=end], key=lambda k: k['serverTimestamp'])

def create_sessions(packets, thresh_sec=5):
    #accepts list of sorted packets, breaks into sessions when: 
    # (1) we fail to see data from the server for >=thresh sec
    # (2) we see a packet_tick_ms descend instead of ascend
    
    final_packet_array_of_arrays = []
    current_session = []
    current_timestamp_packets = [packets[0]]
    
    last_seen = packets[0]['serverTimestamp']
    last_tick = packets[0]['packet_tick_ms']
    
    for packet in packets:
        #duplicate packet
        if packet['serverTimestamp'] == last_seen and packet['packet_tick_ms'] == last_tick:
            pass 
        #packet that is at the same server time, but is increasing packet_tick from prev servertime
        elif packet['serverTimestamp'] == last_seen and packet['packet_tick_ms'] > last_tick and packet['packet_tick_ms'] < last_tick + thresh_sec*1000:
            current_timestamp_packets.append(packet)
        #packet that is increased server time and packet tick
        elif packet['serverTimestamp'] > last_seen and packet['packet_tick_ms'] > last_tick and packet['packet_tick_ms'] < last_tick + thresh_sec*1000:
            current_session.extend(sorted(current_timestamp_packets, key=lambda k: k['packet_tick_ms']))
            current_timestamp_packets = [packet]
            last_seen = packet['serverTimestamp']
            last_tick = current_session[-1]['packet_tick_ms']
        #packet that has not increased servertime or not increased packettime within threshold 
        else:
            #print('new session: ', last_seen, packet['serverTimestamp'], last_tick, packet['packet_tick_ms'])
            if len(current_session) > 10:
                final_packet_array_of_arrays.append(current_session)
            current_session = []
            current_timestamp_packets = [packet]
            last_seen = packet['serverTimestamp']
            last_tick = packet['packet_tick_ms']
 
    if len(current_session) > 10:
        final_packet_array_of_arrays.append(current_session)
    
    return final_packet_array_of_arrays
    
def filter_short_sessions(sessions, min_thresh=5):
    return [session for session in sessions if (session[-1]['serverTimestamp'] - session[0]['serverTimestamp']) > min_thresh*60*1000]
    
def get_sessions_by_day(sessions, day="03/29"):
    return [session for session in sessions if timestamp_ms_to_string(session[0]['serverTimestamp']).startswith(day)]
    
def print_session_times(sessions):
    last_day = ''
    total_time = None
    #separate days and print in groups by day
    for i, session in enumerate(sessions):
        session_day = timestamp_ms_to_string(session[0]['serverTimestamp'])[:8]
        if session_day != last_day:
            if total_time is None: total_time = 0
            else:
                print('\n\t total time spent: %02d:%02d' % (total_time // 60, total_time % 60))
                total_time = 0
            
            last_day = session_day
            print('\n',session_day,'------')
        
        duration = (session[-1]['serverTimestamp'] - session[0]['serverTimestamp'])/(60*1000)
        duration_string = "duration=%02d:%02d" % (duration // 60, duration % 60)
        total_time += duration
                      
        print('\t', timestamp_ms_to_string(session[0]['serverTimestamp']), 'to', timestamp_ms_to_string(session[-1]['serverTimestamp']), '\t', duration_string)
    
    print('\n\t total time spent: %02d:%02d' % (total_time // 60, total_time % 60))
                

with open('./cleaned_data/captivateFiltered.json') as f: data = json.load(f)

#grab packets from a certain range (accepts timestamp ints in ms or datestrings of this form)
sorted_data = get_packets_by_range(start="03/24/21 12:00:00PM EST")

#sort them into sessions and see how many sessions we got
data_by_session = create_sessions(sorted_data)
total_num_sessions = len(data_by_session)

#filter out the sessions that are less than 5 min
data_by_session = filter_short_sessions(data_by_session)
print(len(data_by_session),'sessions found. (%d short sessions filtered.)' % (total_num_sessions - len(data_by_session)))

#print the session information by day
print_session_times(data_by_session)

43 sessions found. (98 short sessions filtered.)

 03/24/21 ------
	 03/24/21 04:10:28PM EDT to 03/24/21 04:22:03PM EDT 	 duration=00:11
	 03/24/21 04:25:50PM EDT to 03/24/21 06:47:50PM EDT 	 duration=02:21

	 total time spent: 02:33

 03/25/21 ------
	 03/25/21 02:25:13PM EDT to 03/25/21 05:55:40PM EDT 	 duration=03:30
	 03/25/21 06:11:11PM EDT to 03/25/21 08:32:30PM EDT 	 duration=02:21

	 total time spent: 05:51

 03/29/21 ------
	 03/29/21 03:28:19PM EDT to 03/29/21 04:07:45PM EDT 	 duration=00:39
	 03/29/21 04:10:38PM EDT to 03/29/21 05:48:17PM EDT 	 duration=01:37
	 03/29/21 05:51:01PM EDT to 03/29/21 06:49:15PM EDT 	 duration=00:58
	 03/29/21 06:49:41PM EDT to 03/29/21 07:27:35PM EDT 	 duration=00:37
	 03/29/21 07:29:10PM EDT to 03/29/21 08:01:04PM EDT 	 duration=00:31

	 total time spent: 04:25

 04/01/21 ------
	 04/01/21 02:35:30PM EDT to 04/01/21 02:45:05PM EDT 	 duration=00:09
	 04/01/21 03:07:10PM EDT to 04/01/21 03:23:42PM EDT 	 duration=00:16
	 04/01/21 03:24:54PM EDT to

Now that we've loaded all the glasses data in and separated them into sessions, we'll go through
and push them to each folder by day, and update the video_metadata as well:

In [28]:
for folder in video_metadata:
    
    #get sesssions for day that is represented by the folder
    sessions_to_folder = get_sessions_by_day(data_by_session, day=video_metadata[folder]['date'])
    
    #collect and append metadata about session start/end timestamps and durations
    video_metadata[folder]['glasses_session_times'] = \
        [[timestamp_ms_to_string(session[0]['serverTimestamp']), 
          timestamp_ms_to_string(session[-1]['serverTimestamp'])] for session in sessions_to_folder]
    
    video_metadata[folder]['glasses_session_durations_min'] = \
        [(session[-1]['serverTimestamp'] - session[0]['serverTimestamp'])/(60*1000) for session in sessions_to_folder]
                                                                      
    total_dur = sum(video_metadata[folder]['glasses_session_durations_min'])
    video_metadata[folder]['glasses_sessions_total_duration_string'] = "%02d:%02d" % (total_dur // 60, total_dur % 60)
        
    #save sessions in folder as pickle
    pickle.dump( sessions_to_folder, open( "./cleaned_data/" + folder + "/glasses_sessions.p", "wb" ) )

#remove this giant object from memory
del data

for folder in video_metadata:
    print(folder)
    print(video_metadata[folder])

david_2
{'date': '03/25', 'vid_start': '03/25/21 01:41:01PM EDT', 'start_error_sec': 60, 'glasses_session_times': [['03/25/21 02:25:13PM EDT', '03/25/21 05:55:40PM EDT'], ['03/25/21 06:11:11PM EDT', '03/25/21 08:32:30PM EDT']], 'glasses_session_durations_sec': [210.46036666666666, 141.32051666666666], 'glasses_sessions_total_duration_string': '05:51', 'glasses_session_durations_min': [210.46036666666666, 141.32051666666666]}
patrick_1
{'date': '03/29', 'vid_start': '', 'start_error_sec': 60, 'glasses_session_times': [['03/29/21 03:28:19PM EDT', '03/29/21 04:07:45PM EDT'], ['03/29/21 04:10:38PM EDT', '03/29/21 05:48:17PM EDT'], ['03/29/21 05:51:01PM EDT', '03/29/21 06:49:15PM EDT'], ['03/29/21 06:49:41PM EDT', '03/29/21 07:27:35PM EDT'], ['03/29/21 07:29:10PM EDT', '03/29/21 08:01:04PM EDT']], 'glasses_session_durations_sec': [39.43695, 97.6538, 58.235866666666666, 37.906083333333335, 31.891], 'glasses_sessions_total_duration_string': '04:25', 'glasses_session_durations_min': [39.43695,

----
**A Note from Patrick:**

The csv is rather large so unless you have enough RAM, you can upload it in chunks or upload individual columns (ref: https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas)

 

As for the data, the full details of the format is explained here but in summary, you’re going to want to look at columns AU45_r and AU45_l for blinks and pose_Rx, pose_Ry, and pose_Rz for head pose (these are radians).

----
 

## Generate our video_min.csv files so we don't have to work with GB csvs
Now we want to run through all folders, if there is a 'video.csv' file and no 'video_min.csv', we
will create 'video_min.csv' with only a few features so we have something we can load into memory easily.

----

#### Openface data: http://lukacreu.blogspot.com/2017/11/week-8-openface-ouput.html

- frame, timestamp in sec, confidence in track, success decision on track
- gaze: direction vector in world coord, normalized
- gaze_angle: direction in radians in world coord
- eye_lmk: pixel loc of landmarks in 2D
- pose_T: head location relative to camera, in mm
- pose_R: head rotation in radian, left hand positive (pitch/roll/yaw) prob in world coord, can be camera coord (world_coord flag)
- lowercase x/y: pixel landmarks in 2D
- uppercase x/y/z: mm landmarks in 3D
- p: point distribution model of face shape

#### AU mapping: https://imotions.com/blog/facial-action-coding-system/

- AU<XX>_r is intensity from 0-5
- AU<XX>_c is presence (0 or 1)

ones of interest may be:
- AU 41-45 (eyelid droop, slit, closed, squint, blink)
- AU 1-7 (1-4 eyebrows, 5 raising eyelids in surprise, 6 cheeks when smile, 7 ~squint)
- AU 61-64 (eye movements)

In [None]:
#print possible columns to include in video_min.csv
'''
columns_to_choose_from = pd.read_csv('./cleaned_data/david_1/video.csv', index_col=0, nrows=0).columns.tolist()
for c in columns_to_choose_from:
    print(c, end='\t')
''' 

In [30]:
#it will save only these columns:
columns_to_save = ['timestamp', 'AU45_r', 'AU45_c', 
                   'pose_Rx', 'pose_Ry', 'pose_Rz', 
                   'pose_Tx', 'pose_Ty', 'pose_Tz']
    
def minimal_csv_file(csv_filename):
    chunksize = 3600 #one min at 60fps
    video_df = pd.DataFrame() 

    with pd.read_csv(csv_filename, delimiter=', ', engine='python', chunksize=chunksize) as reader:
        for chunk in reader:
            print('.', end='')
            chunk = chunk[columns_to_save] 
            video_df = pd.concat([video_df, chunk])
    
    video_df.to_csv(csv_filename[:-4] + '_min.csv')
    print('DONE!')

for folder in video_metadata:
    folder_path = './cleaned_data/' + folder 
    if not pathlib.Path(folder_path + '/video_min.csv').is_file():
        try:
            print('no video_min.csv for ' + folder + ', trying min command...')
            minimal_csv_file(folder_path + '/video.csv')
        except:
            print('min command failed! probably no video.csv in folder!')
    else:
        print('found video_min.csv for ' + folder + '.')

found video_min.csv for david_2.
no video_min.csv for patrick_1, trying min command...
min command failed! probably no video.csv in folder!
found video_min.csv for patrick_2.
found video_min.csv for beata_1.
no video_min.csv for david_3a, trying min command...
min command failed! probably no video.csv in folder!
no video_min.csv for david_3b, trying min command...
min command failed! probably no video.csv in folder!
found video_min.csv for irmandy_1.
found video_min.csv for juliana_1.


## Finish Generating Metadata 
Now that we have video_min.csv files to work with, will run through and grab durations
from our video_min files and add it to our metadata:

In [40]:
for folder in video_metadata:
    min_csv_path = './cleaned_data/' + folder + '/video_min.csv' 
    try:
        video_metadata[folder]['vid_duration_sec'] = pd.read_csv(min_csv_path)['timestamp'].iloc[-1]
    except:
        print('no min_csv file for ' + folder)

no min_csv file for patrick_1
no min_csv file for david_3a
no min_csv file for david_3b


now we can print out a summary nice and easily:

In [46]:
for folder in video_metadata:
    try:
        print('='*20)
        print(folder, 'on', video_metadata[folder]['date'])
        print(video_metadata[folder]['glasses_sessions_total_duration_string'], \
               'hr of glassess data starting at', \
               video_metadata[folder]['glasses_session_times'][0][0])
        print('%02d:%02d hr of  vid starting at ' % 
              (video_metadata[folder]['vid_duration_sec'] / 60 // 60, 
               video_metadata[folder]['vid_duration_sec'] / 60 % 60 )  + \
               video_metadata[folder]['vid_start'])
    except:
        print('failed to load video analysis file')

david_2 on 03/25
05:51 hr of glassess data starting at 03/25/21 02:25:13PM EDT
04:30 hr of  vid starting at 03/25/21 01:41:01PM EDT
patrick_1 on 03/29
04:25 hr of glassess data starting at 03/29/21 03:28:19PM EDT
failed to load video analysis file
patrick_2 on 04/01
04:44 hr of glassess data starting at 04/01/21 02:35:30PM EDT
04:46 hr of  vid starting at 
beata_1 on 04/02
08:47 hr of glassess data starting at 04/02/21 12:13:25PM EDT
08:07 hr of  vid starting at 04/02/21 12:16:30PM EDT
david_3a on 04/06
05:12 hr of glassess data starting at 04/06/21 10:24:56AM EDT
failed to load video analysis file
david_3b on 04/06
05:12 hr of glassess data starting at 04/06/21 10:24:56AM EDT
failed to load video analysis file
irmandy_1 on 04/08
05:07 hr of glassess data starting at 04/08/21 05:26:31PM EDT
00:14 hr of  vid starting at 04/08/21 05:30:56PM EDT
juliana_1 on 04/09
04:03 hr of glassess data starting at 04/09/21 10:22:05AM EDT
05:58 hr of  vid starting at 04/09/21 10:21:33AM EDT


## Save the metadata

In [47]:
pickle.dump( video_metadata, open( "./cleaned_data/metadata.p", "wb" ) )

# Clean the Sessions

now lets go through and make our sessions into easy to work with csv files that we can import
for our blink and position analysis


In [62]:
def blink_timestamp_to_array(timestamp_ms, num_points=100, Fs=1000):
    T_ms = (1000./Fs)
    start_time = timestamp_ms - ((num_points-1) * T_ms)
    return [int(start_time + T_ms*i) for i in range(0, num_points)]
    
def blink_string_to_array(string):
    return struct.unpack('<200B', string.encode('UTF-16-LE'))[::2] #'UTF-16-LE'
    

## GET SPECIFIC DATA TYPE
def get_blink_data(data):
    #remove any packets without blinkdata
    data = [d for d in data if len(d['blink_data']) == 100]
    
    if not len(data): 
        print('NO VALID BLINK DATA DETECTED')
        return [],[]
    
    #associate first packet timestamp with tick to give us a tick to timestamp converter function
    tick_to_timestamp_fn = tick_to_timestamp_ms_converter(data[0]['blink_tick_ms'], data[0]['serverTimestamp'])
    
    #grab data and associated timestamp/ticks
    times = [blink_timestamp_to_array(tick_to_timestamp_fn(d['blink_tick_ms'])) for d in data]
    blink_data = [blink_string_to_array(d['blink_data']) for d in data]

    return flatten(times), flatten(blink_data)

def get_temp_data(data):
    #remove any packets without tempdata
    data = [d for d in data if (d['temple_tick_ms'] != 0 and d['nose_tick_ms'] != 0)]
    
    if not len(data): 
        print('NO VALID TEMP DATA DETECTED')
        return [],[],[],[]
    
    #associate first packet timestamp with tick to give us a tick to timestamp converter function
    tick_to_timestamp_fn = tick_to_timestamp_ms_converter(data[0]['temple_tick_ms'], data[0]['serverTimestamp'])
    
    #grab data and associated timestamp/ticks
    temple_times = [tick_to_timestamp_fn(d['temple_tick_ms']) for d in data]
    temple_data = [d['temple_temp'] for d in data]
    nose_times = [tick_to_timestamp_fn(d['nose_tick_ms']) for d in data]
    nose_data = [d['nose_temp'] for d in data]
    
    return temple_times, temple_data, nose_times, nose_data 

def get_quat_data(data):
    #remove any packets without quatdata
    data = [d for d in data if (d['rot_tick_ms'] != 0)]
    
    if not len(data): 
        print('NO VALID QUAT DATA DETECTED')
        return [],[],[],[],[],[]
    
    #associate first packet timestamp with tick to give us a tick to timestamp converter function
    tick_to_timestamp_fn = tick_to_timestamp_ms_converter(data[0]['rot_tick_ms'], data[0]['serverTimestamp'])
    
    #grab data and associated timestamp/ticks
    times = [tick_to_timestamp_fn(d['rot_tick_ms']) for d in data]
    I_data = [d['quatI'] for d in data]
    J_data = [d['quatJ'] for d in data]
    K_data = [d['quatK'] for d in data]
    real_data = [d['quatReal'] for d in data]
    accuracy_data = [d['quatRadianAccuracy'] for d in data]
    
    return times, I_data, J_data, K_data, real_data, accuracy_data  

def get_pos_data(data):
    #remove any packets without posdata
    data = [d for d in data if (d['tick_ms_pos'] != 0)]
    
    if not len(data): 
        print('NO VALID POS DATA DETECTED')
        return [],[],[],[],[]
    
    #associate first packet timestamp with tick to give us a tick to timestamp converter function
    tick_to_timestamp_fn = tick_to_timestamp_ms_converter(data[0]['tick_ms_pos'], data[0]['serverTimestamp'])
    
    #grab data and associated timestamp/ticks
    times = [tick_to_timestamp_fn(d['tick_ms_pos']) for d in data]
    x_data = [d['pos_x'] for d in data]
    y_data = [d['pos_y'] for d in data]
    z_data = [d['pos_z'] for d in data]
    accuracy_data = [d['pos_accuracy'] for d in data]
    
    return times, x_data, y_data, z_data, accuracy_data 


In [63]:
for folder in video_metadata:
    path_base = './cleaned_data/' + folder 
    
    sessions = pickle.load(open(path_base + '/glasses_sessions.p', 'rb'))
    print('working on ' + folder + ',', len(sessions), 'sessions...')

    b_sessions = [get_blink_data(s) for s in sessions]
    b_sessions = [pd.DataFrame({'timestamp_ms':s[0], 'value':s[1]}) for s in b_sessions]
    pickle.dump( b_sessions, open( path_base + "/sessions_blink.p", "wb" ) )

    t_sessions = [get_temp_data(s)  for s in sessions]
    t_sessions = [pd.DataFrame({'timestamp_tmpl_ms':s[0], 'tmpl_temp':s[1],
                            'timestamp_nose_ms':s[2], 'nose_temp':s[3],}) for s in t_sessions]
    pickle.dump( t_sessions, open( path_base + "/sessions_temps.p", "wb" ) )

    q_sessions = [get_quat_data(s)  for s in sessions]
    q_sessions = [pd.DataFrame({'timestamp_ms':s[0], 'quatI':s[1], 'quatJ':s[2], 'quatK':s[3],
                            'quatReal':s[4], 'quatRadianAccuracy':s[5]}) for s in q_sessions]
    pickle.dump( q_sessions, open( path_base + "/sessions_accel.p", "wb" ) )

working on david_2, 2 sessions...
working on patrick_1, 5 sessions...
working on patrick_2, 7 sessions...
NO VALID QUAT DATA DETECTED
working on beata_1, 2 sessions...
working on david_3a, 10 sessions...
NO VALID TEMP DATA DETECTED
working on david_3b, 10 sessions...
NO VALID TEMP DATA DETECTED
working on irmandy_1, 5 sessions...
working on juliana_1, 5 sessions...


# Consolidate Video Data

take our openface features (min) and our eyeratio features, strap on the estimated timestamp based on our estimated starttime, and consolidate into one df to rule them all.

In [66]:
def make_gopro_timestamp_fnc(start_timestamp="01/01/01 01:01:00AM EST"):
    #start_timestamp can be string of form above, or timestamp in ms
    if type(start_timestamp) == str:
        start_timestamp = datetime.strptime(start_timestamp, '%m/%d/%y %I:%M:%S%p %Z').timestamp()*1000
    
    def get_gopro_timestamp_ms(offset_ms):
        return int(start_timestamp + offset_ms)
    
    return get_gopro_timestamp_ms

In [77]:
for folder in video_metadata:
    print('working on ' + folder + 'video data...')

    try:
        #load video eyeratio, 60 Hz (16 ms between samples)
        df_er = pd.read_csv('./cleaned_data/' + folder + '/video_eyeratio_final.csv', index_col=0)
    
        #load video openface features, 60 Hz (16 ms between samples)
        df_of = pd.read_csv('./cleaned_data/' + folder + '/video_min.csv', index_col=0)
      
        gopro_est_timestamp_ms = make_gopro_timestamp_fnc(video_metadata[folder]['vid_start'])

        #df_er is in ms already (because I generated it)
        df_er['tick_ms'] = df_er['timestamp'].apply(lambda v: round(v-df_er['timestamp'].iloc[0]))
        df_er['est_timestamp_ms'] = df_er['tick_ms'].apply(gopro_est_timestamp_ms)
        del df_er['timestamp']

        #df_of is in partial secs
        df_of['tick_ms'] = df_of['timestamp'].apply(lambda v: round(v*1000.))
        df_of['est_timestamp_ms'] = df_of['tick_ms'].apply(gopro_est_timestamp_ms)
        del df_of['timestamp']

        #consolidate based on tick_ms
        final_df = pd.merge(df_er, df_of, on=["tick_ms", "est_timestamp_ms"])

        if (len(final_df) != min(len(df_er),len(df_of))): print('MERGE FAILED AHHHHHHHHH')
        else: final_df.to_csv('./cleaned_data/' + folder + '/video_consolidated.csv')
    
    except Exception as e:
        #print(e)
        print('FAILED: prob no eyeratio or video_min file, or no go-pro init timestamp!')

working on patrick_2video data...
   eye_ratio  timestamp
0   0.001489  16.683426
1   0.001445  33.366851
2   0.001373  50.050277
3   0.001240  66.733702
4   0.001435  83.417128
time data '' does not match format '%m/%d/%y %I:%M:%S%p %Z'
FAILED: prob no eyeratio or video_min file!


  mask |= (ar1 == a)


## DONE!  


all the data is now clean: we can load it as follows:

get the metadata about the folders:

```
metadata = pickle.load(open('./cleaned_data/metadata.p', 'rb'))
print('Please Choose a Session to Work With:')
for k in metadata: print(k, end=',  ')
```

pick a session and load it (video data is in a pandas DF, glasses data is an array of DFs
broken up by session):

```
SESSION = 'david_1'

#load video data (eyeratio and openface features), 60 Hz (16 ms between samples)
df_vid = pd.read_csv('./cleaned_data/' + SESSION + '/video_consolidated.csv', index_col=0)

#load glasses blink data, 1kHz (1 ms between samples)
blink_sess = pickle.load(open('./cleaned_data/' + SESSION + '/sessions_blink.p', 'rb'))

#load glasses accel data, 10 Hz (100 ms between samples)
accel_sess = pickle.load(open('./cleaned_data/' + SESSION + '/sessions_accel.p', 'rb'))

#load glasses temp data, 10 Hz (100 ms between samples)
temps_sess = pickle.load(open('./cleaned_data/' + SESSION + '/sessions_temps.p', 'rb'))
```



----
#### Go forth and rejoice!