#  Project Description

##  Background

Eye tracking is a technology that is used to measure the movement and position of the eye. Eye tracking can be used to obtain a variety of information, such as where someone is looking (also known as the gaze point). The raw eye tracking data cann also be used to engineer new features - eye tracking events - which can further be used to obtain more information. 

The types of eye tracking events that we can measure for include fixations, which are periods of time where the eye fixates on a target. There are saccades where the eyes move between points of fixations. There are also post-saccidic oscillations and glissades where the eye will oscillate after a saccade before settling to a fixation point. Post-saccadic oscillations overshoot the target, while glissades undershoot.

These types of events can be measured by applying different threshold techniques. I-VT applies a velocity threshold; If the speed between two gaze points is below a certain threshold, it is identified as a fixation. If the speed is above the threshold, it is a  saccade. There is also a dispersion/distance based method as well known as I-DT, that uses the distance between the gaze points instead to classify either fixations and saccades. These threshold algorithms are common in practice, but do not have the ability to classify more complex events. 

For the purpose of performing the I-VT  algorithm, a speed of 0.5px/ms was selected, and a dispersion of 1º was selected for I-DT.

## Dataset

For the following notebook, the dataset used is from a study performed in the University of Guelph DRiVE lab. Particpants wore eye-tracking glasses (Tobii Pro 3 glasses) and drove an OKTAL driving simulator. The dataset contains 72 participants that are randomly separated into train, test and validation sets. This will prevent leakage amonst the different particpant data.  Each of the files contains 3 different sets of data. There is some device information that is read in and in the sheet titled 'Event Data'. There is IMU sensor data in the sheet titled 'IMU Data'. The eye tracking data is in the sheet titled 'Gaze Data'. The sheets have 4, 22 and 11 columns respectively. The data  from the eye tracker is collected at 60Hz, and each participant file has roughly 20000 records in each file. The data is pre-split to ensure that there is no leakage between participant data, which could affect the training of the models, and to ensure a more consistent evaluation of the performance of the models.

## Procedure

1. The Gaze Data is read into the notebook using an Excel library.
2. For each participant file, any gaps in the data are filled in using linear interpolation first. 
3. Next, every two records are taken to calculate the labels using I-VT and I-DT and these are stored into a new dataframe. 

# Set Up Python Notebook

## Import Python Libraries

In [10]:
import os
from os import listdir
import pandas as pd

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# import spark libraries
import findspark
findspark.init()
from pyspark.sql import SparkSession

import pyspark.pandas as ps

import numpy as np

## Define Global Variables

In [11]:
datasets = ['dataset_training','dataset_testing','dataset_validation'] # directories for training, testing and valdiation
sheet = 'Gaze Data' # name of sheet with the eye tracking data

column_names = ['Type', 'Timestamp', 'Data_Gaze2D_X', 'Data_Gaze2D_Y', 'Data_Gaze3D_X',
       'Data_Gaze3D_Y', 'Data_Gaze3D_Z', 'Data_Eyeleft_Gazeorigin_X',
       'Data_Eyeleft_Gazeorigin_Y', 'Data_Eyeleft_Gazeorigin_Z',
       'Data_Eyeleft_Gazedirection_X', 'Data_Eyeleft_Gazedirection_Y',
       'Data_Eyeleft_Gazedirection_Z', 'Data_Eyeleft_Pupildiameter',
       'Data_Eyeright_Gazeorigin_X', 'Data_Eyeright_Gazeorigin_Y',
       'Data_Eyeright_Gazeorigin_Z', 'Data_Eyeright_Gazedirection_X',
       'Data_Eyeright_Gazedirection_Y', 'Data_Eyeright_Gazedirection_Z',
       'Data_Eyeright_Pupildiameter']

# createempty dataframe variables
df_train = None
df_test = None
df_validation = None

#speed threshold
ivt_thresh = 0.5

#dispersion threshold
idt_thresh = 1

## Create Spark Session

In [12]:
spark = SparkSession.builder\
    .appName("Cis6180_FinalProject")\
    .config("spark.driver.memory", "48g")\
    .config("spark.memory.offHeap.enabled","true")\
    .config("spark.memory.offHeap.size","5g")\
    .config("spark.executor.memory", "48g")\
    .config("spark.executor.cores", 4)\
    .config("spark.shuffle.service.enabled", True)\
    .config("spark.dynamicAllocation.enabled", True)\
    .config("spark.dynamicAllocation.minExecutors", 1)\
    .config("spark.dynamicAllocation.maxExecutors", 4)\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")\
    .getOrCreate()

## conversion issues between pandas and spark fixed by using spark.sql.execution.arrow.pyspark.enabled et to true ##
## not all pandas commands work, such as tail. tail sees to give issues still

## Create List of Files to Import

In [13]:
file_paths = []
file_limits = [1,1,1]

# iterate through all the files in the dataset to ge
for ds_num,dataset in enumerate(datasets):
    data_files = listdir(dataset)
    for f_num,f in enumerate(data_files):
        if f_num == file_limits[ds_num]:
            break
        file_path = dataset + '/' + f # file path is the relative file path for the current excel file
        file_paths.append((file_path,ds_num))

print(len(file_paths))

3


## Iterate Through All Files to Read into DataFrames

In [14]:
file_num = 1

# iterate through the list of files to read them, interpolate missing data, merge them and convert to pyspark
# this takes 5-20 minutes and should be optimized in the future #

for file_path in file_paths:
    print(f'File: {file_num}/{len(file_paths)} {file_path[0]}')
    
    # read the dataframe as python and then convert to pyspark since it has to be read in from excel spreadsheet
    ppdf = pd.read_excel(io=file_path[0],sheet_name=sheet) # read excel as pandas

    # print(f'Shape pre copy and merge: {ppdf.shape}')# this is fine

    ppdf.columns = column_names
    ppdf = ppdf.drop(['Type'],axis=1)
    ppdf = ppdf.interpolate(method='linear',limit_direction='both')

    ppdf_first = ppdf.iloc[:-1] # create copy of ppdf and remove last row
    ppdf_second = ppdf.copy().iloc[1:] # create a copy of ppdf and remove the first row
    ppdf_second.reset_index(drop=True, inplace=True)  # Reset index to ensure alignment

    ppdf_second.columns = [col + '_2' for col in ppdf_second.columns]# rename the columns of the second copy
    ppdf_combined = pd.concat([ppdf_first,ppdf_second],axis=1) # merge first copy and second copy side by side

    # print(f'Shape post copy and merge: {ppdf_combined.shape}')# this is fine

    psdf = ps.DataFrame(ppdf_combined) # this does not work

    # print(f'pySpark DF shape: {psdf.count()}') # this does not work
    # appending to dataframe variables need to set variable equal to the appending 
    if file_path[1] == 0:  # train
        if df_train is None:
            df_train = psdf
        else:
            df_train = df_train.append(psdf,ignore_index=True) # append current dataset to the one that is already existing

    elif file_path[1] == 1:  # test
        if df_test is None:
            df_test = psdf
        else:
            df_test = df_test.append(psdf,ignore_index=True)# append current dataset to the one that is already existing
    elif file_path[1] == 2:  # validation
        if df_validation is None:
            df_validation = psdf
        else:
            df_validation = df_validation.append(psdf,ignore_index=True)

    file_num+=1

File: 1/3 dataset_training/eye-data-10327.xlsx


File: 2/3 dataset_testing/eye-data-11868.xlsx
File: 3/3 dataset_validation/eye-data-11085.xlsx


In [15]:
df_train.head()

Unnamed: 0,Timestamp,Data_Gaze2D_X,Data_Gaze2D_Y,Data_Gaze3D_X,Data_Gaze3D_Y,Data_Gaze3D_Z,Data_Eyeleft_Gazeorigin_X,Data_Eyeleft_Gazeorigin_Y,Data_Eyeleft_Gazeorigin_Z,Data_Eyeleft_Gazedirection_X,Data_Eyeleft_Gazedirection_Y,Data_Eyeleft_Gazedirection_Z,Data_Eyeleft_Pupildiameter,Data_Eyeright_Gazeorigin_X,Data_Eyeright_Gazeorigin_Y,Data_Eyeright_Gazeorigin_Z,Data_Eyeright_Gazedirection_X,Data_Eyeright_Gazedirection_Y,Data_Eyeright_Gazedirection_Z,Data_Eyeright_Pupildiameter,Timestamp_2,Data_Gaze2D_X_2,Data_Gaze2D_Y_2,Data_Gaze3D_X_2,Data_Gaze3D_Y_2,Data_Gaze3D_Z_2,Data_Eyeleft_Gazeorigin_X_2,Data_Eyeleft_Gazeorigin_Y_2,Data_Eyeleft_Gazeorigin_Z_2,Data_Eyeleft_Gazedirection_X_2,Data_Eyeleft_Gazedirection_Y_2,Data_Eyeleft_Gazedirection_Z_2,Data_Eyeleft_Pupildiameter_2,Data_Eyeright_Gazeorigin_X_2,Data_Eyeright_Gazeorigin_Y_2,Data_Eyeright_Gazeorigin_Z_2,Data_Eyeright_Gazedirection_X_2,Data_Eyeright_Gazedirection_Y_2,Data_Eyeright_Gazedirection_Z_2,Data_Eyeright_Pupildiameter_2
0,0.003732,0.591048,0.267156,-403.732235,641.688232,2104.947219,31.702048,-5.298482,-29.809115,-0.191349,0.282862,0.93988,5.573006,-35.001305,-5.039065,-30.152352,-0.163314,0.287835,0.943652,5.30228,0.023823,0.590733,0.269367,-392.841447,621.014096,2055.604595,31.704264,-5.298387,-29.812654,-0.191369,0.282087,0.940109,5.552172,-34.999259,-5.050988,-30.150719,-0.16215,0.283922,0.945037,5.280413
1,0.023823,0.590733,0.269367,-392.841447,621.014096,2055.604595,31.704264,-5.298387,-29.812654,-0.191369,0.282087,0.940109,5.552172,-34.999259,-5.050988,-30.150719,-0.16215,0.283922,0.945037,5.280413,0.043804,0.590784,0.270588,-409.056617,643.079464,2139.308191,31.706771,-5.300355,-29.812396,-0.191091,0.281428,0.940363,5.530343,-35.005476,-5.053979,-30.153545,-0.163002,0.282109,0.945434,5.258709
2,0.043804,0.590784,0.270588,-409.056617,643.079464,2139.308191,31.706771,-5.300355,-29.812396,-0.191091,0.281428,0.940363,5.530343,-35.005476,-5.053979,-30.153545,-0.163002,0.282109,0.945434,5.258709,0.063896,0.590586,0.272172,-384.384972,601.730043,2014.898284,31.706589,-5.302493,-29.811885,-0.191299,0.280865,0.940489,5.516606,-35.002751,-5.061798,-30.153049,-0.161806,0.279179,0.946508,5.237484
3,0.063896,0.590586,0.272172,-384.384972,601.730043,2014.898284,31.706589,-5.302493,-29.811885,-0.191299,0.280865,0.940489,5.516606,-35.002751,-5.061798,-30.153049,-0.161806,0.279179,0.946508,5.237484,0.083877,0.590482,0.27365,-343.499936,535.095116,1802.799571,31.703131,-5.30431,-29.811325,-0.192118,0.280499,0.940431,5.499812,-34.99657,-5.073592,-30.151606,-0.159965,0.27609,0.947727,5.218326
4,0.083877,0.590482,0.27365,-343.499936,535.095116,1802.799571,31.703131,-5.30431,-29.811325,-0.192118,0.280499,0.940431,5.499812,-34.99657,-5.073592,-30.151606,-0.159965,0.27609,0.947727,5.218326,0.103969,0.590604,0.274184,-318.678837,494.660143,1670.253406,31.698023,-5.306287,-29.810815,-0.193283,0.280127,0.940304,5.488733,-34.992541,-5.078294,-30.152909,-0.158745,0.27506,0.948231,5.20036


# Calculate Threshold Values

## Apply Velocity Threshold (I-VT)

### Calculate Distance

In [16]:
df_train['Distance'] = ((df_train['Data_Gaze2D_X'] - df_train['Data_Gaze2D_X_2'])**2 + (df_train['Data_Gaze2D_Y'] - df_train['Data_Gaze2D_Y_2'])**2)**0.5
df_test['Distance'] = ((df_test['Data_Gaze2D_X'] - df_test['Data_Gaze2D_X_2'])**2 + (df_test['Data_Gaze2D_Y'] - df_test['Data_Gaze2D_Y_2'])**2)**0.5
df_validation['Distance'] = ((df_validation['Data_Gaze2D_X'] - df_validation['Data_Gaze2D_X_2'])**2 + (df_validation['Data_Gaze2D_Y'] - df_validation['Data_Gaze2D_Y_2'])**2)**0.5

### Calculate Time

In [17]:
df_train['Elapsed_T'] = df_train['Timestamp_2'] - df_train['Timestamp']
df_test['Elapsed_T'] = df_test['Timestamp_2'] - df_test['Timestamp']
df_validation['Elapsed_T'] = df_validation['Timestamp_2'] - df_validation['Timestamp']

### Calculate Velocity

In [18]:
df_train['Velocity'] = df_train['Distance'] / df_train['Elapsed_T']
df_test['Velocity'] = df_test['Distance'] / df_test['Elapsed_T']
df_validation['Velocity'] = df_validation['Distance'] / df_validation['Elapsed_T']

### Calculate I-VT Classificiaton

### Check DataFrame

In [19]:
df_train.head()

Unnamed: 0,Timestamp,Data_Gaze2D_X,Data_Gaze2D_Y,Data_Gaze3D_X,Data_Gaze3D_Y,Data_Gaze3D_Z,Data_Eyeleft_Gazeorigin_X,Data_Eyeleft_Gazeorigin_Y,Data_Eyeleft_Gazeorigin_Z,Data_Eyeleft_Gazedirection_X,Data_Eyeleft_Gazedirection_Y,Data_Eyeleft_Gazedirection_Z,Data_Eyeleft_Pupildiameter,Data_Eyeright_Gazeorigin_X,Data_Eyeright_Gazeorigin_Y,Data_Eyeright_Gazeorigin_Z,Data_Eyeright_Gazedirection_X,Data_Eyeright_Gazedirection_Y,Data_Eyeright_Gazedirection_Z,Data_Eyeright_Pupildiameter,Timestamp_2,Data_Gaze2D_X_2,Data_Gaze2D_Y_2,Data_Gaze3D_X_2,Data_Gaze3D_Y_2,Data_Gaze3D_Z_2,Data_Eyeleft_Gazeorigin_X_2,Data_Eyeleft_Gazeorigin_Y_2,Data_Eyeleft_Gazeorigin_Z_2,Data_Eyeleft_Gazedirection_X_2,Data_Eyeleft_Gazedirection_Y_2,Data_Eyeleft_Gazedirection_Z_2,Data_Eyeleft_Pupildiameter_2,Data_Eyeright_Gazeorigin_X_2,Data_Eyeright_Gazeorigin_Y_2,Data_Eyeright_Gazeorigin_Z_2,Data_Eyeright_Gazedirection_X_2,Data_Eyeright_Gazedirection_Y_2,Data_Eyeright_Gazedirection_Z_2,Data_Eyeright_Pupildiameter_2,Distance,Elapsed_T,Velocity
0,0.003732,0.591048,0.267156,-403.732235,641.688232,2104.947219,31.702048,-5.298482,-29.809115,-0.191349,0.282862,0.93988,5.573006,-35.001305,-5.039065,-30.152352,-0.163314,0.287835,0.943652,5.30228,0.023823,0.590733,0.269367,-392.841447,621.014096,2055.604595,31.704264,-5.298387,-29.812654,-0.191369,0.282087,0.940109,5.552172,-34.999259,-5.050988,-30.150719,-0.16215,0.283922,0.945037,5.280413,0.002233,0.020091,0.111155
1,0.023823,0.590733,0.269367,-392.841447,621.014096,2055.604595,31.704264,-5.298387,-29.812654,-0.191369,0.282087,0.940109,5.552172,-34.999259,-5.050988,-30.150719,-0.16215,0.283922,0.945037,5.280413,0.043804,0.590784,0.270588,-409.056617,643.079464,2139.308191,31.706771,-5.300355,-29.812396,-0.191091,0.281428,0.940363,5.530343,-35.005476,-5.053979,-30.153545,-0.163002,0.282109,0.945434,5.258709,0.001222,0.019981,0.061165
2,0.043804,0.590784,0.270588,-409.056617,643.079464,2139.308191,31.706771,-5.300355,-29.812396,-0.191091,0.281428,0.940363,5.530343,-35.005476,-5.053979,-30.153545,-0.163002,0.282109,0.945434,5.258709,0.063896,0.590586,0.272172,-384.384972,601.730043,2014.898284,31.706589,-5.302493,-29.811885,-0.191299,0.280865,0.940489,5.516606,-35.002751,-5.061798,-30.153049,-0.161806,0.279179,0.946508,5.237484,0.001597,0.020092,0.079462
3,0.063896,0.590586,0.272172,-384.384972,601.730043,2014.898284,31.706589,-5.302493,-29.811885,-0.191299,0.280865,0.940489,5.516606,-35.002751,-5.061798,-30.153049,-0.161806,0.279179,0.946508,5.237484,0.083877,0.590482,0.27365,-343.499936,535.095116,1802.799571,31.703131,-5.30431,-29.811325,-0.192118,0.280499,0.940431,5.499812,-34.99657,-5.073592,-30.151606,-0.159965,0.27609,0.947727,5.218326,0.001482,0.019981,0.07416
4,0.083877,0.590482,0.27365,-343.499936,535.095116,1802.799571,31.703131,-5.30431,-29.811325,-0.192118,0.280499,0.940431,5.499812,-34.99657,-5.073592,-30.151606,-0.159965,0.27609,0.947727,5.218326,0.103969,0.590604,0.274184,-318.678837,494.660143,1670.253406,31.698023,-5.306287,-29.810815,-0.193283,0.280127,0.940304,5.488733,-34.992541,-5.078294,-30.152909,-0.158745,0.27506,0.948231,5.20036,0.000547,0.020092,0.027237
