#  Project Description

##  Background

Eye tracking is a technology that is used to measure the movement and position of the eye. Eye tracking can be used to obtain a variety of information, such as where someone is looking (also known as the gaze point). The raw eye tracking data cann also be used to engineer new features - eye tracking events - which can further be used to obtain more information. 

The types of eye tracking events that we can measure for include fixations, which are periods of time where the eye fixates on a target. There are saccades where the eyes move between points of fixations. There are also post-saccidic oscillations and glissades where the eye will oscillate after a saccade before settling to a fixation point. Post-saccadic oscillations overshoot the target, while glissades undershoot.

These types of events can be measured by applying different threshold techniques. I-VT applies a velocity threshold; If the speed between two gaze points is below a certain threshold, it is identified as a fixation. If the speed is above the threshold, it is a  saccade. There is also a dispersion/distance based method as well known as I-DT, that uses the distance between the gaze points instead to classify either fixations and saccades. These threshold algorithms are common in practice, but do not have the ability to classify more complex events. 

For the purpose of performing the I-VT  algorithm, a speed of 0.5px/ms was selected, and a dispersion of 1º was selected for I-DT.

## Dataset

For the following notebook, the dataset used is from a study performed in the University of Guelph DRiVE lab. Particpants wore eye-tracking glasses (Tobii Pro 3 glasses) and drove an OKTAL driving simulator. The dataset contains 72 participants that are randomly separated into train, test and validation sets. This will prevent leakage amonst the different particpant data.  Each of the files contains 3 different sets of data. There is some device information that is read in and in the sheet titled 'Event Data'. There is IMU sensor data in the sheet titled 'IMU Data'. The eye tracking data is in the sheet titled 'Gaze Data'. The sheets have 4, 22 and 11 columns respectively. The data  from the eye tracker is collected at 60Hz, and each participant file has roughly 20000 records in each file. The data is pre-split to ensure that there is no leakage between participant data, which could affect the training of the models, and to ensure a more consistent evaluation of the performance of the models.

## Procedure

1. The Gaze Data is read into the notebook using an Excel library.
2. For each participant file, any gaps in the data are filled in using linear interpolation first. 
3. Next, every two records are taken to calculate the labels using I-VT and I-DT and these are stored into a new dataframe. 

# Set Up Python Notebook

## Import Python Libraries

In [13]:
import os
from os import listdir
import pandas as pd

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# import spark libraries
import findspark
findspark.init()
from pyspark.sql import SparkSession

import pyspark.pandas as ps
from pyspark.sql.functions import col
from pyspark.sql.functions import when
from pyspark.sql.functions import count
from pyspark.sql.window import Window
from pyspark.sql import functions as F

## Define Global Variables

In [14]:
datasets = ['dataset_training','dataset_testing','dataset_validation'] # directories for training, testing and valdiation
sheet = 'Gaze Data' # name of sheet with the eye tracking data

column_names = ['Type', 'Timestamp', 'Data_Gaze2D_X', 'Data_Gaze2D_Y', 'Data_Gaze3D_X',
       'Data_Gaze3D_Y', 'Data_Gaze3D_Z', 'Data_Eyeleft_Gazeorigin_X',
       'Data_Eyeleft_Gazeorigin_Y', 'Data_Eyeleft_Gazeorigin_Z',
       'Data_Eyeleft_Gazedirection_X', 'Data_Eyeleft_Gazedirection_Y',
       'Data_Eyeleft_Gazedirection_Z', 'Data_Eyeleft_Pupildiameter',
       'Data_Eyeright_Gazeorigin_X', 'Data_Eyeright_Gazeorigin_Y',
       'Data_Eyeright_Gazeorigin_Z', 'Data_Eyeright_Gazedirection_X',
       'Data_Eyeright_Gazedirection_Y', 'Data_Eyeright_Gazedirection_Z',
       'Data_Eyeright_Pupildiameter']

# createempty dataframe variables
df_train = None
df_test = None
df_validation = None

## Create Spark Session

In [15]:
spark = SparkSession.builder\
    .appName("Cis6180_FinalProject")\
    .config("spark.driver.memory", "48g")\
    .config("spark.memory.offHeap.enabled","true")\
    .config("spark.memory.offHeap.size","10g")\
    .config("spark.executor.memory", "48g")\
    .config("spark.executor.cores", 4)\
    .config("spark.shuffle.service.enabled", True)\
    .config("spark.dynamicAllocation.enabled", True)\
    .config("spark.dynamicAllocation.minExecutors", 1)\
    .config("spark.dynamicAllocation.maxExecutors", 4)\
    .getOrCreate()


## Create List of Files to Import

In [16]:
file_list = []

# iterate through all the files in the dataset to ge
for ds_num,dataset in enumerate(datasets):
    data_files = listdir(dataset)
    for f_num,f in enumerate(data_files):
        file_path = dataset + '/' + f # file path is the relative file path for the current excel file
        file_list.append((file_path,ds_num))

print(len(file_list))

73


## Iterate Through All Files to Read into DataFrames

In [17]:
file_num = 1

# iterate through the list of files to read them, interpolate missing data, merge them and convert to pyspark
# this takes 5-20 minutes and should be optimized in the future

for file_path in file_list:
    print(f'File: {file_num}/{len(file_list)} {file_path[0]}')
    
    # read the dataframe as python and then convert to pyspark since it has to be read in from excel spreadsheet
    ppdf = pd.read_excel(io=file_path[0],sheet_name=sheet) # read excel as pandas
    ppdf.columns = column_names
    ppdf = ppdf.drop(['Type'],axis=1)
    ppdf = ppdf.interpolate(method='linear',limit_direction='both')

    ppdf_first = ppdf.iloc[:-1] # create copy of ppdf and remove last row
    ppdf_second = ppdf.copy().iloc[1:] # create a copy of ppdf and remove the first row
    ppdf_second.reset_index(drop=True, inplace=True)  # Reset index to ensure alignment

    ppdf_second.columns = [col + '_2' for col in ppdf_second.columns]# rename the columns of the second copy
    ppdf_combined = pd.concat([ppdf_first,ppdf_second],axis=1) # merge first copy and second copy side by side

    psdf = ps.from_pandas(ppdf_combined)
    
    if file_path[1] == 0:  # train
        if df_train is None:
            df_train = psdf
        else:
            df_train.append(psdf, ignore_index = True)# append current dataset to the one that is already existing

    elif file_path[1] == 1:  # test
        if df_test is None:
            df_test = psdf
        else:
            df_test.append(psdf, ignore_index = True)# append current dataset to the one that is already existing
    elif file_path[1] == 2:  # validation
        if df_validation is None:
            df_validation = psdf
        else:
            df_validation.append(psdf, ignore_index = True)

    file_num+=1

File: 1/73 dataset_training/eye-data-10327.xlsx
File: 2/73 dataset_training/eye-data-12471.xlsx




File: 3/73 dataset_training/eye-data-18514.xlsx




File: 4/73 dataset_training/eye-data-20116.xlsx




File: 5/73 dataset_training/eye-data-21051.xlsx




File: 6/73 dataset_training/eye-data-21895.xlsx




File: 7/73 dataset_training/eye-data-22013.xlsx




File: 8/73 dataset_training/eye-data-23090.xlsx




File: 9/73 dataset_training/eye-data-23753.xlsx




File: 10/73 dataset_training/eye-data-25462.xlsx




File: 11/73 dataset_training/eye-data-26370.xlsx




File: 12/73 dataset_training/eye-data-28334.xlsx




File: 13/73 dataset_training/eye-data-29048.xlsx




File: 14/73 dataset_training/eye-data-34473.xlsx




File: 15/73 dataset_training/eye-data-35217.xlsx




File: 16/73 dataset_training/eye-data-35745.xlsx




File: 17/73 dataset_training/eye-data-41517.xlsx




File: 18/73 dataset_training/eye-data-46121.xlsx




File: 19/73 dataset_training/eye-data-46307.xlsx




File: 20/73 dataset_training/eye-data-47274.xlsx




File: 21/73 dataset_training/eye-data-47402.xlsx




File: 22/73 dataset_training/eye-data-48737.xlsx




File: 23/73 dataset_training/eye-data-51637.xlsx




File: 24/73 dataset_training/eye-data-52063.xlsx




File: 25/73 dataset_training/eye-data-53209.xlsx




File: 26/73 dataset_training/eye-data-53349.xlsx




File: 27/73 dataset_training/eye-data-54455.xlsx




File: 28/73 dataset_training/eye-data-55367.xlsx




File: 29/73 dataset_training/eye-data-55746.xlsx




File: 30/73 dataset_training/eye-data-56135.xlsx




File: 31/73 dataset_training/eye-data-56233.xlsx




File: 32/73 dataset_training/eye-data-59774.xlsx




File: 33/73 dataset_training/eye-data-63923.xlsx




File: 34/73 dataset_training/eye-data-64765.xlsx




File: 35/73 dataset_training/eye-data-69876.xlsx




File: 36/73 dataset_training/eye-data-70253.xlsx




File: 37/73 dataset_training/eye-data-70615.xlsx




File: 38/73 dataset_training/eye-data-71291.xlsx




File: 39/73 dataset_training/eye-data-76001.xlsx




File: 40/73 dataset_training/eye-data-79820.xlsx




File: 41/73 dataset_training/eye-data-83008.xlsx




File: 42/73 dataset_training/eye-data-84384.xlsx




File: 43/73 dataset_training/eye-data-86812.xlsx




File: 44/73 dataset_training/eye-data-91060.xlsx




File: 45/73 dataset_training/eye-data-94231.xlsx




File: 46/73 dataset_training/eye-data-95397.xlsx




File: 47/73 dataset_training/eye-data-95985.xlsx




File: 48/73 dataset_training/eye-data-96194.xlsx




File: 49/73 dataset_training/eye-data-96679.xlsx




File: 50/73 dataset_training/eye-data-97448.xlsx




File: 51/73 dataset_training/eye-data-97973.xlsx




File: 52/73 dataset_testing/eye-data-11868.xlsx
File: 53/73 dataset_testing/eye-data-21182.xlsx




File: 54/73 dataset_testing/eye-data-22446.xlsx




File: 55/73 dataset_testing/eye-data-23921.xlsx




File: 56/73 dataset_testing/eye-data-38989.xlsx




File: 57/73 dataset_testing/eye-data-46094.xlsx




File: 58/73 dataset_testing/eye-data-54097.xlsx




File: 59/73 dataset_testing/eye-data-72799.xlsx




File: 60/73 dataset_testing/eye-data-75601.xlsx




File: 61/73 dataset_testing/eye-data-91260.xlsx




File: 62/73 dataset_testing/eye-data-97051.xlsx




File: 63/73 dataset_validation/eye-data-11085.xlsx
File: 64/73 dataset_validation/eye-data-14732.xlsx




File: 65/73 dataset_validation/eye-data-17381.xlsx




File: 66/73 dataset_validation/eye-data-19733.xlsx




File: 67/73 dataset_validation/eye-data-26585.xlsx




File: 68/73 dataset_validation/eye-data-29097.xlsx




File: 69/73 dataset_validation/eye-data-37883.xlsx




File: 70/73 dataset_validation/eye-data-39692.xlsx




File: 71/73 dataset_validation/eye-data-41473.xlsx




File: 72/73 dataset_validation/eye-data-51553.xlsx




File: 73/73 dataset_validation/eye-data-64087.xlsx




In [18]:
df_train.dtypes

Timestamp                          float64
Data_Gaze2D_X                      float64
Data_Gaze2D_Y                      float64
Data_Gaze3D_X                      float64
Data_Gaze3D_Y                      float64
Data_Gaze3D_Z                      float64
Data_Eyeleft_Gazeorigin_X          float64
Data_Eyeleft_Gazeorigin_Y          float64
Data_Eyeleft_Gazeorigin_Z          float64
Data_Eyeleft_Gazedirection_X       float64
Data_Eyeleft_Gazedirection_Y       float64
Data_Eyeleft_Gazedirection_Z       float64
Data_Eyeleft_Pupildiameter         float64
Data_Eyeright_Gazeorigin_X         float64
Data_Eyeright_Gazeorigin_Y         float64
Data_Eyeright_Gazeorigin_Z         float64
Data_Eyeright_Gazedirection_X      float64
Data_Eyeright_Gazedirection_Y      float64
Data_Eyeright_Gazedirection_Z      float64
Data_Eyeright_Pupildiameter        float64
Timestamp_2                        float64
Data_Gaze2D_X_2                    float64
Data_Gaze2D_Y_2                    float64
Data_Gaze3D

In [19]:
df_train.columns

Index(['Timestamp', 'Data_Gaze2D_X', 'Data_Gaze2D_Y', 'Data_Gaze3D_X',
       'Data_Gaze3D_Y', 'Data_Gaze3D_Z', 'Data_Eyeleft_Gazeorigin_X',
       'Data_Eyeleft_Gazeorigin_Y', 'Data_Eyeleft_Gazeorigin_Z',
       'Data_Eyeleft_Gazedirection_X', 'Data_Eyeleft_Gazedirection_Y',
       'Data_Eyeleft_Gazedirection_Z', 'Data_Eyeleft_Pupildiameter',
       'Data_Eyeright_Gazeorigin_X', 'Data_Eyeright_Gazeorigin_Y',
       'Data_Eyeright_Gazeorigin_Z', 'Data_Eyeright_Gazedirection_X',
       'Data_Eyeright_Gazedirection_Y', 'Data_Eyeright_Gazedirection_Z',
       'Data_Eyeright_Pupildiameter', 'Timestamp_2', 'Data_Gaze2D_X_2',
       'Data_Gaze2D_Y_2', 'Data_Gaze3D_X_2', 'Data_Gaze3D_Y_2',
       'Data_Gaze3D_Z_2', 'Data_Eyeleft_Gazeorigin_X_2',
       'Data_Eyeleft_Gazeorigin_Y_2', 'Data_Eyeleft_Gazeorigin_Z_2',
       'Data_Eyeleft_Gazedirection_X_2', 'Data_Eyeleft_Gazedirection_Y_2',
       'Data_Eyeleft_Gazedirection_Z_2', 'Data_Eyeleft_Pupildiameter_2',
       'Data_Eyeright_Gazeorigin_X

In [21]:
df_train.count()

Py4JJavaError: An error occurred while calling o33182.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7.0 failed 1 times, most recent failure: Lost task 2.0 in stage 7.0 (TID 21) (ER1CHS-LAPT0P executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
	... 25 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
	... 25 more
