# **<font style = "color:rgb(58, 45, 247)">Data Processing for Model </font>**
This note describes our process in data processing before training model

In [22]:
import tensorflow as tf
import pandas as pd

# Verify it's using GPU
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# Check the version of TensorFlow Decision Forests
print("Current Tensorflow Version: " + tf.__version__)

Num GPUs Available:  0
Current Tensorflow Version: 2.16.1


In [4]:
DATAPATH = "../full_dataset"
data_walking_path = DATAPATH + "/2_walking.csv"
data_walking = pd.read_csv(data_walking_path)
# Display the first few rows
data_walking.head(15)

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,Eye event
0,1,0.4751,0.2892,0.38,0.35,S
1,1,0.4086,0.6167,0.18,0.32,S
2,1,0.3461,0.7543,0.47,0.47,S
3,1,0.4077,0.2892,0.48,0.46,S
4,1,0.3911,0.2945,0.46,0.43,S
5,2,0.3755,0.2836,0.16,0.28,S
6,2,0.5444,0.3246,0.35,0.41,S
7,2,0.5572,0.2428,0.35,0.38,BE
8,2,0.4598,0.3404,0.44,0.45,S
9,2,0.4217,0.4518,0.46,0.43,S


In [5]:
data_walking.describe()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm
count,54.0,54.0,54.0,54.0,54.0
mean,5.259259,0.484067,0.324465,0.278148,0.26963
std,2.526726,0.219625,0.085108,0.1049,0.116812
min,1.0,0.2362,0.2295,0.11,0.07
25%,3.0,0.3025,0.287475,0.21,0.16
50%,5.0,0.40815,0.3129,0.265,0.23
75%,7.75,0.61745,0.333675,0.365,0.3775
max,9.0,0.8983,0.7543,0.48,0.47


## **<font style = "color:rgb(58, 45, 247)"> Adding labels to dataset </font>**
To tackle this, we'll run the code snippet from `data_processor.py` to:

1. **Iterate through each CSV file in the `full_dataset` folder**.
2. **Determine the expected action** from each file’s name (e.g., `walking`, `reading`, `playing`).
3. **Map each action** to a numerical code: `walking -> 1`, `reading -> 2`, and `playing -> 3`.
4. **Add a new column** called `result` with the correct numeric code based on the filename.
5. **Save the updated file** back with the added column, and log each file processed.

Since we are using TensorFlow in data training and Keras metrics expect integers, our activity label (result) should not be stored as a string (i.e., walking, reading, playing), so let's convert it into an integer.

### <font style = "color:rgb(58, 45, 247)"> Explanation </font>

- **Logging**: This will keep track of each file processed, allowing you to see which files were updated.
- **Action Mapping**: The dictionary `action_map` ensures the correct numeric value is assigned based on the action type in the filename.
- **Error Handling**: If there’s an issue parsing the filename or another error arises, it’s logged for later review.
- **Output Directory Creation**: The line os.makedirs(output_folder, exist_ok=True) ensures full_dataset_labelled is created if it doesn’t already exist.
- **File Saving Path**: Each modified CSV file is saved in full_dataset_labelled rather than overwriting the original.

You can either run the python file from your console or uncomment the below cell to execute the file from here. The process can takes some minutes and it might takes 1-2 minutes to see the file as the thread needs to flush the log

In [10]:
"""
The following cell will run the script in `data_processor.py`
"""
# import subprocess
# import sys
# # Run the script and capture output in the notebook
# data_processor_result = subprocess.run(
#     [sys.executable, 'data_processor.py'],
#     stdout=subprocess.PIPE,
#     stderr=subprocess.PIPE,
#     text=True
# )
# 
# # Print both stdout and stderr
# print(data_processor_result.stdout)
# print(data_processor_result.stderr)

2024-11-02 18:50:01,476 - Processed file 224_walking.csv - Action: walking
2024-11-02 18:50:01,477 - Processed file 96_walking.csv - Action: walking
2024-11-02 18:50:01,478 - Processed file 876_walking.csv - Action: walking
2024-11-02 18:50:01,479 - Processed file 894_walking.csv - Action: walking
2024-11-02 18:50:01,479 - Processed file 98_walking.csv - Action: walking
2024-11-02 18:50:01,480 - Processed file 362_reading.csv - Action: reading
2024-11-02 18:50:01,481 - Processed file 976_reading.csv - Action: reading
2024-11-02 18:50:01,482 - Processed file 872_walking.csv - Action: walking
2024-11-02 18:50:01,483 - Processed file 749_playing.csv - Action: playing
2024-11-02 18:50:01,484 - Processed file 578_reading.csv - Action: reading
2024-11-02 18:50:01,485 - Processed file 345_playing.csv - Action: playing
2024-11-02 18:50:01,485 - Processed file 727_playing.csv - Action: playing
2024-11-02 18:50:01,486 - Processed file 547_reading.csv - Action: reading
2024-11-02 18:50:01,487 - P

## **<font style = "color:rgb(58, 45, 247)"> Combine all labeled into 3 CSV files </font>**
To combine all the labeled CSV files in `full_dataset_labelled` into three separate files (`walking.csv`, `reading.csv`, and `playing.csv`), follow these steps:

1. **Iterate through the files in `full_dataset_labelled`** and categorize them based on the activity type (walking, reading, playing).
2. **Concatenate Data** from each category into a DataFrame for each activity.
3. **Remove Duplicate Headers**: Only include the header row once in the output files.
4. **Save the Combined Files** to the `full_dataset_combined` directory.
5. **Log the Process** for tracking.

### <font style = "color:rgb(58, 45, 247)"> Explanation </font>

- **Combine Data for Each Activity**: Each activity (`walking`, `reading`, and `playing`) has a separate DataFrame to accumulate rows from all corresponding files.
- **Avoid Duplicate Headers**: By concatenating DataFrames without resetting headers, we ensure only one header row appears in each output file.
- **Logging**: Logs each file addition and the final save action to track progress.
  
You can run the `data_concatenation.py` in Jupyter using the `subprocess.run` method as bellowed. This will create three combined CSV files in `full_dataset_combined`, with all data from the labeled files merged into their respective files.


In [12]:
"""
The following cell will run the script in `data_concatenation.py`
"""
# import subprocess
# import sys
# # Run the script and capture output in the notebook
# data_concat_result = subprocess.run(
#     [sys.executable, 'data_concatenation.py'],
#     stdout=subprocess.PIPE,
#     stderr=subprocess.PIPE,
#     text=True
# )
# 
# # Print both stdout and stderr
# print(data_concat_result.stdout)
# print(data_concat_result.stderr)

2024-11-02 22:18:32,210 - Added data from 224_walking.csv to walking.csv
2024-11-02 22:18:32,210 - Added data from 96_walking.csv to walking.csv
2024-11-02 22:18:32,211 - Added data from 876_walking.csv to walking.csv
2024-11-02 22:18:32,211 - Added data from 894_walking.csv to walking.csv
2024-11-02 22:18:32,212 - Added data from 98_walking.csv to walking.csv
2024-11-02 22:18:32,212 - Added data from 362_reading.csv to reading.csv
2024-11-02 22:18:32,213 - Added data from 976_reading.csv to reading.csv
2024-11-02 22:18:32,213 - Added data from 872_walking.csv to walking.csv
2024-11-02 22:18:32,214 - Added data from 749_playing.csv to playing.csv
2024-11-02 22:18:32,215 - Added data from 578_reading.csv to reading.csv
2024-11-02 22:18:32,215 - Added data from 345_playing.csv to playing.csv
2024-11-02 22:18:32,216 - Added data from 727_playing.csv to playing.csv
2024-11-02 22:18:32,216 - Added data from 547_reading.csv to reading.csv
2024-11-02 22:18:32,217 - Added data from 1033_readin

## **<font style = "color:rgb(58, 45, 247)"> Overview of the results </font>**

Now the preprocessing is done, we check each three files

In [13]:
FILE_PATH = "../full_dataset_combined"
walking_data_path = FILE_PATH + "/walking.csv"
reading_data_path = FILE_PATH + "/reading.csv"
playing_data_path = FILE_PATH + "/playing.csv"

walking_data = pd.read_csv(walking_data_path)
reading_data = pd.read_csv(reading_data_path)
playing_data = pd.read_csv(playing_data_path)

In [15]:
walking_data.head()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,Eye event,result
0,1,0.5502,0.315,0.18,0.0,S,1
1,1,0.548,0.2929,0.18,0.0,FB,1
2,1,0.5473,0.2965,0.18,0.0,,1
3,1,0.5477,0.2965,0.17,0.0,,1
4,1,0.5737,0.3835,0.19,0.0,S,1


In [16]:
reading_data.head()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,Eye event,result
0,1,0.3992,0.8388,0.52,0.42,FB,2
1,1,0.4026,0.8242,0.72,0.57,,2
2,1,0.4547,0.778,0.64,0.55,S,2
3,1,0.4586,0.7392,0.83,0.63,S,2
4,1,0.5456,0.7639,0.62,0.72,S,2


In [17]:
playing_data.head()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,Eye event,result
0,1,0.4784,0.4913,0.41,0.44,,3
1,1,0.4841,0.51,0.41,0.44,FEx0.474y0.482d0.183,3
2,1,0.4018,0.5011,0.42,0.42,S,3
3,1,0.3985,0.4477,0.43,0.45,S,3
4,1,0.4067,0.4727,0.43,0.45,S,3


In [18]:
walking_data.describe()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,result
count,51282.0,51282.0,51282.0,51282.0,51282.0,51282.0
mean,4.928532,0.491353,0.406102,0.249498,0.199427,1.0
std,2.579411,0.169161,0.129553,0.167004,0.14598,0.0
min,1.0,-0.1561,-0.3179,0.0,0.0,1.0
25%,3.0,0.3772,0.3242,0.14,0.13,1.0
50%,5.0,0.4867,0.3989,0.17,0.2,1.0
75%,7.0,0.5944,0.4759,0.3075,0.27,1.0
max,9.0,1.0828,1.303,1.0,1.0,1.0


In [19]:
reading_data.describe()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,result
count,43833.0,43833.0,43833.0,43833.0,43833.0,43833.0
mean,4.869528,0.484123,0.528648,0.35819,0.350515,2.0
std,2.573818,0.107528,0.115623,0.145378,0.143969,0.0
min,1.0,0.0063,0.2784,0.0,0.0,2.0
25%,3.0,0.4,0.4375,0.25,0.26,2.0
50%,5.0,0.4763,0.4835,0.34,0.32,2.0
75%,7.0,0.5634,0.6309,0.45,0.39,2.0
max,9.0,1.12,1.2605,1.0,1.0,2.0


In [20]:
playing_data.describe()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,result
count,36416.0,36416.0,36416.0,36416.0,36416.0,36416.0
mean,4.912703,0.505684,0.493492,0.395138,0.519348,3.0
std,2.580238,0.06516,0.150218,0.14101,0.112524,0.0
min,1.0,0.0,-0.5389,0.0,0.0,3.0
25%,3.0,0.4743,0.412,0.26,0.45,3.0
50%,5.0,0.5001,0.5038,0.44,0.51,3.0
75%,7.0,0.5309,0.5792,0.49,0.58,3.0
max,9.0,1.3063,1.3889,1.0,1.0,3.0
