# Data Processing for Model Training
This note describes our process in data processing before training model

In [1]:
import tensorflow as tf
import pandas as pd

# Verify it's using GPU
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

2024-11-02 18:43:43.871013: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-02 18:43:43.871356: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-02 18:43:43.873506: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-02 18:43:43.897874: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Num GPUs Available:  0


In [4]:
DATAPATH = "../full_dataset"
data_walking_path = DATAPATH + "/2_walking.csv"
data_walking = pd.read_csv(data_walking_path)
# Display the first few rows
data_walking.head(15)

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm,Eye event
0,1,0.4751,0.2892,0.38,0.35,S
1,1,0.4086,0.6167,0.18,0.32,S
2,1,0.3461,0.7543,0.47,0.47,S
3,1,0.4077,0.2892,0.48,0.46,S
4,1,0.3911,0.2945,0.46,0.43,S
5,2,0.3755,0.2836,0.16,0.28,S
6,2,0.5444,0.3246,0.35,0.41,S
7,2,0.5572,0.2428,0.35,0.38,BE
8,2,0.4598,0.3404,0.44,0.45,S
9,2,0.4217,0.4518,0.46,0.43,S


In [5]:
data_walking.describe()

Unnamed: 0,Timestamp,Gazepoint X,Gazepoint Y,Pupil area (right) sq mm,Pupil area (left) sq mm
count,54.0,54.0,54.0,54.0,54.0
mean,5.259259,0.484067,0.324465,0.278148,0.26963
std,2.526726,0.219625,0.085108,0.1049,0.116812
min,1.0,0.2362,0.2295,0.11,0.07
25%,3.0,0.3025,0.287475,0.21,0.16
50%,5.0,0.40815,0.3129,0.265,0.23
75%,7.75,0.61745,0.333675,0.365,0.3775
max,9.0,0.8983,0.7543,0.48,0.47


## Adding Labels to dataset
To tackle this, we'll run the code snippet from `data_processor.py` to:

1. **Iterate through each CSV file in the `full_dataset` folder**.
2. **Determine the expected action** from each file’s name (e.g., `walking`, `reading`, `playing`).
3. **Map each action** to a numerical code: `walking -> 1`, `reading -> 2`, and `playing -> 3`.
4. **Add a new column** called `result` with the correct numeric code based on the filename.
5. **Save the updated file** back with the added column, and log each file processed.

Since we are using TensorFlow in data training and Keras metrics expect integers, our activity label (result) should not be stored as a string (i.e., walking, reading, playing), so let's convert it into an integer.

### Explanation

- **Logging**: This will keep track of each file processed, allowing you to see which files were updated.
- **Action Mapping**: The dictionary `action_map` ensures the correct numeric value is assigned based on the action type in the filename.
- **Error Handling**: If there’s an issue parsing the filename or another error arises, it’s logged for later review.
- **Output Directory Creation**: The line os.makedirs(output_folder, exist_ok=True) ensures full_dataset_labelled is created if it doesn’t already exist.
- **File Saving Path**: Each modified CSV file is saved in full_dataset_labelled rather than overwriting the original.

The process can takes some minutes

In [8]:
import sys
print(sys.executable)

/home/vy/LUT/SeeTrue-AI/seetrue-venv/bin/python


In [10]:
import subprocess
import sys
# Run the script and capture output in the notebook
result = subprocess.run(
    [sys.executable, 'data_processor.py'],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

# Print both stdout and stderr
print(result.stdout)
print(result.stderr)

2024-11-02 18:50:01,476 - Processed file 224_walking.csv - Action: walking
2024-11-02 18:50:01,477 - Processed file 96_walking.csv - Action: walking
2024-11-02 18:50:01,478 - Processed file 876_walking.csv - Action: walking
2024-11-02 18:50:01,479 - Processed file 894_walking.csv - Action: walking
2024-11-02 18:50:01,479 - Processed file 98_walking.csv - Action: walking
2024-11-02 18:50:01,480 - Processed file 362_reading.csv - Action: reading
2024-11-02 18:50:01,481 - Processed file 976_reading.csv - Action: reading
2024-11-02 18:50:01,482 - Processed file 872_walking.csv - Action: walking
2024-11-02 18:50:01,483 - Processed file 749_playing.csv - Action: playing
2024-11-02 18:50:01,484 - Processed file 578_reading.csv - Action: reading
2024-11-02 18:50:01,485 - Processed file 345_playing.csv - Action: playing
2024-11-02 18:50:01,485 - Processed file 727_playing.csv - Action: playing
2024-11-02 18:50:01,486 - Processed file 547_reading.csv - Action: reading
2024-11-02 18:50:01,487 - P