This short notebook provides a piece of code to convert the original mouse data into conventional (t, x, y) format.

Originally, mouse data was recorded to 16 text files, one per participant. Each file consists of a header line and 280 data lines, one per each trial. In each line, the variables are tab-separated. 

Mouse coordinates were supposed to be stored in the column _PositionList_ (9th column). However, it seems there was an issue with the data recording software: instead of storing each mouse trajectory in its own line, the mouse data was accumulated from trial to trial, so that column 9 at line 280 contains __all the mouse trajectories of a participant__.

To make the data usable, we process the original data file by file. Function `process_all_mouse_data` simply loops over individual files, and calls `process_participant_data` on them. After that, the data is recorded again in individual files. (later we'll merge them in one data file for convenience)

To process each file, we read the file line by line, and extract from each line only the data that does not overlap with what was extracted before. Most of it is done by this line

```python
eval(line.split('\t')[8][1:-1])[previous_trajs_len:]
```
We then remember how many samples were in the trajectory just processed so that we can ignore this data in the next lines.

```python
previous_trajs_len += len(trajectory)
```

Finally, we add information on participant and trial number to each trajectory, and when all trajectories are extracted, concatenate them in a single dataframe.

This all takes a while, because in the end we still need to read every single line of the sixteen 500 MB or so original data files. However, we only need to do this once, as the processed data is saved to csv in the end.

In [None]:
import numpy as np
import pandas as pd
import os
import re

def process_participant_data(data, participant):
    trajectories = []
    previous_trajs_len = 0
    
    for j, line in enumerate(data):
        trial_no = j+1
        trajectory = pd.DataFrame(eval(line.split('\t')[8][1:-1])[previous_trajs_len:], columns=['t', 'x', 'y'])
        previous_trajs_len += len(trajectory)
        trajectory['participant'] = participant
        trajectory['trial_no'] = trial_no
        trajectories.append(trajectory)

    return pd.concat(trajectories)

def process_all_mouse_data(input_data_path, processed_data_path):
    if not os.path.exists(processed_data_path):
        os.makedirs(processed_data_path)
    
    for file_name in os.listdir(input_data_path):
        participant = int(re.findall(r'\d+', file_name)[0])
        print('Processing participant %i...' % (participant))
        file_path = os.path.join(input_data_path, file_name)

        with open(file_path) as f:
            # skip the header line
            f.readline()
            data = f.readlines()
        data = process_participant_data(data, participant).loc[:,['participant', 'trial_no', 't', 'x', 'y']]    
        data.to_csv(os.path.join(processed_data_path, 'P%i_tftmouse.csv' % (participant)), index=False)
        print('Processing finished!')

input_data_path = '../../data/TfT/MouseData/'
processed_data_path = '../../data/TfT/MouseData_processed/'

process_all_mouse_data(input_data_path, processed_data_path)