# Preprocessing Reading Time Corpora

This notebook reads from each reading time corpora and generates ```.tsv``` files containing the data we need, i.e.:

- ```Identifier```: a string containing the text ID and the token position in the text. The IDs may or not match the original ones: They are created in an arbitrary order here for our standard naming and can be mapped back to originals using the ```.json``` maps (see below). 
- ```Token```: a string containing the token as shown to the subjects (e.g. punctuation is considered together with the neighbor token). For some datasets we need some acrobacy to infer what exactly was shown in the screen. In general, it should be the text contained in the column Interest Area. When the raw texts are available, we check that the token order matches the texts.
- ```<Measure>:Subj_i```: The measure for subject $i$ for each token. E.g. for first pass regression, a binary variable which is 1 if subject $i$ initiated a regression at the current token. Such column is created for each subject $i$. 

For standardization, we create a convention to use integers to refer to subjects and texts and also save the map from the ids to the original naming. We make text ids, subject ids and word positions start from 0. They can be mapped back to the original using the ```.json``` maps in the directory ```preprocessed/maps/``` and ```preprocessed/texts/```.

### Decisions

- We will extract first-pass regressions (i.e. we won't consider regressions that occurr in subsequent passes) and number of fixations.
- For first-pass regressions, the value is ```0.0``` if no regression was initiated at that token in the first pass. If there is a regression from that token, the value is ```1.0```. For tokens skipped in the first pass reading, we use ```-1.0```.
- Subjects with missing data are filled with ```NaN```.
- Some clarification requests were sent for the authors of Provo, Rastros, Nicenboim. See meta.md for the replies.

Tokens can be:
1. Never fixated (therefore no regressions)
2. Skipped at first pass but fixated later (regressions can happen, but we don't care for them here)
3. Fixated at first pass (and thus regress / not regress)

For us, case 1 and 2 are the same (word was skipped at first pass). We need to extract that from the data formats. If the word was skipped, we use the skipped label. If the word was not skipped, then we use the first-pass regression label.

### Generated Files

Running this notebook will create the following outputs:

- ```preprocessed/human_data/```: one ```.tsv``` file for each measure and each dataset.
- ```preprocessed/maps/```: two ```JSON``` files for each dataset. ```*__subjects``` maps original subject ids to integers and ```*_texts``` maps original text ids to integers.
- ```preprocessed/texts/```: a ```JSON``` file for each dataset. One key for each text id (our integers). The values are dictionaries mapping token positions (from 0 to text length - 1) to tokens.

In [1]:
import os
from collections import Counter, namedtuple
from pathlib import Path

import csv
import json
import numpy as np
import pandas as pd
import pyreadr
import rdata
import seaborn as sns

sns.set_theme()
pd.set_option('display.max_rows', 50)

### Auxiliary functions and constants

In [2]:
PATH_TO_TEXTS = Path('preprocessed/texts/')
PATH_TO_HUMAN_DATA = Path('preprocessed/human_data/')
PATH_TO_MAPS = Path('preprocessed/maps/')

# columns and naming standards
IDENTIFIER = 'Identifier'
TOKEN = 'Token'
VAR_SUBJ = '{}:Subj_{}'        # measure name and subject ID
TOKEN_ID = 'text_{}_token_{}'  # text ID and token position in text

Decide what value to use for words that were skipped and for really missing data:

In [3]:
SKIP_LABEL = -1.0
MISSING_LABEL = np.nan

We'll stardandise the sometimes different measure namings as follows:

In [4]:
Measure = namedtuple('Measure', ['name', 'shortform'])

# Was there a regression initiating at a token in the first pass?
FPREGOUT = Measure('first-pass-regression-out', 'fpregout')  # binary, or categorial if skips are identified

In [5]:
def create_dataframe(varname, measures, identifiers, ordered_subjects):
    """Return a dataframe in the standard format, given the data dictionaries."""
    columns = [IDENTIFIER, TOKEN] + [VAR_SUBJ.format(varname.shortform, subject) for subject in ordered_subjects]

    preprocessed = pd.DataFrame(columns=columns)
    for identifier, word in identifiers.items():
        measure_by_subj = [measures[identifier][subject] for subject in ordered_subjects]
        new_row = [identifier, word] + measure_by_subj
        n = len(preprocessed)
        preprocessed.loc[n] = new_row

    return preprocessed


def save_preprocessed(corpus_name, varname, preproc_data):
    """Save dataframe as a .tsv file."""
    preproc_data.to_csv(PATH_TO_HUMAN_DATA / f'{corpus_name}_{varname.name}.tsv', sep='\t')  


def save_meta(corpus_name, texts, text_ids, subject_ids):
    """Save the text id, subject id and texts as JSON files."""
    with open(PATH_TO_TEXTS / f'{corpus_name}.json', 'w') as file:
        json.dump(texts, file)

    with open(PATH_TO_MAPS / f'{corpus_name}_subjects.json', 'w') as file:
        json.dump(subject_ids, file)

    with open(PATH_TO_MAPS / f'{corpus_name}_texts.json', 'w') as file:
        json.dump(text_ids, file)

## RastrOS

Preprocessing the Brazilian Portuguese RastrOS corpus. The details have been published in a master thesis [Vieira, 2020](https://repositorio.ufc.br/handle/riufc/55798) and also in [Leal et al. (2022)](https://doi.org/10.1007/s10579-022-09609-0).

- Eye Link 1000 Hz (SR Research)
- 37 participants, all read all paragraphs
- 50 paragraphs
- 120 sentences
- 2494 words; 2831 tokens (with punctuation)
- journalistic, literary, popular science
- Participants read paragraphs, one by one in a random order. 

We follow the documentation described in Table 5 in Leal et al. (2022). For each subject, we retrieve the ```Word_Unique_ID``` which here conveniently encodes the text/paragraph ID and the word position in the text. We also get the word as it was shown in the screen from ```IA_LABEL``` (or ```Word```, see discussion below) and the measures. We use the ```IA_REGRESSION_OUT``` measure in the file ```RastrOS_Corpus_Eytracking_Data.tsv```, which contains:

> Whether regression(s) was made from the current interest area to earlier interest areas (e.g., previous parts of the sentence) prior to leaving that interest area in a forward direction. 1 if a saccade exits the current interest area to a lower IA_ID (to the left in English) before a later interest area was fixated; 0 if not.

We need the help of the variable ```IA_SKIP```:

> An interest area is considered skipped (i.e.,IA_SKIP = 1) if no fixation occurred in first-pass reading.

In [6]:
RASTROS_NAME = 'rastros_ptbr'
PATH_TO_RASTROS = Path('data/RastrOS/osfstorage/')

rastros_measures = {
    'IA_REGRESSION_OUT': FPREGOUT
}

In [7]:
# we need the quoting none because of one field in which the word begins with "
rastros_raw = pd.read_csv(PATH_TO_RASTROS / 'RastrOS_Corpus_Eyetracking_Data.tsv', 
                          sep='\t', low_memory=False, quoting=csv.QUOTE_NONE)

In [8]:
rastros_raw

Unnamed: 0,RECORDING_SESSION_LABEL,Word_Unique_ID,Text_ID,Genre,Word_Number,Sentence_Number,Word_In_Sentence_Number,Word_Place_In_Sent,Word,Word_Cleaned,...,IA_REGRESSION_IN_COUNT,IA_REGRESSION_OUT,IA_REGRESSION_OUT_COUNT,IA_REGRESSION_OUT_FULL,IA_REGRESSION_OUT_FULL_COUNT,IA_REGRESSION_PATH_DURATION,IA_FIRST_SACCADE_AMPLITUDE,IA_FIRST_SACCADE_ANGLE,IA_FIRST_SACCADE_START_TIME,IA_FIRST_SACCADE_END_TIME
0,C01,UID_13_1,13,DC,1,1,1,1,Mudanças,mudanças,...,1,0,0,0,0,157,.,.,.,.
1,C01,UID_13_2,13,DC,2,1,2,1,climáticas,climáticas,...,0,0,0,0,0,306,4.14,2.20,224,270
2,C01,UID_13_3,13,DC,3,1,3,1,estão,estão,...,1,0,0,0,0,122,1.41,-177.52,850,878
3,C01,UID_13_4,13,DC,4,1,4,1,aquecendo,aquecendo,...,0,1,1,1,1,319,5.84,-0.22,613,652
4,C01,UID_13_5,13,DC,5,1,5,2,o,o,...,.,.,.,.,.,.,.,.,.,.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87878,I5,UID_39_45,39,JN,45,3,7,3,quilômetros,quilômetros,...,0,0,0,0,0,308,4.09,3.93,11194,11229
87879,I5,UID_39_46,39,JN,46,3,8,3,e,e,...,.,.,.,.,.,.,.,.,.,.
87880,I5,UID_39_47,39,JN,47,3,9,4,vários,vários,...,0,0,0,0,0,293,5.23,4.99,11538,11580
87881,I5,UID_39_48,39,JN,48,3,10,4,trechos,trechos,...,0,0,0,0,0,341,3.87,1.41,11874,11913


Inspect the values used for each variable:

In [9]:
for column, measure_type in rastros_measures.items():
    print(f'\n{measure_type.name}')
    print(Counter(rastros_raw[column]))
    print('NaNs: ', rastros_raw[column].isna().sum())

print('\nfirst pass skips')
print(Counter(rastros_raw['IA_SKIP']))
print('NaNs: ', rastros_raw['IA_SKIP'].isna().sum())


assert set(rastros_raw['IA_REGRESSION_OUT'].unique()) == set(['0', '1', '.'])
assert set(rastros_raw['IA_SKIP'].unique()) == set([0, 1])


first-pass-regression-out
Counter({'0': 58254, '.': 17715, '1': 11914})
NaNs:  0

first pass skips
Counter({0: 59061, 1: 28822})
NaNs:  0


The column we are mostly interested in (```IA_REGRESSION_OUT```) sometimes contains a ```.```. Let's investigate why. But none of the measures has ```NaN```s.

In [10]:
for index, row in rastros_raw.iterrows():
    regression_label = row['IA_REGRESSION_OUT']
    assert regression_label in ('0', '1', '.')
    if regression_label in ('0', '1'):
        if row['IA_SKIP'] == 1.:
            # tokens skipped at first pass but fixated later 
            assert row['IA_FIXATION_COUNT'] > 0
            # they use regression 0 for these cases
            assert regression_label == '0'
        # if there is a label, the token was fixated at least once
        assert row['IA_DWELL_TIME'] > 0.
    if regression_label == '.':
        # the dots imply skipped at first pass and also skipped altogether
        assert row['IA_SKIP'] == 1.
        assert row['IA_DWELL_TIME'] == 0.
        assert row['IA_FIXATION_COUNT'] == 0.

All dots correspond to skips, but not all skips correspond to dots. When we inspect the dwell time, we see that dots occur only in tokens whose total dwell time is $0$. So if a word was never fixated, ```IA_REGRESSION_OUT``` contains a dot. If it was fixated, it contains either $0$ or $1$, but it may also be $0$ when it has been skipped at the first pass (which is what ```IA_SKIP``` means.)

Create mappings from subject ids and texts ids to integers, which will be used to create our identifiers.

In [11]:
rastros_subject_ids = {subj_id: i for i, subj_id in enumerate(rastros_raw['RECORDING_SESSION_LABEL'].unique())}
rastros_text_ids = {int(text_id): i for i, text_id in enumerate(rastros_raw['Text_ID'].unique())}

First we create a list of token ids and their corresponding token, making sure that is is consistent across subjects. We also check whether the ```Word``` column contains the same token as the ```IA_LABEL```, which is the string actually shown at the screen. 

For some reason, the row UID_8_9 has a wrong label for the text id. So we'll split the Word_Unique_ID to retrieve the text and token instead of using the columns Text_ID and Word_Number.

In [12]:
def rastros_clean(word):
    """Manually clean up some encoding issues that could not be solved."""
    cleaned = word.replace('\x96', '--').replace('\x94', '”').replace('\x93', '“').replace('\x97', '--').replace('\xa0', '')
    return cleaned

rastros_identifiers = {}
inconsistent_words = {}

for index, row in rastros_raw.iterrows():
    _, orig_text_id, orig_token_number = row['Word_Unique_ID'].split('_')
    text_id = rastros_text_ids[int(orig_text_id)]
    token_position = int(orig_token_number) - 1
    identifier = TOKEN_ID.format(text_id, token_position)

    word = row['Word']
    # Deciding which word column to use
    if row['Word'] != row['IA_LABEL']:
        if row['Word'] == row['IA_LABEL'].replace('.', ','):
            # cases where commas have been replaced by full stops in the IA_LABEL
            # the author said it was a postprocessing mistake
            # in such cases, we stick to row['Word'] containing the comma
            pass  
        elif row['Word'] == row['IA_LABEL'].replace('\x92', "’").replace('\x94', '”').replace('\x93', '“').replace('.', ','):
            # these cases seem to be an encoding issue in the file that we
            # could not solve using pandas
            # in these cases, we'll also use row['Word'] containing the correct 
            # quotes (assuming that the humans saw the correct text)
            pass
        else:
            # for other inconsistent cases (35 when we counted), we'll stick to IA_LABEL instead
            # with some manual cleaning of encoding issues
            word = rastros_clean(row['IA_LABEL'])
            inconsistent_words[index] = (row['Word'], row['IA_LABEL'], word)
    
    if identifier not in rastros_identifiers:
        rastros_identifiers[identifier] = word
    # check that identifiers accross different subjects always map to the same word
    assert rastros_identifiers[identifier] == word

assert len(rastros_identifiers) == 2494  # total number of tokens from the paper

Some rows have mismatches between 'Word' and 'IA_LABEL', mostly because IA_LABEL seems to replace commas by full stops or due to encoding issues that I could not fix while reading the csv. We have sent an email to the authors to know what exactly did the subjects see with respect to the commas. They said that they may have been replaced by full stops in a post processing step.

So our approach is: if 'Word' and 'IA_LABEL' differ only because of swapped commas or encoding issues, we stick to 'Word'. Otherwise, we use 'IA_LABEL', manually cleaning/replacing the \x symbols. This is only necessary for 33 instances, so probably not a big deal, but should be mentioned in the report. The authors said that IA_LABEL is what was shown in the eye-tracking experiment, and Word was used for the cloze task experiment.

In [13]:
n_mismatches = len(set(inconsistent_words.values()))
print(f'There are {n_mismatches} tokens with remaining mismatches!')

There are 33 tokens with remaining mismatches!


These subjects have missing data and will be replaced by NaN.

In [14]:
for s_id in rastros_subject_ids:
    n_obs = rastros_raw[rastros_raw['RECORDING_SESSION_LABEL'] == s_id].shape[0]
    if n_obs != len(rastros_identifiers):
        print(s_id)

C01
C03
C04
C06
C15
E02
E05
E07
E12
I06
I09
I16
I17
I19
I21


Build the texts. Although, in this corpus, the Word_Unique_ID already conveniently encodes the text id and the word position in the text, we reconstruct the texts from our internal integers instead, for consistency with the other corpora. We check that the resulting texts contain all indexes from 0 to the text length.

In [15]:
rastros_texts = {i: {} for i in rastros_text_ids.values()}

for identifier, word in rastros_identifiers.items():
    _, text_id, _, word_position = identifier.split('_')
    assert int(word_position) not in rastros_texts[int(text_id)]
    rastros_texts[int(text_id)][int(word_position)] = word

for token_ids in rastros_texts.values():
    assert set(token_ids.keys()) == set(range(len(token_ids)))

In [16]:
# fix an issue with the open quotation mark in the first position
# otherwise it leads to an error in the adjusted outputs for the models
rastros_texts[27][0] = rastros_texts[27][0].replace('"', '“')

Get the measure we want. We initialise the measures dictionary with ```MISSING_LABEL``` for all identifiers and for all subjects, so any missing data (subjects that do not contain data for all texts) will be the same value.

In [17]:
def get_rastros_measures(measure, subjects, identifiers, data, replacer):
    """Create dictionary with measure for each subject in RastrOS data."""
    # initialize the dictionary with empty values, so that all identifiers have
    # all subject keys, even when no data was collected for them
    measures = {identifier: {subject: MISSING_LABEL for subject in subjects.values()} for identifier in identifiers.keys()}

    for index, row in data.iterrows():
        _, orig_text_id, orig_token_number = row['Word_Unique_ID'].split('_')
        text_id = rastros_text_ids[int(orig_text_id)]
        token_position = int(orig_token_number) - 1
        identifier = TOKEN_ID.format(text_id, token_position)
        # sanity check that the word or ia column matches
        assert (identifiers[identifier] == row['Word'] 
                or identifiers[identifier] == row['IA_LABEL'] 
                or index in inconsistent_words)

        orig_subject = row['RECORDING_SESSION_LABEL']
        subject = subjects[orig_subject]
        
        measure_value = row[measure]
        was_skipped = row['IA_SKIP']
        # no NaNs in this corpus
        assert not pd.isna(measure_value)
        assert was_skipped in (0., 1.)

        # skipped words get replaced with chosen label
        if was_skipped == 1. and replacer is not None:
            measure_value = replacer
        else:
            measure_value = float(measure_value)

        # each observation should be unique, otherwise there is a problem in the data
        # we check that the value is empty before adding it
        assert np.isnan(measures[identifier][subject])
        measures[identifier][subject] = measure_value

    return measures

Create and save dataframes with the format described at the top of this notebook:

In [18]:
save_meta(RASTROS_NAME, rastros_texts, rastros_text_ids, rastros_subject_ids)

In [19]:
# fix order of the subjects across dataframes
ordered_subjs = list(rastros_subject_ids.values())

def rastros_build_and_save(column_name, variable_name, replacer):
    measures = get_rastros_measures(column_name,
                                    rastros_subject_ids,
                                    rastros_identifiers,
                                    rastros_raw,
                                    replacer)

    preproc = create_dataframe(variable_name,
                               measures,
                               rastros_identifiers,
                               ordered_subjs)
    save_preprocessed(RASTROS_NAME, variable_name, preproc)
    return preproc


# first pass regressions, binary -- replace '.' with SKIP_LABEL
rastros_fpregs = rastros_build_and_save('IA_REGRESSION_OUT', FPREGOUT, replacer=SKIP_LABEL)

In [20]:
rastros_fpregs.isna().sum()

Identifier             0
Token                  0
fpregout:Subj_0      100
fpregout:Subj_1        0
fpregout:Subj_2       51
fpregout:Subj_3       56
fpregout:Subj_4        0
fpregout:Subj_5       40
fpregout:Subj_6        0
fpregout:Subj_7        0
fpregout:Subj_8        0
fpregout:Subj_9        0
fpregout:Subj_10     270
fpregout:Subj_11       0
fpregout:Subj_12      39
fpregout:Subj_13       0
fpregout:Subj_14     588
fpregout:Subj_15      55
fpregout:Subj_16       0
fpregout:Subj_17       0
fpregout:Subj_18       0
fpregout:Subj_19     296
fpregout:Subj_20       0
fpregout:Subj_21       0
fpregout:Subj_22       0
fpregout:Subj_23      36
fpregout:Subj_24       0
fpregout:Subj_25       0
fpregout:Subj_26      49
fpregout:Subj_27       0
fpregout:Subj_28       0
fpregout:Subj_29       0
fpregout:Subj_30     440
fpregout:Subj_31      96
fpregout:Subj_32       0
fpregout:Subj_33     407
fpregout:Subj_34       0
fpregout:Subj_35    1872
fpregout:Subj_36       0
dtype: int64

In [21]:
rastros_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_27,fpregout:Subj_28,fpregout:Subj_29,fpregout:Subj_30,fpregout:Subj_31,fpregout:Subj_32,fpregout:Subj_33,fpregout:Subj_34,fpregout:Subj_35,fpregout:Subj_36
0,text_0_token_0,Mudanças,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,,0.0
1,text_0_token_1,climáticas,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,1.0,...,0.0,0.0,0.0,,0.0,0.0,1.0,1.0,,0.0
2,text_0_token_2,estão,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,...,-1.0,0.0,1.0,,0.0,0.0,0.0,0.0,,0.0
3,text_0_token_3,aquecendo,1.0,0.0,0.0,1.0,0.0,0.0,-1.0,0.0,...,1.0,0.0,0.0,,0.0,1.0,0.0,0.0,,0.0
4,text_0_token_4,o,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,0.0,,-1.0,-1.0,1.0,-1.0,,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2489,text_43_token_43,Tais,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,-1.0,-1.0,0.0,,0.0,,0.0
2490,text_43_token_44,objetos,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-1.0,-1.0,0.0,,0.0,,0.0
2491,text_43_token_45,são,,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,0.0,0.0,,0.0,,-1.0
2492,text_43_token_46,chamados,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,,-1.0,,0.0


## PoTeC
Preprocessing the German Potsdam Textbook Corpus, available at [OSF](https://osf.io/dn5hp/).

- SR Research Eyelink 1000 eyetracker
- 75 (or 62 valid?) participants, all read all texts
- 12 scientific texts (6 biology and 6 physics)
- On average 158 words per text
- Each text fits onto a single screen
- Apparently is contains no punctuation

We follow the documentation described in their OSF wiki and use the ```FPReg```
measure in the ```eyetracking_data/readingMeasures/*.txt```, which contains:

> 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc))

We need the auxiliary ```FPF``` to identify skipped tokens: 

> 1 if the word was fixated in the first-pass, otherwise 0


Let's first extract the words from ```eyetracking_data/texts/texts_tags/<TextID>.tags```.

In [22]:
POTEC_NAME = 'potec_de'
PATH_TO_POTEC = Path('data/PoTeC/osfstorage/')

potec_measures = {'FPReg': FPREGOUT}

Extract texts from the raw text files. It's hard to extract the punctuation precisely because no documentation is available. So we'll add only commas and full stops.

Jäger's paper mentions that 13 subjects were removed in Makovski's paper due to poor calibration. Based on their ```mergeFixationsWordfeatures.py``` script, the subject IDs to be removed are:

-  2, 9, 16, 19, 22, 31, 39, 41, 64, 72, 83, 85, 90, 93

In [23]:
potec_text_ids = {}
potec_texts = {}
potec_subject_ids = {}

def add_punctuation(before, word, after):
    text = word
    # according to STTS documentation, $( means "other punctuation"
    # some other columns mention parenthesis, quotes, but no documentation
    # since we cannot know for sure, we'll use nothing
    # https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/STTS-Tagset.pdf
    # the other "$." if final sentence, although we also don't know exactly
    # which, we will use .
    assert before in ('$(', 'None')
    # ignore pure '$(' in before because we don't know what is was
    if after not in ('$(', 'None'):
        after = after.replace('$(', '')
        assert after in ('$.', '$,')
        text = text + after.replace('$', '')
    return text


# from https://stackoverflow.com/a/56469905
with os.scandir(PATH_TO_POTEC / 'texts' / 'texts_tags') as directory:
    for entry in directory:
        if entry.name.endswith('.tags') and entry.is_file():
            text_id, _ = entry.name.split('.')
            text_index = len(potec_text_ids)
            potec_text_ids[text_id] = text_index
            potec_texts[text_index] = {} 
            with open(entry.path, 'r') as file:
                for word_id, line in enumerate(file.readlines()[1:]):
                    _, word, _, _, _, _, _, _, punc_before, punc_after, *_ = line.split('\t')
                    ia_label = add_punctuation(punc_before, word, punc_after)
                    potec_texts[text_index][word_id] = ia_label


POTEC_EXCLUDED_SUBJS = [2, 9, 16, 19, 22, 31, 39, 41, 64, 72, 83, 85, 90, 93]
with os.scandir(PATH_TO_POTEC / 'eyetracking_data' / 'eyetracking_data' / 'readingMeasures') as directory:
    for entry in directory:
        if entry.name.endswith(".txt") and entry.is_file():
            subj_id, *_ = entry.name.strip('.txt').split('_')
            # add new subject
            if subj_id not in potec_subject_ids:
                if int(subj_id.replace('reader', '')) not in POTEC_EXCLUDED_SUBJS:
                    potec_subject_ids[subj_id] = len(potec_subject_ids)

In [24]:
potec_identifiers = {TOKEN_ID.format(text_id, word_position): word 
                     for text_id, words in potec_texts.items() for word_position, word in words.items()}

In [25]:
# fix two issues with the full stops
# probably this should not be a full stop because it's in the middle of a sentence, so we'll use the other possibility which is ;
potec_texts[5][46] = potec_texts[5][46].replace('Stärke.', 'Stärke;')
potec_texts[6][84] = potec_texts[6][84].replace('sind.', 'sind;')

Now we extract the measures from each ```reader<ID>_<TextID>_rm.txt``` file: 

In [26]:
potec_data_files = []
# from https://stackoverflow.com/a/56469905
with os.scandir(PATH_TO_POTEC / 'eyetracking_data' / 'eyetracking_data' / 'readingMeasures') as directory:
    for entry in directory:
        if entry.name.endswith(".txt") and entry.is_file():
            potec_data_files.append(entry)

def get_potec_measures(measure, subjects, identifiers):
    measures = {identifier: {subj: MISSING_LABEL for subj in potec_subject_ids.values()} for identifier in potec_identifiers.keys()}
    for entry in potec_data_files:
        orig_subj_id, orig_text_id, _ = entry.name.strip('.txt').split('_')
        if int(orig_subj_id.replace('reader', '')) in POTEC_EXCLUDED_SUBJS:
            # ignore the files of reader with poor calibration (according to the original code)
            continue
        subj_id = potec_subject_ids[orig_subj_id]
        text_id = potec_text_ids[orig_text_id]
        data = pd.read_csv(entry.path, sep='\t')
        assert data.shape[0] == len(potec_texts[text_id])
        for index, row in data.iterrows():
            measure_value = float(row[measure])
            # no missing values in the corpus
            assert not pd.isna(measure_value)
            assert measure_value in (0., 1.)
            # skip is the opposite of first pass fix
            assert float(row['FPF']) in (0., 1.)
            was_skipped = 1. - float(row['FPF'])
            if was_skipped == 1.:
                assert measure_value == 0.
                measure_value = SKIP_LABEL
            identifier = TOKEN_ID.format(text_id, index)
            # value should be empty until we add it
            assert np.isnan(measures[identifier][subj_id])
            measures[identifier][subj_id] = measure_value
    
    for subj_data in measures.values():
        assert sum([np.isnan(value) for value in subj_data.values()]) == 0

    return measures

In [27]:
save_meta(POTEC_NAME, potec_texts, potec_text_ids, potec_subject_ids)

In [28]:
potec_ordered_subjs = list(potec_subject_ids.values())

def potec_build_and_save(column_name, variable_name):
    measures = get_potec_measures(column_name, potec_subject_ids, potec_identifiers)
    preproc = create_dataframe(variable_name, measures, potec_identifiers, potec_ordered_subjs)
    save_preprocessed(POTEC_NAME, variable_name, preproc)
    return preproc

potec_fpregs = potec_build_and_save('FPReg', FPREGOUT)

In [29]:
potec_fpregs.isna().sum()

Identifier          0
Token               0
fpregout:Subj_0     0
fpregout:Subj_1     0
fpregout:Subj_2     0
                   ..
fpregout:Subj_57    0
fpregout:Subj_58    0
fpregout:Subj_59    0
fpregout:Subj_60    0
fpregout:Subj_61    0
Length: 64, dtype: int64

In [30]:
potec_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_52,fpregout:Subj_53,fpregout:Subj_54,fpregout:Subj_55,fpregout:Subj_56,fpregout:Subj_57,fpregout:Subj_58,fpregout:Subj_59,fpregout:Subj_60,fpregout:Subj_61
0,text_0_token_0,Photonische,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,text_0_token_1,Kristalle,0.0,0.0,0.0,1.0,-1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,text_0_token_2,sind,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,-1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0
3,text_0_token_3,räumlich,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,1.0,0.0,0.0
4,text_0_token_4,periodische,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1890,text_11_token_136,Wellenlänge,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,-1.0,0.0,0.0,0.0,-1.0,1.0,1.0,0.0
1891,text_11_token_137,des,0.0,-1.0,0.0,1.0,-1.0,-1.0,-1.0,-1.0,...,0.0,1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
1892,text_11_token_138,beleuchtenden,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
1893,text_11_token_139,Lichtes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


In [31]:
# double check that in this corpus there are no missing values
for c, column in potec_fpregs[[c for c in potec_fpregs.columns if 'Subj' in c]].items():
    assert set(column) == set([0., -1., 1.])


## Provo
Preprocessing the English Provo corpus. 

- 84 participants
- Eye Link 1000 Plus (SR Research)
- 55 paragraphs of various sources
- The texts were presented in a random order for each participant.

We follow the documentation described in Table 2 in Luke and Christianson (2018). We use the ```IA_REGRESSION_OUT``` measure in the file ```Provo_Corpus_Eyetracking_Data.csv```, which contains:

> Whether regression(s) was made from the current interest area to earlier interest areas (e.g., previous parts of the sentence) prior to leaving that interest area in a forward direction. 1 if a saccade exits the current interest area to a lower IA_ID (to the left in English) before a later interest area was fixated; 0 if not.

We use ```IA_SKIP``` to check first-pass skips:
> An interest area is considered skipped (i.e.,IA_SKIP = 1) if no fixation occurred in first-pass reading.

In [32]:
PROVO_NAME = 'provo_en'
PATH_TO_PROVO = Path('data/Provo/osfstorage/')

provo_measures = {
    'IA_REGRESSION_OUT': FPREGOUT,
}

In [33]:
provo_texts = {}
provo_text_ids = {}

PATH_PROVO_TEXTS = PATH_TO_PROVO / 'Provo Corpus Eyelink Program Files and Raw Data' / 'Paragraph Reading' / 'datasets' / 'TRIAL_DataSource_Paragraph_Reading_BLOCKTRIAL.dat'

with open(PATH_PROVO_TEXTS, 'r') as file:
    for index, line in enumerate(file.readlines()[1:]):
        _, text_id, text = line.split('\t')
        provo_text_ids[text_id.strip('"')] = index
        # [1:-1] removes " at the beginning and the end while keeping cases where there is a double " due to a real " in the text
        #text = text.strip('\n')[1:-1].split()
        # however, it seems that the IA_LABEL does not contain these quotations in the raw file, 
        # so we remove them here
        text = text.strip('\n').replace('"', '').split()
        provo_texts[index] = {i: word for i, word in enumerate(text)}

In [34]:
provo_raw = pd.read_csv(PATH_TO_PROVO / 'Provo_Corpus-Eyetracking_Data.csv', sep=',', 
                        low_memory=False, doublequote=False, encoding='utf-8')

In [35]:
provo_raw

Unnamed: 0,RECORDING_SESSION_LABEL,Participant_ID,Word_Unique_ID,Text_ID,Word_Number,Sentence_Number,Word_In_Sentence_Number,Word,Word_Cleaned,Word_Length,...,IA_REGRESSION_IN_COUNT,IA_REGRESSION_OUT,IA_REGRESSION_OUT_COUNT,IA_REGRESSION_OUT_FULL,IA_REGRESSION_OUT_FULL_COUNT,IA_REGRESSION_PATH_DURATION,IA_FIRST_SACCADE_AMPLITUDE,IA_FIRST_SACCADE_ANGLE,IA_FIRST_SACCADE_END_TIME,IA_FIRST_SACCADE_START_TIME
0,80,Sub01,QID1,1,2.0,1.0,2.0,are,are,3.0,...,0.0,0.0,0.0,0.0,0.0,147.0,2.39,9.31,247.0,221.0
1,80,Sub01,QID2,1,3.0,1.0,3.0,now,now,3.0,...,0.0,0.0,0.0,0.0,0.0,193.0,1.86,1.89,415.0,395.0
2,80,Sub01,QID3,1,4.0,1.0,4.0,rumblings,rumblings,9.0,...,0.0,0.0,0.0,0.0,0.0,198.0,2.17,2.46,632.0,609.0
3,80,Sub01,QID4,1,5.0,1.0,5.0,that,that,4.0,...,,,,,,,,,,
4,80,Sub01,QID5,1,6.0,1.0,6.0,Apple,apple,5.0,...,0.0,0.0,0.0,0.0,0.0,235.0,3.89,0.66,864.0,831.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230407,sub99,Sub84,QID2742,55,58.0,2.0,34.0,unreliable,unreliable,10.0,...,1.0,0.0,0.0,0.0,0.0,330.0,3.19,3.73,8728.0,8702.0
230408,sub99,Sub84,QID2743,55,59.0,2.0,35.0,and,and,3.0,...,,,,,,,,,,
230409,sub99,Sub84,QID2744,55,60.0,2.0,36.0,detestable,detestable,10.0,...,0.0,0.0,0.0,0.0,0.0,181.0,5.38,8.19,9109.0,9076.0
230410,sub99,Sub84,,55,,,,,,,...,0.0,0.0,0.0,0.0,0.0,145.0,4.01,-3.07,220.0,193.0


Inspect the values:

In [36]:
for column, measure_type in provo_measures.items():
    print(f'\n{measure_type.name}')
    print(provo_raw[column].value_counts())
    print('NaNs: ', provo_raw[column].isna().sum())

print('\nfirst pass skips')
print(Counter(provo_raw['IA_SKIP']))
print('NaNs: ', provo_raw['IA_SKIP'].isna().sum())

assert set(provo_raw['IA_REGRESSION_OUT'].dropna().unique()) == set([0, 1])
assert set(provo_raw['IA_SKIP'].unique()) == set([0, 1])


first-pass-regression-out
IA_REGRESSION_OUT
0.0    129221
1.0     24345
Name: count, dtype: int64
NaNs:  76846

first pass skips
Counter({0: 128807, 1: 101605})
NaNs:  0


The regression variable contains several NaNs. Skips are always either 0 or 1. Let's investigate why we have so many ```NaN```s. It seems to be the same case of the dot in RastrOS: if the word was fixated at any point, IA_REGRESSION_OUT is either 0 or 1. When it is NaN, the token was skipped in the first pass.

In [37]:
for index, row in provo_raw.iterrows():
    regression_label = row['IA_REGRESSION_OUT']
    assert regression_label in (0, 1) or pd.isna(regression_label)
    if regression_label in (0, 1):
        if row['IA_SKIP'] == 1.:
            # tokens skipped at first pass but fixated later 
            assert row['IA_FIXATION_COUNT'] > 0
            # strage cases where it was skipped but regression is one,#
            # probably because of recursive regressions
            #assert regression_label == 0
        # if there is a label, the token was fixated at least once
        assert row['IA_DWELL_TIME'] > 0.
        # checking if first run fixation count is more reliable
        assert not pd.isna(row['IA_FIRST_RUN_FIXATION_COUNT'])
        assert row['IA_FIRST_RUN_FIXATION_COUNT'] >= 1.
    if pd.isna(regression_label):
        # the dots mean skipped at first pass and also skipped altogether
        assert row['IA_SKIP'] == 1.
        assert row['IA_DWELL_TIME'] == 0.
        assert row['IA_FIXATION_COUNT'] == 0.
        assert pd.isna(row['IA_FIRST_RUN_FIXATION_COUNT'])
    if pd.isna(row['IA_FIRST_RUN_FIXATION_COUNT']):
        # when it is NaN, regression is also NaN
        assert row['IA_SKIP'] == 1.
        assert pd.isna(regression_label)
    else:
        # otherwise regression has a value
        assert regression_label in (0, 1)

So when ```IA_REGRESSION_OUT``` is nan, skip is always 1.

In [38]:
provo_identifiers = {TOKEN_ID.format(text_id, word_id): word for text_id, words in provo_texts.items() for word_id, word in words.items()}
provo_subject_ids = {subj_id: i for i, subj_id in enumerate(provo_raw['Participant_ID'].unique())}

Let's check whether the tokens in 'Word' and 'IA_LABEL' match and also if they match the texts.

In [39]:
inconsistent_words = {}
internal_inconsistencies = {}

for index, row in provo_raw.iterrows():
    text_id = row['Text_ID']
    word_id = row['IA_ID']
    identifier = TOKEN_ID.format(provo_text_ids[str(text_id)], word_id - 1)
    
    # check that different subjects identifiers always map to the same word
    if provo_identifiers[identifier] != row['IA_LABEL'].strip():
        inconsistent_words[index] = (provo_identifiers[identifier], row['IA_LABEL'])

    if row['Word'] != row['IA_LABEL']:
        internal_inconsistencies[index] = (row['Word'], row['IA_LABEL'])

In [40]:
n_mismatches = len(set(internal_inconsistencies.values()))
print(f'There are {n_mismatches} tokens with mismatches between IA_LABEL and Word!')

n_mismatches = len(set(inconsistent_words.values()))
print(f'There are {n_mismatches} tokens with mismatches wrt the texts!')

print(set(inconsistent_words.values()))

There are 1366 tokens with mismatches between IA_LABEL and Word!
There are 4 tokens with mismatches wrt the texts!
{('Ñ', '? '), ('doesnÕt', 'doesn?t '), ('bondsÕ', 'bonds? '), ('womenÕs', 'women?s ')}


We cannot trust the ```Word``` and ```Word_ID``` fields. They have too many inconsistencies and strange NaNs where there shouldn't be. We'll rely on the ```IA_LABEL``` and ```IA_ID``` instead, which seem to be as they should, except for some quotation marks (that we removed a few cells up) and the 4 tokens above, which we'll keep as in the original texts.

In [41]:
def get_provo_measures(measure, subjects, identifiers, data, replacer):
    """Create dictionary with measure for each subject in ProVo data."""
    # initialize the dictionary with np.nans, so that all identifiers have
    # all subject keys, even when no data was collected for them
    measures = {identifier: {subject: MISSING_LABEL for subject in subjects.values()} for identifier in identifiers.keys()}

    for index, row in data.iterrows():

        orig_text = str(row['Text_ID'])
        text_id = provo_text_ids[orig_text]
        token_number = row['IA_ID'] - 1
        identifier = TOKEN_ID.format(text_id, token_number)
        assert (identifiers[identifier] == row['IA_LABEL'].strip() 
                or index in inconsistent_words)

        orig_subject = row['Participant_ID']
        subject = subjects[orig_subject]
        
        measure_value = row[measure]
        was_skipped = row['IA_SKIP']
        assert was_skipped  in (0., 1.)

        if was_skipped == 1. and replacer is not None:
            measure_value = replacer
        else:
            measure_value = float(measure_value)

        # each observation should be unique, otherwise there is a problem in the data
        assert np.isnan(measures[identifier][subject])
        measures[identifier][subject] = measure_value

    return measures

In [42]:
save_meta(PROVO_NAME, provo_texts, provo_text_ids, provo_subject_ids)

In [43]:
# fix order of the subjects across dataframes
provo_ordered_subjs = list(provo_subject_ids.values())

def provo_build_and_save(column_name, variable_name, replacer):
    measures = get_provo_measures(column_name, provo_subject_ids, provo_identifiers, provo_raw, replacer)
    preproc = create_dataframe(variable_name, measures, provo_identifiers, provo_ordered_subjs)
    save_preprocessed(PROVO_NAME, variable_name, preproc)
    return preproc

# first pass regressions, binary -- replace np.nan with skip value
provo_fpregs = provo_build_and_save('IA_REGRESSION_OUT', FPREGOUT, replacer=SKIP_LABEL)

In [44]:
provo_fpregs.isna().sum()

Identifier          0
Token               0
fpregout:Subj_0     0
fpregout:Subj_1     0
fpregout:Subj_2     0
                   ..
fpregout:Subj_79    0
fpregout:Subj_80    0
fpregout:Subj_81    0
fpregout:Subj_82    0
fpregout:Subj_83    0
Length: 86, dtype: int64

In [45]:
provo_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_74,fpregout:Subj_75,fpregout:Subj_76,fpregout:Subj_77,fpregout:Subj_78,fpregout:Subj_79,fpregout:Subj_80,fpregout:Subj_81,fpregout:Subj_82,fpregout:Subj_83
0,text_0_token_0,There,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
1,text_0_token_1,are,0.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,0.0,...,-1.0,0.0,0.0,-1.0,0.0,0.0,-1.0,0.0,-1.0,-1.0
2,text_0_token_2,now,0.0,-1.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,...,0.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0
3,text_0_token_3,rumblings,0.0,1.0,1.0,-1.0,1.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,1.0,0.0,-1.0,-1.0,0.0,0.0
4,text_0_token_4,that,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,...,-1.0,1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2738,text_54_token_55,most,-1.0,0.0,-1.0,0.0,-1.0,-1.0,1.0,-1.0,...,0.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,0.0
2739,text_54_token_56,unreliable,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
2740,text_54_token_57,and,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,...,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0
2741,text_54_token_58,detestable,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0


## MECO L1 (Multilanguages)

Preprocessing the Multilingual MECO L1 corpus (Dutch, English, Estoniar, Finnish, German, Greek, Hebrew, Italia, Korean, Norwegian, Russian, Spanish, Turkish). We'll extract only some of the languages.

- 13 languages 
- 12 texts, 5 translated for each language and 7 just on similar topic
- Wikipedia-stly
- around 45 subjects for each language
- EyeLink Portable Duo, 1000 or 1000+ (SR Research)
- Each of the 12 texts appeared on a separate screen


We follow the documentation described in the page they point to [here](https://rdrr.io/github/sascha2schroeder/popEye/). Actually, the variables are described [here](https://rdrr.io/github/sascha2schroeder/popEye/f/materials/Measures.md). The main variable we want is ```firstrun.reg.out```: 

> Variable indicating whether there was a regression from the word during first-pass reading

We use ```firstrun.skip```  to detect the tokens that were skipped in the first run:

> Variable indicating whether the IA was skipped during first-pass reading

In [46]:
MECOL1_NAME = 'mecol1'
PATH_TO_MECOL1 = Path('data/MECO-L1/osfstorage/release 1.0/version 1.2/primary data/eye tracking data/joint_data_trimmed.rda')
PATH_TO_MECOL1_TEXTS = Path('data/MECO-L1/osfstorage/release 1.0/version 1.2/auxiliary files/reading task materials/supp texts.csv')

meco_l1_measures = {
    'firstrun.reg.out': FPREGOUT,
}

In [47]:
parsed = rdata.parser.parse_file(PATH_TO_MECOL1)
mecol1_raw = rdata.conversion.convert(parsed)['joint.data']

# get only Dutch, other languages have inconsistencies
mecol1_raw = mecol1_raw[mecol1_raw.lang == 'du']
mecol1_texts_raw = pd.read_csv(PATH_TO_MECOL1_TEXTS, index_col=0).loc['Dutch']

In [48]:
mecol1_text_ids = {text: i for i, text in enumerate(mecol1_texts_raw.index) if 'Unnamed' not in text}
mecol1_texts = {i: {} for i in mecol1_text_ids.values()}

for index, text in mecol1_texts_raw.items():
    if 'Unnamed' not in index:
        text_id = mecol1_text_ids[index]
        tokens = text.replace('-', '- ')
        mecol1_texts[text_id] = {i: word for i, word in enumerate(tokens.split())}

Checking NaNs in both measures. Here, the number of fixations also contain NaNs, exactly the same number as the regressions:

In [49]:
for column, measure_type in meco_l1_measures.items():
    print(f'\n{measure_type.name}')
    print(mecol1_raw [column].value_counts())
    print('NaNs:', mecol1_raw [column].isna().sum())

print('\nfirst run skip')
print(mecol1_raw['firstrun.skip'].value_counts())
print('NaNs: ', mecol1_raw['firstrun.skip'].isna().sum())

assert set(mecol1_raw['firstrun.reg.out'].dropna().unique()) == set([0, 1])
assert set(mecol1_raw['firstrun.skip'].unique()) == set([0, 1])


first-pass-regression-out
firstrun.reg.out
0.0    34748
1.0    12054
Name: count, dtype: int64
NaNs: 20793

first run skip
firstrun.skip
1.0    36058
0.0    31537
Name: count, dtype: int64
NaNs:  0


In [50]:
for index, row in mecol1_raw.iterrows():
    regression_label = row['firstrun.reg.out']
    assert regression_label in (0.0, 1.0) or pd.isna(regression_label)
    if regression_label in (0.0, 1.0):
        assert row['firstrun.nfix'] > 0.
        assert not pd.isna(row['firstrun.nfix'])
        assert row['dur'] > 0.
    if pd.isna(regression_label):
        # nans imply skipped tokens
        assert row['firstrun.skip'] == 1.
        assert pd.isna(row['firstrun.dur'])
        assert pd.isna(row['firstrun.nfix'])

If there is a regression label, there was a first run fixation. If regression is nan, then first run skip is 1.

In [51]:
for index, row in mecol1_raw.iterrows():
    label = row['nfix']
    if not pd.isna(label):
        if row['firstrun.skip'] != 0.:
            #print(index)  # Too many!
            pass
        assert row['dur'] > 0.
    else:
        assert row['skip'] == 1. or pd.isna(row['skip'])
        assert pd.isna(row['dur'])

So, for both measures, NaN means the token was not fixated.

In [52]:
mecol1_identifiers = {TOKEN_ID.format(text_id, word_id): word for text_id, words in mecol1_texts.items() for word_id, word in words.items()}

In [53]:
mecol1_subject_ids = {subj_id: i for i, subj_id in enumerate(mecol1_raw['uniform_id'].unique())}

Check ```trialid``` is indeed fixed across subjects and encodes the text identifier, despite their strange documentation.

In [54]:
temp_dic = {x: {} for x in range(1, 13)}
for index, row in mecol1_raw.iterrows():
    # assuming this ID is fixed, although they mention position in experiment in the doc
    text_id = row['trialid']
    word_id = row['ianum']
    word = row['ia']

    if word_id not in temp_dic[text_id]:
        temp_dic[text_id][word_id] = word
    assert temp_dic[text_id][word_id] == word

Check whether it matches the raw texts. Alternatively we can use only the info in the data file.

In [55]:
for text_id, words in temp_dic.items():
    for word_id in range(1, len(words)+1):
        try:
            word = words[word_id]
        except KeyError:
            print('Investigate...')

        if word != mecol1_texts[text_id - 1][word_id - 1]:
            print('Mismatch!', word, mecol1_texts[text_id - 1][word_id - 1])

In [56]:
def get_mecol1_measures(measure, subjects, identifiers, data, replacer):
    """Create dictionary with measure for each subject in MECO-L1 data."""
    # initialize the dictionary with np.nans, so that all identifiers have
    # all subject keys, even when no data was collected for them
    measures = {identifier: {subject: MISSING_LABEL for subject in subjects.values()} for identifier in identifiers.keys()}

    for index, row in data.iterrows():

        text_id = int(row['trialid'] - 1)
        token_number = int(row['ianum'] - 1)
        identifier = TOKEN_ID.format(text_id, token_number)
        assert row['ia'].strip(' ') == mecol1_texts[text_id][token_number]
        
        subject = mecol1_subject_ids[row['uniform_id']]

        measure_value = row[measure]
        was_skipped = row['firstrun.skip']
        assert was_skipped in (0., 1.)

        if was_skipped == 1. and replacer is not None:
            measure_value = replacer
        else:
            measure_value = float(measure_value)

        # each observation should be unique, otherwise there is a problem in the data
        assert np.isnan(measures[identifier][subject])
        measures[identifier][subject] = measure_value
            
    return measures

In [57]:
save_meta(f'{MECOL1_NAME}_du', mecol1_texts, mecol1_text_ids, mecol1_subject_ids)

In [58]:
# fix order of the subjects across dataframes
mecol1_ordered_subjs = list(mecol1_subject_ids.values())

def mecol1_build_and_save(column_name, variable_name, replacer):
    measures = get_mecol1_measures(column_name,
                                   mecol1_subject_ids,
                                   mecol1_identifiers,
                                   mecol1_raw,
                                   replacer)

    preproc = create_dataframe(variable_name,
                               measures,
                               mecol1_identifiers,
                               mecol1_ordered_subjs)
    save_preprocessed(f'{MECOL1_NAME}_du', variable_name, preproc)
    return preproc

mecol1_fpregs = mecol1_build_and_save('firstrun.reg.out', FPREGOUT, replacer=SKIP_LABEL)

In [59]:
mecol1_fpregs.isna().sum()

Identifier             0
Token                  0
fpregout:Subj_0      383
fpregout:Subj_1     1103
fpregout:Subj_2     1112
fpregout:Subj_3     1122
fpregout:Subj_4      161
fpregout:Subj_5      347
fpregout:Subj_6        0
fpregout:Subj_7      756
fpregout:Subj_8      552
fpregout:Subj_9      907
fpregout:Subj_10     745
fpregout:Subj_11     748
fpregout:Subj_12     886
fpregout:Subj_13     936
fpregout:Subj_14    1123
fpregout:Subj_15     920
fpregout:Subj_16    1117
fpregout:Subj_17     789
fpregout:Subj_18     342
fpregout:Subj_19       0
fpregout:Subj_20    1132
fpregout:Subj_21     760
fpregout:Subj_22     763
fpregout:Subj_23    1325
fpregout:Subj_24     173
fpregout:Subj_25     730
fpregout:Subj_26     511
fpregout:Subj_27    1298
fpregout:Subj_28     169
fpregout:Subj_29    1277
fpregout:Subj_30     573
fpregout:Subj_31     213
fpregout:Subj_32     938
fpregout:Subj_33     918
fpregout:Subj_34     578
fpregout:Subj_35    1112
fpregout:Subj_36    1295
fpregout:Subj_37     929


In [60]:
mecol1_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_35,fpregout:Subj_36,fpregout:Subj_37,fpregout:Subj_38,fpregout:Subj_39,fpregout:Subj_40,fpregout:Subj_41,fpregout:Subj_42,fpregout:Subj_43,fpregout:Subj_44
0,text_0_token_0,Janus,0.0,,0.0,,0.0,,-1.0,,...,-1.0,0.0,,,0.0,-1.0,0.0,,0.0,0.0
1,text_0_token_1,is,-1.0,,0.0,,-1.0,,-1.0,,...,-1.0,0.0,,,-1.0,0.0,-1.0,,1.0,-1.0
2,text_0_token_2,in,-1.0,,-1.0,,1.0,,-1.0,,...,-1.0,-1.0,,,-1.0,-1.0,1.0,,-1.0,0.0
3,text_0_token_3,de,-1.0,,0.0,,-1.0,,1.0,,...,-1.0,-1.0,,,-1.0,1.0,-1.0,,0.0,-1.0
4,text_0_token_4,oude,1.0,,-1.0,,-1.0,,1.0,,...,-1.0,0.0,,,1.0,0.0,0.0,,0.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2226,text_11_token_164,met,0.0,,-1.0,-1.0,0.0,-1.0,0.0,0.0,...,,,-1.0,-1.0,-1.0,,,-1.0,-1.0,-1.0
2227,text_11_token_165,overheden,1.0,,0.0,0.0,-1.0,0.0,-1.0,0.0,...,,,0.0,0.0,-1.0,,,-1.0,0.0,-1.0
2228,text_11_token_166,en,-1.0,,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,,,-1.0,-1.0,-1.0,,,-1.0,-1.0,-1.0
2229,text_11_token_167,internationale,0.0,,0.0,1.0,0.0,0.0,0.0,0.0,...,,,1.0,0.0,-1.0,,,1.0,0.0,1.0


## MECO L2
Preprocessing the MECO corpus (second language English).

- 543 participants with 12 L1 languages (1 is English)
- popEye 
- 12 texts (training materials for the ACCUPLACER Reading test and the English as Second Language Reading Skills test)
- participants read 12 texts in their L1 silently for comprehension while their eye movements were recorded, and then answered four yes/no questions after each text.

This corpus was released as a ```rda``` files. We use the ```rdata``` library to read it into a pandas dataframe.

The documentation of the variables is [here](https://rdrr.io/github/sascha2schroeder/popEye/f/materials/Measures.md). We will use the same variables as MECO L1.

In [61]:
meco_l2_measures = {
    'firstrun.reg.out': FPREGOUT
}

In [62]:
MECOL2_NAME = 'mecol2_enl2'

PATH_TO_MECOL2 = Path('data/MECO-L2/osfstorage/release 1.0/version 1.1/primary data/eye tracking data/joint_data_l2_trimmed.rda')
parsed = rdata.parser.parse_file(PATH_TO_MECOL2)
mecol2_raw = rdata.conversion.convert(parsed)['joint.data']

PATH_TO_MECOL2_TEXTS = Path('data/MECO-L2/osfstorage/release 1.0/version 1.1/auxiliary files/materials/texts.meco.l2.rda')
parsed = rdata.parser.parse_file(PATH_TO_MECOL2_TEXTS)
meco_texts_raw = rdata.conversion.convert(parsed)['d']



In [63]:
mecol2_texts = {}
mecol2_text_ids = {}

for index, row in meco_texts_raw.iterrows():
    text_id = row['trialid']
    mecol2_text_ids[text_id] = index - 1

    text_raw = row['text']
    # we need this so that the tokens match what was shown to participants (hyphenated words become two interest areas)
    text_tokens = text_raw.replace('-', '- ').split()

    mecol2_texts[index - 1] = {i: word for i, word in enumerate(text_tokens)} 

In [64]:
for column, measure_type in meco_l2_measures.items():
    print(f'\n{measure_type.name}')
    print(mecol2_raw[column].value_counts())
    print('NaNs:', mecol2_raw[column].isna().sum())

print('\nfirst run skip')
print(mecol1_raw['firstrun.skip'].value_counts())
print('NaNs: ', mecol1_raw['firstrun.skip'].isna().sum())

assert set(mecol2_raw['firstrun.reg.out'].dropna().unique()) == set([0, 1])
assert set(mecol2_raw['firstrun.skip'].unique()) == set([0, 1])


first-pass-regression-out
firstrun.reg.out
0.0    407130
1.0     97913
Name: count, dtype: int64
NaNs: 164910

first run skip
firstrun.skip
1.0    36058
0.0    31537
Name: count, dtype: int64
NaNs:  0


Let's investigate the NaNs in the firstrun.reg.out variable. Similar to Provo, always when ```firstrun.reg.out``` is NaN, it was skipped in the first run.

In [65]:
for index, row in mecol2_raw.iterrows():
    regression_label = row['firstrun.reg.out']
    assert regression_label in (0.0, 1.0) or pd.isna(regression_label)
    if regression_label in (0.0, 1.0):
        assert row['firstrun.nfix'] > 0.
        assert not pd.isna(row['firstrun.nfix'])
        assert row['dur'] > 0.
    if pd.isna(regression_label):
        # nan imply skip
        assert row['firstrun.skip'] == 1.
        assert pd.isna(row['firstrun.dur'])
        assert pd.isna(row['firstrun.nfix'])

In [66]:
for index, row in mecol2_raw.iterrows():
    label = row['nfix']
    if not pd.isna(label):
        if row['firstrun.skip'] != 0.:
            #print(index)  # Too many!
            pass
        assert row['dur'] > 0.
    else:
        assert row['skip'] == 1.
        assert pd.isna(row['dur'])

Nfix is nan when the token was skipped. But the number is different from the first pass regression skips, apparently because in the first pass skip there are cases where a word was fixated later (i.e., if we assert row['skip'] == 1. above, it throws an error, but below it does not).

In [67]:
mecol2_identifiers = {TOKEN_ID.format(text_id, word_id): word for text_id, words in mecol2_texts.items() for word_id, word in words.items()}

# ignore subjects who have strangely repeated data
EXCLUDED_SUBJS_MECOL2 = ('macmo03', 'macmo06', 'macmo11', 'macmo38', 'macmo39')
mecol2_subject_ids = {subj_id: i for i, subj_id in enumerate(mecol2_raw['subid'].unique()) if subj_id not in EXCLUDED_SUBJS_MECOL2}

Check that there are no inconsistencies between the interest area column and the texts:

In [68]:
inconsistent_words = {}

for index, row in mecol2_raw.iterrows():
    if index > 10000:
        # check only up to a point because it's too large
        # the final check is also done upon adding the token to the dictionary below
        break
    text_id = row['itemid']
    word_id = row['ianum']
    word = mecol2_texts[mecol2_text_ids[int(text_id)]][word_id - 1]

    if not row['ia'] == word:
        inconsistent_words[index] = (word, row['ia'])

assert len(inconsistent_words) == 0

In [69]:
def get_mecol2_measures(measure, subjects, identifiers, data, replacer):
    """Create dictionary with measure for each subject in MECO-L2 data."""
    # initialize the dictionary with np.nans, so that all identifiers have
    # all subject keys, even when no data was collected for them
    measures = {identifier: {subject: MISSING_LABEL for subject in subjects.values()} for identifier in identifiers.keys()} 

    for index, row in data.iterrows():

        text_id = mecol2_text_ids[float(row['itemid'])]
        token_number = int(row['ianum'] - 1)
        identifier = TOKEN_ID.format(text_id, token_number)
        assert row['ia'].strip(' ') == mecol2_texts[text_id][token_number]
        
        if row['subid'] in EXCLUDED_SUBJS_MECOL2:
            # these subjects have strangely repeated data
            continue
        subject = subjects[row['subid']]

        measure_value = row[measure]
        was_skipped = row['firstrun.skip']
        assert was_skipped in (0., 1.)

        if was_skipped == 1. and replacer is not None:
            measure_value = replacer
        else:
            measure_value = float(measure_value)

        # each observation should be unique, otherwise there is a problem in the data
        assert np.isnan(measures[identifier][subject])
        measures[identifier][subject] = measure_value
            
    return measures

In [70]:
save_meta(MECOL2_NAME, mecol2_texts, mecol2_text_ids, mecol2_subject_ids)

In [71]:
# fix order of the subjects across dataframes
mecol2_ordered_subjs = list(mecol2_subject_ids.values())

def mecol2_build_and_save(column_name, variable_name, replacer):
    measures = get_mecol2_measures(column_name,
                                   mecol2_subject_ids,
                                   mecol2_identifiers,
                                   mecol2_raw,
                                   replacer)

    preproc = create_dataframe(variable_name,
                               measures,
                               mecol2_identifiers,
                               mecol2_ordered_subjs)
    save_preprocessed(MECOL2_NAME, variable_name, preproc)
    return preproc


mecol2_fpregs = mecol2_build_and_save('firstrun.reg.out', FPREGOUT, replacer=SKIP_LABEL)

In [72]:
mecol2_fpregs.isna().sum()

Identifier             0
Token                  0
fpregout:Subj_0      676
fpregout:Subj_1      635
fpregout:Subj_2       98
                    ... 
fpregout:Subj_538    526
fpregout:Subj_539    455
fpregout:Subj_540    963
fpregout:Subj_541    803
fpregout:Subj_542    801
Length: 540, dtype: int64

In [73]:
mecol2_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_533,fpregout:Subj_534,fpregout:Subj_535,fpregout:Subj_536,fpregout:Subj_537,fpregout:Subj_538,fpregout:Subj_539,fpregout:Subj_540,fpregout:Subj_541,fpregout:Subj_542
0,text_0_token_0,Samuel,0.0,-1.0,0.0,,0.0,,0.0,,...,0.0,-1.0,,,0.0,0.0,0.0,-1.0,-1.0,-1.0
1,text_0_token_1,"Morse,",-1.0,1.0,-1.0,,1.0,,0.0,,...,1.0,-1.0,,,1.0,1.0,1.0,1.0,-1.0,1.0
2,text_0_token_2,best,1.0,-1.0,1.0,,0.0,,0.0,,...,0.0,-1.0,,,0.0,0.0,0.0,-1.0,-1.0,0.0
3,text_0_token_3,known,0.0,0.0,-1.0,,-1.0,,0.0,,...,0.0,-1.0,,,-1.0,0.0,0.0,0.0,-1.0,0.0
4,text_0_token_4,today,0.0,0.0,1.0,,1.0,,-1.0,,...,0.0,-1.0,,,-1.0,0.0,0.0,0.0,-1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,text_11_token_143,their,-1.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,...,,0.0,,,,0.0,,,-1.0,-1.0
1654,text_11_token_144,connectivity,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,...,,0.0,,,,0.0,,,1.0,-1.0
1655,text_11_token_145,in,-1.0,-1.0,-1.0,1.0,1.0,0.0,-1.0,-1.0,...,,0.0,,,,-1.0,,,0.0,-1.0
1656,text_11_token_146,personal,-1.0,-1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,,0.0,,,,0.0,,,1.0,-1.0


## Nicenboim's corpus
Preprocessing Nicenboim's corpus (Spanish).

- 71 participants
- EyeLink 1000 
- 120+48 items

This corpus was released as a ```rda``` files. We use the ```pyreadr``` library to read it into a pandas dataframe (```rdata``` threw some warnings).

I could not find documentation anywhere. Nicenboim's could not confirm whether these are the variables we want. He pointed to [this link](https://mran.microsoft.com/snapshot/2014-08-18_0233/web/packages/em2/em2.pdf), but it's not available. So we'll use:

- ```fp_reg```: assuming that it refers to first-pass regression like the others

Although this corpus has a measure ```skip```, it's unclear whether it refers to first pass skips or not. We will use ```FPRT``` as an auxiliary. Although no documentation is available, the PoTeC documentation has a measure with the same name. We inspected that column and it seems to contain NaNs that should be the skipped in first pass tokens (because the others are numbers > 0).

Participants saw sentences with different conditions. It would require some workaround if we wanted to use this corpus. But we can use only the fillers, which seem to have been the same across participants, based on the reverse engineering below. Nicenboim said that the fillers were the same for all participants.

In [74]:
NICENBOIM_NAME = 'nicenboim_es'

PATH_TO_NICENBOIM = 'data/Nicenboim/NicenboimEtAl2013ET.Rda'
nicenboim_raw = pyreadr.read_r(PATH_TO_NICENBOIM)
print(nicenboim_raw.keys())

odict_keys(['indivET', 'dataET', 'dataexpET'])


I understand that 'dataET' contains all sentences (from their experiment, the secondary experiment and the fillers) and 'dataexpET' contain only their experiment. Nicenboim also confirmed that "dataET includes experimental sentences for other experiments as well that were used as fillers here". 

In [75]:
nicenboim_raw = nicenboim_raw['dataET']

In [76]:
nicenboim_raw

Unnamed: 0_level_0,subj,sentenceid,FFD,FFP,SFD,FPRT,RBRT,TFT,RPD,CRPD,...,x,xbeg,xend,p,pcu,ran,word,wordn,question.acc,question.corrans
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,156,169,1,169,169,169,169,169,169,...,33,10,36,728,0.800635,19.039000,Los,1,,
169,2,156,231,1,231,231,231,231,231,400,...,99,41,111,756,0.800635,19.039000,pasajeros,2,,
337,2,156,,0,0,,0,,,400,...,,116,189,,0.800635,19.039000,ignoraban,3,,
505,2,156,321,1,321,321,321,321,321,721,...,204,194,206,763,0.800635,19.039000,si,4,,
673,2,156,178,1,0,178,178,329,178,899,...,246,211,300,697,0.800635,19.039000,solucionaría,5,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209184,76,169,363,1,363,363,363,363,363,1869,...,445,404,450,683,0.572381,19.556625,novela,10,1.0,F
209349,76,169,,0,0,,0,,,1869,...,,455,467,,0.572381,19.556625,la,11,1.0,F
209511,76,169,299,1,299,299,299,299,299,2168,...,502,472,530,693,0.572381,19.556625,maestra,12,1.0,F
209671,76,169,301,1,301,301,301,301,301,2469,...,544,535,553,671,0.572381,19.556625,de,13,1.0,F


A bit of reverse engineering to identify sentences that were ***not*** the same across participants, and later exclude them.

In [77]:
aux_texts = {}
varying_sentences = set()

for index, row in nicenboim_raw.iterrows():
    text_id = row['sentenceid']
    if text_id not in aux_texts:
        aux_texts[text_id] = {}

    word = row['word']
    word_id = row['wordn']

    if word_id not in aux_texts[text_id]:
        aux_texts[text_id][word_id] = word
    if word != aux_texts[text_id][word_id]:
        varying_sentences.add(text_id)

In [78]:
print(f'{len(varying_sentences)} out of {len(nicenboim_raw["sentenceid"].unique())} sentences vary across subjects.') 

120 out of 168 sentences vary across subjects.


Build the texts, i.e. the sentences that were the same across subjects and we will be used. Also check that tokens were the same for all subjects.

In [79]:
nicenboim_texts = {}
nicenboim_text_ids = {}

for index, row in nicenboim_raw.iterrows():
    text_id = row['sentenceid']
    if text_id in varying_sentences:
        # we'll not use sentences that vary across subjects
        continue
    if text_id not in nicenboim_text_ids:
        new_id = len(nicenboim_text_ids)
        nicenboim_text_ids[text_id] = new_id
        nicenboim_texts[new_id] = {}

    word_id = row['wordn'] - 1
    word = row['word']

    text_id = nicenboim_text_ids[text_id]
    if word_id not in nicenboim_texts[text_id]:
        nicenboim_texts[text_id][word_id] = word
    assert word == nicenboim_texts[text_id][word_id]

Get the filtered dataframe containing only filler sentences:

In [80]:
fillers = list(nicenboim_text_ids.keys())
nicenboim_filtered = nicenboim_raw[(nicenboim_raw.sentenceid.isin(fillers))]

Check some integers not used as sentence IDs for some reason:

In [81]:
for x in range(0, 171):
    if str(x) not in fillers and str(x) not in varying_sentences:
        print(x)

0
97
170


We don't have any NaNs to handle here.

In [82]:
print(nicenboim_filtered['fp_reg'].value_counts())
print('NaNs:', nicenboim_filtered['fp_reg'].isna().sum())

fp_reg
0.0    50087
1.0     6055
Name: count, dtype: int64
NaNs: 0


FPRT has no values 0, but it has NaNs:

In [83]:
print((nicenboim_filtered['FPRT']<1).sum())
print('NaNs:', nicenboim_filtered['FPRT'].isna().sum())

0
NaNs: 23263


In [84]:
assert set(nicenboim_raw['fp_reg'].dropna().unique()) == set([0, 1])

In [85]:
nicenboim_identifiers = {TOKEN_ID.format(text_id, word_id): word for text_id, words in nicenboim_texts.items() for word_id, word in words.items()}
nicenboim_subject_ids = {subj_id: i for i, subj_id in enumerate(nicenboim_raw['subj'].unique())}

In [86]:
def get_nicenboim_measures(measure, subjects, identifiers, data, replacer):
    """Create dictionary with measure for each subject in Nicenboim data."""
    measures = {identifier: {subject: MISSING_LABEL for subject in nicenboim_subject_ids.values()} for identifier in nicenboim_identifiers}

    for index, row in nicenboim_filtered.iterrows():

        text_id = nicenboim_text_ids[row['sentenceid']]
        token_number = int(row['wordn'] - 1)
        identifier = TOKEN_ID.format(text_id, token_number)
        assert row['word'] == nicenboim_texts[text_id][token_number]
        
        subject = subjects[row['subj']]
        measure_value = row[measure]

        if pd.isna(row['FPRT']) and replacer is not None:
            assert measure_value == 0.
            measure_value = replacer
        else:
            assert row['FPRT'] > 0
            measure_value = float(measure_value)

        assert np.isnan(measures[identifier][subject])
        measures[identifier][subject] = measure_value 

    return measures

In [87]:
save_meta(NICENBOIM_NAME, nicenboim_texts, nicenboim_text_ids, nicenboim_subject_ids)

In [88]:
# fix order of the subjects across dataframes
nicenboim_ordered_subjs = list(nicenboim_subject_ids.values())

def nicenboim_build_and_save(column_name, variable_name, replacer):
    measures = get_nicenboim_measures(column_name, nicenboim_subject_ids, nicenboim_identifiers, nicenboim_raw, replacer)
    preproc = create_dataframe(variable_name, measures, nicenboim_identifiers, nicenboim_ordered_subjs)
    save_preprocessed(NICENBOIM_NAME, variable_name, preproc)
    return preproc

nicenboim_fpregs = nicenboim_build_and_save('fp_reg', FPREGOUT, replacer=SKIP_LABEL)

In [89]:
nicenboim_fpregs.isna().sum()

Identifier           0
Token                0
fpregout:Subj_0      0
fpregout:Subj_1     19
fpregout:Subj_2      0
                    ..
fpregout:Subj_66     0
fpregout:Subj_67     0
fpregout:Subj_68     0
fpregout:Subj_69     0
fpregout:Subj_70     0
Length: 73, dtype: int64

In [90]:
nicenboim_fpregs

Unnamed: 0,Identifier,Token,fpregout:Subj_0,fpregout:Subj_1,fpregout:Subj_2,fpregout:Subj_3,fpregout:Subj_4,fpregout:Subj_5,fpregout:Subj_6,fpregout:Subj_7,...,fpregout:Subj_61,fpregout:Subj_62,fpregout:Subj_63,fpregout:Subj_64,fpregout:Subj_65,fpregout:Subj_66,fpregout:Subj_67,fpregout:Subj_68,fpregout:Subj_69,fpregout:Subj_70
0,text_0_token_0,El,0.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,...,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0
1,text_0_token_1,boxeador,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
2,text_0_token_2,anunció,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,...,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,text_0_token_3,que,0.0,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,...,-1.0,1.0,-1.0,0.0,0.0,-1.0,0.0,0.0,0.0,-1.0
4,text_0_token_4,se,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,...,-1.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
786,text_47_token_11,había,0.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,...,0.0,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,-1.0,0.0
787,text_47_token_12,derrotado,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
788,text_47_token_13,a,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0
789,text_47_token_14,Franco,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
