In [1]:
# default_exp datasets

# Preparing Free Recall Data
Whether working with simulated free recall data or the real deal, it has to be represented in a standardized way to enable interchangeable use of library functions.

For most of our analyses of free recall data, we'll follow the lead of the [Psifr library](https://psifr.readthedocs.io/en/latest/index.html) and represent most of our data in a long table format (encoded using `pandas`), with each row corresponding to a study or recall event and tracking for each event a subject index, a trial index, an input or output position, and item id. To identify items, we'll use a unique index or text string depending on the features of the item we're interested in.

Here we'll demonstrate the format using a few well-known datasets I'm using for research. Data preprocessing code is assigned to the `datasets` submodule of the `compmemlearn` package.

## Murdock1962 Dataset
> Murdock, B. B., Jr. (1962). The serial position effect of free recall. Journal of Experimental Psychology, 64(5), 482-488. https://doi.org/10.1037/h0045106  

Our data structure associated with Murdock (1962) has three `LL` structures that each seem to correspond to a different data set with different list lengths.  Inside
each structure is:
- `recalls` with 1200 rows and 50 columns. Each row presumably represents a subject, and each column seems to
  correspond to a recall position, with -1 coded for intrusions. `MurdData_clean.mat` probably doesn't have these
  intrusions coded at all.
- `listlength` is an integer indicating how long the studied list is.
- `subject` is a 1200x1 vector coding the identities of each subject for each row. Each subject seems to get 80 rows a piece. He really got that much data for each subject?
- `session` similarly codes the index of the session under consideration, and it's always 1 in this case.
- `presitemnumbers` probably codes the number associated with each item. Is just its presentation index.

We'll enable selection of relevant information from these structures based on which `LL` structure we're interested in using a `dataset_index` parameter.

In [2]:
# export

import scipy.io as sio
import numpy as np
import pandas as pd
from psifr import fr

def prepare_murdock1962_data(path, dataset_index=0):
    """
    Prepares data formatted like `data/MurdData_clean.mat` for fitting.

    Loads data from `path` with same format as `data/MurdData_clean.mat` and
    returns a selected dataset as an array of unique recall trials and a
    dataframe of unique study and recall events organized according to `psifr`
    specifications.

    **Arguments**:
    - path: source of data file
    - dataset_index: index of the dataset to be extracted from the file

    **Returns**:
    - trials: int64-array where rows identify a unique trial of responses and
        columns corresponds to a unique recall index.
    - merged: as a long format table where each row describes one study or
        recall event.
    - list_length: length of lists studied in the considered dataset
    """

    # load all the data
    matfile = sio.loadmat(path, squeeze_me=True)
    murd_data = [matfile['data'].item()[0][i].item() for i in range(3)]

    # encode dataset into psifr format
    trials, list_length, subjects = murd_data[dataset_index][:3]
    trials = trials.astype('int64')

    data = []
    for trial_index, trial in enumerate(trials):

        # every time the subject changes, reset list_index
        if not data or data[-1][0] != subjects[trial_index]:
            list_index = 0
        list_index += 1

        # add study events
        for i in range(list_length):
            data += [[subjects[trial_index],
                      list_index, 'study', i+1, i+1]]

        # add recall events
        for recall_index, recall_event in enumerate(trial):
            if recall_event != 0:
                data += [[subjects[trial_index], list_index,
                          'recall', recall_index+1, recall_event]]

    data = pd.DataFrame(data, columns=[
        'subject', 'list', 'trial_type', 'position', 'item'])
    merged = fr.merge_free_recall(data)
    return trials, merged, list_length

In [3]:
murd_trials, murd_events, murd_length = prepare_murdock1962_data(
    '../data/MurdData_clean.mat', 0)

murd_events.head()

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion
0,1,1,1,1,5.0,True,True,0,False
1,1,1,2,2,7.0,True,True,0,False
2,1,1,3,3,,True,False,0,False
3,1,1,4,4,,True,False,0,False
4,1,1,5,5,,True,False,0,False


## MurdockOkada1970 Dataset
> Murdock, B. B., & Okada, R. (1970). Interresponse times in single-trial free recall. Journal of Experimental Psychology, 86(2), 263.

Authors investigated interresponse times in single-trial free recall. Each of 72 undergraduates was given 20 test lists with 20-word lists visually presented at either 60 or 120 words/min. The format of the data in these files is as follows:  

Row 1: subject/trial information  
Row 2: serial position as a function of output position.  
Row 3: inter-response time as a function of output position  

The code 88 means that the subject made an extra-list intrusion.

In [4]:
# export

def prepare_murdock1970_data(path):
    """
    Prepares data formatted like `data/MurdData_clean.mat` for fitting.

    Loads data from `path` with same format as `data/MurdData_clean.mat` and 
    returns a selected dataset as an array of unique recall trials and a 
    dataframe of unique study and recall events organized according to `psifr`
    specifications.  

    **Arguments**:  
    - path: source of data file  
    - dataset_index: index of the dataset to be extracted from the file

    **Returns**:
    - trials: int64-array where rows identify a unique trial of responses and 
        columns corresponds to a unique recall index.  
    - merged: as a long format table where each row describes one study or 
        recall event.  
    - list_length: length of lists studied in the considered dataset
    """
    
    with open(path) as f:
        oka_data = f.read()

    counter = 0
    trials = []
    subjects = []
    list_length = 20

    for line in oka_data.split('\n'):

        if not line:
            continue

        # build subjects array
        if counter == 0:
            subjects.append(int(line.strip().split('    ')[1]))

        # build trials array
        if counter == 1:

            trial = [int(each) for each in line.strip().split('    ')]
            trial = [each for each in trial if each <= 20]
            already = []
            for each in trial:
                if each not in already:
                    already.append(each)
            trial = already
            
            while len(trial) < 13:
                trial.append(0)

            trials.append(trial)

        # keep track of which row we are on for the given trial
        counter += 1
        if counter == 3:
            counter = 0

    trials = np.array(trials).astype('int64')
    
    data = []
    for trial_index, trial in enumerate(trials):

        # every time the subject changes, reset list_index
        if not data or data[-1][0] != subjects[trial_index]:
            list_index = 0
        list_index += 1

        # add study events
        for i in range(list_length):
            data += [[subjects[trial_index], 
                      list_index, 'study', i+1, i+1]]

        # add recall events
        for recall_index, recall_event in enumerate(trial):
            if recall_event != 0:
                data += [[subjects[trial_index], list_index, 
                          'recall', recall_index+1, recall_event]]

    data = pd.DataFrame(data, columns=[
        'subject', 'list', 'trial_type', 'position', 'item'])
    merged = fr.merge_free_recall(data)
    return trials, merged, list_length

In [5]:
murd_trials, murd_events, murd_length = prepare_murdock1970_data('../data/mo1970.txt')
murd_events.head()

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion
0,1,1,1,1,,True,False,0,False
1,1,1,2,2,,True,False,0,False
2,1,1,3,3,,True,False,0,False
3,1,1,4,4,,True,False,0,False
4,1,1,5,5,,True,False,0,False


## Lohnas2014 Dataset
> Siegel, L. L., & Kahana, M. J. (2014). A retrieved context account of spacing and repetition effects in free recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(3), 755.

Across 4 sessions, 35 subjects performed delayed free recall of 48 lists. Subjects were University of Pennsylvania undergraduates, graduates and staff, age 18-32. List items were drawn from a pool of 1638 words taken from the University of South Florida free association norms (Nelson, McEvoy, & Schreiber, 2004; Steyvers, Shiffrin, & Nelson, 2004, available at http://memory.psych.upenn.edu/files/wordpools/PEERS_wordpool.zip). Within each session, words were drawn without replacement. Words could repeat across sessions so long as they did not repeat in two successive sessions. Words were also selected to ensure that no strong semantic associates co-occurred in a given list (i.e., the semantic relatedness between any two words on a given list, as determined using WAS (Steyvers et al., 2004), did not exceed a threshold value of 0.55).

Subjects encountered four different types of lists: 
1. Control lists that contained all once-presented items;  
2. pure massed lists containing all twice-presented items; 
3. pure spaced lists consisting of items presented twice at lags 1-8, where lag is defined as the number of intervening items between a repeated item's presentations; 
4. mixed lists consisting of once presented, massed and spaced items. Within each session, subjects encountered three lists of each of these four types. 

In each list there were 40 presentation positions, such that in the control lists each position was occupied by a unique list item, and in the pure massed and pure spaced lists, 20 unique words were presented twice to occupy the 40 positions. In the mixed lists 28 once-presented and six twice-presented words occupied the 40 positions. In the pure spaced lists, spacings of repeated items were chosen so that each of the lags 1-8 occurred with equal probability. In the mixed lists, massed repetitions (lag=0) and spaced repetitions (lags 1-8) were chosen such that each of the 9 lags of 0-8 were used exactly twice within each session. The order of presentation for the different list types was randomized within each session. For the first session, the first four lists were chosen so that each list type was presented exactly once. An experimenter sat in with the subject for these first four lists, though no subject had difficulty understanding the task.

The data for this experiment is stored in `data/repFR.mat`. We define a unique `prepare_lohnas2014_data` function to build structures from the dataset that works with our existing data analysis and fitting functions.

Like in `prepare_murdock1962_data`, we need list lengths, a data frame for visualizations with psifir, and a trials array encoding recall events as sequences of presentation positions. But we'll also need an additional array tracking presentation order, too.

In [71]:
# export

def prepare_lohnas2014_data(path):
    """
    Prepares data formatted like `data/repFR.mat` for fitting.
    """
    
    # load all the data
    matfile = sio.loadmat(path, squeeze_me=True)['data'].item()
    subjects = matfile[0]
    pres_itemnos = matfile[4]
    recalls = matfile[6]
    list_types = matfile[7]
    list_length = matfile[12]
    
    # convert pres_itemnos into rows of unique indices for easier model encoding
    presentations = []
    for i in range(len(pres_itemnos)):
        seen = []
        presentations.append([])
        for p in pres_itemnos[i]:
            if p not in seen:
                seen.append(p)
            presentations[-1].append(seen.index(p))
    presentations = np.array(presentations)

    # discard intrusions from recalls
    trials = []
    for i in range(len(recalls)):
        trials.append([])
        
        trial = list(recalls[i])
        for t in trial:
            if (t > 0) and (t not in trials[-1]):
                trials[-1].append(t)
        
        while len(trials[-1]) < list_length:
            trials[-1].append(0)
            
    trials = np.array(trials)
    
    # encode dataset into psifr format
    data = []
    for trial_index, trial in enumerate(trials):
        presentation = presentations[trial_index]
        
        # every time the subject changes, reset list_index
        if not data or data[-1][0] != subjects[trial_index]:
            list_index = 0
        list_index += 1
        
        # add study events
        for presentation_index, presentation_event in enumerate(presentation):
            data += [[subjects[trial_index], 
                      list_index, 'study', presentation_index+1, presentation_event,  list_types[trial_index]
                     ]]
            
        # add recall events
        for recall_index, recall_event in enumerate(trial):
            if recall_event != 0:
                data += [[subjects[trial_index], list_index, 
                          'recall', recall_index+1, presentation[recall_event-1], list_types[trial_index]
                         ]]
                
    data = pd.DataFrame(data, columns=[
        'subject', 'list', 'trial_type', 'position', 'item', 'condition'])
    merged = fr.merge_free_recall(data, list_keys=['condition'])
    
    return trials, merged, list_length, presentations, list_types, data, subjects

In [72]:
trials, events, list_length, presentations, list_types, rep_data, subjects = prepare_lohnas2014_data(
    '../data/repFR.mat')

events.head()

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion,condition
0,1,1,0,1,1.0,True,True,0,False,4
1,1,1,1,2,2.0,True,True,0,False,4
2,1,1,2,3,3.0,True,True,0,False,4
3,1,1,3,4,4.0,True,True,0,False,4
4,1,1,4,5,5.0,True,True,0,False,4


## HowaKaha05 Dataset
> Kahana, M. J., & Howard, M. W. (2005). Spacing and lag effects in free recall of pure lists. Psychonomic Bulletin & Review, 12(1), 159-164.

Sixty-six students studied and attempted free recall of 15 different lists of high-frequency nouns drawn from the Toronto Noun Pool (Friendly, Franklin, Hoffman, & Rubin, 1982). The lists consisted of 30 words, each repeated three times for a total of 90 presentations per list. List
presentation was auditory, and the subjects made their responses
vocally into a headset microphone. The words were presented at a rate
of 1.5 sec. After list presentation, the subjects were given a distractor task
involving simple arithmetic problems of the form A  B  C  ?.
The subjects had to correctly answer 15 problems in a row before
they could proceed to the recall phase.

There were three list types: massed, spaced short, and spaced
long. In the massed lists, each word was repeated three times successively. In the spaced-short lists, the presentation order was randomized, subject to the constraint that the lag between repetitions
was at least 2 and no more than 6. For the spaced-long lists, presentation order was randomized, subject to the constraint that interrepetition lags were at least 6 and not more than 20.

As is typical in free recall studies, we took mea-sures to eliminate warm-up effects by excluding the first 2 lists
from our data analyses. One of these first 2 practice lists was massed,
and the other was randomly chosen to be either spaced short or
spaced long. Of the subsequent 12 lists, 4 were massed, 4 were
spaced short, and 4 were spaced long, presented in an individually
randomized order for each subject.

In [97]:
# export
def prepare_howakaha05_data(path):
    """
    Prepares data formatted like `../data/HowaKaha05.dat` for fitting.
    """
    
    with open(path) as f:
        howa_data = f.read()

    subject_count = 66
    trial_count = 15
    total_lines = 66 * 15 * 5
    list_length = 90

    lines = [each.split('\t') for each in howa_data.split('\n')]
    trial_info_inds = np.arange(1, total_lines, 5)
    presentation_info_inds = np.arange(2, total_lines, 5)
    recall_info_inds = np.arange(4, total_lines, 5)

    # build vectors/matrices tracking list types and presentation item numbers across trials
    list_types = np.array([int(lines[trial_info_inds[i]-1][2]) for i in range(subject_count * trial_count)])
    subjects = np.array([int(lines[trial_info_inds[i]-1][0]) for i in range(subject_count * trial_count)])
    pres_itemnos = np.array([[int(each) for each in lines[presentation_info_inds[i]-1][:-1]] for i in range(
        subject_count * trial_count)])
        
    # convert pres_itemnos into rows of unique indices for easier model encoding
    presentations = []
    for i in range(len(pres_itemnos)):
        seen = []
        presentations.append([])
        for p in pres_itemnos[i]:
            if p not in seen:
                seen.append(p)
            presentations[-1].append(seen.index(p))
    presentations = np.array(presentations)

    # track recalls, discarding intrusions
    trials = []
    for i in range(subject_count * trial_count):
        trials.append([])
        
        # if it can be cast as a positive integer and is not yet in the recall sequence, it's not an intrusion
        trial = lines[recall_info_inds[i]-1][:-1]
        for t in trial:
            try:
                t = int(t)
                if (t in pres_itemnos[i]):
                    item = presentations[i][np.where(pres_itemnos[i] == t)[0][0]]+1
                    if item not in trials[-1]:
                        trials[-1].append(item)
            except ValueError:
                continue
        
        # pad with zeros to make sure the list is the right length
        while len(trials[-1]) < list_length:
            trials[-1].append(0)
            
    trials = np.array(trials)

    # encode dataset into psifr format
    data = []
    for trial_index, trial in enumerate(trials):
        presentation = presentations[trial_index]
        
        # every time the subject changes, reset list_index
        if not data or data[-1][0] != subjects[trial_index]:
            list_index = 0
        list_index += 1
        
        # add study events
        for presentation_index, presentation_event in enumerate(presentation):
            data += [[subjects[trial_index], 
                      list_index, 'study', presentation_index+1, presentation_event,  list_types[trial_index]
                     ]]
            
        # add recall events
        for recall_index, recall_event in enumerate(trial):
            if recall_event != 0:
                data += [[subjects[trial_index], list_index, 
                          'recall', recall_index+1, presentation[recall_event-1], list_types[trial_index]
                         ]]
                
    data = pd.DataFrame(data, columns=[
        'subject', 'list', 'trial_type', 'position', 'item', 'condition'])
    merged = fr.merge_free_recall(data, list_keys=['condition'])
    
    return trials, merged, list_length, presentations, list_types, data, subjects

In [98]:
trials, events, list_length, presentations, list_types, rep_data, subjects = prepare_howakaha05_data(
    '../data/HowaKaha05.dat')

events.head()

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion,condition
0,118,1,0,1,3.0,True,True,0,False,0
1,118,1,0,1,9.0,False,True,1,False,0
2,118,1,0,2,3.0,True,True,0,False,0
3,118,1,0,2,9.0,False,True,1,False,0
4,118,1,0,3,3.0,True,True,0,False,0


## Simulated Datasets
The approach for creating simulated datasets is to initialize a model with specified parameters and experience sequences and then populate a psifr-formatted array with the outcomes of performing `free recall`. 

The `simulate_data` function below presumes each item is just presented once and that a model has already been initialized, and is better for quick baseline characterization of model performance. Datasets with item repetitions during presentation violate this premise; a more unique function is normally necessary for simulating these models in a performant way.

Since model simulation this way has always directly led to visualization in work done so far, a corresponding `trials` array is not produced.

In [8]:
# export

def simulate_data(model, experiment_count, first_recall_item=None):
    """
    Initialize a model with specified parameters and experience sequences and 
    then populate a psifr-formatted dataframe with the outcomes of performing `free recall`. 
    
    **Required model attributes**:
    - item_count: specifies number of items encoded into memory
    - context: vector representing an internal contextual state
    - experience: adding a new trace to the memory model
    - free_recall: function that freely recalls a given number of items or until recall stops
    """
    
    # encode items
    model.experience(model.items)

    # simulate retrieval for the specified number of times, tracking results in df
    data = []
    for experiment in range(experiment_count):
        data += [[experiment, 0, 'study', i + 1, i] for i in range(model.item_count)]
    for experiment in range(experiment_count):
        if first_recall_item is not None:
            model.force_recall(first_recall_item)
        data += [[experiment, 0, 'recall', i + 1, o] for i, o in enumerate(model.free_recall())]
    data = pd.DataFrame(data, columns=['subject', 'list', 'trial_type', 'position', 'item'])
    merged = fr.merge_free_recall(data)
    
    return merged