

 # Lohnas & Kahana, 2014 Dataset



 > Siegel, L. L., & Kahana, M. J. (2014). A retrieved context account of spacing and repetition effects in free recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(3), 755.

 Across 4 sessions, 35 subjects performed delayed free recall of 48 lists. Subjects were University of Pennsylvania undergraduates, graduates and staff, age 18-32. List items were drawn from a pool of 1638 words taken from the University of South Florida free association norms (Nelson, McEvoy, & Schreiber, 2004; Steyvers, Shiffrin, & Nelson, 2004, available at http://memory.psych.upenn.edu/files/wordpools/PEERS_wordpool.zip). Within each session, words were drawn without replacement. Words could repeat across sessions so long as they did not repeat in two successive sessions. Words were also selected to ensure that no strong semantic associates co-occurred in a given list (i.e., the semantic relatedness between any two words on a given list, as determined using WAS (Steyvers et al., 2004), did not exceed a threshold value of 0.55).

 Subjects encountered four different types of lists:
 1. Control lists that contained all once-presented items;
 2. pure massed lists containing all twice-presented items;
 3. pure spaced lists consisting of items presented twice at lags 1-8, where lag is defined as the number of intervening items between a repeated item's presentations;
 4. mixed lists consisting of once presented, massed and spaced items. Within each session, subjects encountered three lists of each of these four types.

 In each list there were 40 presentation positions, such that in the control lists each position was occupied by a unique list item, and in the pure massed and pure spaced lists, 20 unique words were presented twice to occupy the 40 positions. In the mixed lists 28 once-presented and six twice-presented words occupied the 40 positions. In the pure spaced lists, spacings of repeated items were chosen so that each of the lags 1-8 occurred with equal probability. In the mixed lists, massed repetitions (lag=0) and spaced repetitions (lags 1-8) were chosen such that each of the 9 lags of 0-8 were used exactly twice within each session. The order of presentation for the different list types was randomized within each session. For the first session, the first four lists were chosen so that each list type was presented exactly once. An experimenter sat in with the subject for these first four lists, though no subject had difficulty understanding the task.

 Here, we read these raw data from `repFR.mat` and transform them into a Python dictionary
 adhering to the JAXCMR [RecallDataset protocol](#RecallDataset-Protocol), described below.



 ## 1. Loading the Raw `repFR.mat` File

 We use `scipy.io.loadmat` to read the .mat file. The `mat_file["data"]` field
 holds an array of objects, from which we extract relevant fields.
 The file contains:

 - **Subjects**: Identifiers for each trial.
 - **Sessions**: Session indicators for each trial.
 - **Recalled items**: Indices and times.
 - **Presented items**: Indices at each presentation position.
 - **Other**: Additional info like list type, etc.

 We store them in Python lists and then progressively transform them into
 integer arrays.

In [1]:
import scipy.io as sio
import numpy as np
from jaxcmr.helpers import save_dict_to_hdf5

# Path to the raw .mat file
path = 'data/raw/repFR.mat'
# Load the .mat file (with squeeze_me=True to reduce dimensionality)
mat_file = sio.loadmat(path, squeeze_me=True)

# 'data' is an array of objects in the .mat file
# We convert it to a Python list for easier indexing
mat_data = [mat_file['data'].item()[i] for i in range(14)]

# The total list length is constant for each trial
list_length = mat_data[12]
print("List Length =", list_length)

List Length = 40




 ## 2. Constructing Presented Items

 We have arrays indicating which item IDs were shown at each presentation position

 Next, we translate these "raw IDs" into *within-list* item numbers (`pres_itemnos`)
 in a way that each new *unique* item encountered in a trial is assigned a new index.

In [2]:
pres_itemnos_raw = mat_data[4].astype('int64')  # shape: (n_trials, positions)

presentations = []  # We'll fill as a Python list-of-lists, then convert to np.array

for i in range(len(pres_itemnos_raw)):
    seen = []
    presentations.append([])
    for p in pres_itemnos_raw[i]:
        if p not in seen:
            seen.append(p)
        # 'pres_itemnos' is the index of p in the "seen" array
        presentations[-1].append(seen.index(p))

presentations = np.array(presentations)+1
print("Shape of presentations =", presentations.shape)

Shape of presentations = (1680, 40)


 ## 3. Constructing Recalls

 We have the `rec_itemnos` that identifies which item was recalled, and `recalls`
 that gives the recall order (1-based in MATLAB). We also have the time of recall.

We need to transform these into a format that is more convenient for analysis.
Each nonzero value in our `trials` array will track the first study position of each recalled item in its trial.
Each nonzero value in our `trial_items` array will track the cross-trial item index of the recalled item.
And finally, each nonzero value in our `trial_irts` array will track the inter-recall time of the recalled item.

In [3]:
list_length = mat_data[12]
rec_itemids_raw = mat_data[2].astype('int64')
recalls_raw = mat_data[6]
irt_raw = mat_data[3].astype('int64')

trials = []
trial_items = []
trial_irts = []

for i in range(len(recalls_raw)):
    trials.append([])
    trial_items.append([])
    trial_irts.append([])

    trial = list(recalls_raw[i])
    for j in range(len(trial)):
        t = trial[j]  # The recall (1-based from MATLAB)
        trial_item = rec_itemids_raw[i][j]  # The cross-list ID
        rt = irt_raw[i][j]  # The recall time
        # Exclude 0 or negative values and repeated items in the same trial
        if (t > 0) and (t not in trials[-1]):
            trials[-1].append(t)
            trial_items[-1].append(trial_item)
            trial_irts[-1].append(rt)


    # Pad up to the known list_length with zeros
    while len(trials[-1]) < list_length:
        trials[-1].append(0)
        trial_items[-1].append(0)
        trial_irts[-1].append(0)

trials = np.array(trials, dtype='int64')
trial_items = np.array(trial_items, dtype='int64')
trial_irts = np.array(trial_irts, dtype='int64')
print("Shape of trials =", trials.shape)
print("Shape of trial_items =", trial_items.shape)

Shape of trials = (1680, 40)
Shape of trial_items = (1680, 40)




 ## 4. Constructing the Final Result Dictionary

 We now assemble all the fields into a single dictionary called `result`.

 All arrays are `int64` and 2D, with zero-padding for unused entries.

In [4]:
# Additional fields from mat_data
subject = np.expand_dims(mat_data[0].astype('int64'), axis=1)
session = np.expand_dims(mat_data[1].astype('int64'), axis=1)+1
list_type = np.expand_dims(mat_data[7].astype('int64'), axis=1)
list_length = np.expand_dims(np.ones(np.shape(mat_data[0]), dtype="int64"), axis=1) * 40
pres_itemids = mat_data[4].astype('int64')

result = {
    "subject":      subject,            # (n_trials, 1)
    "session":      session,            # (n_trials, 1) - optional
    #'pres_items': mat_data[2],
    #'rec_items':  mat_data[2],
    "pres_itemnos": presentations,      # (n_trials, ?)
    "pres_itemids": pres_itemids,       # (n_trials, ?)
    "rec_itemids":  trial_items,        # (n_trials, ?)
    "recalls":      trials,             # (n_trials, ?)
    "listLength":   list_length,        # (n_trials, 1)
    "list_type":    list_type,          # (n_trials, 1)
    "irt":          trial_irts,                # (n_trials, ?)
    #'pres_lag': 
    #'recalls_lag':
    #'trial': mat_data[9].astype('int64'),
    #'intrusions': 
    #'subject_sess':
    #'massed_recalls':
}



 ## 5. Verifying the Result

 We check the shape and type of each field in the `result` dictionary.
 We also check the maximum and minimum values for each field.
 This is a good practice to ensure that the data is as expected.

In [5]:
# Print a short summary
for k, v in result.items():
    print(k, "->", np.shape(v), v.dtype)
    print(v[:1])  # Print first two entries for each field
    print(f"Max value in {k}:", np.max(v))
    print(f"Min value in {k}:", np.min(v))
    print()

# If desired, we could proceed to store `result` in a standardized HDF5 or JSON format.
save_dict_to_hdf5(result, "data/LohnasKahana2014.h5")

subject -> (1680, 1) int64
[[1]]
Max value in subject: 37
Min value in subject: 1

session -> (1680, 1) int64
[[1]]
Max value in session: 4
Min value in session: 1

pres_itemnos -> (1680, 40) int64
[[ 1  2  3  4  5  6  7  8  9 10 11 12 12 13 14 15 16 17 10 18 19 20 19 21
  22 23 20 24 25 26 22 27 28 24 29 30 31 32 33 34]]
Max value in pres_itemnos: 40
Min value in pres_itemnos: 1

pres_itemids -> (1680, 40) int64
[[1585  886 1045  695  809   39 1636  358  249  692 1029  919  919  955
  1407  745   81   19  692  321  279  170  279  212  639  840  170  302
  1025  364  639  698  696  302 1562  819  105 1559  887  187]]
Max value in pres_itemids: 1638
Min value in pres_itemids: 1

rec_itemids -> (1680, 40) int64
[[1585  886 1045  695  809   39 1636  249  692 1029   81  955  919 1407
   639  321  302  364  887 1559  105   19    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]]
Max value in rec_itemids: 1638
Min value in rec_itemids: 0

recalls -> (1

We also add automated tests to verify our final `result` structure meets the following conditions:

 1. **All entries** are 2D.
 2. **Subject**, **session**, **listLength**, and **list_type** each has exactly one column.
 3. All **other entries** have the same number of columns (40).
 4. The **minimum value** for all entries is at least 0.
 5. **Zeros** in `rec_itemids`, `recalls`, and `irt` occupy **the same** indices (if `irt` is the same shape).

In [6]:
# 1. All entries are 2D
for key, val in result.items():
    assert val.ndim == 2, f"{key} must be 2D. Got shape={val.shape}."

# 2. subject, session, listLength, and list_type must have exactly one column
single_col_keys = ["subject", "session", "listLength", "list_type"]
for k in single_col_keys:
    if k in result:
        assert result[k].shape[1] == 1, f"{k} must have shape (n_trials, 1). Got {result[k].shape}."

# 3. All other entries have the same number of columns (40)
column_40_keys = [
    "pres_itemnos", "recalls", "pres_itemids", "rec_itemids"
]
# If irt is also supposed to have 40 columns, include it:
# column_40_keys.append("irt")

for k in column_40_keys:
    if k in result:
        assert result[k].shape[1] == 40, f"{k} must have 40 columns. Got {result[k].shape}."

# 4. Minimum value for all entries is at least 0
for key, val in result.items():
    min_val = val.min()
    assert min_val >= 0, f"{key} has negative values (min={min_val})."

# 5. 0 values in rec_itemids, recalls, and irt occupy the same indices
#    (only if your `irt` array has the same shape as recalls).
if "rec_itemids" in result and "recalls" in result:
    rec_itemids_zeros = (result["rec_itemids"] == 0)
    recalls_zeros     = (result["recalls"]     == 0)

    # If irt is present and the same shape, compare it too
    if "irt" in result and result["irt"].shape == result["recalls"].shape:
        irt_zeros = (result["irt"] == 0)
        # All must match
        assert (
            (rec_itemids_zeros == recalls_zeros).all() and 
            (recalls_zeros == irt_zeros).all()
        ), "Mismatch in zero indices among rec_itemids, recalls, and irt."
    else:
        # If no irt or shape mismatch, just compare rec_itemids and recalls
        assert (rec_itemids_zeros == recalls_zeros).all(), \
            "Mismatch in zero indices between rec_itemids and recalls."