# Pivot Test
We're going to test the sufficiency of pandas' pivot function to generate arrays we normally represent separately from the psifr-formatted DataFrame.

## Initial Representation
Let's retrieve the relatively complex Lohnas 2014 item repetitions dataset to support our comparisons.

In [None]:
# export

import scipy.io as sio
import numpy as np
import pandas as pd
from psifr import fr

def prepare_repdata(path):
    """
    Prepares data formatted like `data/repFR.mat` for fitting.
    """
    
    # load all the data
    matfile = sio.loadmat(path, squeeze_me=True)['data'].item()
    subjects = matfile[0]
    pres_itemnos = matfile[4]
    recalls = matfile[6]
    list_types = matfile[7]
    list_length = matfile[12]
    
    # convert pres_itemnos into rows of unique indices for easier model encoding
    presentations = []
    for i in range(len(pres_itemnos)):
        seen = []
        presentations.append([])
        for p in pres_itemnos[i]:
            if p not in seen:
                seen.append(p)
            presentations[-1].append(seen.index(p))
    presentations = np.array(presentations)

    # discard intrusions from recalls
    trials = []
    for i in range(len(recalls)):
        trials.append([])
        
        trial = list(recalls[i])
        for t in trial:
            if (t > 0) and (t not in trials[-1]):
                trials[-1].append(t)
        
        while len(trials[-1]) < list_length:
            trials[-1].append(0)
            
    trials = np.array(trials)
    
    # encode dataset into psifr format
    data = []
    for trial_index, trial in enumerate(trials):
        presentation = presentations[trial_index]
        
        # every time the subject changes, reset list_index
        if not data or data[-1][0] != subjects[trial_index]:
            list_index = 0
        list_index += 1
        
        # add study events
        for presentation_index, presentation_event in enumerate(presentation):
            data += [[subjects[trial_index], 
                      list_index, 'study', presentation_index+1, presentation_event,  list_types[trial_index]
                     ]]
            
        # add recall events
        for recall_index, recall_event in enumerate(trial):
            if recall_event != 0:
                data += [[subjects[trial_index], list_index, 
                          'recall', recall_index+1, presentation[recall_event-1], list_types[trial_index]
                         ]]
    
    data = pd.DataFrame(data, columns=['subject', 'list', 'trial_type', 'position', 'item', 'condition'])
    merged = fr.merge_free_recall(data, list_keys=['condition'])
    
    return trials, merged, list_length, presentations, list_types, data, subjects

In [None]:
trials, events, list_length, presentations, list_types, rep_data, subjects = prepare_repdata(
    '../data/repFR.mat')

events.head()

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion,condition
0,1,1,0,1,1.0,True,True,0,False,4
1,1,1,1,2,2.0,True,True,0,False,4
2,1,1,2,3,3.0,True,True,0,False,4
3,1,1,3,4,4.0,True,True,0,False,4
4,1,1,4,5,5.0,True,True,0,False,4


## Examples

### Trials Array
Each row corresponds to a specific trial. Each column corresponds to a recall position. Each value contains the study position of the item recalled at that recall position, at that trial.

In [None]:
trials_df = events.pivot_table(index=['subject', 'list'], columns='output', values='input')
trials_df.head()

Unnamed: 0_level_0,output,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,28.0,29.0,30.0,31.0,32.0,33.0,34.0,35.0,36.0,37.0
subject,list,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1,1.0,2.0,3.0,4.0,5.0,6.0,7.0,9.0,14.5,11.0,...,,,,,,,,,,
1,2,38.0,39.0,5.0,6.0,14.0,11.0,1.0,33.0,16.0,34.0,...,,,,,,,,,,
1,3,5.5,5.5,7.5,8.5,6.0,10.5,12.5,12.0,21.5,21.5,...,,,,,,,,,,
1,4,1.5,3.5,5.5,7.5,9.5,13.5,15.5,23.5,17.5,19.5,...,,,,,,,,,,
1,5,2.0,3.0,16.0,12.0,11.0,8.0,9.0,6.0,18.0,17.0,...,,,,,,,,,,


And we can make this an array. Passing 0 as the argument for `na_value` sets to 0 the recall positions where no item was recalled. A different manipulation might have to code intrusions, which are absent from this dataframe.

Adding an `astype` operation converts the result into an integer array.

In [None]:
trials_df.to_numpy(na_value=0).astype('int64')

array([[ 1,  2,  3, ...,  0,  0,  0],
       [38, 39,  5, ...,  0,  0,  0],
       [ 5,  5,  7, ...,  0,  0,  0],
       ...,
       [ 4,  5,  7, ...,  0,  0,  0],
       [38,  6,  6, ...,  0,  0,  0],
       [ 2,  1,  5, ...,  0,  0,  0]], dtype=int64)

### Presentations Array

Presentation order varies in this dataset. Can we retrieve that, too? Each row corresponds to a unique trial. Each column corresponds to a unique input position. Each value corresponds to a unique item index.

In [None]:
events.pivot_table(index=['subject', 'list'], columns='input', values='item').to_numpy(na_value=0).astype('int64')

array([[ 0,  1,  2, ..., 31, 32, 33],
       [ 0,  1,  2, ..., 37, 38, 39],
       [ 0,  1,  2, ..., 19, 18, 17],
       ...,
       [ 0,  1,  2, ..., 19, 18, 19],
       [ 0,  1,  2, ..., 17, 18, 19],
       [ 0,  1,  2, ..., 37, 38, 39]], dtype=int64)

### Subsetting by Categorical Variables
If we work with array representations of our data, then we'll often need vectors coding more categorical information like condition or subject_id. Other times, we might just want to select a definite subset of our data based on the values of these variables.

In [None]:
events.loc[(events.subject==1) & (events.condition==1)]

Unnamed: 0,subject,list,item,input,output,study,recall,repeat,intrusion,condition
40,1,2,0,1,7.0,True,True,0,False,1
41,1,2,1,2,11.0,True,True,0,False,1
42,1,2,2,3,,True,False,0,False,1
43,1,2,3,4,,True,False,0,False,1
44,1,2,4,5,3.0,True,True,0,False,1
...,...,...,...,...,...,...,...,...,...,...
1915,1,48,35,36,,True,False,0,False,1
1916,1,48,36,37,,True,False,0,False,1
1917,1,48,37,38,,True,False,0,False,1
1918,1,48,38,39,18.0,True,True,0,False,1


#### Vector Extraction

It seems we can be sure that vector values will be ordered based on each index value, in the order they're specified.

In [None]:
list_types = events.pivot_table(index=['subject', 'list'], values='condition').to_numpy().flatten()
list_types

array([4, 1, 3, ..., 3, 3, 1], dtype=int64)

We're pivoting by `subject` and `trial` this time. Can we still grab a `subject` vector in this context?

In [None]:
events.pivot_table(index=['subject', 'list']).index.get_level_values('subject').values

array([ 1,  1,  1, ..., 37, 37, 37], dtype=int64)

And taking an `np.arange` using the length of a vector like this one is enough to retrieve trial indices.