# Text Preprocessing

The raw data obtained from the Brown-Schmidt lab for this research requires extensive extra preprocessing to be suitable for downstream analyses. Here we do that by:

- Removing irrelevant boilerplate
- Putting text into a maintainable file format
- Extracting serial order of study idea units from texts and researcher segmentations/correspondences
- Pairing extracted trial data with semantic similarity data

Tabnine::config 

## Dataset Overview
We render an overview of the dataset prepared for our publication:

In [1]:
from IPython.display import Markdown

def render_tex(tex_path, bib_path, csl_path):
    result = !pandoc -C --ascii {tex_path} -f latex -t markdown_mmd --bibliography {bib_path} --csl {csl_path}
    return Markdown('\n'.join(result))

render_tex('writing/BrownSchmidt_Dataset.tex', 'writing/references.bib', 'writing/main/apa.csl')

Recall for narratives, if split into idea units &ndash; "meaningful
chunks of information that convey a piece of the narrative" &ndash; that
are numbered according to chronological order, can be examined using
analytic techniques developed for free and serial list recall tasks.
This framework enables direct comparison between ideas, assumptions, and
models applied to understand how people remember sequences such as word
lists and those used to understand memory for narrative texts. To
support analysis of narrative recall this way, we considered a dataset
collected, preprocessed, and presented by Cutler et al. (2019). In
corresponding experiments, research participants read 6 distinct short
stories. Upon reading a story, participants performed immediate free
recall of the narrative twice. Three weeks later, participants performed
free recall of each narrative again. Each recall period was limited to
five minutes. Following data collection, a pair of research assistants
in the Brown-Schmidt laboratory were each instructed to independently
split stories and participant responses into idea units as defined
above, and to identify correspondences between idea units in participant
responses and corresponding studied stories reflecting recall. Following
this initial preprocessing, research assistants then compared and
discussed their results and recorded consensus decisions regarding the
segmentation and correspondence of idea units across the dataset.
Further analysis focused on the sequences of story idea units recalled
by participants on each trial as tracked by these researchers.

<div id="refs" class="references csl-bib-body hanging-indent"
markdown="1" line-spacing="2">

<div id="ref-cutler2019narrative" class="csl-entry" markdown="1">

Cutler, R., Palan, J., Polyn, S., & Brown-Schmidt, S. (2019). Semantic
and temporal structure in memory for narratives: A benefit for
semantically congruent ideas. *Context and Episodic Memory Symposium*.

</div>

</div>

## Standardizing Text Representations

Using the data in `raw`, we produce in `texts` one subdirectory for each passage (with passage contents at base) and in each subdirectory, one file for each recall period. Each file will contain only the recalled text associated with a particular passage, subject, and recall period and be labeled accordingly (e.g. as `Supermarket_1_1.txt`). At the base of `texts`, the text of the source passages will each be included as separate files.

### We start with some initial dependencies and constants.

In [2]:
# import dependencies
import os
import pathlib
import docxpy
import ftfy

# key paths
source_directory = os.path.join('data', 'raw')
target_directory = os.path.join('data', 'texts')

source_names = ['Fisherman', 'Supermarket', 'Flight', 'Cat', 'Fog', 'Beach']
source_titles = ['where does susie go at noon?']
title_tags = [['''man and the bear'''], ['''act of kindness'''], 
              ["""a man can’t just sit""", "a man just can’t sit"], 
              ["where does susie go at noon?"], ["fog: a maine t"], 
              ["day at the beach"]]
author_tags = ['author unknown', 'anonymous', 'chris holm', 'adapted from',
               'unknown', 'anonymous']

### Next we create directories in our file system to organize preprocessed data.

In [3]:
# make a pooled subdirectory if one doesn't already exist
if not os.path.isdir(target_directory):
    os.mkdir(target_directory)

# generate subdirectory for each passage
for source_name in source_names:
    passage_path = os.path.join(target_directory, source_name)
    if not os.path.isdir(passage_path):
        os.mkdir(passage_path)

### Preprocess raw `docx` files and store as text

In [4]:
# for each pt1 written recall file, extract text and remove boilerplate, 
# and save to correct location in `pooled`
for path, subdirs, files in os.walk(os.path.join(
    source_directory, 'recall', 'Written Recall Part 1')):
    for name in files:
        recall_path = str(pathlib.PurePath(path, name))
        
        # extract text and remove boilerplate
        recall_text = '\n'.join(
            docxpy.process(recall_path).split('\n')[1:]).strip()
        passage_index = recall_path[-9:-8]
        subject_index = recall_path.split(name)[0][-3:-1]
        phase_index = recall_path[-7:-6]
        targetname = '{}_{}_{}.txt'.format(
            source_names[int(passage_index)-1], int(subject_index), phase_index)
        
        # handle special cases??
        recall_text = recall_text.replace(
            'vbeach', 'beach').replace('Susie gp at noon', 'Susie go at noon')
        
        # filter out source titles from recall data
        if any([each in recall_text[:recall_text.find(
            '.')].lower() for each in title_tags[int(passage_index)-1]]):
            if len(recall_text[:recall_text.find('\n')]) < 100:
                recall_text = recall_text[recall_text.find('\n'):].strip()
                
        # filter out source authors from recall data
        if (recall_text[:len(author_tags[int(
            passage_index)-1])].lower() == author_tags[int(passage_index)-1]):
            recall_text = recall_text[recall_text.find('\n'):].strip()
            
        # clean the data
        recall_text = ftfy.fix_text(recall_text)
            
        # save to correct location in pooled
        with open(
            os.path.join(target_directory, source_names[int(passage_index)-1], 
                         targetname), 'w', encoding='utf-8') as f:
            f.write(recall_text)

Part 1 and Part 2 data were collected in slightly different contexts, so they are preprocessed a little differently:

In [5]:
# for each pt2 written recall file, extract text and remove boilerplate, 
# and save to correct location in `pooled`
for path, subdirs, files in os.walk(
    os.path.join(source_directory, 'recall', 'Written Recall Part 2')):
    for name in files:
        recall_path = str(pathlib.PurePath(path, name))
        
        # identify correct location in pooled
        passage_index = recall_path[-7:-6]
        subject_index, phase_index =  recall_path.split(name)[0][-3:-1], 3
        if len(passage_index.strip()) == 0:
            continue
        targetname = '{}_{}_{}.txt'.format(
            source_names[int(passage_index)-1], int(subject_index), phase_index)

        # extract text and remove boilerplate
        boilerplate = 'You have 5 minutes to type the story you just read for memory. There is no word limit. Please write as much as you can remember.'
        recall_text = docxpy.process(
            recall_path).replace(boilerplate, '').strip()
        recall_text = '\n'.join(recall_text.split('\n')[1:]).strip()
        
        # clean text
        recall_text = ftfy.fix_text(recall_text)
        
        # save to correct location
        with open(os.path.join(
            target_directory, source_names[int(passage_index)-1], 
            targetname), 'w', encoding='utf-8') as f:
            f.write(recall_text)

### The result is an organized directory of text representations of participant responses absent methodology-specific details such as the content of the recall prompt.

## Extracting Sequence Representations

Most of our analyses don't work directly on the text data we preprocess above. Instead, we want something formatted more like the traditional object of free recall modeling: vectors tracking the order in which items were recalled in each trial.

What kind of information do we want stored about each trial for subsequent analyses?

For downstream interpretability:
- `source_units`, text representation of each idea unit in the original story
- `response_units`, text representation of each idea unit in participant's recall response

To support simulation/prediction:
- `trials`, the serial index of the source unit matched to the serial index of each corresponding response unit represented as a recall sequence; 0 for termination
- `cycles`, vector of integers grouping successive items into a cycle based on sentence-structure of studied narratives; 0 for termination.
- `similarities`, similarity matrix between each source unit to one another, a common parameter for models of narrative comprehension and memory. 0 diagonal.

`cycles` and `source_units` are story-specific. We'll want a vector that codes for each trial the relevant story to simulate along with these structures. We'll similarly want vector that codes the subject and session index for each trial.

Stick all of these inside a single MAT file or whatever. Consider pregenerating the psifr dataframe as well.

Human raters have gotten us most of what we want in the spreadsheet at `data/raw/Narrative Recall Data.xlsx`. Most extra preprocessing is devoted to sorting source units into specific cycles and tracking response units that raters did not match to a specific source unit - something unlikely to matter for most downstream analyses but still worth keeping up with (and we already have most of the code for it anyway from previous projects).

In [None]:
# dependencies
import os
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_trf")

# key paths
source_directory = os.path.join('data', 'raw')
text_directory = os.path.join('data', 'texts')
target_directory = os.path.join('data', 'sequences', 'human')

# names for relevant passages
passage_names = ['Fisherman', 'Supermarket', 'Flight', 'Cat', 'Fog', 'Beach']

# we use the original xlsx
data = pd.read_excel(os.path.join(
    source_directory, 'Narrative Recall Data.xlsx'), 
                     list(range(22)), engine='openpyxl')

data[0].head()

### Semantic Similarity

In [None]:
# export

import torch
from sentence_transformers import SentenceTransformer

if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device("cpu")

class TextSimilarity(torch.nn.Module):
    def __init__(self, model_name="stsb-distilbert-base"):
        super(TextSimilarity, self).__init__()
        self.model_name = model_name

        # get the model and tokenizer, assuming we are using DistilBert
        self.model = SentenceTransformer(model_name)
        self.to(device)
  
    @staticmethod
    def cosine_similarity(x: torch.Tensor) -> torch.Tensor:
        """
        Custom implementation of cosine similarities
        """
        norm = x.norm(dim=-1).unsqueeze(0)

        # this is the formula for cosine similarities in a symmetric matrix
        return x @ x.t() / (norm.t() @ norm)

    def forward(self, cycles: list) -> torch.Tensor:
        """
        Assumes input is a list of lists of sentences/text units
        i.e. a List of Lists of strings
        """
        embeddings = torch.cat([self.model.encode(i, convert_to_tensor=True) for i in cycles])
        init_connections = self.cosine_similarity(embeddings)
        return init_connections
    
ts = TextSimilarity("stsb-distilbert-base")

### Story Information
- Strings identifying idea units within each story
- Semantic similarity matrix between source idea units
- Cycles grouping source idea units based on co-occurence in the same sentence

In [None]:
all_cycles = []
all_source_units = []
all_similarities = []
story_sequence = []

for trial_index, trial in data[0].groupby(['story', 'timeTest']):
    
    # we only consider each story once
    if trial['timeTest'].values[0] > 1:
        continue
    
    # identify story
    story_index = trial['story'].values[0]
    story_sequence.append(story_index)
    print(story_index)
    
    # source units are reproduced perfectly in xlsx file
    source_units = [each for each in list(trial['origText']) if type(each) == str]
    
    # collect relevant text
    with open(os.path.join(
        text_directory, passage_names[story_index-1] + '.txt'), encoding='utf8') as f:
        story_text = f.read()
        
    # identify discrete sentences in the text so we can sort units into cycles
    sentences = [each for each in nlp(story_text).sents if len(each) > 1]
    
    # build cycle vector identifying the number of idea units per sentence
    cycles = []
    last = 0
    counter = 0
    for unit in source_units:

        for sentence_index, sentence in enumerate(sentences):
            if unit in sentence.text:

                if sentence_index == last:
                    counter += 1
                else: 
                    cycles.append(counter)
                    last = sentence_index
                    counter = 1
                break
                
    # track semantic similarity between each source unit
    similarities = ts([source_units]).detach().tolist()
    
    all_cycles.append(cycles)
    all_similarities.append(similarities)
    all_source_units.append(source_units)

### Trial Information

Again, we want a `trials` array with each row indicating the order of recalled source units for each trial. We also want for each trial the strings identifying the idea units in participants' responses. 

We also want to build vectors coding for each trial the story, timeTest, and subject corresponding to each trial.

In [None]:
all_correspondences = []
all_response_units = []
all_subject = []
all_story = []
all_timeTest = []

# path format for subject text data,
text_path = os.path.join(text_directory, '{}', '{}_{}_{}.txt')

# consider each unique trial
for subject_index, subject in enumerate(data):
    print(subject)
    for trial_index, trial in data[subject].groupby(['story', 'timeTest']):
        
        # identify story, timeTest (we already have subject_index)
        story_index = trial['story'].values[0]
        timeTest = trial['timeTest'].values[0]
        passage_name = passage_names[story_index-1]
        
        # grab preprocessed textual representation of participant response
        # so we can idea units for unrecalled response units
        response_text_path = text_path.format(
            passage_name, passage_name, subject_index+1, trial_index[1])
        
        try:
            with open(response_text_path, encoding='utf8') as f:
                raw_response = f.read()
        except FileNotFoundError:
            print('Could not find:', response_text_path)
            continue
        
        # for idea unit identification, start with initial units coded in xlsv
        # as well as the sentences in the raw text
        initial_units = [each for each in list(
            trial['recText']) if each and type(each) == str]
        response_sentences = [each for each in nlp(raw_response).sents if any(char.isalnum() for char in each.text)]
        
        # use line breaks to perform segmentation, discarding pre-existing ones
        response_text = raw_response.replace('\n', '\t') 
        
        # segment by human-coded units then any non-overlapping sentence units
        for unit in initial_units:
            response_text = response_text.replace(unit, f'\n{unit}\n')
        for sentence in response_sentences:
            response_text = response_text.replace(
                sentence.text.replace('\n', ''), f'\n{sentence.text}\n')
        
        # using this initial split, build list of response idea units
        # we'll ignore units light in content (no alphanumeric, etc)
        response_units = []
        correspondences = []
        
        for unit in response_text.split('\n'):

            # reject units that have no alphanumeric characters
            if not any(char.isalnum() for char in unit):
                continue
                
            # identify index of source unit associated w/ this proposed recall unit
            # if unit is unmatched to a source unit, code as -1
            # reserve 0 for termination
            if unit in initial_units:
                correspondence = int(trial.loc[trial['recText'] == unit]['serialPos'])
            else:
                correspondence = -1
            response_units.append(' '.join([word.text for word in nlp(unit)]).lower())
            correspondences.append(correspondence)
            
        all_correspondences.append(correspondences)
        all_response_units.append(response_units)
        all_subject.append(subject_index)
        all_timeTest.append(timeTest)
        all_story.append(story_index)

### Storage
For flexibility, we'll go for a simple JSON representation for now and leave further processing for the Data Preparation notebook or other downstream pipelines.

In [None]:
import json

result = {}

# story information
result['cycles'] = all_cycles
result['source_units'] = all_source_units
result['similarities'] = all_similarities
result['story_names'] = passage_names

# response information
result['subject'] = all_subject
result['response_units'] = all_response_units
result['story'] = [int(each) for each in all_story]
result['timeTest'] = [int(each) for each in all_timeTest]
result['trials'] = [[int(each) for each in row] for row in all_correspondences]
#print(json.dumps(result, indent=4))

In [None]:
for index, each in enumerate(result['response_units']):
    print(result['subject'][index], result['story'][index], result['timeTest'][index])
    print(each)
    print(result['trials'][index])
    print()