# Text Preprocessing

The raw data obtained from the Brown-Schmidt lab for this research requires extensive extra preprocessing to be suitable for downstream analyses. Here we do that by:

- Removing irrelevant boilerplate
- Putting text into a maintainable file format
- Extracting serial order of study idea units from texts and researcher segmentations/correspondences
- Pairing extracted trial data with semantic similarity data

## Dataset Overview
We render an overview of the dataset prepared for our publication:

In [2]:
from IPython.display import Markdown

def render_tex(tex_path, bib_path, csl_path):
    result = !pandoc -C --ascii {tex_path} -f latex -t markdown_mmd --bibliography {bib_path} --csl {csl_path}
    return Markdown('\n'.join(result))

render_tex('writing/BrownSchmidt_Dataset.tex', 'writing/references.bib', 'writing/main/apa.csl')

Recall for narratives, if split into idea units &ndash; "meaningful
chunks of information that convey a piece of the narrative" &ndash; that
are numbered according to chronological order, can be examined using
analytic techniques developed for free and serial list recall tasks.
This framework enables direct comparison between ideas, assumptions, and
models applied to understand how people remember sequences such as word
lists and those used to understand memory for narrative texts. To
support analysis of narrative recall this way, we considered a dataset
collected, preprocessed, and presented by Cutler et al. (2019). In
corresponding experiments, research participants read 6 distinct short
stories. Upon reading a story, participants performed immediate free
recall of the narrative twice. Three weeks later, participants performed
free recall of each narrative again. Each recall period was limited to
five minutes. Following data collection, a pair of research assistants
in the Brown-Schmidt laboratory were each instructed to independently
split stories and participant responses into idea units as defined
above, and to identify correspondences between idea units in participant
responses and corresponding studied stories reflecting recall. Following
this initial preprocessing, research assistants then compared and
discussed their results and recorded consensus decisions regarding the
segmentation and correspondence of idea units across the dataset.
Further analysis focused on the sequences of story idea units recalled
by participants on each trial as tracked by these researchers.

<div id="refs" class="references csl-bib-body hanging-indent"
markdown="1" line-spacing="2">

<div id="ref-cutler2019narrative" class="csl-entry" markdown="1">

Cutler, R., Palan, J., Polyn, S., & Brown-Schmidt, S. (2019). Semantic
and temporal structure in memory for narratives: A benefit for
semantically congruent ideas. *Context and Episodic Memory Symposium*.

</div>

</div>

## Standardizing Text Representations

Using the data in `raw`, we produce in `texts` one subdirectory for each passage (with passage contents at base) and in each subdirectory, one file for each recall period. Each file will contain only the recalled text associated with a particular passage, subject, and recall period and be labeled accordingly (e.g. as `Supermarket_1_1.txt`). At the base of `texts`, the text of the source passages will each be included as separate files.

### We start with some initial dependencies and constants.

In [2]:
# import dependencies
import os
import pathlib
import docxpy
import ftfy

# key paths
source_directory = os.path.join('data', 'raw')
target_directory = os.path.join('data', 'texts')

source_names = ['Fisherman', 'Supermarket', 'Flight', 'Cat', 'Fog', 'Beach']
source_titles = ['where does susie go at noon?']
title_tags = [['''man and the bear'''], ['''act of kindness'''], 
              ["""a man can’t just sit""", "a man just can’t sit"], 
              ["where does susie go at noon?"], ["fog: a maine t"], 
              ["day at the beach"]]
author_tags = ['author unknown', 'anonymous', 'chris holm', 'adapted from',
               'unknown', 'anonymous']

### Next we create directories in our file system to organize preprocessed data.

In [3]:
# make a pooled subdirectory if one doesn't already exist
if not os.path.isdir(target_directory):
    os.mkdir(target_directory)

# generate subdirectory for each passage
for source_name in source_names:
    passage_path = os.path.join(target_directory, source_name)
    if not os.path.isdir(passage_path):
        os.mkdir(passage_path)

### Preprocess raw `docx` files and store as text

In [4]:
# for each pt1 written recall file, extract text and remove boilerplate, 
# and save to correct location in `pooled`
for path, subdirs, files in os.walk(os.path.join(
    source_directory, 'recall', 'Written Recall Part 1')):
    for name in files:
        recall_path = str(pathlib.PurePath(path, name))
        
        # extract text and remove boilerplate
        recall_text = '\n'.join(
            docxpy.process(recall_path).split('\n')[1:]).strip()
        passage_index = recall_path[-9:-8]
        subject_index = recall_path.split(name)[0][-3:-1]
        phase_index = recall_path[-7:-6]
        targetname = '{}_{}_{}.txt'.format(
            source_names[int(passage_index)-1], int(subject_index), phase_index)
        
        # handle special cases??
        recall_text = recall_text.replace(
            'vbeach', 'beach').replace('Susie gp at noon', 'Susie go at noon')
        
        # filter out source titles from recall data
        if any([each in recall_text[:recall_text.find(
            '.')].lower() for each in title_tags[int(passage_index)-1]]):
            if len(recall_text[:recall_text.find('\n')]) < 100:
                recall_text = recall_text[recall_text.find('\n'):].strip()
                
        # filter out source authors from recall data
        if (recall_text[:len(author_tags[int(
            passage_index)-1])].lower() == author_tags[int(passage_index)-1]):
            recall_text = recall_text[recall_text.find('\n'):].strip()
            
        # clean the data
        recall_text = ftfy.fix_text(recall_text)
            
        # save to correct location in pooled
        with open(
            os.path.join(target_directory, source_names[int(passage_index)-1], 
                         targetname), 'w', encoding='utf-8') as f:
            f.write(recall_text)

Part 1 and Part 2 data were collected in slightly different contexts, so they are preprocessed a little differently:

In [5]:
# for each pt2 written recall file, extract text and remove boilerplate, 
# and save to correct location in `pooled`
for path, subdirs, files in os.walk(
    os.path.join(source_directory, 'recall', 'Written Recall Part 2')):
    for name in files:
        recall_path = str(pathlib.PurePath(path, name))
        
        # identify correct location in pooled
        passage_index = recall_path[-7:-6]
        subject_index, phase_index =  recall_path.split(name)[0][-3:-1], 3
        if len(passage_index.strip()) == 0:
            continue
        targetname = '{}_{}_{}.txt'.format(
            source_names[int(passage_index)-1], int(subject_index), phase_index)

        # extract text and remove boilerplate
        boilerplate = 'You have 5 minutes to type the story you just read for memory. There is no word limit. Please write as much as you can remember.'
        recall_text = docxpy.process(
            recall_path).replace(boilerplate, '').strip()
        recall_text = '\n'.join(recall_text.split('\n')[1:]).strip()
        
        # clean text
        recall_text = ftfy.fix_text(recall_text)
        
        # save to correct location
        with open(os.path.join(
            target_directory, source_names[int(passage_index)-1], 
            targetname), 'w', encoding='utf-8') as f:
            f.write(recall_text)

### The result is an organized directory of text representations of participant responses absent methodology-specific details such as the content of the recall prompt.