# Pirahã syntax

In this notebook we are going to explore Pirahã syntax. As you know, Pirahã is (in)famous in the field of linguistics for many reasons. One of the most prominent claims made about the language is a lack of syntactic recursion. Futrell et al. (2016) analyze a corpus of Pirahã for evidence of syntactic embedding. They report that they do not find any such evidence, and that the corpus is consistent with an analysis of Pirahã as a [regular language](https://en.wikipedia.org/wiki/Regular_language). If you want to know more about Pirahã and the controversy surrounding it, read the following:

- Everett, Daniel L. "Pirahã culture and grammar: a response to some criticisms." Language 85.2 (2009): 405-442.
- Everett, Daniel, et al. "Cultural constraints on grammar and cognition in Piraha: Another look at the design features of human language." Current anthropology 46.4 (2005): 621-646.
- Frank, Michael C., et al. "Number as a cognitive technology: Evidence from Pirahã language and cognition." Cognition 108.3 (2008): 819-824.
- Futrell, Richard, et al. "A corpus investigation of syntactic embedding in Pirahã." PloS one 11.3 (2016): e0145289.
- Nevins, Andrew, David Pesetsky, and Cilene Rodrigues. "Pirahã exceptionality: A reassessment." Language 85.2 (2009): 355-404.

Throughout all the notebooks of this workshop, we're not going to be concerned with whether the arguments are right or wrong. Rather, we care about practising our Python skills so that we can do similar linguistic analyses. In particular, in this notebook we're going to practice the following topics:

### Core ideas of this notebook
- Object types
- Manipulating text data
- Regular expressions
- User-defined functions
- Writing modular code
- Documentation
- Going from messy to structured data
- Using other people's code


### Our goal
We want to turn <a href="data/piraha.txt">this</a> into this:                                   


|story_num|speaker|                fname                 |utt_num|                               utt_translation                               |sent_num|                                   words                                   |                 sent_translation                  |
|--------:|-------|--------------------------------------|------:|-----------------------------------------------------------------------------|-------:|---------------------------------------------------------------------------|---------------------------------------------------|
|        1|Aogioso|01_KATO'S BABY FALLS NEAR THE FIRE.pdf|      1|Early in the day I spoke. BaIgipOhoasi spoke (carried sound). Is Kato sleepy?|       1|['ti', 'xahoa', '-gI', 'ti', 'iga', 'O', '-p', '-I', '-xi']|[Early in the] day I spoke.                        |
|        1|Aogioso|01_KATO'S BABY FALLS NEAR THE FIRE.pdf|      1|Early in the day I spoke. BaIgipOhoasi spoke (carried sound). Is Kato sleepy?|       2|['hi', 'igA', 'xai', 'baIgipOhoasi']       |BaIgipOhoasi [speaker's sister] spoke.             |
|        1|Aogioso|01_KATO'S BABY FALLS NEAR THE FIRE.pdf|      1|Early in the day I spoke. BaIgipOhoasi spoke (carried sound). Is Kato sleepy?|       3|['KatO', 'hi', 'o', '*', '-b', '-a', '-p', '-I', '-aag', '-oxoihI', 'KatO']|Is Kato sleepy?"  [Lit: "Kato-- her eyes flutter?"]|
|        1|Aogioso|01_KATO'S BABY FALLS NEAR THE FIRE.pdf|      2|He (TixohOI) fell by the fire.                                               |       1|['hi', 'hoaI', 'ib', '-a', '-b', '-og', '-aA']|He [TixohOI, KatO's baby] almost fell in the fire. |
|        1|Aogioso|01_KATO'S BABY FALLS NEAR THE FIRE.pdf|      3|I spoke (carried sound). TixohOI is crying on the ground                     |       1|['ti', 'igA', 'xai', '-ai']                         |I spoke!                                           |


## Data

The corpus used in the paper consists of Pirahã stories that were originally collected and translated by Steve Sheldon and Dan Everett over the period of several decades. It also includes aligned translation between Pirahã
and English, including shallow syntactic parses and approximate English glosses.

Here's how the authors describe the way they created the corpus:
> We obtained glossed transcriptions of 17 stories in Pirahã, consisting of a total of 1149 sentences
and 6830 words in our analysis. 13 of the stories were collected by Steve Sheldon in the
1970s, and the remaining 4 stories were collected by Dan Everett over the period 1980–2009.
Each story was told by a single speaker with no recorded interruptions. The stories were transcribed
by Everett or Sheldon; audio recordings are only available for stories 2 and 3. According
to Everett, the texts are fairly representative of how the Pirahã tell stories to one another.

If you're interested, you can see all the original texts in pdf format [here](https://github.com/languageMIT/piraha/tree/master/sources).

### Downloading the data

Futrell et al. make their data [available](https://github.com/languageMIT/piraha), but I've already downloaded it for you. In the `data` folder there's a file called `piraha.txt`. 

### Structure of the data

Although it's great that the data is made freely available, it's in a really messy format. It's going to take some preprocessing for us to reformat the data into a more usable form. Here's the description from the [README](https://github.com/languageMIT/piraha/blob/master/README.md) of how the corpus is structured:

> Each text in the corpus is preceded by three lines of hashes and information about the source of the text. The corpus is divided into stories; stories are divided into "utterances"; and utterances are divided into sentences. Utterances correspond to the sentence breaks in Steve Sheldon's original glosses. Each utterance is preceded by two English glosses (free translations). The first is labeled with a hash and a code of two numbers, and reflects the English translation given by Steve Sheldon or Daniel L. Everett in their original translation. The second is labeled with a hash and a code of three numbers, and reflects the current best translation, as judged by Daniel L. Everett, Steve Sheldon, and the other authors. Clarifications are provided in square brackets in these glosses. The glosses allow simple text searching to find rough equivalents of English words and phrases (e.g. what does Pirahã use to convey meanings glossed as "and"?).

> Each Pirahã sentence is labeled with a unique code of three numbers, appearing on the preceding line along with its current best English translation. The first number indicates the text in which the utterance appears. The second number indicates the utterance's sequential placement within the text. This second number reflects the utterance boundaries present in the original transcriptions (e.g. by Dan Everett and Steve Sheldon). In many cases, text which was originally translated into a single utterance actually includes a group of Pirahã sentences according to our current best translations. When this occurs, the third number indicates the order of the Pirahã sentences within this grouping. Otherwise, the third number is simply 1.

Here's my simplification of it:

- The corpus is all in one file, `piraha.txt`.
- There are 17 stories in the corpus. They are separated from one another by three lines of hashtags.
- Each story starts with three pieces of metadata: the source, the informant and the comment.
- The source holds the names of the original pdf files of the story and the story number.
- The informant is the speaker (for most stories there's only one speaker).
- The comment holds various other things, like background, summary of the story, etc.
- After the metadata, each story has a series of utterances. Utterances are the original sentence boundaries given by either Everett or Sheldon. Each utterance has a number within the story (e.g. 4.16 for utterance number 16 in story number 4). Immediately after the numeric identifier comes the original English translation of the utterance.
- However, Sheldon and Everett have re-analyzed many utterances to in fact consist of more than one sentence. So each utterance actually consists of one or more sentences. Each sentence is identified by the story number, utterance number and sentence number (so 4.16.2 is the 2nd sentence of the 16th utterance of the 4th text). Immediately after each numeric identifier of a sentence is a shallow parse tree.



### Reading in data in Python

The main way to read in data in Python is to use the `open` function. The `open` function is always available as soon as you start Python (whether from a Jupyter notebook, an interactive session from the terminal, in IDLE, or in Sypder).

###  1. Read in the data
_Hint: It's sorted in a file called 'piraha.txt' in a folder called 'data'._

In [1]:
### FILL IN THE BLANKS
fname = 'data/piraha.txt'
with open(fname, 'r') as f:
    raw_text = f.read()

- What type is `fname`?
- What type is `open(fname)`?
- Can you access the opened file after the `with` statement?
- Old school way of opening files
- Joining paths together
- Finding the type of a file
- Reading in files line by line
- Read Python docs for [open](https://docs.python.org/3/library/functions.html) and [os.path](https://docs.python.org/3/library/os.path.html)

### 2. Split `raw_text` into individual stories

_Hint: Each text is separated by a special string. Use that special string to turn the raw text into a list of stories called `stories`._

In [2]:
### FILL IN THE BLANKS
text_boundary = '''############################################################################################################################################ 
############################################################################################################################################ 
############################################################################################################################################'''
stories = raw_text.split(text_boundary)[1:]
stories = [story.strip() for story in stories]

- What type is `stories`?
- How can you get the first story?
- What type is each story?
- How can you get the first 100 characters of the seventh story?
- What does `strip()` [do](https://docs.python.org/3.6/library/stdtypes.html#text-sequence-type-str)?

### 3. Turn what you just did in 2 into a function

In [3]:
def split_stories(text):
    """Return all the stories in `text`.
    
    Parameters
    ----------
    text : str
        The raw text of the corpus
    
    Returns
    -------
    stories: list(str)
    """
    text_boundary = '''############################################################################################################################################ 
############################################################################################################################################ 
############################################################################################################################################'''
    stories = text.split(text_boundary)[1:]
    return [story.strip() for story in stories]

stories = split_stories(raw_text)
story = stories[3]

- Does the function `split_stories`'s parameter have the name you gave it?
- Do user-defined functions have to have parameters?
- If you forget the `return` statement, what does the function return?
- Why do we bother ever defining functions?
- Lambda functions
- Higher-order functions (e.g. for `sorted`)

### 4. Import the library for regular expressions in Python

In [4]:
### FILL IN THE BLANKS
import re

- Why do we have to import libraries at all? Wouldn't it be easier if everything was immediately available like the `open` function is?
- What type is the module?
- What if we really wanted to have our own variable with the same name as a library we import? Won't their names clash?

### 5. Extract the story number from a story

_Hint: Look in `piraha.txt` to see that a story's number is stored on the line that starts with "# SOURCE". What is the regular expression for a single digit?_

In [5]:
match = re.search(r'# SOURCE: (\d+)', story)
if match:
    number = match.group(1)
    number = int(number)
else:
    number = -1

- How do we access the function called `search` in the `re` module?
- What if we had imported everything from the `re` module like this: `from re import *`?
- What parameters does the `search` function take?
- What's the difference between `search` and `match`?
- What type of object does the `re.search` function return if it does find a match? What if it doesn't?
- How can I turn something of type `int` into a string? How can I turn a string into an integer?
- Use [pythex](https://pythex.org/) to check your regexes
- Compiling regexes in Python

### 6. Now turn what you did in 5 into a function

In [6]:
### FILL IN THE BLANKS
def story_number(story):
    """Extract the number of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    number: int
        The position of the story in the corpus
    """
    match = re.search(r'# SOURCE: (\d+)', story)
    if match:
        number = match.group(1)
        number = int(number)
        return number
    else:
        return -1

- What type is `story_number`?
- Function documentation

### 7. Define functions for extracting the filename, speaker and comment from a story

_Hint: The filename is stored on the same line as the story number. The speaker and comment are stored in a similar fashion. You'll need to look at the file to see exactly what regex to use._

In [7]:
### FILL IN THE BLANKS
def story_filename(story):
    """Extract the original pdf filename of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    filename: str
    """
    match = re.search(r'# SOURCE: .*, (.*.*\.pdf)', story)
    if match:
        filename = match.group(1)
        return filename
    return None

In [8]:
### FILL IN THE BLANKS
def story_informant(story):
    """Extract the name of the speaker of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    name: str
    """
    speaker_pattern = re.compile(r'# INFORMANT: (.*)\n')
    match = re.search(speaker_pattern, story)
    if match:
        return match.group(1)
    return None

In [9]:
### FILL IN THE BLANKS
def story_comment(story):
    """_____"""
    speaker_pattern = re.compile(r'# COMMENT: (.*)\n')
    match = re.search(speaker_pattern, story)
    if match:
        return match.group(1)
    return None

### 8. Split a story into a list of its utterances

_Hint: Looking at the file 'piraha.txt' again, note that within each story, there are blocks of data separated by newlines. Each block is an utterance. We want a function that takes in a story (what type will this be?) and returns a list of utterances (i.e. a list of those blocks). A newline is represented as the sequence '\n' in Python._

In [10]:
def split_utterances(story):
    """Return all the utterances in `story`.
    
    Parameters
    ----------
    story : str
        The raw text of a story
    
    Returns
    -------
    utterances: list(str)
    """
    utterance_boundary = '\n\n'
    utterances = story.split(utterance_boundary)[1:]
    first_utterance = utterances[0]
    if first_utterance.startswith('### NOTE'):
        first_utterance = re.sub('### NOTE.*\n', '', first_utterance)
    utterances = [first_utterance] + utterances[1:]
    return utterances

- If '\n' is a newline, what is a tab?

### 9. Extract the utterance number and the translation

_Hint: In an utterance, the first line always consists of the number and the translation (in English). The number is of the form "x.y", where "x" is the story number, and "y" is the utterance number (we just want the utterance number). The translation is everything after the ":" after the number._

In [11]:
def utterance_number(utt):
    """Extract the number of `utt`.
    
    Parameters
    ----------
    utt : str
        An utterance

    Returns
    -------
    number: int
        The position of the utterance in the story
    """
    match = re.search(r'(\d+\.\d+):', utt)
    if match:
        whole_number = match.group(1)
        number = whole_number.split('.')[1]
        number = int(number)
        return number
    return -1

In [12]:
def utterance_translation(utt):
    """Extract the (original) translation of `utt`.
    
    Parameters
    ----------
    utt : str
        An utterance

    Returns
    -------
    translation: str
        The translation of the utterance
    """
    match = re.search(r'(\d+\.\d+): (.*)', utt)
    if match:
        translation = match.group(2)
        return translation
    return None

### 10. Split an utterance into its sentences

_Hint: Within an utterance, after the first line containing the number of English translation, there are one or more sentences. We want a function to takes an utterance and returns a list of sentences._

In [13]:
def split_sentences(utt):
    """Return all the sentences in `utterance`.
    
    Parameters
    ----------
    utt : str
        The raw text of an utterance
    
    Returns
    -------
    sentences: list(str)
    """
    all_sentences = re.sub(r'(# \d+\.\d+): (.*)', '', utt).strip()
    sentences = re.split(r'(?: ^|\n)# ', all_sentences)
    return sentences

### 11. Extract the sentence number and translation

_Hint: Every sentence starts with the number in the form of "x.y.z", where "x" is the story number, "y" is the utterance number, and "z" is the sentence number._

In [14]:
def sentence_number(sent):
    """Extract the number of `sent`.
    
    Parameters
    ----------
    sent : str
        A sentence

    Returns
    -------
    number: int
        The position of the sentence in the utterance
    """
    match = re.search(r'(\d+\.\d+\.\d+):', sent)
    if match:
        whole_number = match.group(1)
        number = whole_number.split('.')[2]
        number = int(number)
        return number
    return -1

In [15]:
def sentence_translation(sent):
    """Extract the translation of `sent`.
    
    Parameters
    ----------
    sent : str
        An utterance

    Returns
    -------
    translation: str
        The translation of the sent
    """
    match = re.search(r'(\d+\.\d+\.\d+): (.*)', sent)
    if match:
        translation = match.group(2)
        return translation
    return None

### Sentence trees

_Hint: This part is nuanced, so don't worry about following every single detail for the time being. The core concepts we should focus on are `try/except` clauses and using other people's code._

In [16]:
from nltk.tree import ParentedTree
from nltk.tgrep import tgrep_nodes, tgrep_positions

In [17]:
def sentence_tree(sent):
    """Extract the tree of `sent`.
    
    Parameters
    ----------
    sent : str

    Returns
    -------
    tree: str
    """
    try:
        non_tree_pattern = re.compile(r'(?:# )?\d+\.\d+(\.\d+)?: .*')
        tree = re.sub(non_tree_pattern, '', sent).strip()
        tree = re.sub(r'#.*', '', tree).strip()
        return ParentedTree.fromstring(tree)
    except ValueError:
        return None

In [18]:
def words_from_tree(t):
    """Return list of words from tree `t`."""
    if t:
        return [leaf.split('/')[0] for leaf in t.leaves()]
    else:
        return None

### Putting the pieces together

Now, we have functions that can extract the relevant information from the data. We don't really want to have to extract a particular piece of information out every time we're interested in it (i.e. when we're asking questions like "Do the different speakers in the corpus use this word differently?"). We want to do it once, and then work with the extracted data. We also want a way of linking data from the same sentence together. That is, for every sentence in the corpus, we'd like to be able to know who the speaker is, what the words are, what story it came from, etc. In the olden days, this was how we did that:

In [19]:
sentences = []
for story in split_stories(raw_text):
    for utterance in split_utterances(story):
        for sentence in split_sentences(utterance):
            dictionary = {}
            dictionary['story_num'] = story_number(story)
            dictionary['fname'] = story_filename(story)
            dictionary['speaker'] = story_informant(story)
            #dictionary['comment'] = story_comment(story)
            dictionary['utt_num'] = utterance_number(utterance)
            dictionary['utt_translation'] = utterance_translation(utterance)
            dictionary['sent_num'] = sentence_number(sentence)
            dictionary['sent_translation'] = sentence_translation(sentence)
            #dictionary['sent_tree'] = sentence_tree(sentence)
            dictionary['words'] = words_from_tree(sentence_tree(sentence))
            sentences.append(dictionary)

- What type is `sentences`?
- What type is `sentences[0]`?
- What type is `sentences[0]['words']`?
- What type is `sentences[0]['words'][0]`?
- What type is `sentences[0]['words'][0][0]`?

### 12. Get a list of all the sentences

_Hint: Each sentence is a list of strings, so what we want is a list of lists of strings._

In [20]:
[s['words'] for s in sentences]

[['ti', 'xahoa', '-gI', 'ti', 'iga', 'O', '-p', '-I', '-xi'],
 ['hi', 'igA', 'xai', 'baIgipOhoasi'],
 ['KatO', 'hi', 'o', '*', '-b', '-a', '-p', '-I', '-aag', '-oxoihI', 'KatO'],
 ['hi', 'hoaI', 'ib', '-a', '-b', '-og', '-aA'],
 ['ti', 'igA', 'xai', '-ai'],
 ['hi', 'big', 'a', '-I'],
 ['*', 'is', '-Aaga', '-haI', 'TixohOI'],
 ['*', 'hoaI', '*', '-b', '-o', '-i', '-hI', 'pixAi', '-xIga'],
 ['ti', 'xaigIA', 'igA', 'xai', '-ai', 'xaI', 'Xopisi'],
 ['hi', 'o', '-b', '-a', '-hohI', 'pixAi', '-xIga', 'TixohOI'],
 ['*', 'hoai', 'Is', '-aagA', '-haI'],
 ['hi', 'o', '-b', '-a', '-hoi', '-hI', 'pixAi', '-xIga'],
 ['ti', 'xaigIa', 'igA', 'xai', '-ai'],
 ['xaI', 'ti', 'aigIa', 'hi', 'xoi', '-si', 'kaop', '-aI'],
 ['hi', 'Ao', 'big', '*', '-o', '-b', 'Ao', '-p', '-aI', '-xai'],
 ['ti', 'xaigIa', 'igA', 'xai', '-aI'],
 ['xopIsi',
  'hi',
  'o',
  '*',
  '-b',
  'a',
  '-p',
  '-I',
  '-aag',
  '-ag',
  '-i',
  '-sahaxaI'],
 ['*', 'hoaI', '*', '-b', '-O', '-i', '-hI', 'pixAi', '-xIga'],
 ['hI', 'O', 

### Homework: Verify the following
> The phonological segments of Pirahã are /i/, /a/, /u/, /p/, /t/, /k/, /h/, /s/, /b/, /g/, and /ʔ/. In the orthography we adopt for this paper, < x > represents the glottal stop and < o > represents /u/.

_Hint: You'll need to join together all the words in all the sentences into one big string. Then you can use the [set](https://docs.python.org/3/tutorial/datastructures.html#sets) constructor._

> The language has two tones, high and low. Note that in this corpus high tones are marked with a capitalized letter. 

Once you've verified this, change all high tones to the IPA vowel with an accute accent, and low vowels to the IPA vowel with grave accent. Change the < o > to a /u/ as well.

> The sound /s/ is usually absent from women’s speech; women use /h/ where men use /s/. Can you infer which speakers are female and which are male?

> 1149 sentences, with an average of 5.9 words per sentence.

### Pandas

[Pandas](http://pandas.pydata.org/pandas-docs/stable/) is a new-ish Python library that makes data analysis easy and flexible. The main object it defines for us is a `DataFrame`. This will be familiar for people that use R. When I think of a `DataFrame`, I think of an Excel spreadsheet. Each column has a variable, and each row is an observation.

In [21]:
import pandas as pd
corpus = pd.DataFrame(sentences)

- What does the `as pd` bit do?
- Why did we choose `pd` as the name to import `pandas` as?
- Constructing DataFrames

In [22]:
corpus

Unnamed: 0,fname,sent_num,sent_translation,speaker,story_num,utt_num,utt_translation,words
0,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,[Early in the] day I spoke.,Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[ti, xahoa, -gI, ti, iga, O, -p, -I, -xi]"
1,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,2,BaIgipOhoasi [speaker's sister] spoke.,Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[hi, igA, xai, baIgipOhoasi]"
2,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,3,"""Is Kato sleepy?"" [Lit: ""Kato-- her eyes flut...",Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[KatO, hi, o, *, -b, -a, -p, -I, -aag, -oxoihI..."
3,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,"He [TixohOI, KatO's baby] almost fell in the f...",Aogioso,1,2,He (TixohOI) fell by the fire.,"[hi, hoaI, ib, -a, -b, -og, -aA]"
4,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,I spoke!,Aogioso,1,3,I spoke (carried sound). TixohOI is crying on ...,"[ti, igA, xai, -ai]"
5,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,2,"""He [TixohOI] is on the ground.""",Aogioso,1,3,I spoke (carried sound). TixohOI is crying on ...,"[hi, big, a, -I]"
6,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,3,"""TixohOI is crying.""",Aogioso,1,3,I spoke (carried sound). TixohOI is crying on ...,"[*, is, -Aaga, -haI, TixohOI]"
7,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,[He] certainly fell by the fire just now.,Aogioso,1,4,He fell by the fire right now.,"[*, hoaI, *, -b, -o, -i, -hI, pixAi, -xIga]"
8,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,I thus spoke [to] Xopisi [speaker's husband]!,Aogioso,1,5,I spoke to Opisi. 'Did TixohOI burn himself ju...,"[ti, xaigIA, igA, xai, -ai, xaI, Xopisi]"
9,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,2,"""Did TixohOI fall down [and burn himself] just...",Aogioso,1,5,I spoke to Opisi. 'Did TixohOI burn himself ju...,"[hi, o, -b, -a, -hohI, pixAi, -xIga, TixohOI]"


We can view the `DataFrame` we created simply by typing its name. The Jupyter notebook knows about pandas dataframes and do prints them in a nice way (with alternating colours for rows). We can use the `head` method of a dataframe to see just the first couple of rows.

In [23]:
corpus.head()

Unnamed: 0,fname,sent_num,sent_translation,speaker,story_num,utt_num,utt_translation,words
0,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,[Early in the] day I spoke.,Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[ti, xahoa, -gI, ti, iga, O, -p, -I, -xi]"
1,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,2,BaIgipOhoasi [speaker's sister] spoke.,Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[hi, igA, xai, baIgipOhoasi]"
2,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,3,"""Is Kato sleepy?"" [Lit: ""Kato-- her eyes flut...",Aogioso,1,1,Early in the day I spoke. BaIgipOhoasi spoke (...,"[KatO, hi, o, *, -b, -a, -p, -I, -aag, -oxoihI..."
3,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,"He [TixohOI, KatO's baby] almost fell in the f...",Aogioso,1,2,He (TixohOI) fell by the fire.,"[hi, hoaI, ib, -a, -b, -og, -aA]"
4,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,I spoke!,Aogioso,1,3,I spoke (carried sound). TixohOI is crying on ...,"[ti, igA, xai, -ai]"


In [24]:
columns = ['story_num', 'speaker', 'fname', 'utt_num', 'utt_translation', 'sent_num', 'words', 'sent_translation']
corpus = corpus[columns]
corpus.head()

Unnamed: 0,story_num,speaker,fname,utt_num,utt_translation,sent_num,words,sent_translation
0,1,Aogioso,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,Early in the day I spoke. BaIgipOhoasi spoke (...,1,"[ti, xahoa, -gI, ti, iga, O, -p, -I, -xi]",[Early in the] day I spoke.
1,1,Aogioso,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,Early in the day I spoke. BaIgipOhoasi spoke (...,2,"[hi, igA, xai, baIgipOhoasi]",BaIgipOhoasi [speaker's sister] spoke.
2,1,Aogioso,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,1,Early in the day I spoke. BaIgipOhoasi spoke (...,3,"[KatO, hi, o, *, -b, -a, -p, -I, -aag, -oxoihI...","""Is Kato sleepy?"" [Lit: ""Kato-- her eyes flut..."
3,1,Aogioso,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,2,He (TixohOI) fell by the fire.,1,"[hi, hoaI, ib, -a, -b, -og, -aA]","He [TixohOI, KatO's baby] almost fell in the f..."
4,1,Aogioso,01_KATO'S BABY FALLS NEAR THE FIRE.pdf,3,I spoke (carried sound). TixohOI is crying on ...,1,"[ti, igA, xai, -ai]",I spoke!
