# Pirahã syntax

In this notebook we are going to explore Pirahã syntax. As you know, Pirahã is (in)famous in the field of linguistics for many reasons. One of the most prominent claims made about the language is a lack of syntactic recursion. Futrell et al. (2016) analyze a corpus of Pirahã for evidence of syntactic embedding. They report that they do not find any such evidence, and that the corpus is consistent with an analysis of Pirahã as a regular language. If you want to know more about Pirahã and the controversy surrounding it, read the following:

- Everett, Daniel L. "Pirahã culture and grammar: a response to some criticisms." Language 85.2 (2009): 405-442.
- Everett, Daniel, et al. "Cultural constraints on grammar and cognition in Piraha: Another look at the design features of human language." Current anthropology 46.4 (2005): 621-646.
- Frank, Michael C., et al. "Number as a cognitive technology: Evidence from Pirahã language and cognition." Cognition 108.3 (2008): 819-824.
- Futrell, Richard, et al. "A corpus investigation of syntactic embedding in Pirahã." PloS one 11.3 (2016): e0145289.
- Nevins, Andrew, David Pesetsky, and Cilene Rodrigues. "Pirahã exceptionality: A reassessment." Language 85.2 (2009): 355-404.

Futrell et al. (2016) make their corpus data freely available, which is why I chose this article. In particular, the corpus comes in a nice messy format. Throughout all the notebooks of this workshop, we're not going to be concerned with whether the arguments are right or wrong. Rather, we care about practising our Python skills so that we can do similar linguistic analyses.

## Data

The corpus used in the paper consists of Pirahã stories that were originally collected and translated by Steve Sheldon and Dan Everett over the period of several decades. It also includes aligned translation between Pirahã
and English, including shallow syntactic parses and approximate English glosses.

Here's how the authors describe the way they created the corpus:
> We obtained glossed transcriptions of 17 stories in Pirahã, consisting of a total of 1149 sentences
and 6830 words in our analysis. 13 of the stories were collected by Steve Sheldon in the
1970s, and the remaining 4 stories were collected by Dan Everett over the period 1980–2009.
Each story was told by a single speaker with no recorded interruptions. The stories were transcribed
by Everett or Sheldon; audio recordings are only available for stories 2 and 3. According
to Everett, the texts are fairly representative of how the Pirahã tell stories to one another.

If you're interested, you can see all the original texts in pdf format [here](https://github.com/languageMIT/piraha/tree/master/sources).

### Downloading the data

I've already downloaded the data for you. In the `data` folder there's a file called `piraha.txt`.

### Structure of the data

Although it's great that the data is made freely available, it's in a really messy format. It's going to take some preprocessing for us to reformat the data into a more usable form. Here's the description from the [README](https://github.com/languageMIT/piraha/blob/master/README.md) of how the corpus is structured:

> Each text in the corpus is preceded by three lines of hashes and information about the source of the text. The corpus is divided into stories; stories are divided into "utterances"; and utterances are divided into sentences. Utterances correspond to the sentence breaks in Steve Sheldon's original glosses. Each utterance is preceded by two English glosses (free translations). The first is labeled with a hash and a code of two numbers, and reflects the English translation given by Steve Sheldon or Daniel L. Everett in their original translation. The second is labeled with a hash and a code of three numbers, and reflects the current best translation, as judged by Daniel L. Everett, Steve Sheldon, and the other authors. Clarifications are provided in square brackets in these glosses. The glosses allow simple text searching to find rough equivalents of English words and phrases (e.g. what does Pirahã use to convey meanings glossed as "and"?).

> Each Pirahã sentence is labeled with a unique code of three numbers, appearing on the preceding line along with its current best English translation. The first number indicates the text in which the utterance appears. The second number indicates the utterance's sequential placement within the text. This second number reflects the utterance boundaries present in the original transcriptions (e.g. by Dan Everett and Steve Sheldon). In many cases, text which was originally translated into a single utterance actually includes a group of Pirahã sentences according to our current best translations. When this occurs, the third number indicates the order of the Pirahã sentences within this grouping. Otherwise, the third number is simply 1.

Here's my simplification of it:

- The corpus is all in one file, `piraha.txt`.
- There are 17 stories in the corpus. They are separated from one another by three lines of hashtags.
- Each story starts with three pieces of metadata: the source, the informant and the comment.
- The source holds the names of the original pdf files of the story and the story number.
- The informant is the speaker (for most stories there's only one speaker).
- The comment holds various other things, like background, summary of the story, etc.
- After the metadata, each story has a series of utterances. Utterances are the original sentence boundaries given by either Everett or Sheldon. Each utterance has a number within the story (e.g. 4.16 for utterance number 16 in story number 4). Immediately after the numeric identifier comes the original English translation of the utterance.
- However, Sheldon and Everett have re-analyzed many utterances to in fact consist of more than one sentence. So each utterance actually consists of one or more sentences. Each sentence is identified by the story number, utterance number and sentence number (so 4.16.2 is the 2nd sentence of the 16th utterance of the 4th text). Immediately after each numeric identifier of a sentence is a shallow parse tree.



### Reading in the data

In [101]:
fname = 'data/piraha.txt'
with open(fname, 'r') as f:
    raw_text = f.read()

### Working with a single story

In [261]:
def split_stories(text):
    """Return all the stories in `text`.
    
    Parameters
    ----------
    text : str
        The raw text of the corpus
    
    Returns
    -------
    stories: list(str)
    """
    text_boundary = '''############################################################################################################################################ 
############################################################################################################################################ 
############################################################################################################################################'''
    stories = text.split(text_boundary)[1:]
    return [story.strip() for story in stories]

stories = split_stories(raw_text)
story = stories[3]

### Extracting metadata

In [None]:
import os
import re

In [246]:
def story_number(story):
    """Extract the number of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    number: int
        The position of the story in the corpus
    """
    match = re.search(r'# SOURCE: (\d+)', story)
    if match:
        number = match.group(1)
        number = int(number)
        return number
    else:
        return -1

story_number(story)

17

In [247]:
def story_filename(story):
    """Extract the original pdf filename of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    filename: str
        The filename of the original pdf
    """
    match = re.search(r'# SOURCE: .*, (.*.*\.pdf)', story)
    if match:
        filename = match.group(1)
        return filename
    return None

story_filename(story)

'17 PORTO VELHO IS BIG.pdf'

In [248]:
def story_informant(story):
    """Extract the name of the speaker of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    name: str
    """
    speaker_pattern = re.compile(r'# INFORMANT: (.*)\n')
    match = re.search(speaker_pattern, story)
    if match:
        return match.group(1)
    return None

story_informant(story)

'Tisahai'

In [249]:
def story_comment(story):
    """Extract the comment of `story`.
    
    Parameters
    ----------
    story : str
    
    Returns
    -------
    comment: str
    """
    comment_pattern = re.compile(r'# COMMENT: (.*)\n')
    match = re.search(comment_pattern, story)
    if match:
        return match.group(1)
    return None

story_comment(story)

'Tisahai and her husband Apibai were in Porto Velho attending a workshop when this story was told.  She was sending it as a letter to others back in the village to tell them some of her impressions.'

### Working with a single utterance

In [267]:
def split_utterances(story):
    """Return all the utterances in `story`.
    
    Parameters
    ----------
    story : str
        The raw text of a story
    
    Returns
    -------
    utterances: list(str)
    """
    utterance_boundary = '\n\n'
    utterances = story.split(utterance_boundary)[2:]
    return utterances

utterances = split_utterances(story)
utterance = utterances[0]

### Extracting utterance data

In [236]:
def utterance_number(utt):
    """Extract the number of `utt`.
    
    Parameters
    ----------
    utt : str
        An utterance

    Returns
    -------
    number: int
        The position of the utterance in the story
    """
    match = re.search(r'(\d+\.\d+):', utt)
    if match:
        whole_number = match.group(1)
        number = whole_number.split('.')[1]
        number = int(number)
        return number
    return -1

utterance_number(utterances[-2])

19

In [245]:
def utterance_translation(utt):
    """Extract the (original) translation of `utt`.
    
    Parameters
    ----------
    utt : str
        An utterance

    Returns
    -------
    translation: str
        The translation of the utterance
    """
    match = re.search(r'(\d+\.\d+): (.*)', utt)
    if match:
        translation = match.group(2)
        return translation
    return None

utterance_translation(utterances[0])

'Itaibigai he (Steve) bought a lot of fish.'

### Working with a single sentence

In [365]:
def split_sentences(utt):
    """Return all the sentences in `utterance`.
    
    Parameters
    ----------
    utt : str
        The raw text of an utterance
    
    Returns
    -------
    sentences: list(str)
    """
    all_sentences = re.sub(r'(# \d+\.\d+): (.*)', '', utt).strip()
    sentences = re.split(r'(?: ^|\n)# ', all_sentences)
    return sentences

### Extracting sentence data

In [354]:
def sentence_number(sent):
    """Extract the number of `sent`.
    
    Parameters
    ----------
    sent : str
        A sentence

    Returns
    -------
    number: int
        The position of the sentence in the utterance
    """
    match = re.search(r'(\d+\.\d+\.\d+):', sent)
    if match:
        whole_number = match.group(1)
        number = whole_number.split('.')[2]
        number = int(number)
        return number
    return -1

sentence_number(sentence)

1

In [356]:
def sentence_translation(sent):
    """Extract the translation of `sent`.
    
    Parameters
    ----------
    sent : str
        An utterance

    Returns
    -------
    translation: str
        The translation of the sent
    """
    match = re.search(r'(\d+\.\d+\.\d+): (.*)', sent)
    if match:
        translation = match.group(2)
        return translation
    return None

sentence_translation(sentence)

'[You] have not seen the foreigners.'

In [366]:
from nltk.tree import ParentedTree
from nltk.tgrep import tgrep_nodes, tgrep_positions

In [368]:
def sentence_tree(sent):
    """Extract the tree of `sent`.
    
    Parameters
    ----------
    sent : str

    Returns
    -------
    tree: str
    """
    non_tree_pattern = re.compile(r'# \d+\.\d+(\.\d+)?: .*')
    tree = re.sub(non_tree_pattern, '', sent).strip()
    return ParentedTree.fromstring(tree)

t = sentence_tree(sentence)
type(t)

nltk.tree.ParentedTree

In [372]:
for story in split_stories(raw_text):
    print(story_number(story), story_filename(story), story_informant(story))
    print()
    print(story_comment(story))
    print()
    for utterance in split_utterances(story):
        print(utterance_number(utterance), utterance_translation(utterance))
        print()
        for sentence in split_sentences(utterance):
            print(sentence_number(sentence), sentence_translation(sentence))
            print()

1 01_KATO'S BABY FALLS NEAR THE FIRE.pdf Aogioso

Steve Sheldon early 1970s; Translations and glosses completed by Dan Everett 2009. This is a story about a small baby (Tixohóí) who during the night falls off the bed and near the fire. His mother (Kato) and father (Hoagaixoóxai) are sound sleepers and do not waken to his crying. The grandmother (Xaogoíso) who is telling the story is very angry as is her sister (Xigaoxixaihoái) and husband (Xopísi). The father’s brother (Xoii) is also angry at their sleepiness and threatens that because the mother is such a sound sleeper he will have sexual contact with her some night and she would never know it. The baby is not really burned nor hurt, and finally grandmother takes baby back to his mommy to be nursed and put back to sleep. When the baby is sleeping again, the grandmother also returns to sleep.

2 He (TixohOI) fell by the fire.

1 He [TixohOI, KatO's baby] almost fell in the fire.

3 I spoke (carried sound). TixohOI is crying on the grou

-1 None

1 [Dan speaking:] Thus you finish talking.

2 [Dan speaking:] [I will take it] with [me] to another jungle [for them to see.]

17 OK?

-1 None

1 [Dan speaking:] Okay?

18 OK

1 Okay.

19 Well, OK then. I speak. OK then.

1 Well [it] is okay.

2 Thus I speak (as has been mentioned).

3 [It] is okay.

20 Foreigners in the (other) jungle?

1 Foreigners are in [another] jungle?

21 I a whole lot want to see.

1 Very many of them want to see me.  ###OR: [They] want to see very many of them [like] me. Meaning: They want to see many Piraha, many people like me. 

22 There is a small (amount here).  There is a small amount here.

1 A few [are here.] ###Or:  A few of them [the Piraha] [are here].

23 Paóxai wants to see.  A whole lot (of people). 

1 Paoxai [Dan] wants to see very many of them [Piraha].

24 Thus a whole lot he wants to see.

1 Thus he [Dan] wants to see very many of them [Piraha].

25 A really whole lot. He wants to see. Here.

1 He wants to see very many of them ther


2 Opisi and the snake climbed the tree.

1 (I heard that as has been mentioned,) Xopisi climbed the tree.

3 Iohoabi said, "The snake has become angry."

1 (As has been mentioned,) Iaohoabi spoke.

2 "The animal [snake] has become angry."

4 The snake almost went after Opisi.

1 (I heard that as has been mentioned,) the snake almost went after him [Xopisi].

5 Iaohoabi said, "The snake has become angry."

1 (As has been mentioned,) Iaohoabi spoke.

2 "The snake has become angry."

6 It (snake) is there.

1 "[The snake] is [there]."

7 Don't go into the jungle to hunt!

1 "Don't [go into] the jungle to hunt it/things!"

2 "Xopisi"

8 Iaohoabi said, "Opisi you did not kill the snake."

1 (As has been mentioned,) Iaohoabi spoke.

2 "[You] certainly did not kill the snake."

9 The snake has become angry.

1 "The animal has become angry."

10 Iohoabi said, "Don't hunt Opisi."

1 (As has been mentioned,) Iaohoabi spoke.

2 "[You] did not kill it, Xopisi."

11 It's angry.  It has become angr

### TODO
- Check code so far
- Put in pandas dataframe

## TODO
- Check overview
- The word order in Pirahã is predominantly verb-final, with subjects (S) usually preceding objects (O) for a predominantly SOV word order
- For example, if the subject intervened between the object and the verb (as in an OSV order), the object was labeled as a topic-obj. Similarly, noun phrases appearing after the verb were labeled as topics
- We used labels similar to the Penn Treebank labels for syntactic categories [40]: NP (noun phrase); IN (adposition); PP (adpositional phrase); VP (verb phrase); S (sentence); NN (a common noun); PRP (pronoun); NNP (proper noun); POS (possessive NP); JJ (adjective); DT (determiner); CD (quantity term); RB (adverb); FW (foreign word); FRAG (fragment). We also introduced the symbol Q dominating the contents of direct speech reports
- Use [this](https://github.com/wroberts/nltk_tgrep) for parsing trees.
- Can download files [here](https://osf.io/kt2e8/) too.
- Note that in this corpus high tones are marked with a capitalized letter.
- 1149 sentences, with an average of 5.9 words per sentence. But also contains original sentence boundaries, making 749 sentences.
- Shallow POS was added, and grammatical relations (subject, object, indirect object, locative, temporal, instrumental, vocative, topic)