Conversational data formatted in CHAT provides transcriptions with rich annotations for both linguistic and extra-linguistic information. PyLangAcq is designed to extract data and annotations in CHAT and expose them in Python data structures for flexible modeling work. This notebook explains how PyLangAcq represents CHAT data and annotations.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

#Optional: move to the desired location:
#%cd drive/My Drive/DIRECTORY_IN_YOUR_DRIVE

Mounted at /content/drive


In [1]:
!pip install --upgrade -q pylangacq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
import pylangacq

In [7]:
url = "drive/My Drive/Omdena/SriLanka/Autism/asd_talkbank/Nadig.zip"

nadig = pylangacq.read_chat(url )
nadig.n_files() #how many chat files are present for this data

38

# Explanation of ASDBank Utterrance Structure

**A typical structure of (two) utterrance**:


```
*CHI:       more cookie . [+ IMP]
%mor:       qn|more n|cookie .
%gra:       1|2|QUANT 2|0|INCROOT 3|2|PUNCT
%int:       distinctive , loud
*MOT:       you 0v more cookies ?
%mor:       pro:per|you 0v|v qn|more n|cookie-PL ?
%gra:       1|2|SUBJ 2|0|ROOT 3|4|QUANT 4|2|OBJ 5|2|PUNCT
```



PyLangAcq handles CHAT data by paying attention to the following:

**Participants**: The two participants are CHI and MOT. It is customary to denote the target child by CHI and the child’s mother by MOT. The asterisk * that comes just before the participant code signals a transcription line. Each utterance must begin with the transcription line.

**Transcriptions**: The two transcription lines are `more cookie . [+ IMP]` from Eve and `you 0v more cookies ?` from her mother. The transcriptions are word-segmented by spaces. Punctuation marks are treated as “words”. Annotations such as [+ IMP] and 0v here can be found in transcriptions.

**Dependent tiers**: Between one transcription line and the next one, there are dependent tiers, signed by %, associated with the transcription line just immediately above; CHI’s utterance has the dependent tiers %mor (morphological information), %gra (grammatical relations), and %int (intonation), whereas MOT’s has only %mor and %gra.

**The %mor tier**: The morphological information aligns one-to-one to the segmented words (including punctuation marks) in the transcription line; annotations in the transcription line are ignored. In each item of %mor, the part-of-speech tag is on the left of the pipe |, e.g., `qn` for a nominal quantifier in `qn|more `aligned to `more` in CHI’s line. Inflectional and derivational information is on the right of |, e.g., `cookie-PL` for the plural form of “cookie” in `n|cookie-PL` aligned to cookies in MOT’s line.

**The %gra tier**: CHAT represents grammatical relations in terms of heads and dependents in dependency grammar. Every item on the %gra tier corresponds one-to-one to the segmented words in the transcription (and therefore one-to-one to the %mor items as well). In MOT’s %gra, `3|4|QUANT` means `more` at position 3 of the utterance is a dependent of the word `cookies` at position 4 as the head, and that the relation is one of *quantification*.

**Other tiers**: Apart from %mor and %gra, other dependent tiers may appear in CHAT data files. Some of them contain more linguistic information, e.g., %int for intonation in CHI’s utterance here, and others contain contextual information about the utterance or recording session. **Many of these tiers are used only as needed** (%int not used in MOT’s utterance in this example).

# Exploring Utterrance metadata in code using Pylangacq library

In [8]:
nadig_utterance = nadig.utterances(by_files = True) #get all the utterrance of the chat files of Nadig chat data
print(len(nadig_utterance))
print(nadig_utterance[0][0]) #describe the first utterrance in the first conversation.

38
Utterance(participant='MOT', tokens=[Token(word='what', pos='pro:int', mor='what', gra=Gra(dep=1, head=2, rel='MOD')), Token(word="animal's", pos='n', mor='animal', gra=Gra(dep=2, head=3, rel='SUBJ')), Token(word='CLITIC', pos='cop', mor='be&3S', gra=Gra(dep=3, head=0, rel='ROOT')), Token(word='on', pos='prep', mor='on', gra=Gra(dep=4, head=3, rel='JCT')), Token(word='there', pos='n', mor='there', gra=Gra(dep=5, head=6, rel='MOD')), Token(word='Tracy', pos='n:prop', mor='Tracy', gra=Gra(dep=6, head=4, rel='POBJ')), Token(word='?', pos='?', mor='', gra=Gra(dep=7, head=3, rel='PUNCT'))], time_marks=None, tiers={'MOT': "what animal's on there Tracy ?", '%mor': 'pro:int|what n|animal~cop|be&3S prep|on n|there n:prop|Tracy ?', '%gra': '1|2|MOD 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|6|MOD 6|4|POBJ 7|3|PUNCT'})


Let's further analyse the first utterrance of the first chat data

In [18]:
first_utter = nadig_utterance[0][0]
participant = first_utter.participant #Get participant name of the utterrance
list_of_tokens = first_utter.tokens #Get the tokens of the utterrance

print(f'Participant: {participant}')
print(f'Tokens: ')
#Let's anaylyse each token
for i, token in enumerate(list_of_tokens):
  print(f'token {i+1}: {token}')
  token_word = token.word #extract the word of token
  print(f'\t token word: {token_word}')
  token_pos = token.pos #get the parts of speech of the token
  print(f'\t token pos: {token_pos}')
  token_mor = token.mor #get the morphological information of the token
  print(f'\t token mor: {token_mor}')
  token_gra = token.gra #get the grammatical info. of the token
  print(f'\t token gra: {token_gra}: {token_gra.dep} | {token_gra.head} | {token_gra.rel}')

Participant: MOT
Tokens: 
token 1: Token(word='what', pos='pro:int', mor='what', gra=Gra(dep=1, head=2, rel='MOD'))
	 token word: what
	 token pos: pro:int
	 token mor: what
	 token gra: Gra(dep=1, head=2, rel='MOD'): 1 | 2 | MOD
token 2: Token(word="animal's", pos='n', mor='animal', gra=Gra(dep=2, head=3, rel='SUBJ'))
	 token word: animal's
	 token pos: n
	 token mor: animal
	 token gra: Gra(dep=2, head=3, rel='SUBJ'): 2 | 3 | SUBJ
token 3: Token(word='CLITIC', pos='cop', mor='be&3S', gra=Gra(dep=3, head=0, rel='ROOT'))
	 token word: CLITIC
	 token pos: cop
	 token mor: be&3S
	 token gra: Gra(dep=3, head=0, rel='ROOT'): 3 | 0 | ROOT
token 4: Token(word='on', pos='prep', mor='on', gra=Gra(dep=4, head=3, rel='JCT'))
	 token word: on
	 token pos: prep
	 token mor: on
	 token gra: Gra(dep=4, head=3, rel='JCT'): 4 | 3 | JCT
token 5: Token(word='there', pos='n', mor='there', gra=Gra(dep=5, head=6, rel='MOD'))
	 token word: there
	 token pos: n
	 token mor: there
	 token gra: Gra(dep=5, head

In [25]:
tier = first_utter.tiers #tier is a dictionary that contains the utterrance, %mor and %gra info in a summarised form
print(tier)
utter_sentence = tier[participant] #get the dialogue sentence
print(f'sentence: {utter_sentence}')
morphological_info = tier['%mor'] #get the morphological info of all the words
print(f'%mor: {morphological_info}')
grammatical_info = tier['%gra'] #get the grammatical relation info of all the words
print(f'%gra: {grammatical_info}')

{'MOT': "what animal's on there Tracy ?", '%mor': 'pro:int|what n|animal~cop|be&3S prep|on n|there n:prop|Tracy ?', '%gra': '1|2|MOD 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|6|MOD 6|4|POBJ 7|3|PUNCT'}
sentence: what animal's on there Tracy ?
%mor: pro:int|what n|animal~cop|be&3S prep|on n|there n:prop|Tracy ?
%gra: 1|2|MOD 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|6|MOD 6|4|POBJ 7|3|PUNCT
