<font size ='4'>A Quick guide into using pylangcq for using ASDTalk transcript dataset</font>

<font size="4"> Datasets can be downloaded from https://asd.talkbank.org/access/ I used Nadig dataset from here: https://asd.talkbank.org/access/English/Nadig.html . This data was used in paper: **Detecting Autism Spectrum Disorders with Machine Learning Models Using Speech Transcripts -2021- Vikram Ramesh** </font>

<font size="4"> Reference: https://pylangacq.org/quickstart.html</font>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#Optional: move to the desired location:
#%cd drive/My Drive/DIRECTORY_IN_YOUR_DRIVE

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install --upgrade pylangacq

In [None]:
import pylangacq

In [None]:
url = "drive/My Drive/Omdena/SriLanka/Autism/asd_talkbank/Nadig.zip"

nadig = pylangacq.read_chat(url )
nadig.n_files() #how many chat files are present for this data

38

In [None]:
nadig.ages() #ages of child participants in (year, month, day) format

[(3, 6, 5),
 (4, 2, 4),
 (2, 7, 21),
 (3, 7, 2),
 (6, 6, 27),
 (3, 1, 0),
 (6, 1, 0),
 (2, 10, 13),
 (4, 4, 21),
 (3, 1, 11),
 (2, 8, 0),
 (3, 6, 25),
 (2, 8, 16),
 (4, 10, 17),
 (2, 10, 6),
 (5, 1, 21),
 (1, 11, 19),
 (5, 4, 25),
 (3, 1, 5),
 (4, 4, 13),
 (6, 3, 10),
 (1, 11, 5),
 (5, 3, 7),
 (2, 7, 20),
 (5, 5, 15),
 (2, 8, 8),
 (4, 10, 28),
 (2, 7, 27),
 (2, 4, 24),
 (2, 2, 5),
 (2, 3, 12),
 (2, 2, 29),
 (2, 2, 3),
 (1, 10, 28),
 (1, 8, 29),
 (1, 8, 14),
 (3, 7, 16),
 (3, 6, 29)]

In [None]:
words = nadig.words()  # list of strings, for all the words across all files
print(len(words))  # total word count
print(words[:8])

43514
['what', "animal's", 'on', 'there', 'Tracy', '?', 'hm', 'horsie']


<font size = '4'>By default, words() returns a flat list of results from all the files. If we are interested in the results for individual files, the method has the optional boolean parameter by_files</font>

In [None]:
words_by_files = nadig.words(by_files=True)  # list of lists of strings, each inner list for one file
print(len(words_by_files) ) # expects 38 -- that's the number of files of ``nadig``
for words_one_file in words_by_files:
    print(len(words_one_file))

38
1584
1431
1214
1018
965
1381
1135
1098
792
1795
867
636
1773
1293
1875
1115
1248
1196
756
1157
1386
985
1462
1224
1082
1314
258
1215
1342
772
1555
1167
635
683
729
1240
889
1247


<font size = '4'>Apart from transcriptions, CHAT data has rich annotations for linguistic and extra-linguistic information. Such annotations are accessible through the methods tokens() and utterances().

Many CHAT datasets on CHILDES have the %mor and %gra tiers for morphological information and grammatical relations, respectively. A reader such as eve from above has all this information readily available to you via tokens() – think of tokens() as words() with annotations: </font>

In [None]:
all_tokens = nadig.tokens(by_files = True)
print(f'available token {len(all_tokens)}')
some_tokens = all_tokens[0][:5] #take first 5 token for the first chat
print(some_tokens)

available token 38
[Token(word='what', pos='pro:int', mor='what', gra=Gra(dep=1, head=2, rel='MOD')), Token(word="animal's", pos='n', mor='animal', gra=Gra(dep=2, head=3, rel='SUBJ')), Token(word='CLITIC', pos='cop', mor='be&3S', gra=Gra(dep=3, head=0, rel='ROOT')), Token(word='on', pos='prep', mor='on', gra=Gra(dep=4, head=3, rel='JCT')), Token(word='there', pos='n', mor='there', gra=Gra(dep=5, head=6, rel='MOD'))]


In [None]:
# The Token class is a dataclass. A Token instance has attributes as shown above.
for token in some_tokens:
    print(f'Word: {token.word}, POS: {token.pos}')

Word: what, POS: pro:int
Word: animal's, POS: n
Word: CLITIC, POS: cop
Word: on, POS: prep
Word: there, POS: n


<font size='4'>Beyond the %mor and %gra tiers, an utterance has yet more information from the original CHAT data file. If you need information such as the unsegmented transcription, time marks, or any unparsed tiers, utterances() is what you need: </font>

In [None]:
nadig_utterance = nadig.utterances(by_files = True)
print(len(nadig_utterance))
print(nadig_utterance[0][0]) #describe the first utterrance in the first conversation.

38
Utterance(participant='MOT', tokens=[Token(word='what', pos='pro:int', mor='what', gra=Gra(dep=1, head=2, rel='MOD')), Token(word="animal's", pos='n', mor='animal', gra=Gra(dep=2, head=3, rel='SUBJ')), Token(word='CLITIC', pos='cop', mor='be&3S', gra=Gra(dep=3, head=0, rel='ROOT')), Token(word='on', pos='prep', mor='on', gra=Gra(dep=4, head=3, rel='JCT')), Token(word='there', pos='n', mor='there', gra=Gra(dep=5, head=6, rel='MOD')), Token(word='Tracy', pos='n:prop', mor='Tracy', gra=Gra(dep=6, head=4, rel='POBJ')), Token(word='?', pos='?', mor='', gra=Gra(dep=7, head=3, rel='PUNCT'))], time_marks=None, tiers={'MOT': "what animal's on there Tracy ?", '%mor': 'pro:int|what n|animal~cop|be&3S prep|on n|there n:prop|Tracy ?', '%gra': '1|2|MOD 2|3|SUBJ 3|0|ROOT 4|3|JCT 5|6|MOD 6|4|POBJ 7|3|PUNCT'})


<font size='4'>Word Frequencies and Ngrams: For word combinatorics, check out word_frequencies() and word_ngrams():</font>

In [None]:
word_freq = nadig.word_frequencies(by_files = True)  # a collections.Counter object
print(len(word_freq))
print(word_freq[0].most_common(5)) #print the 5 most common words for the first chat

38
[('.', 164), ('?', 87), ('you', 62), ('a', 35), ('tea', 34)]


In [None]:
bigrams = nadig.word_ngrams(2, by_files = True)  # a collections.Counter object
bigrams[0].most_common(5) #print the 5 most common bigrams for the first chat

[(('you', 'wanna'), 12),
 (('do', 'you'), 8),
 (('yeah', '.'), 8),
 (('tea', '.'), 8),
 (('I', 'think'), 7)]