Source: https://asd.talkbank.org/access/English/Eigsti.html

Associated dataset paper link: https://asd.talkbank.org/access/0docs/Eigsti2007.pdf

- The files consist of transcribed 30-minute free play sessions of children ages 3-6 years, with ASD, non-ASD developmental delay, and typical development (n=16 per group)[48 files in total].
- The first 100 utterances for each participant were included. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

#Optional: move to the desired location:
#%cd drive/My Drive/DIRECTORY_IN_YOUR_DRIVE

Mounted at /content/drive


In [2]:
!pip install --upgrade pylangacq

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pylangacq
  Downloading pylangacq-0.17.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pylangacq
Successfully installed pylangacq-0.17.0


In [3]:
import pylangacq
import pandas as pd
import os

In [4]:
url = "drive/My Drive/Omdena/SriLanka/Autism/asd_talkbank/Eigsti.zip"
eigsti = pylangacq.read_chat(url )
eigsti.n_files() #how many chat files are present for this data

48

# Extract Raw Data

In [7]:
file_names = eigsti.file_paths() #file names will be used for later csv file-saving.
file_names = [file_name.split('/')[1].split('.')[0] for file_name in file_names] #extract the file name from the path
#file name has format Eigsti/1010.cha . So, we split on '/', take the later part(1010.cha). Again we split on '.' and take the first part '1010'.

In [9]:
file_names[:5]

['1010', '1011', '1012', '1013', '1017']

<font size ='4'>Utterrance is a line of a conversation and its associated metadata such as token, timestamp, and grammatical info. Using utterrances(by_files=True) method here will return a 2d array of the format **(number_of_chat_files, number_of_utterrance_object_in_that_file)**</font>

In [10]:
utterrances = eigsti.utterances( by_files=True)

<font size ='4'>As an example, we take the utterrances of the first chat file denoted by index 0. It has a **participant** attribute indicating who's speaking and a dictionary named **tiers** that holds the original text and some additional grammatical info. We can thus extract the original text from **tiers** using the key of participant name.</font>

In [11]:
first_file_utterrances = utterrances[0]
for utter in first_file_utterrances:
  participant = utter.participant; participant_line = utter.tiers[participant]
  print(f'{participant} : {participant_line}')

INV2 : alright !
INV2 : you have fun, okay ?
CHI : okay .
INV2 : alright .
INV1 : did you see this ?
CHI : did you see this ?
INV1 : you can play with any of the stuff in there .
INV1 : did you get all of these ?
CHI : yeah .
INV1 : you did ?
INV1 : how'd you get (th)em ?
INV1 : where'd you get these ?
CHI : xxx let's see +...
INV1 : you what ?
CHI : xxx .
INV1 : hm: ?
CHI : <she's not fit> [?] .
INV1 : hm: ?
CHI : she's [//] she doesn't fit .
INV1 : does it go in ?
CHI : yeah .
INV1 : show me .
INV1 : oh ‡ there you go .
INV1 : good job, good job .
INV1 : do you know what that's called ?
CHI : a star .
INV1 : what's it called, a star ?
CHI : yeah .
INV1 : yeah .
CHI : a square .
INV1 : uhhuh .
INV1 : where'd you learn all those ?
INV1 : where'd you learn your shapes ?
CHI : I did i(t) .
INV1 : you did, where ?
CHI : here .
CHI : aw@o .
INV1 : what happens ?
CHI : æəwəwəwəiibəcamibəbəbəbəbəgheorn@b .
INV1 : here let's move back, can we move back a little bit ?
CHI : yeah .
INV1 : just 

<font size ='4'>We will save each conversation in file_name.csv file. So, for chat file **1010.cha** we will have **1010.csv**. The csv data columns will be: 
<ul>
<li>participant</li> 
<li>sentence</li>
</ul>
</font>

In [12]:
column_names = ['participant', 'sentence']
save_dir = "drive/My Drive/Omdena/SriLanka/Autism/asd_talkbank/eigsti"
chat_file_index = 0
for chat_file in utterrances:
  chat_df = pd.DataFrame(columns=column_names)
  #print(chat_df)
  for utter in chat_file:
    participant = utter.participant; participant_line = utter.tiers[participant] #extract participant and chat data info from the utterrance object
    chat_df = chat_df.append({'participant':participant, 'sentence':participant_line}, ignore_index=True) #add the sentence and participant info to a df object
  file_name = file_names[chat_file_index]; chat_file_index += 1
  file_name = file_name + '.csv' #construct the file name of the csv file
  chat_df.to_csv(os.path.join(save_dir, file_name), index = False) # save the csv file in the destination folder
  print(f'{chat_file_index}: {file_name}')
  #break

1: 1010.csv
2: 1011.csv
3: 1012.csv
4: 1013.csv
5: 1017.csv
6: 1020.csv
7: 1022.csv
8: 1024.csv
9: 1025.csv
10: 1027.csv
11: 1028.csv
12: 1030.csv
13: 1031.csv
14: 1032.csv
15: 1035.csv
16: 1040.csv
17: 1041.csv
18: 1042.csv
19: 1043.csv
20: 1044.csv
21: 1045.csv
22: 1046.csv
23: 1047.csv
24: 1048.csv
25: 1052.csv
26: 1054.csv
27: 1055.csv
28: 1056.csv
29: 1057.csv
30: 1058.csv
31: 1060.csv
32: 1061.csv
33: 1062.csv
34: 1063.csv
35: 1064.csv
36: 1068.csv
37: 1069.csv
38: 1070.csv
39: 1072.csv
40: 1074.csv
41: 1075.csv
42: 1076.csv
43: 1078.csv
44: 1079.csv
45: 1080.csv
46: 1081.csv
47: 1082.csv
48: 1085.csv
