## Getting started with `scikit-talk`

`scikit-talk` can be used to explore and analyse conversation files.

It contains three main levels of objects:

- Corpora; described with the `Corpus` class
- Conversations; described with the `Conversation` class
- Utterances; described with the `Utterance` class

To explore the power of `scikit-talk`, the best entry point is a parser. With the parsers, we can load data into a `scikit-talk` object.

`scikit-talk` currently has the following parsers:

- `Conversation.from_cha()`, which parses .cha files.
- `Conversation.from_eaf()`, which parses ELAN (.eaf) files.

Future plans include the creation of parsers for:

- .TextGrid files
- .xml files

Parsers return an object of the `Conversation` class.

To get started with `scikit-talk`, import the module:

In [2]:
import sktalk

To see it in action, we will need to start with a transcription file.

For example, you can download a file from the
[Griffith Corpus of Spoken Australian English](https://ca.talkbank.org/data-orig/GCSAusE/). This publicly available corpus contains transcription files in `.cha` format.

Another publicly available corpus is the [IFADV](https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/Annotations/EAF/) corpus, which contains annotations as `.eaf` files.

We will go over both options below.

### Parsing a `.cha` file

From the Griffith corpus, we have downloaded [this file](https://ca.talkbank.org/data-orig/GCSAusE/01.cha).

We will parse the file with the `Conversation.from_cha()` method, resulting in a `Conversation` object.:

In [3]:
griffith01 = sktalk.Conversation.from_cha('01.cha')

griffith01

<sktalk.corpus.conversation.Conversation at 0x10fce4be0>

A parsed cha file is a conversation object. It has metadata, and a collection of utterances:

In [4]:
griffith01.utterances[:10]

[Utterance(utterance='0', participant='S', time=[0, 1500], begin='00:00:00.000', end='00:00:01.500', metadata=None, utterance_clean='0', utterance_list=['0'], n_words=1, n_characters=1, time_to_next=None, dyadic=None, FTO=None),
 Utterance(utterance="mm I'm glad I saw you⇗", participant='S', time=[1500, 2775], begin='00:00:01.500', end='00:00:02.775', metadata=None, utterance_clean='mm Im glad I saw you', utterance_list=['mm', 'Im', 'glad', 'I', 'saw', 'you'], n_words=6, n_characters=15, time_to_next=None, dyadic=None, FTO=None),
 Utterance(utterance="I thought I'd lost you", participant='S', time=[2775, 3773], begin='00:00:02.775', end='00:00:03.773', metadata=None, utterance_clean='I thought Id lost you', utterance_list=['I', 'thought', 'Id', 'lost', 'you'], n_words=5, n_characters=17, time_to_next=None, dyadic=None, FTO=None),
 Utterance(utterance='le⌉,', participant='H', time=[4052, 5515], begin='00:00:04.052', end='00:00:05.515', metadata=None, utterance_clean='le', utterance_list

In [5]:
griffith01.metadata

{'source': '01.cha',
 'UTF8': '',
 'PID': '11312/t-00017232-1',
 'Languages': ['eng'],
 'Participants': {'S': {'name': 'Sarah',
   'language': 'eng',
   'corpus': 'GCSAusE',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Adult',
   'education': '',
   'custom': ''},
  'H': {'name': 'Hannah',
   'language': 'eng',
   'corpus': 'GCSAusE',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Adult',
   'education': '',
   'custom': ''}},
 'Options': 'CA',
 'Media': '01, audio'}

We can explore the conversation using the `summary` method:

In [6]:
griffith01.summary(n=30)

(0 - 1500) S: '0'
(1500 - 2775) S: 'mm I'm glad I saw you⇗'
(2775 - 3773) S: 'I thought I'd lost you'
(4052 - 5515) H: 'le⌉,'
(4052 - 5817) S: '⌊xxx⌋'
(6140 - 9487) S: ': (.) if ʔI couldn't boʔrrow, (1.3)'
(9487 - 12888) S: 'r'
(12888 - 14050) H: 'nicating acro-'
(14050 - 17014) H: 'for family gender and sexuality'
(17014 - 18611) S: 'that's the second on is itʔ'
(18611 - 21090) H: '+≋ I think it's s⌈ame family gender⌉ has a second book'
(19011 - 20132) S: '⌊whatever xxx⌋'
(21090 - 23087) H: 'not communicating across cultures'
(24457 - 25746) H: '⌈family gen⌈der has two'
(24457 - 25931) S: '⌊can-   ⌊can I borrow it⇗'
(25931 - 26971) H: 'ʔh ⌈sure'
(26576 - 27215) S: '⌊thank you'
(27554 - 28309) H: 'I've got all my-'
(28700 - 30774) H: 'in fact all my reading books are all together,'
(31400 - 31876) H: 'so that'
(32276 - 33530) H: 'se them⇗'
(33800 - 34706) H: 'I do ∆sort of∆ think-'
(34706 - 38006) H: 'cause I don't think that one I'll be using (0.2) particularly'
(38100 - 39261) H: 'in

This method also allows us to look in detail at e.g. a specific participant:

In [7]:
griffith01.summary(participant = 'S', n = 5)

(0 - 1500) S: '0'
(1500 - 2775) S: 'mm I'm glad I saw you⇗'
(2775 - 3773) S: 'I thought I'd lost you'
(4052 - 5817) S: '⌊xxx⌋'
(6140 - 9487) S: ': (.) if ʔI couldn't boʔrrow, (1.3)'


### Parsing an `.eaf` file

From the IFADV corpus, we have downloaded [this file](https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/Annotations/EAF/DVA3E.EAF).

We will use the `Conversation.from_eaf()` method to parse the file, resulting in a `Conversation` object.

In [8]:
ifadv03 = sktalk.Conversation.from_eaf("DVA3E.EAF")

ifadv03

<sktalk.corpus.conversation.Conversation at 0x11780f550>

ELAN formats are a bit more complex than `.cha` files, as they may contain additional annotations (e.g. for gestures). These annotations are stored in the ELAN format as different tiers, which end up in the `Conversation` object as utterances from different participants.

We can look at the participants in the conversation:

In [9]:
ifadv03.participants

{'kijkrichting spreker1 [v] (TIE1)',
 'kijkrichting spreker2 [v] (TIE3)',
 'spreker1 [v] (TIE0)',
 'spreker2 [v] (TIE2)'}

In this case, we are only interested in `'spreker1 [v] (TIE0)'` and `'spreker2 [v] (TIE1)'`. We want to remove the other "participants" from the conversation.

In [10]:
ifadv03.remove(participant = 'kijkrichting spreker1 [v] (TIE1)')
ifadv03.remove(participant = 'kijkrichting spreker2 [v] (TIE3)')

ifadv03.participants

{'spreker1 [v] (TIE0)', 'spreker2 [v] (TIE2)'}

Another way to ensure only the right tiers are included, is to specify the tiers we want to parse when we call the `from_eaf()` method:

In [11]:
ifadv03 = sktalk.Conversation.from_eaf("DVA3E.EAF", tiers = ['spreker1 [v] (TIE0)', 'spreker2 [v] (TIE2)'])

ifadv03.participants

{'spreker1 [v] (TIE0)', 'spreker2 [v] (TIE2)'}

## Analyzing turn-taking dynamics

When creating a `Conversation` object, a number of calculations and transformations are performed on the `Utterance` objects within.
For example, the number of words in each utterance is calculated, and stored under `Utterance.n_words`.
You can see this for a specific utterance as follows:

In [12]:
print(griffith01.utterances[13].utterance)
print(griffith01.utterances[13].utterance_clean)
print(griffith01.utterances[13].n_words)

⌈family gen⌈der has two
family gender has two
4


More sophisticated calculations can be performed, but do not happen automatically.
An example of this is the calculation of the Floor Transfer Offset (FTO) per utterance.
FTO is defined as the difference between the time that a turn starts, and the end of the most relevant prior turn by the other participant.
If there is overlap between these turns, the FTO is negative.
If there is a pause between these utterances, the FTO is positive.

We can calculate the FTOs of the utterances in a conversation:

In [13]:
griffith01.calculate_FTO()

for utterance in griffith01.utterances[:10]:
    print(f'{utterance.time} {utterance.participant} - FTO: {utterance.FTO}')

[0, 1500] S - FTO: None
[1500, 2775] S - FTO: None
[2775, 3773] S - FTO: None
[4052, 5515] H - FTO: 279
[4052, 5817] S - FTO: None
[6140, 9487] S - FTO: None
[9487, 12888] S - FTO: None
[12888, 14050] H - FTO: 0
[14050, 17014] H - FTO: None
[17014, 18611] S - FTO: 0


To determine which prior turn is the relevant turn for FTO calculation, the following criteria are used to find a relevant utterance prior to an utterance U:

- the relevant utterance must be by another participant
- the relevant utterance must be the most recent utterance by that participant
- the relevant utterance must have started more than a specified number of ms before the start of U. This time defaults to 200 ms, but can be changed with the `planning_buffer` argument.
- the relevant utterance must be partly or entirely within the context window. The context window is defined as 10s (or 10000ms) prior to the utterance U. The size of this window can be changed with the `window` argument.
- within the context window, there must be a maximum of 2 speakers, which can be changed to 3 with the `n_participants` argument.

When calculating the FTO, the settings for the arguments `planning_buffer`, `window`, and `n_participants` can be changed. Their values are stored in the metadata of the conversation object when the FTOs are calculated.

They can be retrieved as follows:

In [14]:
griffith01.metadata["Calculations"]["FTO"]

{'window': 10000, 'planning_buffer': 200, 'n_participants': 2}

## The `Corpus` object

A Corpus is a way to collect conversations.

A Corpus can be initialized from a single conversation, or a list of conversations.
It can also be initialized as an empty object, with metadata.

In [15]:
GCSAusE = sktalk.Corpus(name = "Griffith Corpus of Spoken Australian English",
                        url = "https://ca.talkbank.org/data-orig/GCSAusE/")

GCSAusE.metadata

{'name': 'Griffith Corpus of Spoken Australian English',
 'url': 'https://ca.talkbank.org/data-orig/GCSAusE/'}

We can add conversations to a `Corpus`:

In [16]:
GCSAusE.append(griffith01)

GCSAusE.conversations

[<sktalk.corpus.conversation.Conversation at 0x10fce4be0>]

## Storing and retrieving `Conversation` and `Corpus` objects


Both `Conversation` and `Corpus` objects can be written to file in .csv and .json formats.

### json
.json files are comprehensive, and contain the entire object in one file:

In [17]:
# Corpus
GCSAusE.write_json(path = "CGSAusE.json")

# Conversation
griffith01.write_json(path = "GCSAusE_01.json")

Object saved to CGSAusE.json
Object saved to GCSAusE_01.json


The objects can be recreated from the .json files:

In [18]:
# Corpus
GCSAusE_2 = sktalk.Corpus.from_json("CGSAusE.json")

# Conversation
griffith01_2 = sktalk.Conversation.from_json(path = "CGSAusE_01.json")

### csv

When writing to .csv, two files are created. One contains the utterances, and the other contains the metadata.
The former is named using the path provided, and the metadata file is named with the suffix `_metadata.csv` added.

In [19]:
# Corpus
GCSAusE.write_csv(path = "CGSAusE.csv")

# Conversation
griffith01.write_csv(path = "GCSAusE_01.csv")

Utterances saved to CGSAusE.csv
Metadata saved to CGSAusE_metadata.csv
Utterances saved to GCSAusE_01.csv
Metadata saved to GCSAusE_01_metadata.csv
