## Getting started with `scikit-talk`

`scikit-talk` can be used to explore and analyse conversation files.

It contains three main levels of objects:
- Corpora; described with the `Corpus` class
- Conversations; described with the `Conversation` class
- Utterances; described with the `Utterance` class

To explore the power of `scikit-talk`, the best entry point is a parser. With the parsers, we can load data into a `scikit-talk` object.

`scikit-talk` currently has the following parsers:

- `ChaFile`.parse(), which parsers .cha files.

Future plans include the creation of parsers for:
- .eaf files
- .TextGrid files
- .xml files
- .csv files
- .json files

Parsers return an object of the `Conversation` class.

To get started with `scikit-talk`, import the module:

In [2]:
import sktalk

To see it in action, we will need to start with a transcription file.

For example, you can download a file from the
[Griffith Corpus of Spoken Australian English](https://ca.talkbank.org/data-orig/GCSAusE/). This publicly available corpus contains transcription files in `.cha` format.

We use the `ChaFile.parse` module to create the `Conversation` object:

In [10]:
cha01 = sktalk.ChaFile('GCSAusE_01.cha').parse()

cha01

<sktalk.corpus.conversation.Conversation at 0x10b395240>

A parsed cha file is a conversation object. It has metadata, and a collection of utterances:

In [11]:
cha01.utterances[:10]

[Utterance(utterance='0', participant='S', time=(0, 1500), begin='00:00:00.000', end='00:00:01.500', metadata=None),
 Utterance(utterance="mm I'm glad I saw you⇗", participant='S', time=(1500, 2775), begin='00:00:01.500', end='00:00:02.775', metadata=None),
 Utterance(utterance="I thought I'd lost you (0.3)", participant='S', time=(2775, 3773), begin='00:00:02.775', end='00:00:03.773', metadata=None),
 Utterance(utterance="⌈no I've been here for a whi:le⌉,", participant='H', time=(4052, 5515), begin='00:00:04.052', end='00:00:05.515', metadata=None),
 Utterance(utterance='⌊xxx⌋ (0.3)', participant='S', time=(4052, 5817), begin='00:00:04.052', end='00:00:05.817', metadata=None),
 Utterance(utterance="⌊hm:: (.) if ʔI couldn't boʔrrow, (1.3) the second (0.2) book of readings fo:r", participant='S', time=(6140, 9487), begin='00:00:06.140', end='00:00:09.487', metadata=None),
 Utterance(utterance='commu:nicating acro-', participant='H', time=(12888, 14050), begin='00:00:12.888', end='00:00:

In [12]:
cha01.metadata

{'source': 'GCSAusE_01.cha',
 'UTF8': '',
 'PID': '11312/t-00017232-1',
 'Languages': ['eng'],
 'Participants': {'S': {'name': 'Sarah',
   'language': 'eng',
   'corpus': 'GCSAusE',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Adult',
   'education': '',
   'custom': ''},
  'H': {'name': 'Hannah',
   'language': 'eng',
   'corpus': 'GCSAusE',
   'age': '',
   'sex': '',
   'group': '',
   'ses': '',
   'role': 'Adult',
   'education': '',
   'custom': ''}},
 'Options': 'CA',
 'Media': '01, audio'}

We can write the conversation to file as a json file:

In [13]:
cha01.write_json(name = "CGSAusE_01", directory = ".")

## The `Corpus` object

A Corpus is a way to collect conversations.

A Corpus can be initialized from a single conversation, or a list of conversations.
It can also be initialized as an empty object, with metadata.

In [14]:
GCSAusE = sktalk.Corpus(name = "Griffith Corpus of Spoken Australian English",
                        url = "https://ca.talkbank.org/data-orig/GCSAusE/")

GCSAusE.metadata

{'name': 'Griffith Corpus of Spoken Australian English',
 'url': 'https://ca.talkbank.org/data-orig/GCSAusE/'}

We can add conversations to a `Corpus`:

In [15]:
GCSAusE.append(cha01)

GCSAusE.conversations

[<sktalk.corpus.conversation.Conversation at 0x10b395240>]