# Word segmentation

In this notebook we are going to explore word segmentation. Word segmentation is the task of finding words in a stream of speech. This is a difficult problem for language learners because there are limited acoustic cues to word boundaries. Yet somehow, children as yound as 7.5 months begin to effortlessly parse a speech stream into individual words. How do they do this? 

Hockema (2006) is a nice study of this ability. It looks at whether there is sufficient information in the order of phonemes to successfully segment words. In particular, the study is concerned with the probabilities of one phoneme following another. The probabilities are derived from relative frequencies. For example, in English, the phoneme sequence /zg/ will rarely ever occur within a word (if at all). Thus, if a transition from /z/ to /g/ is ever encountered, odds are high that the /z/ is at the end of a word and the /g/ is at the beginning of the next. The argument is made that highly accurate word segmentation of American English speech is possible by a model that attends only to these probabilities. 

For more details about word segmentation more broadly, read the following:

- Goldwater, Sharon, Thomas L. Griffiths, and Mark Johnson. "A Bayesian framework for word segmentation: Exploring the effects of context." Cognition 112.1 (2009): 21-54.
- Hockema, Stephen A. "Finding words in speech: An investigation of American English." Language Learning and Development 2.2 (2006): 119-146.
- Johnson, Elizabeth K., and Peter W. Jusczyk. "Word segmentation by 8-month-olds: When speech cues count more than statistics." Journal of memory and language 44.4 (2001): 548-567.
- Saffran, Jenny R., Richard N. Aslin, and Elissa L. Newport. "Statistical learning by 8-month-old infants." Science (1996): 1926-1928.

Throughout all the notebooks of this workshop, we're not going to be concerned with whether the arguments are right or wrong. Rather, we care about practising our Python skills so that we can do similar linguistic analyses. In particular, we're going to practice the following skills:

- Strings
- Dataframes
- Python objects
- etc.

## Data

The corpus used in this study is the American English part of CHILDES. Here's the relevant description of the corpus from the paper:

> Utterances were extracted from the CHILDES corpora, which consist of an assortment of speech interactions involving children. All of the corpora in the American English portion were used. Any utterance by a speaker older than 6 years was extracted and analyzed. (Speech produced by children still in the process of actively acquiring the language was thus excluded.) A phonemic pronunciation of each word was obtained from an electronic copy of the Carnegie Mellon Pronouncing Dictionary (1998, cmudict, Version 0.6d) containing 129,482 phonemically annotated entries using an alphabet of 39 phonemes (a subset of ARPAbet). Thus, there were 1,521 (39^2) possible phoneme transition pairs (PTPs). In cases where there was more than one pronunciation listed for a word, the first (most common) was always used.

Here's my simplification of it:
- American English CHILDES, which is orthographically transcribed
- Child-directed speech (speakers over 6 years old)
- Used CMU to map to ARPAbet.
- ARPAbet has 39 phonemes, so there are 1,521 possible phoneme pairs (e.g. /pt/, /ko/)

### Downloading the data

I've always had a lot of trouble downloading the CHILDES corpus. It's slow and unreliable. It's available somewhere [here](http://childes.talkbank.org/). I've included it in this directory for you. It's stored in `data`. In that folder you'll find numerous subdirectories labelled by language name. Within each language's subdirectory are zip files. When you unzip those files you get another folder filled with more subdirectories. Inside each of those are the raw data files of CHILDES, stored in a `.cha` format.

### After unzipping the files

In [27]:
import os
import glob
import pylangacq as pla

In [31]:
path = glob.glob('data/ENG-MA-MOR/**/*.cha', recursive=True)

In [None]:
reader = pla.read_chat(path[0])
for filename in path[1:]:
    try:
        reader.add(filename)
    except:
        pass