Skip to content

doches/childes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

Scripts for CHILDES

If you've ever tried to work with the CHILDES corpus before, you know that dealing with it can be...difficult. It's organized into a tremendous number of flat text files, one per session; the file format used to encode the session transcript (not to mention the metadata!) is, to put it mildly, highly irregular.

In short, it's a mess.

This repository is a central place for various CHILDES-related tools, scripts, and hacks. The first of these are a set of scripts to wrangle the .cha format (the 'highly irregular' thing above) into sanity-inducing XML, adding in POS tags and dependency parses along the way.

Tasks

Produce an XML version of CHILDES in xml/ (do this first!)

ruby tools/xmlize.rb Eng-USA/ && ruby tools/centralize_xml.rb Eng-USA/ xml/

Get a list of nouns (and filter by WordNet)

ruby tools/extract_nouns.rb xml/ | ruby tools/filter_nouns.rb > childes.nouns

Build a basic clustering using WordNet hypernyms

cat childes.nouns | ruby tools/cluster_nouns.rb > childes.wordnet_cluster

Tools

xml2document

Takes an XML file, extracts all of the utterances, and prints them to standard output for later processing. Takes an optional key to ignore (e.g. CHI)

Usage

ruby tools/xml2document.rb path/to/childes.xml <ignore_key>

cluster_wordnet_nouns

Reads a list of noun/count pairs from STDIN (like the output of extract_nouns or filter_nouns) and prints a yaml clustering based on their first-order WordNet synsets. Requires doches/rwordnet.

Usage

[cat noun.list] | ruby tools/cluster_wordnet_nouns.rb

corpusize

Takes a path to a directory containing XML (as ouput by cha2xml), finds all of the utterances therein, and writes a target corpus of all of them to standard out. Basically a wrapper around xml2corpus, and probably the sort of thing you're only interested in if you're also using doches/corncob.

Usage

ruby tools/corpusize.rb path/to/xml/root

centralize_xml

Replicates the directory structure of input in output, copying only xml files over into the new structure. Use this after xmlize, to build a version of CHILDES containing only XML. You don't have to do this (other tools will silently ignore .cha files, preferring XML), but it satisfies my housekeeping urges.

Usage

ruby tools/centralize_xml.rb path/to/CHILDES/input path/to/xml/output

build_agemap

Takes a directory containing CHILDES XML and prints a mapping (tab-delimited) of child age to filename.

Usage

ruby tools/build_agemap.rb path/to/childes

xml2corpus

Takes an XML file, extracts all of the utterances, and prints a TargetCorpus (see doches/corncob) using nouns from the POStag list as target words. Like cha2xml, you probably want to call this automatically from some other script. Takes an optional key to ignore (e.g. CHI)

Usage

ruby tools/xml2corpus.rb path/to/childes.xml <ignore_key>

filter_corpus

Filters a target corpus (from standard input) to include only lines involving target words from a list, printing the result to standard out. Used to clean up the output of xml2corpus according to the output of filter_nouns.

Usage

[cat file.target_corpus] | ruby tools/filter_corpus.rb path/to/nouns.filtered

cha2xml

Reads a CHILDES .cha file from STDIN and outputs XML file containing cleaned dialog to STDOUT

Usage

[cat thing.cha] | ruby cha2xml.rb <options>

Options

xmlize calls cha2xml with all of these options on by default

  • --braces Strip out experimenter annotations ("foo [this is a note] bar") from utterances.
  • --clean Remove words containing nonsensical (i.e. non-word) characters.
  • --minipar Run utterances through MINIPAR, including the result in the <parse> tag. Looks for ./vendor/pdemo/pdemo, with data files in ./vendor/data.
  • --tag Run utterances through a pure Ruby implementation of the Brill tagger, including the result in the <tags> tag.

agemap2documents

Reads an agemap from standard input and creates a set of nlda-friendly corpora in , binned into six-month periods

Usage

cat [agemap] | ruby tools/agemap2nldacorpora.rb path/to/output

compute_reading_levels

Compute reading levels for each document in a directory, and produce a data file ready for plotting with GnuPlot.

Usage

ruby tools/compute_reading_levels.rb path/to/dir > file.dat

xmlize

Scans a diretory for .cha files, converting any it finds into dialog XML

Usage

ruby xmlize.rb <path/to/CHILDES/root>

filter_nouns

Reads a list of noun/counts (as output by extract_nouns) from STDIN, filtering the list to include only nouns appearing in WordNet (as nouns). Requires doches/rwordnet.

Usage

[cat noun.txt] | ruby tools/filter_nouns.rb

agemap2corpora

Reads an agemap from standard input and creates a set of corpora in , one per each six-month period

Usage

cat [agemap] | ruby tools/agemap2corpora.rb path/to/output

reading_level

Compute reading levels (e.g. Coleman-Liau Index, Automated Readability Index, Flesh-Kincaid Readability Test) for a given target_corpus

Input

One sentence per line.

Usage

ruby tools/reading_level.rb path/to/file.target_corpus <options>

Output

Prints a tab-delimited list of metric names as a comment (e.g. "# coleman ari words"), followed by a tab-delimited list of computed metrics (e.g. "4.3 4.1 6.0 132.8")

Options

extract_nouns

Looks recursively in a directory for xml, scanning each file found for nouns in the POS tag list and outputting a list of all nouns found (plus counts)

Usage

ruby tools/extract_nouns.rb path/to/xml/root

About

Code for wrangling the CHILDES corpus into something useful

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages