Switchboard Dialog Act Corpus with Penn Treebank links
The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2 with turn/utterance-level dialog-act tags. The tags summarize syntactic, semantic, and pragmatic information about the associated turn. The SwDA project was undertaken at UC Boulder in the late 1990s.
The SwDA is not inherently linked to the Penn Treebank 3 parses of Switchboard, and it is far from straightforward to align the two resources. In addition, the SwDA is not distributed with the Switchboard's tables of metadata about the conversations and their participants.
This project includes a version of the corpus (
pools all of this information to the best of my ability. In addition,
it includes Python classes that should make it easy to work with
this merged resource.
This project was originally part of my LSA Linguistic Institute 2011 course Computational Pragmatics. Additional resources from that corpus:
- Corpus overview
- Experiment: Question acts and interrogative clauses in the SwDA
- Analysis: Clustering words by tags in the SwDA
swda.py: the module for processing this corpus distribution
swda.zip: the corpus; needs to be unzipped
swda_functions.py: some simple examples aggregating informaton with
metadata_processor.py: auxiliary processing file used to create
Transcript objects model the individual files in the corpus.
Transcript object is built from a transcript filename and the corpus
from swda import Transcript trans = Transcript('swda/sw00utt/sw_0001_4325.utt.csv', 'swda/swda-metadata.csv') trans.topic_description 'CHILD CARE' trans.prompt 'FIND OUT WHAT CRITERIA THE OTHER CALLER WOULD USE IN SELECTING CHILD \ CARE SERVICES FOR A PRESCHOOLER. IS IT EASY OR DIFFICULT TO FIND SUCH CARE?' trans.talk_day datetime.datetime(1992, 3, 23, 0, 0) trans.talk_day.year 1992 trans.talk_day.month 3 trans.from_caller 1632 trans.from_caller_sex 'FEMALE'
Transcript instances have many attributes:
for a in sorted([a for a in dir(trans) if not a.startswith('_')]): print(a) conversation_no conversation_no from_caller from_caller_birth_year from_caller_dialect_area from_caller_education from_caller_sex header length metadata prompt ptd_basename swda_filename talk_day to_caller to_caller_birth_year to_caller_dialect_area to_caller_education to_caller_sex topic_description utterances
These have many attributes and methods. Some examples:
utt = trans.utterances utt.caller 'B' utt.act_tag 'sv' utt.text '[ I guess + --' utt.pos '[ I/PRP ] guess/VBP --/:' utt.pos_words() ['I', 'guess', '--'] utt.pos_lemmas(wn_lemmatize=True) [('I', 'prp'), ('guess', 'v'), ('--', ':')] len(utt.trees) 1 utt.trees.pprint() '(S (EDITED (RM (-DFL- \\[)) (S (NP-SBJ (PRP I)) (VP-UNF (VBP guess))) (IP (-DFL- \\+))) (NP-SBJ (PRP I)) (VP (VBP guess) (RS (-DFL- \\])) (SBAR (-NONE- 0) (S (NP-SBJ (PRP we)) (VP (MD can) (VP (VB start)))))) (. .))'
Because the trees often properly contain the utterance, they cannot be used to gather word- or phrase-level statistics unless care is taken to restrict attention to the subtrees, or fragments thereof, that represent the utterance itself.
Not all utterances have trees; only a subset of the Switchboard is fully parsed. Thus, of the 221,616 utterances in the SwDA, 118,218 (53%) have at least one tree.
The main interface provided by
swda.py is the
CorpusReader, which allows you to
iterate through the entire corpus, gathering information as you go.
objects are built from just the root of the directory containing your csv files.
(It assumes that
swda-metadata.csv is in the first directory below that root.)
from swda import CorpusReader corpus = CorpusReader('swda')
The two central methods for
CorpusReader objects are
iter_utterances. The method
iter_utterances is basically an abbreviation
of the following nested loop:
for trans in corpus.iter_transcripts(): for utt in trans.utterances: yield utt
For some illustrations, see
There's a much fuller overview here: http://compprag.christopherpotts.net/swda.html