I downloaded all of the Seinfeld scripts from seinology.com and wrote scripts to extract the scripts and put them into a SQLite database.
Feel free to message me if you want the DB file.
python download.py scripts
- Fix any issues in the data (See CHANGES MADE TO DATA)
./run.sh seinfeld.db scripts
sqlite> .schema episode CREATE TABLE episode( id INTEGER PRIMARY KEY, season_number INTEGER NOT NULL, episode_number INTEGER NOT NULL, title TEXT, the_date TEXT, writer TEXT, director TEXT, UNIQUE(season_number, episode_number) ); sqlite> select * from episode limit 3; id season_number episode_number title the_date writer director 1 1 0 Good News, Bad News July 5, 1989 Larry David, Jerry Seinfeld Art Wolff 2 2 5 The Apartment April 4, 1991 Peter Mehlman Tom Cherones 3 6 16 The Beard February 9, 1995 Carol Leifer Andy Ackerman
CREATE TABLE utterance( id INTEGER PRIMARY KEY, episode_id INTEGER NOT NULL, utterance_number INTEGER NOT NULL, speaker TEXT NOT NULL, text TEXT NOT NULL, UNIQUE(episode_id, utterance_number), FOREIGN KEY(episode_id) REFERENCES episode(id) ); sqlite> select * from utterance limit 3; id episode_id utterance_number speaker text 1 1 1 JERRY (pointing at George's shirt) See, to me, that button is in the worst possible spot. The second button literally makes or breaks the shirt, look at it. It's too high! It's in no-man's-land. You look like you live with your mother. 2 1 2 GEORGE Are you through? 3 1 3 JERRY You do of course try on, when you buy?
- Script transcribers sometimes describe how a line is spoken or what's going on in a scene as parentheticals preceding lines. I'd like to remove these and I think it may be as easy as looking for a pair of parentheses at the beginning of a line.
- A lot of the character names are uses inconsistently.
CHANGES MADE TO DATA
In the file '01.shtml' look for "pc: 101, season 1, episode 1 (Pilot)" and change "episode 1" to "episode 0".
####Characters with the most lines
SELECT speaker, count(*) as count FROM utterance GROUP BY speaker ORDER BY count DESC LIMIT 20; speaker count JERRY 14645 GEORGE 9613 ELAINE 7967 KRAMER 6656 NEWMAN 625 MORTY 502 HELEN 470 FRANK 429 SUSAN 382 ESTELLE 273 MAN 207 PETERMAN 199 WOMAN 199 PUDDY 163 LEO 145 JACK 124 STEINBRENNER 122 MICKEY 118 BANIA 102 ROSS 102