## Cornell Movie--Dialogs Corpus

more info: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

zip file: http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip



In [1]:
import gzip, io, requests, zipfile

# Load zip file

In [2]:
zip_url = "http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip"
r = requests.get(zip_url, stream=True)

In [3]:
z = zipfile.ZipFile(io.BytesIO(r.raw.data))

## List files in zip

In [4]:
for file in z.filelist:
    if file.filename.endswith("txt"):
        print(file.filename)

cornell movie-dialogs corpus/movie_characters_metadata.txt
cornell movie-dialogs corpus/movie_conversations.txt
cornell movie-dialogs corpus/movie_lines.txt
cornell movie-dialogs corpus/movie_titles_metadata.txt
cornell movie-dialogs corpus/raw_script_urls.txt
cornell movie-dialogs corpus/README.txt
__MACOSX/cornell movie-dialogs corpus/._README.txt


## Check out the README

In [5]:
readme = z.read('cornell movie-dialogs corpus/README.txt')
print(readme.decode('iso-8859-1'))

Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs"
Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data.  Thanks!


Contents of this README:

	A) Brief description
	B) Files description
	C) Details on the collection procedure
	D) Contact


A) Brief description:

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
	- genres
	- rele

# Load data with Pandas

In [6]:
import pandas as pd
from IPython.display import HTML

## Titles

In [23]:
movies_headers = ['movieID', 'title', 'year', 'imdb_rating', 'no_imdb_votes', 'genres']
movies_df = pd.read_csv(io.StringIO(z.read('cornell movie-dialogs corpus/movie_titles_metadata.txt').decode('iso-8859-1')),
                                    sep='\ \+\+\+\$\+\+\+\ ',
                                    names=movies_headers,
                                    index_col='movieID')

print('count:', len(movies_df))
HTML(movies_df.tail(3).to_html())

count: 617




Unnamed: 0_level_0,title,year,imdb_rating,no_imdb_votes,genres
movieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
m614,x-men,2000,7.4,122149,"['action', 'sci-fi']"
m615,young frankenstein,1974,8.0,57618,"['comedy', 'sci-fi']"
m616,zulu dawn,1979,6.4,1911,"['action', 'adventure', 'drama', 'history', 'w..."


## Characters

In [22]:
characters_headers = ['characterID', 'character_name', 'movieID', 'movie_title', 'gender', 'credits_position']
characters_df = pd.read_csv(io.StringIO(z.read('cornell movie-dialogs corpus/movie_characters_metadata.txt').decode('iso-8859-1')),
                                    sep='\ \+\+\+\$\+\+\+\ ',
                                    names=characters_headers,
                                    index_col='movieID')

print('count:', len(characters_df))
HTML(characters_df.tail(3).to_html())

count: 9035




Unnamed: 0_level_0,characterID,character_name,movie_title,gender,credits_position
movieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
m616,u9032,NORRIS-NEWMAN,zulu dawn,?,?
m616,u9033,STUART SMITH,zulu dawn,?,?
m616,u9034,VEREKER,zulu dawn,?,?


## Lines

In [20]:
lines_headers = ['lineID', 'characterID', 'movieID', 'character_ame', 'text']
lines_df = pd.read_csv(io.StringIO(z.read('cornell movie-dialogs corpus/movie_lines.txt').decode('iso-8859-1')),
                       sep='\ \+\+\+\$\+\+\+\ ',
                       names=lines_headers,
                       index_col='lineID',
                      )
lines_df = lines_df.fillna("")
print('count:', len(lines_df))
HTML(lines_df.tail(3).to_html())



count: 304713


Unnamed: 0_level_0,characterID,movieID,character_ame,text
lineID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
L666369,u9030,m616,DURNFORD,"Your orders, Mr Vereker?"
L666257,u9030,m616,DURNFORD,"Good ones, yes, Mr Vereker. Gentlemen who can ..."
L666256,u9034,m616,VEREKER,Colonel Durnford... William Vereker. I hear yo...


## Conversations

In [21]:
conversations_headers = ['firstCharacterID', 'secondCharacterID', 'movieId', 'order']
conversations_df = pd.read_csv(io.StringIO(z.read('cornell movie-dialogs corpus/movie_conversations.txt').decode('iso-8859-1')),
                               sep='\ \+\+\+\$\+\+\+\ ',
                               names=conversations_headers
                              )
print('count:', len(conversations_df))
HTML(conversations_df.tail(3).to_html())



count: 83097


Unnamed: 0,firstCharacterID,secondCharacterID,movieId,order
83094,u9030,u9034,m616,"['L666256', 'L666257']"
83095,u9030,u9034,m616,"['L666369', 'L666370', 'L666371', 'L666372']"
83096,u9030,u9034,m616,"['L666520', 'L666521', 'L666522']"


# Save out data

In [54]:
import gzip, pickle

def save_file(name, df):
    with gzip.open(name, 'w') as f:
        pickle.dump(df, f)

save_file('df_movies.pkl.gz', movies_df)
save_file('df_characters.pkl.gz', characters_df)
save_file('df_lines.pkl.gz', lines_df)
save_file('df_conversations.pkl.gz', conversations_df)