# Dataset acquisition
Data from the Cornell Movie-Dialogs Corpus [https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
<pre><code>
@InProceedings{Danescu-Niculescu-Mizil+Lee:11a,
  author={Cristian Danescu-Niculescu-Mizil and Lillian Lee},
  title={Chameleons in imagined conversations: 
  A new approach to understanding coordination of linguistic style in dialogs.},
  booktitle={Proceedings of the 
        Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011},
  year={2011}
}
</code></pre>
From README.txt

## Brief description:

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata included:
	- genres
	- release year
	- IMDB rating
	- number of IMDB votes
	- IMDB rating
- character metadata included:
	- gender (for 3,774 characters)
	- position on movie credits (3,321 characters)


## Files description:

In all files the field separator is `" +++$+++ "`

### movie_titles_metadata.txt
contains information about each movie title
- fields: 
    - movieID, 
    - movie title,
    - movie year, 
    - IMDB rating,
    - no. IMDB votes,
    - genres in the format `['genre1','genre2',…,'genreN']`

### movie_characters_metadata.txt
contains information about each movie character
- fields:
    - characterID
    - character name
    - movieID
    - movie title
    - gender ("?" for unlabeled cases)
    - position in credits ("?" for unlabeled cases) 

### movie_lines.txt
contains the actual text of each utterance
- fields:
    - lineID
    - characterID (who uttered this phrase)
    - movieID
    - character name
    - text of the utterance

### movie_conversations.txt
the structure of the conversations
- fields
    - characterID of the first character involved in the conversation
    - characterID of the second character involved in the conversation
    - movieID of the movie in which the conversation occurred
    - list of the utterances that make the conversation, in chronological 
        order: `['lineID1','lineID2',…,'lineIDN']`
        has to be matched with movie_lines.txt to reconstruct the actual content

### raw_script_urls.txt
the urls from which the raw sources were retrieved

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

In [2]:
sep = ' \+\+\+\$\+\+\+ '

In [3]:
tolist = lambda x: [w.strip("'") for w in x.lstrip('[]').rstrip(']').split(', ')]

## Movies

In [4]:
movies_file = 'data/movie_titles_metadata.txt'
columns = ['title', 'year', 'rating', 'votes', 'genres']
movies_raw = pd.read_csv(movies_file, sep=sep, engine='python', header=None, 
                     index_col=0, names=columns)
movies = movies_raw[columns[:-1]]
genres = [(i, w) for i, row in movies_raw.iterrows() for w in tolist(row.genres)]

## Characters

In [5]:
characters_file = 'data/movie_characters_metadata.txt'
columns = ['name', 'movie', 'title', 'gender', 'pos']
characters = pd.read_csv(characters_file, sep=sep, engine='python', 
                         header=None, names=columns, index_col=0)

In [6]:
characters.head()

Unnamed: 0,name,movie,title,gender,pos
u0,BIANCA,m0,10 things i hate about you,f,4
u1,BRUCE,m0,10 things i hate about you,?,?
u2,CAMERON,m0,10 things i hate about you,m,3
u3,CHASTITY,m0,10 things i hate about you,?,?
u4,JOEY,m0,10 things i hate about you,m,6


In [7]:
characters.shape

(9035, 5)

## Movie lines

In [8]:
lines_file = 'data/movie_lines.txt'
columns = ['character', 'movie', 'name', 'text']
movie_lines = pd.read_csv(lines_file, sep=sep, names=columns, header=None, index_col=0, engine='python')

In [9]:
movie_lines.head()

Unnamed: 0,character,movie,name,text
L1045,u0,m0,BIANCA,They do not!
L1044,u2,m0,CAMERON,They do to!
L985,u0,m0,BIANCA,I hope so.
L984,u2,m0,CAMERON,She okay?
L925,u0,m0,BIANCA,Let's go.


In [10]:
movie_lines.shape

(304713, 4)

## Conversations

In [11]:
conversations_file = 'data/movie_conversations.txt'
columns = ['character_a', 'character_b', 'movie', 'lines']
conversations = pd.read_csv(conversations_file, sep=sep, names=columns, 
                             header=None, engine='python')

In [12]:
conversations.head()

Unnamed: 0,character_a,character_b,movie,lines
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


In [13]:
conversations.shape

(83097, 4)

## To MongoDb

In [14]:
import pymongo

In [15]:
db = pymongo.MongoClient()['movie-dialogs']

In [23]:
movie_records = {}
for i, row in movies_raw.iterrows():
    record = dict(row)
    record['genres'] = tolist(row.genres)
    record['id'] = i
    try:
        record['year'] = int(row.year)
    except ValueError:
        del(record['year'])
    movie_records[i] = record

In [24]:
movie_collection = db['movies']
movie_collection.insert_many([x for x in movie_records.values()])

<pymongo.results.InsertManyResult at 0x11f197780>

In [25]:
character_records = {}
for i, row in characters.iterrows():
    c = dict(row)
    c['movie'] = movie_records[row.movie]
    del(c['title'])
    try:
        del(c['movie']['_id'])
    except KeyError:
        pass
    try:
        c['pos'] = int(c['pos'])
    except ValueError:
        del(c['pos'])
    character_records[i] = c

In [26]:
characters_collection = db['characters']
characters_collection.insert_many([x for x in character_records.values()])

<pymongo.results.InsertManyResult at 0x11f6b9aa0>

In [36]:
line_records = {}
for i, row in movie_lines.iterrows():
    c = dict(row)
    cid = c['character']
    c['character'] = character_records[cid]
    c['character']['id'] = cid
    del(c['movie'])
    del(c['name'])
    try:
        del(c['character']['_id'])
    except KeyError:
        pass
    c['id'] = i
    line_records[i] = c

In [38]:
line_collection = db['lines']
line_collection.insert_many([x for x in line_records.values()])

<pymongo.results.InsertManyResult at 0x12eb9fa00>

In [60]:
conversation_records = {}
for i, row in conversations.iterrows():
    c = dict(row)
    c['character_a'] = dict([(k, v) for k, v in 
                             character_records[c['character_a']].items() if k not in ['movie']])
    c['character_b'] = dict([(k, v) for k, v in 
                             character_records[c['character_b']].items() if k not in ['movie']])
    c['movie'] = movie_records[c['movie']]
    conversations_records[i] = c
    lines_raw = [line_records[l] for l in tolist(c['lines'])]
    lines = [{'line': x['id'], 'text': x['text'], 
              'character': x['character']['id'], 
              'gender': x['character']['gender']} for x in lines_raw]
    c['lines'] = lines
    c['len'] = len(lines)
    conversation_records[i] = c

In [61]:
conversations_collection = db['conversations']
conversations_collection.insert_many([x for x in conversation_records.values()])

<pymongo.results.InsertManyResult at 0x17c261f50>