# Cornell

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

[#6](https://github.com/at15/snowbot/issues/6) exporle [Cornell Move Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), kaggle seems to have a (non-official) [cleaner version](https://www.kaggle.com/Cornell-University/movie-dialog-corpus) with detail explanation, [Currie32/Chatbot-from-Movie-Dialogue](https://github.com/Currie32/Chatbot-from-Movie-Dialogue) and [stanford cs20si/chatbot](https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/assignments/chatbot/data.py) show how to preprocess this data.

In [2]:
ls cornell

chameleons.pdf                 movie_lines.txt            README.txt
movie_characters_metadata.txt  movie_titles_metadata.txt
movie_conversations.txt        raw_script_urls.txt


- separator is '+++$+++', so we can't use pd.read_csv

In [12]:
# NOTE: it seems I can't use pd.read_csv
# In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engin
# movie_titles_metadata = pd.read_csv('cornell/movie_titles_metadata.txt', sep='++++++')

In [16]:
def line_count(file):
    count = 0
    # NOTE: it has encoding problem, we have to ignore it ....
    with open(file, errors='ignore') as f:
        for line in f:
            count += 1
    return count

files = ['movie_titles_metadata.txt', 'movie_characters_metadata.txt', 'movie_lines.txt', 'movie_conversations.txt']
for f in files:
    print(f, line_count('cornell/' + f))

movie_titles_metadata.txt 617
movie_characters_metadata.txt 9035
movie_lines.txt 304713
movie_conversations.txt 83097


Conversion is, user1, user2, movie, lines [u1, u2, u1, u2] NOTE: it may not be an even number, some pepople just [ignore it for one turn QA](https://github.com/suriyadeepan/datasets/blob/master/seq2seq/cornell_movie_corpus/scripts/prepare_data.py)

In [5]:
def file_head(file, lines=10):
    with open(file, errors='ignore') as f:
        for i in range(lines):
            print(i, f.readline())

In [15]:
print('No.c1        c2        movie       lines')
file_head('cornell/movie_conversations.txt')

No.c1        c2        movie       lines
0 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

1 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']

2 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']

3 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']

4 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']

5 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']

6 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']

7 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']

8 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364']

9 u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']



In [19]:
print('No.line    characterID    movie   character name   text(utterance)')
file_head('cornell/movie_lines.txt')

No.line    characterID    movie   character name   text(utterance)
0 L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

1 L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!

2 L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.

3 L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?

4 L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.

5 L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow

6 L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.

7 L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No

8 L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I'm kidding.  You know how sometimes you just become this "persona"?  And you don't know how to quit?

9 L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?



Now we can convert those files to a more readable in memory structure and save it to csv

- conversations: uid, uid, mid, line ids
- lines: number, uid, mid, text

In [44]:
# process conversations
conversations = []
conversation_lines = []
with open('cornell/movie_conversations.txt') as f:
    for line in f:
        cells = line.split('+++$+++')
        cells = [c.strip() for c in cells]
        conv = (int(cells[0][1:]), int(cells[1][1:]), int(cells[2][1:]), cells[3])
        conversations.append(conv)
        conversation_lines.append([c.strip() for c in cells[3].replace('\'', '')[1:-1].split(',')])
conv_df = pd.DataFrame.from_records(conversations, 
                                    columns=['c1', 'c2', 'mid', 'lines'])
conv_df.head()

Unnamed: 0,c1,c2,mid,lines
0,0,2,0,"['L194', 'L195', 'L196', 'L197']"
1,0,2,0,"['L198', 'L199']"
2,0,2,0,"['L200', 'L201', 'L202', 'L203']"
3,0,2,0,"['L204', 'L205', 'L206']"
4,0,2,0,"['L207', 'L208']"


In [45]:
conv_df.c1 = conv_df.c1.astype('category')
conv_df.c2 = conv_df.c2.astype('category')
conv_df.mid = conv_df.mid.astype('category')
conv_df.describe(include='all')

Unnamed: 0,c1,c2,mid,lines
count,83097,83097,83097,83097
unique,5420,5608,617,83097
top,4331,1475,289,"['L127507', 'L127508', 'L127509', 'L127510', '..."
freq,193,187,338,1


We split one conversation into multiple QA, like `['L204', 'L205', 'L206']` becomes two QA `(L204, L205), (L205, L206)`, we didn't simply take the odd one as question and even one as response like [suriyadeepan/dataset does](https://github.com/suriyadeepan/datasets/blob/master/seq2seq/cornell_movie_corpus/scripts/prepare_data.py#L48-L60), the latter way gives half QA pairs.

In [46]:
# turn lines into dictionary, key is line i.e. L204, value is the utterance (text) i.e. How you doing
line2text = {}
with open('cornell/movie_lines.txt', errors='ignore') as f:
    for line in f:
        cells = line.split('+++$+++')
        cells = [c.strip() for c in cells]
        line2text[cells[0]] = cells[4]
len(line2text)

304713

In [49]:
questions = []
answers = []
for conv in conversation_lines:
    for i in range(len(conv) - 1):
        questions.append(line2text[conv[i]])
        answers.append(line2text[conv[i+1]])
print(len(questions))
print(len(answers))

221616
221616


In [52]:
for i in range(6):
    print('Q:', questions[i])
    print('A:', answers[i], '\n')

Q: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
A: Well, I thought we'd start with pronunciation, if that's okay with you. 

Q: Well, I thought we'd start with pronunciation, if that's okay with you.
A: Not the hacking and gagging and spitting part.  Please. 

Q: Not the hacking and gagging and spitting part.  Please.
A: Okay... then how 'bout we try out some French cuisine.  Saturday?  Night? 

Q: You're asking me out.  That's so cute. What's your name again?
A: Forget it. 

Q: No, no, it's my fault -- we didn't have a proper introduction ---
A: Cameron. 

Q: Cameron.
A: The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does. 



TODO: now we need to clean text, remove things like -, split it into words, and turn words to id