# Notebook 1: Data Cleaning

In order to build a movie character recommendation system, I will gather and merge two datasets: one from the [Cornell Movie Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) and the other from the [Cornell Movie Quotes Corpus](http://www.cs.cornell.edu/~cristian/memorability.html). Both datasets contain information about the movie title, the characters, genres, and the movie lines. 

In [1]:
import pandas as pd
import re
import os
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## Cleaning Movie Dialogs Corpus

The movie dialogs corpus is split into 3 csv files, `movie_lines.csv`, `char_data.csv`, and `movie_metadata.csv`.

The `movie_lines.csv` file has the character name, their lines, and the movie id that they are associated with.

The `char_data.csv` contains the character names and the movie name, so `movie_lines.csv` will be merged with `char_data.csv` to have a comprehensive dataframe that contains the movie title, characters, and their lines.

#### **Examining and Cleaning the `movie_lines.csv`:**

In [2]:
mov_line = pd.read_csv('../data/movie_lines.csv')
mov_line.head()

Unnamed: 0,lineID,charID,movieID,charName,line,Unnamed: 5,Unnamed: 6
0,L1045,u0,m0,BIANCA,They do not!,,
1,L1044,u2,m0,zz,They do to!,,
2,L985,u0,m0,BIANCA,I hope so.,,
3,L984,u2,m0,CAMERON,She okay?,,
4,L925,u0,m0,BIANCA,Let's go.,,


The `movie_lines.csv` has 65499 unique movie lines.

In [3]:
print(len(mov_line) == len(mov_line['lineID'].unique()))
print(mov_line.shape)

True
(65499, 7)


Checking for null values:

In [4]:
mov_line.isnull().sum()

lineID            0
charID            0
movieID           0
charName          0
line              0
Unnamed: 5    65489
Unnamed: 6    65498
dtype: int64

In [5]:
mov_line[mov_line['Unnamed: 5'].notnull()]

Unnamed: 0,lineID,charID,movieID,charName,line,Unnamed: 5,Unnamed: 6
3214,L6723,u114,m6,WELLES,A weirdo making S,M films? Who'd have thought it?,
3215,L6722,u108,m6,MAX,"Dino Velvet... yeah, he's like the John Luc G...","M flicks, supposed to be a real weirdo.",
3235,L6618,u108,m6,MAX,There's two kinds of specialty product; legal...,"M and bondage films, they straddle the line. ...",
32773,L229891,u1036,m68,ALEXANDER,"By Grabthar's Hammer, this is true. 159",NT. LIVING ROOM - SOMEWHERE - NIGHT,159.0
32836,L229706,u1042,m68,JASON,BRANDON!,TIME TO GO!,
32875,L229857,u1041,m68,GWEN,"All systems are working, Commander. ,~ -cc",PINK) -' C -,
33068,L229881,u1049,m68,TOMMY,I see them! I see them! RD STREET,PASADENA 57,
33070,L229801,u1041,m68,GWEN,What are you doing? What are thev doino? ~7C ...,h37C,
36206,L237881,u1117,m73,REDBEARD,"It would have been a beautiful bridge, John. ...",suppose... ...never really pay much attention...,
65374,L381472,u1914,m125,ZED'S VOICE,We've got about eight or nine prospects,I want you look--,


In [6]:
mov_line[mov_line['Unnamed: 6'].notnull()]

Unnamed: 0,lineID,charID,movieID,charName,line,Unnamed: 5,Unnamed: 6
32773,L229891,u1036,m68,ALEXANDER,"By Grabthar's Hammer, this is true. 159",NT. LIVING ROOM - SOMEWHERE - NIGHT,159.0


In [7]:
len(mov_line[mov_line['Unnamed: 5'].notnull()])

10

I am going to drop both `Unnamed: 5` and `Unnamed: 6` columns because they only have 11 non-null values combined, so there is not much importance contained in the columns.

In [8]:
mov_line = mov_line.drop(['Unnamed: 5', 'Unnamed: 6'],1)
mov_line.to_csv('../data/movie_lines_updated.csv', index_label = False)
mov_line = pd.read_csv('../data/movie_lines_updated.csv')
mov_line.head()

Unnamed: 0,lineID,charID,movieID,charName,line
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,zz,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [9]:
mov_line.isnull().sum()

lineID      0
charID      0
movieID     0
charName    0
line        0
dtype: int64

Stripping the beginning and leading spaces for each column: `lineID`, `charID`, `movieID`, `charName`, `line`.

In [10]:
def strip_extra_spaces(df):
    
    # Iterate through each column in the dataframe
    for column in df.columns:
        
        # DF series must not be a float and an integer
        if (df[column].dtypes != float) and (df[column].dtypes != int):
            try: 
                # Strip beginning and leading spaces for all observations
                df[column] = df[column].apply(lambda x: x.strip())
                print(f'{column} is clean.')
            except:
                print(f'{column} is not clean')

In [11]:
strip_extra_spaces(mov_line)

lineID is clean.
charID is clean.
movieID is clean.
charName is clean.
line is clean.


In [12]:
mov_line.dtypes

lineID      object
charID      object
movieID     object
charName    object
line        object
dtype: object

Examining the number of unique characters based on their id column `charID`.

In [13]:
len(mov_line['charID'].unique())

1917

Each character in each movie should have a unique character id (`charID`). I am going to look for any character id's that have multiple names associated with it.

In [14]:
def not_unique_character(df):
    counter = 0
    
    # Iterating through each unique character ID
    for unique_character in df['charID'].unique():
        
        # Series that contains the names associated with the unique ID
        character_name = df['charName'][df['charID'] == unique_character]
        
        # if there are multiple names per id
        if len(character_name.value_counts().index) > 1:
            print(f'{unique_character} has multiple character names:')
            print(character_name.value_counts())
            counter += 1
    
    print(f'\nNumber of non-unique character(s): {counter}')

In [16]:
not_unique_character(mov_line)

u2 has multiple character names:
CAMERON    77
zz          1
Name: charName, dtype: int64

Number of non-unique character(s): 1


As a result, the name 'zz' should be replaced by 'CAMERON'.

In [17]:
mov_line[(mov_line['charName'] =='zz')|(mov_line['charName'] =='CAMERON')].head(3)

Unnamed: 0,lineID,charID,movieID,charName,line
1,L1044,u2,m0,zz,They do to!
3,L984,u2,m0,CAMERON,She okay?
5,L924,u2,m0,CAMERON,Wow


In [18]:
mov_line.at[1,'charName'] = 'CAMERON'

In [19]:
mov_line.to_csv('../data/movie_lines_updated.csv', index_label = False)

#### **Examining and Cleaning the `char_data.csv`:**

The `char_data.csv` file has the character name, movie id, and the movie name.

In [20]:
char = pd.read_csv('../data/char_data.csv')
char.head(10)

Unnamed: 0,charID,charName,movieID,movieName,gender,pos
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,?
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,?
4,u4,JOEY,m0,10 things i hate about you,m,6
5,u5,KAT,m0,10 things i hate about you,f,2
6,u6,MANDELLA,m0,10 things i hate about you,f,7
7,u7,MICHAEL,m0,10 things i hate about you,m,5
8,u8,MISS PERKY,m0,10 things i hate about you,?,?
9,u9,PATRICK,m0,10 things i hate about you,m,1


Stripping the beginning and leading spaces for each column: `charID`, `charName`, `movieID`, `movieName`, `gender`, `pos`.

In [21]:
strip_extra_spaces(char)

charID is clean.
charName is clean.
movieID is clean.
movieName is clean.
gender is clean.
pos is clean.


Checking for nulls:

In [22]:
char.isnull().sum().sum()

0

I will now investigate the `gender` column.

In [23]:
char['gender'].value_counts(normalize = True)

?    0.666187
m    0.210183
f    0.101937
M    0.016602
F    0.004981
$    0.000111
Name: gender, dtype: float64

Unfortunately, over 66% of the data contains a null value, so I will have to drop the `gender` column.

In [24]:
char = char.drop('gender',1)

The `pos` column is the character's position in movie credits. For the model's purposes, this column will be dropped.

In [25]:
char = char.drop('pos',1)

In [26]:
char.head()

Unnamed: 0,charID,charName,movieID,movieName
0,u0,BIANCA,m0,10 things i hate about you
1,u1,BRUCE,m0,10 things i hate about you
2,u2,CAMERON,m0,10 things i hate about you
3,u3,CHASTITY,m0,10 things i hate about you
4,u4,JOEY,m0,10 things i hate about you


Each character in each movie has a unique character id (`charID`). I am going to look for any character id's that have multiple names.

In [27]:
not_unique_character(char)


Number of non-unique character(s): 0


Great, there are none.

In [28]:
char.sort_values('charName').head()

Unnamed: 0,charID,charName,movieID,movieName
6558,u6558,,m436,memento
3764,u3764,,m248,arctic blue
5011,u5011,"""BRILL""",m333,enemy of the state
1602,u1602,"""DOCTOR""",m106,jacob's ladder
7845,u7845,"""ILIA""",m531,star trek: the motion picture


None of the blank character names show up in the `mov_line` dataframe.

In [29]:
mov_line[(mov_line['charID'] == 'u6558')|(mov_line['charID'] == 'u3764')]

Unnamed: 0,lineID,charID,movieID,charName,line


I'll have to drop the blank values in `charName` as well as remove quotation marks `""` in the other names.

In [30]:
char['charName'] = char['charName'].replace('"','')
char[char['charName'] == 'BRILL']

Unnamed: 0,charID,charName,movieID,movieName
5015,u5015,BRILL,m333,enemy of the state


In [31]:
char.to_csv('../data/char_data_updated.csv', index_label = False)

#### Grouping all lines per unique character per movie:

In [32]:
mov_line[['charID', 'charName', 'line']].head()

Unnamed: 0,charID,charName,line
0,u0,BIANCA,They do not!
1,u2,CAMERON,They do to!
2,u0,BIANCA,I hope so.
3,u2,CAMERON,She okay?
4,u0,BIANCA,Let's go.


I will now combine each character's lines into one document per character ID.

In [33]:
id_doc = {}
for character in mov_line['charID'].unique():
    character_line = ''
    for line in mov_line[mov_line['charID'] == character]['line']:
        character_line += line + ' '
    id_doc[character] = character_line

Creating a dataframe of the character ID and all of their associated movie lines.

In [34]:
mov_id_lines = pd.DataFrame.from_dict(id_doc, orient='index')
mov_id_lines = mov_id_lines.reset_index(drop=False)
mov_id_lines.columns = ['charID','all_lines']
mov_id_lines.head()

Unnamed: 0,charID,all_lines
0,u0,They do not! I hope so. Let's go. Okay -- you'...
1,u2,"They do to! She okay? Wow No The ""real you"". I..."
2,u3,You think you ' re the only sophomore at the p...
3,u4,"Listen, I want to talk to you about the prom. ..."
4,u5,Perm? It's just you. What? To completely damag...


There are 1917 unique movie characters.

In [35]:
mov_id_lines.shape

(1917, 2)

In [36]:
mov_line.head(3)

Unnamed: 0,lineID,charID,movieID,charName,line
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.


In [37]:
char.head(3)

Unnamed: 0,charID,charName,movieID,movieName
0,u0,BIANCA,m0,10 things i hate about you
1,u1,BRUCE,m0,10 things i hate about you
2,u2,CAMERON,m0,10 things i hate about you


Rearranging `char` columns:

In [38]:
char = char[['charID', 'charName','movieName','movieID']]

#### **Merging `mov_line` and  `char` dataframes:**
Combining all the lines per character ID and the character names, movie, and movie id.

In [39]:
mov_dialog = mov_id_lines.set_index('charID').join(char.set_index('charID'))
mov_dialog = mov_dialog.drop_duplicates()
mov_dialog = mov_dialog.reset_index()

In [40]:
mov_dialog.head()

Unnamed: 0,charID,all_lines,charName,movieName,movieID
0,u0,They do not! I hope so. Let's go. Okay -- you'...,BIANCA,10 things i hate about you,m0
1,u2,"They do to! She okay? Wow No The ""real you"". I...",CAMERON,10 things i hate about you,m0
2,u3,You think you ' re the only sophomore at the p...,CHASTITY,10 things i hate about you,m0
3,u4,"Listen, I want to talk to you about the prom. ...",JOEY,10 things i hate about you,m0
4,u5,Perm? It's just you. What? To completely damag...,KAT,10 things i hate about you,m0


In [41]:
mov_dialog.to_csv('../data/movie_dialogue_updated.csv', index_label = False)

## Cleaning Movie Quotes Corpus

The movie quotes corpus contains the following information from 1068 movie scripts:
- Movie Title
- Character
- Text

In [42]:
mov_quo = open('../data/moviequotes.scripts.txt', 'r') 
mov_quo = open('../data/moviequotes.scripts.txt', encoding='utf-8', errors='ignore').read().split('\n')

There are 894,015 observations

In [43]:
len(mov_quo)

894015

In [44]:
mov_quo[:3]

['0 +++$+++ "murderland" +++$+++ 1 +++$+++ announcer +++$+++  +++$+++ Ladies and gentlemen, the official mascot of Murderland.... Scraps the Dog !',
 '1 +++$+++ "murderland" +++$+++ 2 +++$+++ announcer +++$+++  +++$+++ Choose the doorway that starts you on your magical journey into  MURDERLAND !',
 '2 +++$+++ "murderland" +++$+++ 3 +++$+++ johnny +++$+++  +++$+++ I didn\'t think he\'d make it past Scraps.']

Cleaning the dataset by removing uncommon separators.

In [45]:
clean_list = []
for line in mov_quo:
    line = line.replace(' +++$+++ ', '+$+')
    line = line.replace('"', '')
    line = line.split('+$+')
    clean_list.append(line)

In [46]:
len(clean_list)
clean_list[0]

['0',
 'murderland',
 '1',
 'announcer',
 '',
 'Ladies and gentlemen, the official mascot of Murderland.... Scraps the Dog !']

In [47]:
movie_quotes = pd.DataFrame(clean_list, columns=['LINE_ID','MOVIE_TITLE','MOVIE_LINE_NR','CHARACTER','REPLY_TO_LINE_ID','TEXT'])
movie_quotes.head()

Unnamed: 0,LINE_ID,MOVIE_TITLE,MOVIE_LINE_NR,CHARACTER,REPLY_TO_LINE_ID,TEXT
0,0,murderland,1,announcer,,"Ladies and gentlemen, the official mascot of M..."
1,1,murderland,2,announcer,,Choose the doorway that starts you on your mag...
2,2,murderland,3,johnny,,I didn't think he'd make it past Scraps.
3,3,murderland,4,bruce,2.0,Let's just see if he can make it into round tw...
4,4,murderland,5,bruce,,Don't.


I do not need `LINE_ID`, `MOVIE_LINE_NR`, and `REPLY_TO_LINE_ID` because I am not interested in the order of the dialogue.

In [48]:
movie_quotes = movie_quotes.drop(['LINE_ID','MOVIE_LINE_NR','REPLY_TO_LINE_ID'],1)

In [49]:
movie_quotes.head()

Unnamed: 0,MOVIE_TITLE,CHARACTER,TEXT
0,murderland,announcer,"Ladies and gentlemen, the official mascot of M..."
1,murderland,announcer,Choose the doorway that starts you on your mag...
2,murderland,johnny,I didn't think he'd make it past Scraps.
3,murderland,bruce,Let's just see if he can make it into round tw...
4,murderland,bruce,Don't.


#### Grouping all quotes per unique character per movie:

In [50]:
movie_quotes = pd.DataFrame(movie_quotes.groupby(['MOVIE_TITLE','CHARACTER'])['TEXT'].agg(lambda line: ' '.join(line)))
movie_quotes = movie_quotes.reset_index()
movie_quotes.to_csv('../data/movie_quotes_updated.csv', index_label = False)


In [51]:
movie_quotes.head()

Unnamed: 0,MOVIE_TITLE,CHARACTER,TEXT
0,10 things i hate about you,bartender,What can I get you? You forgot to pay!
1,10 things i hate about you,bianca,Did you change your hair? You might wanna thin...
2,10 things i hate about you,bianca and walter,The sound of a fifteen-year-old in labor.
3,10 things i hate about you,bogey,"Nice to see you. Martini bar to the right, sh..."
4,10 things i hate about you,boy,"Hey, Bianca. Pet my kitty!"


## Combining Movie Quotes and Dialogue Dataframes

Dropping `charName` and `movieID` from the movie dialogue dataframe.

In [52]:
mov_dialog = mov_dialog.drop(['charID','movieID'],1)

In [53]:
mov_dialog['charName'] = mov_dialog['charName'].str.lower()

Changing the order of columns for the dialogue dataframe and renaming them.

In [54]:
mov_dialog = mov_dialog[['movieName','charName','all_lines']]
mov_dialog.columns = ['title','character','text']

In [55]:
mov_dialog.head(3)

Unnamed: 0,title,character,text
0,10 things i hate about you,bianca,They do not! I hope so. Let's go. Okay -- you'...
1,10 things i hate about you,cameron,"They do to! She okay? Wow No The ""real you"". I..."
2,10 things i hate about you,chastity,You think you ' re the only sophomore at the p...


Renaming the movie quotes dataframe columns.

In [56]:
movie_quotes.columns = ['title','character','text']

Combining the `movie_quotes` and `mov_dialog` dataframes.

In [57]:
mov_combo = pd.concat([mov_dialog,movie_quotes], 
                      ignore_index = True)

In [58]:
mov_combo.head()

Unnamed: 0,title,character,text
0,10 things i hate about you,bianca,They do not! I hope so. Let's go. Okay -- you'...
1,10 things i hate about you,cameron,"They do to! She okay? Wow No The ""real you"". I..."
2,10 things i hate about you,chastity,You think you ' re the only sophomore at the p...
3,10 things i hate about you,joey,"Listen, I want to talk to you about the prom. ..."
4,10 things i hate about you,kat,Perm? It's just you. What? To completely damag...


Tokenizing the `text` column to create a `token_text` column.



In [59]:
tokenizer = RegexpTokenizer(r'[A-Za-z]+')

In [60]:
mov_combo['tokenized_text'] = mov_combo['text'].map(lambda x: tokenizer.tokenize(x))

I will use the `token_text` column in order to create a `word_count` column.

In [61]:
mov_combo['word_count'] = mov_combo['tokenized_text'].map(lambda x: len(x))

I will remove each movie's duplicate characters because there are overlaps between the dialogue and quotes corpus dataframes. 

For example, assume there is a unique character, 'Jim Bob', that appears in both corpora. If Jim Bob has 100 words in his lines in the quotes dataframe and 150 in the dialogue dataframe, I will drop the observation with the fewer word count.

In [62]:
mov_combo = mov_combo.sort_values('word_count', ascending = False)
mov_combo = mov_combo.sort_values(['title','character'])
mov_combo.reset_index(drop=True,inplace=True)
mov_combo.head()

Unnamed: 0,title,character,text,tokenized_text,word_count
0,10 things i hate about you,bartender,What can I get you? You forgot to pay!,"[What, can, I, get, you, You, forgot, to, pay]",9
1,10 things i hate about you,bianca,Did you change your hair? You might wanna thin...,"[Did, you, change, your, hair, You, might, wan...",1295
2,10 things i hate about you,bianca,They do not! I hope so. Let's go. Okay -- you'...,"[They, do, not, I, hope, so, Let, s, go, Okay,...",1015
3,10 things i hate about you,bianca and walter,The sound of a fifteen-year-old in labor.,"[The, sound, of, a, fifteen, year, old, in, la...",9
4,10 things i hate about you,bogey,"Nice to see you. Martini bar to the right, sh...","[Nice, to, see, you, Martini, bar, to, the, ri...",13


The following variable returns unique characters per movie with the higher word count.

In [63]:
unique_character = list(mov_combo[['title','character']].drop_duplicates().index)

There are 1737 duplicate characters that must be removed:

In [64]:
duplicate_characters = len(mov_combo) - len(mov_combo.loc[unique_character, :])
duplicate_characters

1737

In [65]:
mov_combo = mov_combo.loc[unique_character, :]
mov_combo.reset_index(drop=True,inplace=True)

Removing non alpha numeric characters in the `character` names. 

In [66]:
mov_combo['character'] = mov_combo['character'].map(lambda x: re.sub('[^A-Za-z]+',' ', x))

## VADER Sentiment Analysis

I will utilize VADER's sentiment analysis to examine the positive, negative, and neutral weights of each character's movie lines. 

The Sentiment Intensity Analyzer (SIA) returns 4 sentiment metrics:

- compound - ranges between -1 to +1, from most extreme negative to most extreme positive
- positive - a percentage rating of how positive a post is
- neutral - a percentage rating of how neutral a post is
- negative - a percentage rating of how negative a post is

The purpose of the SIA is to use the sentiment of each post as an additional feature when implementing the movie character recommendation.

In [67]:
sia = SentimentIntensityAnalyzer()

In [68]:
# Long process
mov_combo['vader'] = mov_combo['text'].map(lambda x: sia.polarity_scores(x))

Changing object types to string values:

In [69]:
mov_combo.dtypes

title             object
character         object
text              object
tokenized_text    object
word_count         int64
vader             object
dtype: object

In [70]:
mov_combo['title'] = mov_combo['title'].astype(str)
mov_combo['character'] = mov_combo['character'].astype(str)
mov_combo['text'] = mov_combo['text'].astype(str)

In [71]:
mov_combo.dtypes

title             object
character         object
text              object
tokenized_text    object
word_count         int64
vader             object
dtype: object

Saving file as a pickle to preserve column datatypes.

In [72]:
mov_combo.to_pickle('../data/mov_combo.pkl')

# Proceed to Notebook 2: Data Scraping

The current movie combo dataframe has the title, character, text, tokenized_text, word count, and vader sentiment analysis scores. In order to have the proper IMDb movie titles and genres, I am going to scrape the IMDb website in the following notebook.