# Notebook 2: Data Scraping

In order to supplement the combined movie dialogue and quotes dataframe, I will create a `movie_info` dataframe that has the following:

|Feature|Type|Description|
|---|---|---|
|**original_title**|string|Original movie title from the `mov_combo` dataframe|
|**imdb_title**|string|Official IMDb movie title|
|**imdb_url**|string|Movie's IMDb url|
|**genre**|list|Movie's genre(s)|
|**pic_url**|string|Movie's promotional picture|

Afterwards, the `movie_info` and `mov_combo` dataframes will be combined and cleaned to create a `mov_combo_final` dataframe that has the following:

|Feature|Type|Description|
|---|---|---|
|**imdb_title**|string|Official IMDb movie title|
|**character**|string|Movie character|
|**text**|string|Movie character lines|
|**tokenized_text**|list|Movie character lines split into individual words|
|**word_count**|integer|Text word count|
|**vader**|dictionary|Sentiment analysis score|
|**genre**|list|Movie's genre(s)|
|**imdb_url**|string|Movie's IMDb url|
|**pic_url**|string|Movie's promotional picture|

In [427]:
mov_combo_final.head(1)

Unnamed: 0,imdb_title,character,text,tokenized_text,word_count,vader,genre,imdb_url,pic_url
0,10 Things I Hate About You (1999),bartender,What can I get you? You forgot to pay!,"[What, can, I, get, you, You, forgot, to, pay]",9,"{'neg': 0.195, 'neu': 0.805, 'pos': 0.0, 'comp...","[Comedy, Drama, Romance]",https://www.imdb.com/title/tt0147800/,https://m.media-amazon.com/images/M/MV5BMmVhZj...


In [440]:
import pandas as pd
import re
import os
import time
import requests, sys, webbrowser, bs4
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import pickle

In [17]:
mov_combo = pd.read_pickle('../data/mov_combo.pkl')
mov_combo.head(3)

Unnamed: 0,title,character,text,tokenized_text,word_count,vader
0,10 things i hate about you,bartender,What can I get you? You forgot to pay!,"[What, can, I, get, you, You, forgot, to, pay]",9,"{'neg': 0.195, 'neu': 0.805, 'pos': 0.0, 'comp..."
1,10 things i hate about you,bianca,Did you change your hair? You might wanna thin...,"[Did, you, change, your, hair, You, might, wan...",1295,"{'neg': 0.108, 'neu': 0.726, 'pos': 0.166, 'co..."
2,10 things i hate about you,bianca and walter,The sound of a fifteen-year-old in labor.,"[The, sound, of, a, fifteen, year, old, in, la...",9,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."


Grabbing the unique movie titles in the `mov_combo` dataframe.

In [21]:
unique_titles = list(mov_combo['title'].unique())
len(unique_titles)

1075

There are 1075 unique movies.

Creating the `movie_info` dataframe.

In [75]:
movie_info = pd.DataFrame({'original_title':unique_titles,
                           'imdb_title': 'none',
                           'imdb_url': 'none',
                           'genre': 'none',
                           'pic_url': 'none',
                           'info_all_collected':0
                          })
movie_info.head()

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
0,10 things i hate about you,none,none,none,none,0
1,1492: conquest of paradise,none,none,none,none,0
2,15 minutes,none,none,none,none,0
3,2001: a space odyssey,none,none,none,none,0
4,48 hrs.,none,none,none,none,0


Creating the movie scraper function, `movie_info_update()`. The following function inputs the original movie title plus 'movie IMDb' into Google search, then visits the first search result, which is likely to be the IMDb url associated with the movie. The title, genre, and picture urls are then scraped.

If there is an error in obtaining the columns `imdb_title`, `imdb_url`, `genre`, `pic_url`, then `info_all_collected` will stay as 0. If all the information is collected then `info_all_collected` will be marked as 1.

In [241]:
def movie_info_update(df, row_index, original_title):
    
    print(f'Processing {original_title}...')
    # input movie title
    movie_title = str(original_title)
    
    # replace spaces with pluses for google search
    query = movie_title.replace(" ","+")
    
    # Replace '&' with 'and'
    query = movie_title.replace("&","and")

    # selects the first url from google search, which is IMDB
    try:
        res = requests.get('https://google.com/search?q='+query+'+movie+imdb')
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        linkElements = soup.select('.r a')
    
    # Grab movie's imdb url
        movie_imdb_link = linkElements[0].get('href').replace('/url?q=','').split('&')[0]
    
    except:
        print("Failed to obtain IMDb url")
        movie_imdb_link = np.NaN
        df.at[row_index,'info_all_collected'] = 0
    
    # wait for 5 seconds until next request
    time.sleep(5)
    
    # Go to the movie's imdb page
    try:
        res = requests.get(movie_imdb_link)
        soup = bs4.BeautifulSoup(res.text, 'lxml')
    except:
        print("Failed to request IMDb url")
        df.at[row_index,'info_all_collected'] = 0
    
    # Get movie's IMDB title
    try:
        title = soup.find('title').text.replace(" - IMDb","")
    except:
        title = np.NaN
        df.at[row_index,'info_all_collected'] = 0
    
    # Find movie's genre(s)
    try:
        for_genre = soup.find_all('div', {'class':'subtext'})
        genres = [g.text for g in for_genre[0].find_all('a')][:-1]
        genres = ",".join(genres)
    except:
        genres = np.NaN
        df.at[row_index,'info_all_collected'] = 0
    
    # Grab movie's picture URL:
    try:
        for_pic = soup.find('script', {'type':'application/ld+json'}).text
        movie_pic = re.findall("https:\/\/m.media[^\"]*",for_pic)[0]
    except:
        movie_pic = np.NaN
        df.at[row_index,'info_all_collected'] = 0
    
    # Updating imdb title, imdb url, genre, pic url
    df.at[row_index,'imdb_title'] = title
    df.at[row_index,'imdb_url'] = movie_imdb_link
    df.at[row_index,'genre'] = genres
    df.at[row_index,'pic_url'] = movie_pic
    
    # Changing info_all_collected to 1 if there are no nulls 
    df.loc[~df.isnull().any(axis=1),'info_all_collected'] = 1

In [79]:
%%time
for row_index, title in movie_info['original_title'].iteritems():
    movie_info_update(movie_info, row_index, title)

Processing 10 things i hate about you...
Processing 1492: conquest of paradise...
Processing 15 minutes...
Processing 2001: a space odyssey...
Processing 48 hrs....
Processing 8mm...
Processing a bucket of blood...
Processing a clockwork orange...
Processing a dry white season...
Processing a few good men...
Processing a hard day's night...
Processing a nightmare on elm street...
Processing a nightmare on elm street 3: dream warriors...
Processing a nightmare on elm street 4: the dream master...
Processing a nightmare on elm street part 2: freddy's revenge...
Failed to obtain IMDb url
Failed to request IMDb url
Processing a nightmare on elm street: the dream child...
Processing a perfect world...
Processing a serious man...
Processing a walk to remember...
Failed to obtain IMDb url
Failed to request IMDb url
Processing above the law...
Processing absolute power...
Processing ace ventura pet detective...
Processing adaptation...
Processing affliction...
Processing after...
Processing af

Processing dumb and dumber...
Processing dumb and dumberer: when harry met lloyd...
Processing dune...
Processing e...
Processing eagle eye...
Processing eastern promises...
Processing ed tv...
Processing ed wood...
Processing edtv...
Processing edward scissorhands...
Processing eight legged freaks...
Processing el mariachi...
Processing election...
Processing elizabeth the golden age...
Processing enemy of the state...
Processing entrapment...
Failed to obtain IMDb url
Failed to request IMDb url
Processing erik the viking...
Processing erin brockovich...
Processing escape from l...
Processing escape from new york...
Processing escape from the planet of the apes...
Processing eternal sunshine of the spotless mind...
Processing even cowgirls get the blues...
Processing event horizon...
Processing evil dead...
Processing evil dead 2: dead before dawn...
Processing evil dead ii dead by dawn...
Processing excalibur...
Processing existenz...
Processing extract...
Failed to obtain IMDb url
F

Processing living in oblivion...
Processing lock, stock and two smoking barrels...
Processing logan's run...
Processing lone star...
Processing lord of illusions...
Processing lord of the rings return of the king...
Processing lord of the rings the two towers...
Processing lord of war...
Processing lost highway...
Processing lost horizon...
Processing lost in space...
Processing lost in translation...
Processing lost souls...
Processing love & basketball...
Processing love and basketball...
Processing lΘon...
Processing made...
Failed to obtain IMDb url
Failed to request IMDb url
Processing magnolia...
Processing major league...
Processing malcolm x...
Processing malibu's most wanted...
Processing man in the iron mask...
Processing man on fire...
Processing man on the moon...
Processing man trouble...
Processing manhunt...
Processing manhunter...
Failed to obtain IMDb url
Failed to request IMDb url
Processing maniac...
Processing margot at the wedding...
Processing mariachi, el...
Proc

Processing star trek v: the final frontier...
Failed to obtain IMDb url
Failed to request IMDb url
Processing star trek vi: the undiscovered country...
Processing star trek: insurrection...
Processing star trek: the wrath of khan...
Processing star wars...
Processing star wars a new hope...
Processing star wars attack of the clones...
Failed to obtain IMDb url
Failed to request IMDb url
Processing star wars return of the jedi...
Processing star wars revenge of the sith...
Processing star wars the empire strikes back...
Processing star wars the phantom menace...
Processing star wars: episode vi - return of the jedi...
Processing starman...
Processing starship troopers...
Processing state and main...
Failed to obtain IMDb url
Failed to request IMDb url
Processing station west...
Processing stepmom...
Processing stir of echoes...
Processing storytelling...
Processing strange days...
Processing strangers on a train...
Processing stranglehold...
Processing suburbia...
Failed to obtain IMDb 

Processing the sandlot kids...
Processing the scarlet letter...
Processing the searchers...
Processing the seventh seal...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the seventh victim...
Processing the shawshank redemption...
Processing the shining...
Processing the shipping news...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the siege...
Processing the sixth sense...
Processing the sting...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the stuntman...
Processing the surfer king...
Processing the sweet hereafter...
Processing the thin man...
Processing the thing...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the third man...
Processing the three musketeers...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the time machine...
Processing the tourist...
Processing the truman show...
Failed to obtain IMDb url
Failed to request IMDb url
Processing the usual suspects...
Processing the verdict...


In [191]:
movie_info.to_csv('../data/movie_info_rough.csv', index_label= False)

Observing unprocessed information:

In [179]:
# If there are no nulls accross the columns for each row, then all the information is collected
movie_info.loc[~movie_info.isnull().any(axis=1),'info_all_collected'] = 1

There are 114 unprocessed movie titles, I will attempt to run them through the function again.

In [181]:
len(movie_info[movie_info['info_all_collected'] ==0])

114

In [182]:
unprocessed_mov = movie_info[movie_info['info_all_collected'] ==0]['original_title']

In [183]:
%%time
for row_index, title in unprocessed_mov.iteritems():
    movie_info_update(movie_info, row_index, title)

Processing a nightmare on elm street part 2: freddy's revenge...
Processing a walk to remember...
Processing ali...
Processing all the president's men...
Processing antitrust...
Processing antz...
Processing arctic blue...
Failed to obtain IMDb url
Failed to request IMDb url
Processing babel...
Processing basic instinct...
Processing beloved...
Processing bridesmaids...
Processing bringing out the dead...
Processing catch me if you can...
Processing chaos...
Processing cinema paradiso...
Processing cold mountain...
Processing crouching tiger, hidden dragon...
Processing cube...
Failed to obtain IMDb url
Failed to request IMDb url
Processing darkman...
Processing deception...
Processing die hard...
Failed to obtain IMDb url
Failed to request IMDb url
Processing dogma...
Processing double indemnity...
Failed to obtain IMDb url
Failed to request IMDb url
Processing entrapment...
Processing extract...
Processing fight club...
Processing four rooms...
Processing g...
Processing grand hotel.

In [207]:
# If there are no nulls accross the columns for each row, then all the information is collected
movie_info.loc[~movie_info.isnull().any(axis=1),'info_all_collected'] = 1

In [186]:
# Same process again:
unprocessed_mov = movie_info[(movie_info['info_all_collected'] ==0)&(movie_info['genre'].isnull())]['original_title']

In [188]:
%%time
unprocessed_mov = movie_info[(movie_info['info_all_collected'] ==0)&(movie_info['genre'].isnull())]['original_title']
for row_index, title in unprocessed_mov.iteritems():
    movie_info_update(movie_info, row_index, title)

Processing arctic blue...
Processing cube...
Processing die hard...
Processing double indemnity...
Processing lΘon...
Processing pet sematary ii...
Processing pirates of the caribbean...
Failed to obtain IMDb url
Failed to request IMDb url
Processing star wars...
Processing star wars attack of the clones...
Processing the hustler...
Processing the white ribbon...
Failed to obtain IMDb url
Failed to request IMDb url
Processing there's something about mary...
CPU times: user 2.74 s, sys: 67.4 ms, total: 2.8 s
Wall time: 1min 24s


In [192]:
movie_info[movie_info.isnull().any(axis=1)].sort_values('genre')

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
613,plastic man,Plastic Man,https://www.imdb.com/title/tt9411946/,"Action,Sci-Fi",,0
855,the goonies 2,The Goonies 2,https://www.imdb.com/title/tt3652896/,Adventure,,0
1048,white jazz,White Jazz,https://www.imdb.com/title/tt0892384/,"Crime,Drama,Mystery",,0
466,labor of love,Labor of Love,https://www.imdb.com/title/tt1677731/,"Drama,Mystery",,0
663,s,S (2015),https://www.imdb.com/title/tt4510636/,"Short,Drama",,0
721,spare me,Spare Me (1993),https://www.imdb.com/title/tt0108200/,Thriller,,0
228,die hard,Die Hard Series,https://www.imdb.com/list/ls057313912/,,,0
500,lΘon,IMDb Top 250,https://www.imdb.com/chart/top,,,0
610,pirates of the caribbean,pirates of the caribbean movie imdb - Google S...,,,,0
740,star wars,STAR WARS FILMS (1977-2015),https://www.imdb.com/list/ls070150896/,,,0


Manually filling `pic_url`'s for lesser known movies that do not have photos on imdb:

In [194]:
movie_info.at[613,'pic_url'] = 'https://m.media-amazon.com/images/M/MV5BZWQ1ZjdkNWYtZTgwNS00MzNiLWExYWEtMWE3NzgwNmYxNTU1XkEyXkFqcGdeQXVyNjExODE1MDc@._V1_.jpg'
movie_info.at[855,'pic_url'] = 'https://cdn3.movieweb.com/i/article/vNfany1Ej6mA3nCrWmVDnabqEQSdjG/798:50/Goonies-2-Wont-Happen-Richard-Donner-Old-Age.jpg'
movie_info.at[1048,'pic_url'] = 'http://r72.cooltext.com/rendered/cooltext324247263286443.gif'
movie_info.at[466,'pic_url'] = 'https://www.ethanproductions.com/movies-newDB/images/402540600007.jpg'
movie_info.at[663,'pic_url'] = 'https://www.timelesswroughtiron.com/v/vspfiles/photos/UGHT-IRON-HOUSE-LETTER-S-LET-S-2T.jpg'
movie_info.at[721,'pic_url'] = 'https://upload.wikimedia.org/wikipedia/en/thumb/e/e3/Video.SM.jpg/220px-Video.SM.jpg'

In [195]:
movie_info[movie_info.isnull().any(axis=1)]

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
228,die hard,Die Hard Series,https://www.imdb.com/list/ls057313912/,,,0
500,lΘon,IMDb Top 250,https://www.imdb.com/chart/top,,,0
610,pirates of the caribbean,pirates of the caribbean movie imdb - Google S...,,,,0
740,star wars,STAR WARS FILMS (1977-2015),https://www.imdb.com/list/ls070150896/,,,0
966,the white ribbon,the white ribbon movie imdb - Google Search,,,,0


In [201]:
movie_info.loc[~movie_info.isnull().any(axis=1),'info_all_collected'] = 1

In [205]:
%%time
unprocessed_mov = movie_info[movie_info['info_all_collected'] == 0]['original_title']
print(unprocessed_mov)
for row_index, title in unprocessed_mov.iteritems():
    movie_info_update(movie_info, row_index, title)

228                    die hard
500                        lΘon
610    pirates of the caribbean
740                   star wars
966            the white ribbon
Name: original_title, dtype: object
Processing die hard...
Processing lΘon...
Processing pirates of the caribbean...
Processing star wars...
Processing the white ribbon...
CPU times: user 1.75 s, sys: 855 ms, total: 2.6 s
Wall time: 35.3 s


In [208]:
movie_info[movie_info['info_all_collected'] == 0]

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
228,die hard,Die Hard Series,https://www.imdb.com/list/ls057313912/,,,0
500,lΘon,IMDb Top 250,https://www.imdb.com/chart/top,,,0
610,pirates of the caribbean,Pirates of the Caribbean Film Series,https://www.imdb.com/list/ls059602270/,,,0
740,star wars,STAR WARS FILMS (1977-2015),https://www.imdb.com/list/ls070150896/,,,0


Manually changing the last four movies' information.
I will first rename 'lΘon' to 'leon the professional' in both dataframes, `movie_info` and `mov_combo`.

In [228]:
mov_combo.loc[mov_combo[mov_combo['title'].str.contains('lΘon')].index,'title'] = 'leon the professional'

In [232]:
movie_info.loc[movie_info[movie_info['original_title'].str.contains('lΘon')].index,'original_title'] = 'leon the professional'

In [235]:
movie_info[movie_info['info_all_collected'] == 0]

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
228,die hard,Die Hard Series,https://www.imdb.com/list/ls057313912/,,,0
500,leon the professional,IMDb Top 250,https://www.imdb.com/chart/top,,,0
610,pirates of the caribbean,Pirates of the Caribbean Film Series,https://www.imdb.com/list/ls059602270/,,,0
740,star wars,STAR WARS FILMS (1977-2015),https://www.imdb.com/list/ls070150896/,,,0


In [242]:
movie_info_update(movie_info, 500, 'leon the professional')

Processing leon the professional...


In [243]:
movie_info[movie_info['info_all_collected'] == 0]

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
228,die hard,Die Hard Series,https://www.imdb.com/list/ls057313912/,,,0
610,pirates of the caribbean,Pirates of the Caribbean Film Series,https://www.imdb.com/list/ls059602270/,,,0
740,star wars,STAR WARS FILMS (1977-2015),https://www.imdb.com/list/ls070150896/,,,0


In [248]:
movie_info_update(movie_info, 228, 'die hard 1988')
movie_info_update(movie_info, 610, 'pirates of the caribbean black pearl')
movie_info_update(movie_info, 740, 'Star Wars episode iv')

Processing die hard 1988...
Processing pirates of the caribbean black pearl...
Processing Star Wars episode iv...


There are no nulls, huzzah!

In [249]:
movie_info[movie_info['info_all_collected'] == 0]

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected


In [251]:
movie_info.isnull().sum().sum()

0

Looking for duplicate movies based on their `imdb_title`.

In [267]:
len(movie_info[movie_info.duplicated('imdb_title')].sort_values('imdb_title'))

51

There are roughly 51 duplicated movie titles. Since the `original_title` column originally came from the `mov_combo` dataframe, I will merge the two dataframes together and then eliminate the duplicate movie titles, which may imply that there are duplicate movie characters per movie.

In [289]:
movie_info.to_csv('../data/movie_info.csv', index_label = False)

## Merging `movie_info` and `mov_combo`

I am going to merge the dataframes on their respective movie titles column. I first need to check whether their titles match correctly by taking the set difference of the movie titles.

In [276]:
# Number of movie titles are true
len(movie_info) == len(mov_combo['title'].unique())

True

In [278]:
set(movie_info['original_title']).difference(set(mov_combo['title'].unique()))

set()

An empty set indicates that I can go ahead and merge the dataframes together.

In [290]:
mov_combo_final  = mov_combo.merge(movie_info, left_on = 'title', right_on = 'original_title')

In [285]:
len(mov_combo_final['original_title'].unique())

1075

There are several duplicate movies with slightly different `original_title` movie titles. I will remove one of the two duplicate movies whether it has a lower word count than the other.

In [357]:
#mov_combo_final.to_pickle('../data/mov_combo_final.pkl')

In [313]:
dups = movie_info[movie_info.duplicated('imdb_title',keep = False)].sort_values('imdb_title')
dups.head(6)

Unnamed: 0,original_title,imdb_title,imdb_url,genre,pic_url,info_all_collected
29,airplane 2 the sequel,Airplane II: The Sequel (1982),https://www.imdb.com/title/tt0083530/,"Comedy,Sci-Fi",https://m.media-amazon.com/images/M/MV5BMmVkYj...,1
30,airplane ii: the sequel,Airplane II: The Sequel (1982),https://www.imdb.com/title/tt0083530/,"Comedy,Sci-Fi",https://m.media-amazon.com/images/M/MV5BMmVkYj...,1
28,airplane,Airplane! (1980),https://www.imdb.com/title/tt0080339/,Comedy,https://m.media-amazon.com/images/M/MV5BZjA3Yj...,1
31,airplane!,Airplane! (1980),https://www.imdb.com/title/tt0080339/,Comedy,https://m.media-amazon.com/images/M/MV5BZjA3Yj...,1
34,alien,Alien (1979),https://www.imdb.com/title/tt0078748/,"Horror,Sci-Fi",https://m.media-amazon.com/images/M/MV5BMmQ2Mm...,1
38,alien vs,Alien (1979),https://www.imdb.com/title/tt0078748/,"Horror,Sci-Fi",https://m.media-amazon.com/images/M/MV5BMmQ2Mm...,1


In [334]:
dup_titles = list(dups['original_title'])

# creating a mask
for position, title in enumerate(dup_titles):
    quotes = '"'
    if position != len(dup_titles)-1:
        print(f"(mov_combo_final['original_title'] == {quotes}{title}{quotes})|")
    else:
        print(f"(mov_combo_final['original_title'] == {quotes}{title}{quotes})")

(mov_combo_final['original_title'] == "airplane 2 the sequel")|
(mov_combo_final['original_title'] == "airplane ii: the sequel")|
(mov_combo_final['original_title'] == "airplane")|
(mov_combo_final['original_title'] == "airplane!")|
(mov_combo_final['original_title'] == "alien")|
(mov_combo_final['original_title'] == "alien vs")|
(mov_combo_final['original_title'] == "american werewolf in london")|
(mov_combo_final['original_title'] == "an american werewolf in london")|
(mov_combo_final['original_title'] == "austin powers   international man of mystery")|
(mov_combo_final['original_title'] == "austin powers: international man of mystery")|
(mov_combo_final['original_title'] == "bachelor party")|
(mov_combo_final['original_title'] == "the bachelor party")|
(mov_combo_final['original_title'] == "beavis and butt-head do america")|
(mov_combo_final['original_title'] == "beavis and butt head do america")|
(mov_combo_final['original_title'] == "blast from the past")|
(mov_combo_final['origin

In [336]:
mov_combo_dup_title = mov_combo_final[
(mov_combo_final['original_title'] == "airplane 2 the sequel")|
(mov_combo_final['original_title'] == "airplane ii: the sequel")|
(mov_combo_final['original_title'] == "airplane")|
(mov_combo_final['original_title'] == "airplane!")|
(mov_combo_final['original_title'] == "alien")|
(mov_combo_final['original_title'] == "alien vs")|
(mov_combo_final['original_title'] == "american werewolf in london")|
(mov_combo_final['original_title'] == "an american werewolf in london")|
(mov_combo_final['original_title'] == "austin powers   international man of mystery")|
(mov_combo_final['original_title'] == "austin powers: international man of mystery")|
(mov_combo_final['original_title'] == "bachelor party")|
(mov_combo_final['original_title'] == "the bachelor party")|
(mov_combo_final['original_title'] == "beavis and butt-head do america")|
(mov_combo_final['original_title'] == "beavis and butt head do america")|
(mov_combo_final['original_title'] == "blast from the past")|
(mov_combo_final['original_title'] == "the blast from the past")|
(mov_combo_final['original_title'] == "cinema paradiso")|
(mov_combo_final['original_title'] == "nuovo cinema paradiso")|
(mov_combo_final['original_title'] == "crazy love")|
(mov_combo_final['original_title'] == "crazylove")|
(mov_combo_final['original_title'] == "dragon slayer")|
(mov_combo_final['original_title'] == "dragonslayer")|
(mov_combo_final['original_title'] == "ed tv")|
(mov_combo_final['original_title'] == "edtv")|
(mov_combo_final['original_title'] == "el mariachi")|
(mov_combo_final['original_title'] == "mariachi, el")|
(mov_combo_final['original_title'] == "evil dead ii dead by dawn")|
(mov_combo_final['original_title'] == "evil dead 2: dead before dawn")|
(mov_combo_final['original_title'] == "face/off")|
(mov_combo_final['original_title'] == "face off")|
(mov_combo_final['original_title'] == "ghostbusters 2")|
(mov_combo_final['original_title'] == "ghostbusters ii")|
(mov_combo_final['original_title'] == "glengarry glen gross")|
(mov_combo_final['original_title'] == "glengarry glen ross")|
(mov_combo_final['original_title'] == "gone in 60 seconds")|
(mov_combo_final['original_title'] == "gone in sixty seconds")|
(mov_combo_final['original_title'] == "grosse point blank")|
(mov_combo_final['original_title'] == "grosse pointe blank")|
(mov_combo_final['original_title'] == "halloween the curse of michael myers")|
(mov_combo_final['original_title'] == "halloween: the curse of michael myers")|
(mov_combo_final['original_title'] == "hellbound: hellraiser ii")|
(mov_combo_final['original_title'] == "hellbound hellraiser ii")|
(mov_combo_final['original_title'] == "hellraiser 3 hell on earth")|
(mov_combo_final['original_title'] == "hellraiser iii: hell on earth")|
(mov_combo_final['original_title'] == "interview with the vampire")|
(mov_combo_final['original_title'] == "interview with the vampire: the vampire chronicles")|
(mov_combo_final['original_title'] == "l'avventura")|
(mov_combo_final['original_title'] == "avventura, l' (the adventure)")|
(mov_combo_final['original_title'] == "love & basketball")|
(mov_combo_final['original_title'] == "love and basketball")|
(mov_combo_final['original_title'] == "leon the professional")|
(mov_combo_final['original_title'] == "leon")|
(mov_combo_final['original_title'] == "mighty morphin power rangers")|
(mov_combo_final['original_title'] == "mighty morphin power rangers the movie")|
(mov_combo_final['original_title'] == "o brother where art thou%3f")|
(mov_combo_final['original_title'] == "o brother, where art thou?")|
(mov_combo_final['original_title'] == "punch drunk love")|
(mov_combo_final['original_title'] == "punch-drunk love")|
(mov_combo_final['original_title'] == "rambo first blood ii the mission")|
(mov_combo_final['original_title'] == "rambo: first blood part ii")|
(mov_combo_final['original_title'] == "romeo & juliet")|
(mov_combo_final['original_title'] == "romeo and juliet")|
(mov_combo_final['original_title'] == "se7en")|
(mov_combo_final['original_title'] == "seven")|
(mov_combo_final['original_title'] == "south park")|
(mov_combo_final['original_title'] == "south park: bigger longer & uncut")|
(mov_combo_final['original_title'] == "spider man")|
(mov_combo_final['original_title'] == "spider-man")|
(mov_combo_final['original_title'] == "star trek ii the wrath of khan")|
(mov_combo_final['original_title'] == "star trek: the wrath of khan")|
(mov_combo_final['original_title'] == "star wars")|
(mov_combo_final['original_title'] == "star wars a new hope")|
(mov_combo_final['original_title'] == "star wars return of the jedi")|
(mov_combo_final['original_title'] == "star wars: episode vi - return of the jedi")|
(mov_combo_final['original_title'] == "sugar and spice")|
(mov_combo_final['original_title'] == "sugar & spice")|
(mov_combo_final['original_title'] == "terminator 2 judgement day")|
(mov_combo_final['original_title'] == "terminator 2: judgment day")|
(mov_combo_final['original_title'] == "terminator 2: judgment day - omitted scenes")|
(mov_combo_final['original_title'] == "the adventures of buckaroo banzai across the 8th dimension")|
(mov_combo_final['original_title'] == "the adventures of buckaroo banzai across the eighth dimension")|
(mov_combo_final['original_title'] == "la battaglia di algeri")|
(mov_combo_final['original_title'] == "the battle of algiers")|
(mov_combo_final['original_title'] == "le grand bleu")|
(mov_combo_final['original_title'] == "the big blue")|
(mov_combo_final['original_title'] == "jurassic park the lost world")|
(mov_combo_final['original_title'] == "the lost world: jurassic park")|
(mov_combo_final['original_title'] == "the majestic (the bijou)")|
(mov_combo_final['original_title'] == "the majestic")|
(mov_combo_final['original_title'] == "the x files")|
(mov_combo_final['original_title'] == "the x files fight the future")|
(mov_combo_final['original_title'] == "three kings")|
(mov_combo_final['original_title'] == "three kings (spoils of war)")|
(mov_combo_final['original_title'] == "twin peaks")|
(mov_combo_final['original_title'] == "twin peaks: fire walk with me")|
(mov_combo_final['original_title'] == "u turn")|
(mov_combo_final['original_title'] == "u-turn")|
(mov_combo_final['original_title'] == "who framed roger rabbit")|
(mov_combo_final['original_title'] == "who framed roger rabbit%3f")|
(mov_combo_final['original_title'] == "x men")|
(mov_combo_final['original_title'] == "x-men")]

In [367]:
title_wc = pd.DataFrame(mov_combo_dup_title.sort_values('original_title').groupby('original_title')['word_count'].sum())
title_wc = title_wc.reset_index()

# merging title_wc with movie_info to sort by imdb_title
title_wc  = title_wc.merge(movie_info[['original_title','imdb_title']], on = 'original_title')

I will now sort the `title_wc` dataframe by the movie title and the word count from smallest to largest.
I will then extract the `original_title` movie titles that have a fewer word count to remove them from the `mov_combo_final` dataframe.

In [369]:
title_wc = title_wc.sort_values(['imdb_title','word_count'], ascending = True)

title_wc.head(8)

Unnamed: 0,original_title,word_count,imdb_title
1,airplane 2 the sequel,9025,Airplane II: The Sequel (1982)
2,airplane ii: the sequel,9180,Airplane II: The Sequel (1982)
3,airplane!,2869,Airplane! (1980)
0,airplane,7458,Airplane! (1980)
4,alien,8559,Alien (1979)
5,alien vs,12062,Alien (1979)
6,american werewolf in london,7364,An American Werewolf in London (1981)
7,an american werewolf in london,7541,An American Werewolf in London (1981)


In [371]:
movies_to_remove = list(title_wc.drop_duplicates(subset='imdb_title')['original_title'])

In [375]:
# creating a mask to remove the movies
for position, title in enumerate(movies_to_remove):
    quotes = '"'
    if position != len(movies_to_remove)-1:
        print(f"(mov_combo_final['original_title'] != {quotes}{title}{quotes})&")
    else:
        print(f"(mov_combo_final['original_title'] != {quotes}{title}{quotes})")

(mov_combo_final['original_title'] != "airplane 2 the sequel")&
(mov_combo_final['original_title'] != "airplane!")&
(mov_combo_final['original_title'] != "alien")&
(mov_combo_final['original_title'] != "american werewolf in london")&
(mov_combo_final['original_title'] != "austin powers   international man of mystery")&
(mov_combo_final['original_title'] != "bachelor party")&
(mov_combo_final['original_title'] != "beavis and butt-head do america")&
(mov_combo_final['original_title'] != "blast from the past")&
(mov_combo_final['original_title'] != "nuovo cinema paradiso")&
(mov_combo_final['original_title'] != "crazy love")&
(mov_combo_final['original_title'] != "dragon slayer")&
(mov_combo_final['original_title'] != "edtv")&
(mov_combo_final['original_title'] != "el mariachi")&
(mov_combo_final['original_title'] != "evil dead ii dead by dawn")&
(mov_combo_final['original_title'] != "face/off")&
(mov_combo_final['original_title'] != "ghostbusters ii")&
(mov_combo_final['original_title'] 

In [380]:
mov_combo_final = mov_combo_final[(mov_combo_final['original_title'] != "airplane 2 the sequel")&
(mov_combo_final['original_title'] != "airplane!")&
(mov_combo_final['original_title'] != "alien")&
(mov_combo_final['original_title'] != "american werewolf in london")&
(mov_combo_final['original_title'] != "austin powers   international man of mystery")&
(mov_combo_final['original_title'] != "bachelor party")&
(mov_combo_final['original_title'] != "beavis and butt-head do america")&
(mov_combo_final['original_title'] != "blast from the past")&
(mov_combo_final['original_title'] != "nuovo cinema paradiso")&
(mov_combo_final['original_title'] != "crazy love")&
(mov_combo_final['original_title'] != "dragon slayer")&
(mov_combo_final['original_title'] != "edtv")&
(mov_combo_final['original_title'] != "el mariachi")&
(mov_combo_final['original_title'] != "evil dead ii dead by dawn")&
(mov_combo_final['original_title'] != "face/off")&
(mov_combo_final['original_title'] != "ghostbusters ii")&
(mov_combo_final['original_title'] != "glengarry glen gross")&
(mov_combo_final['original_title'] != "gone in sixty seconds")&
(mov_combo_final['original_title'] != "grosse point blank")&
(mov_combo_final['original_title'] != "halloween: the curse of michael myers")&
(mov_combo_final['original_title'] != "hellbound: hellraiser ii")&
(mov_combo_final['original_title'] != "hellraiser 3 hell on earth")&
(mov_combo_final['original_title'] != "interview with the vampire")&
(mov_combo_final['original_title'] != "l'avventura")&
(mov_combo_final['original_title'] != "love & basketball")&
(mov_combo_final['original_title'] != "leon the professional")&
(mov_combo_final['original_title'] != "mighty morphin power rangers the movie")&
(mov_combo_final['original_title'] != "o brother where art thou%3f")&
(mov_combo_final['original_title'] != "punch drunk love")&
(mov_combo_final['original_title'] != "rambo: first blood part ii")&
(mov_combo_final['original_title'] != "romeo and juliet")&
(mov_combo_final['original_title'] != "seven")&
(mov_combo_final['original_title'] != "south park: bigger longer & uncut")&
(mov_combo_final['original_title'] != "spider-man")&
(mov_combo_final['original_title'] != "star trek: the wrath of khan")&
(mov_combo_final['original_title'] != "star wars")&
(mov_combo_final['original_title'] != "star wars: episode vi - return of the jedi")&
(mov_combo_final['original_title'] != "sugar and spice")&
(mov_combo_final['original_title'] != "terminator 2: judgment day - omitted scenes")&
(mov_combo_final['original_title'] != "the adventures of buckaroo banzai across the 8th dimension")&
(mov_combo_final['original_title'] != "the battle of algiers")&
(mov_combo_final['original_title'] != "le grand bleu")&
(mov_combo_final['original_title'] != "jurassic park the lost world")&
(mov_combo_final['original_title'] != "the majestic (the bijou)")&
(mov_combo_final['original_title'] != "the x files fight the future")&
(mov_combo_final['original_title'] != "three kings")&
(mov_combo_final['original_title'] != "twin peaks")&
(mov_combo_final['original_title'] != "u turn")&
(mov_combo_final['original_title'] != "who framed roger rabbit%3f")&
(mov_combo_final['original_title'] != "x-men")]

In [382]:
mov_combo_final.reset_index(drop=True,inplace=True)

Dropping unnecessary columns, `info_all_collected`, `original_title`, and `title`, because I already have all of the information collected for each observation, and I have the official IMDb movie titles.

In [385]:
# Dropping mov_combo_final's info all collected
mov_combo_final['info_all_collected'].value_counts()

1    77400
Name: info_all_collected, dtype: int64

In [400]:
# title 
(mov_combo_final['title'] != mov_combo_final['original_title']).sum()

0

In [401]:
mov_combo_final = mov_combo_final.drop(['info_all_collected','title'], 1)
mov_combo_final.reset_index(drop=True,inplace=True)

In [402]:
mov_combo_final = mov_combo_final.drop(['original_title'], 1)

Changing `genre` values into lists for EDA purposes.

In [403]:
mov_combo_final['genre'] = mov_combo_final['genre'].map(lambda x: x.split(','))

Rearranging the columns

In [425]:
mov_combo_final = mov_combo_final[['imdb_title', 'character','text','tokenized_text','word_count','vader','genre','imdb_url','pic_url']]

Deleting characters that have 0 words in their lines. 

In [441]:
len(mov_combo_final[mov_combo_final['word_count'] == 0])

1273

In [443]:
mov_combo_final = mov_combo_final[mov_combo_final['word_count']>0]
mov_combo_final.reset_index(drop=True,inplace=True)

Saving the `mov_combo_final` as a pickle to preserve dictionaries and lists within the dataframe. Also saving the `movie_info` dataframe as a csv.

In [444]:
mov_combo_final.to_pickle('../data/mov_combo_final.pkl')
movie_info.to_csv('../data/movie_info.csv', index_label = False)

# Proceed to Notebook 3: EDA

Now that the `mov_combo_final` is cleaned up and ready to go, I will explore different trends in the movie lines.