# Scrape movie sequel data from Box Office Mojo 

## 'reach' features to scrape from individual movie webpage:
> * genre
> * lead actor
> * weeks in release
> * director
> * writer
> * MPAA rating (G, PG-13, etc.)

In [26]:
import requests
import time
import random
import pickle
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

Get data from Box Office Mojo. This doesn't lump in Marvel Cinematic Universe data as one series (or other examples), only includes movies in box office (as direct to video sales are very spotty)

## Scrape list of franchises and franchise summary webpages

In [2]:
url = 'http://www.boxofficemojo.com/franchises/'
response = requests.get(url)
response.status_code
# print(response.text)
page = response.text
soup = BeautifulSoup(page,"lxml")

Get list of webpage links for franchises

In [3]:
site_link = 'http://www.boxofficemojo.com/franchises'
inflation_link = '&adjust_yr=2018&p=.htm'
franchise_table = soup.find_all('table')[3]
franchise_links = []
franchise_list = []
for franchise in franchise_table.find_all('tr')[1:]:
    franchise_links.append(site_link + (franchise.find('a')['href'][1:]) + inflation_link)
    franchise_list.append(franchise.find('a').text)
print(franchise_links[:5])
print(franchise_list[:5])

['http://www.boxofficemojo.com/franchises/chart/?id=3ninjas.htm&adjust_yr=2018&p=.htm', 'http://www.boxofficemojo.com/franchises/chart/?id=300.htm&adjust_yr=2018&p=.htm', 'http://www.boxofficemojo.com/franchises/chart/?id=agathachristie.htm&adjust_yr=2018&p=.htm', 'http://www.boxofficemojo.com/franchises/chart/?id=alexcross.htm&adjust_yr=2018&p=.htm', 'http://www.boxofficemojo.com/franchises/chart/?id=aliceinwonderland.htm&adjust_yr=2018&p=.htm']
['3 Ninjas', '300', 'Agatha Christie', 'Alex Cross', 'Alice in Wonderland']


In [4]:
print(f'There are {len(franchise_links)} movie franchises with at least one box office sequel.')

There are 254 movie franchises with at least one box office sequel.


## Webscrape to data frame pipeline

Scrape basic movie data (7 features) from the summary page for each franchise - rank, title, adjusted gross domestic, theaters, release date, franchise name, studio. Some movie sequels have the exact same name, so a column with the movie website will also be added as a unique identifier. 

In [13]:
franchise_name = []
movie_title = []
rank = []
studio = []
adjusted_domestic_gross = []
n_theaters = []
release = []
movie_webpage = []

for url in franchise_links: 
    response = requests.get(url)
    # print(response.status_code)
    page = response.text
    soup = BeautifulSoup(page,"lxml")

    franchise_table = soup.find_all('table')[4]
    
    for franchise in franchise_table.find_all('tr')[1:-2]:
        movie_title.append(franchise.find_all('td')[1].text)
        rank.append(int(franchise.find_all('td')[0].text))
        studio.append(franchise.find_all('td')[2].text)
        adj_domestic_gross_str = franchise.find_all('td')[3].text[1:]
        adjusted_domestic_gross.append(int(adj_domestic_gross_str.replace(',', '')))
        franchise_name.append(soup.find('h1').text)
        movie_webpage.append(site_link + franchise.find('a')['href'])
        theaters = franchise.find_all('td')[4].text
        n_theaters.append(int(theaters.replace(',','').replace('-','0')))
        theater_release_str = franchise.find_all('td')[7].text
        if theater_release_str != 'N/A':
            release.append(pd.to_datetime(theater_release_str))
        else:
            release.append(np.NaN)
    
    sec = random.uniform(5,12)
    print(f'Next scrape in {int(sec)} seconds. Just finished {franchise_name[-1]}.')
    time.sleep(sec)    


df = pd.DataFrame(
    {'franchise_name': franchise_name,
     'movie_title': movie_title,
     'rank': rank,
     'studio': studio,
     'adjusted_domestic_gross': adjusted_domestic_gross,
     'release_date': release,
     'theaters': n_theaters,
     'movie_webpage': movie_webpage
    })

Next scrape in 5 seconds. Just finished 3 Ninjas.
Next scrape in 11 seconds. Just finished 300.
Next scrape in 8 seconds. Just finished Agatha Christie.
Next scrape in 5 seconds. Just finished Alex Cross.
Next scrape in 8 seconds. Just finished Alice in Wonderland.
Next scrape in 7 seconds. Just finished Alien.
Next scrape in 8 seconds. Just finished Alvin and the Chipmunks.
Next scrape in 8 seconds. Just finished American Ninja.
Next scrape in 5 seconds. Just finished American Pie.
Next scrape in 10 seconds. Just finished Amityville.
Next scrape in 7 seconds. Just finished Angry Birds.
Next scrape in 7 seconds. Just finished Annabelle.
Next scrape in 9 seconds. Just finished Ant-Man.
Next scrape in 9 seconds. Just finished Arthur.
Next scrape in 6 seconds. Just finished Atlas Shrugged Franchise.
Next scrape in 7 seconds. Just finished Austin Powers.
Next scrape in 5 seconds. Just finished Avatar.
Next scrape in 5 seconds. Just finished Avengers.
Next scrape in 7 seconds. Just finished

Next scrape in 9 seconds. Just finished The Mummy.
Next scrape in 11 seconds. Just finished The Muppets.
Next scrape in 9 seconds. Just finished My Big Fat Greek Wedding.
Next scrape in 5 seconds. Just finished The Naked Gun.
Next scrape in 10 seconds. Just finished Neighbors.
Next scrape in 8 seconds. Just finished The Neverending Story.
Next scrape in 9 seconds. Just finished Night at the Museum.
Next scrape in 5 seconds. Just finished Nightmare on Elm Street.
Next scrape in 5 seconds. Just finished Now You See Me.
Next scrape in 7 seconds. Just finished The Nut Job.
Next scrape in 11 seconds. Just finished Ocean's 11.
Next scrape in 8 seconds. Just finished Oh, God!.
Next scrape in 8 seconds. Just finished The Omen.
Next scrape in 11 seconds. Just finished Ong Bak.
Next scrape in 8 seconds. Just finished Ouija.
Next scrape in 8 seconds. Just finished Pacific Rim.
Next scrape in 10 seconds. Just finished Paddington Bear.
Next scrape in 9 seconds. Just finished Paranormal Activity.
Ne

In [18]:
with open('raw_movie_data.pickle', 'wb') as to_write:
    pickle.dump(df, to_write)

In [27]:
df.head()

Unnamed: 0,franchise_name,movie_title,rank,studio,adjusted_domestic_gross,release_date,theaters,movie_webpage
0,3 Ninjas,3 Ninjas,1,BV,64010300,1992-08-07,1954,http://www.boxofficemojo.com/franchises/movies...
1,3 Ninjas,3 Ninjas Kick Back,2,TriS,25855900,1994-05-06,2043,http://www.boxofficemojo.com/franchises/movies...
2,3 Ninjas,3 Ninjas Knuckle Up,3,Sony,870700,1995-03-10,52,http://www.boxofficemojo.com/franchises/movies...
3,3 Ninjas,3 Ninjas: High Noon at Mega Mountain,4,Sony,734000,1998-04-10,120,http://www.boxofficemojo.com/franchises/movies...
4,300,300,1,WB,280411700,2007-03-09,3280,http://www.boxofficemojo.com/franchises/movies...
