# Box Office Mojo Data Scraper

## Library Importation

Basic library for data and HTML manipulation operations. `lxml` is an HTML and XML processing library that is used for it's convenient HTML element selection for web scraping. `tqdm` is a great method for implementing flexible, versatile progress bars by wrapping `iterables` in a `function`.

In [286]:
import requests
from lxml import html
import pandas as pd 
import numpy as np
from bs4 import BeautifulSoup as bs
from tqdm import tqdm, trange

## Pagination Function

This function takes in a `requests` object, converts it into a soup object using the more efficient `lxml` parser. Once converted to a soup object the function searches for the main table of the Box Office Mojo Domestic Box Office for N Year and returns a list of incomplete links to individual movie pages that can be then passed to parse the individual movie pages.

In [287]:
def pagination_func(req_url):
    soup = bs(req_url.content, 'lxml')
    
    table = soup.find('table')
    links = [a['href'] for a in table.find_all('a', href=True)]
    pagination_list = []

    substring = '/release'
    for link in links:
        if substring in link:
            pagination_list.append(link)
            
    return pagination_list

## Link Retrieval

The following `for` loop takes in a hand-crafted list of years typed as strings. The loop iterates through the list and creates a `pagination_url` using the formatting function for strings. This URL is used to create a `requests` object that is then passed to the `pagination_func` to create a list of incomplete URL lists. The loop following the `complete_links` list initiation formats a string of a URL into a complete URL for later use in scraping.

In [288]:
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
link_list_by_year = []
for count, year in tqdm(enumerate(years)):
    pagination_url = 'https://www.boxofficemojo.com/year/{}/?grossesOption=calendarGrosses'.format(year)
    pagination = requests.get(pagination_url)
    link_list_by_year.append(pagination_func(pagination))

10it [00:26,  2.66s/it]


In [289]:
complete_links = []

In [290]:
for link in link_list_by_year:
    for url in link:
        complete_links.append('https://www.boxofficemojo.com{}'.format(url))

## Page Scraping Function


The function uses the `lxml` library to target elements of the page to scrape. `lxml` uses Xpaths to locate these elements, appending the text of the element to a list and if the element does not exist it appends a `string` of 'Missing'. The function makes extensive use of `method` chaining to combine line length and make readability greater. 

In [291]:
def scrape_page(req_page):
    tree = html.fromstring(req_page.content)
    
    titles = tree.xpath('//*[@id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')
    if titles:
        title.append(titles[0])
    else:
        title.append('Missing')
        
    domestics = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')
    if domestics:
        domestic.append(domestics[0].replace('$','').replace(',',''))
    else:
        domestics.append('Missing')
    
    internationals = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[1]/div/div[2]/span[2]/a/span/text()')
    if internationals:
        international.append(internationals[0].replace('$','').replace(',',''))
    else:
        international.append('Missing')
    
    worldwides = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[1]/div/div[3]/span[2]/a/span/text()')
    if worldwides:
        worldwide.append(worldwides[0].replace('$','').replace(',',''))
    else:
        worldwide.append('Missing')
        
    openings = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
    if openings:
        opening.append(openings[0].replace('$','').replace(',',''))
    else:
        opening.append('Missing')
                   
    opening_theatress = tree.xpath('/html/body/div[1]/main/div/div[3]/div[4]/div[2]/span[2]/text()')
    if opening_theatress:
        opening_theatres.append(opening_theatress[0].replace('\n', '').replace(',','').split()[0])
    else:
        opening_theatres.append('Missing')
    

# Following contents commented out as I debug their functionality.

#     with_budget = ['5', '6', '7']
#     without_budget = ['4', '5', '6']
    
#     substring = '\n'
    
#     MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')
    
#     if substring in MPAAs[0]:
#         MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[0]))
#         print(MPAAs)
#         try:
#             if MPAAs:
#                 MPAA.append(MPAAs[0])
#             else:
#                 MPAA.append('Missing')
#         except:
#             MPAA.append('Missing')
        
#         run_times = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[1]))
#         print(run_times)
#         try:
#             if run_times:
#                 run_time.append(run_times[0])
#             else:
#                 run_time.append('Missing')
#         except:
#             run_time.append('Missing')
        
#         genress = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[2]))
#         print(genress)
#         try:
#             if genress:
#                 genres.append(genress[0].replace('\n','').split())
#             else:
#                 genres.append('Missing')
#         except:
#             genres.append('Missing')
        
#     else:
#         MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[0]))
#         print(MPAAs)
#         try:
#             if MPAAs:
#                 MPAA.append(MPAAs[0])
#             else:
#                 MPAA.append('Missing')
#         except:
#             MPAA.append('Missing')

        
#         run_times = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[1]))
#         print(run_times)
#         try:
#             if run_times:
#                 run_time.append(run_times[0])
#             else:
#                 run_time.append('Missing')
#         except:
#             run_time.append('Missing')
            
#         genress = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[2]))
#         print(genress)
#         try:
#             if genress:
#                 genres.append(genress[0].replace('\n','').split())
#             else:
#                 genres.append('Missing')
#         except: 
#             genres.append('Missing')

`List` intialization for the `scrape_page() function`.

In [292]:
title = []
domestic = []
international = []
worldwide = []
opening = []
opening_theatres = []
MPAA = []
run_time = []
genres = []

## Scrape Process

In [293]:
# Scrape test -- First 20 entries in complete_links list
#Erase brackets and contents to scrape full list (!!WARNING!!: Will take a while)

for link in tqdm(complete_links):
    movie = requests.get(link)
    scrape_page(movie)



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8319/8319 [1:03:53<00:00,  2.17it/s]


## Data Storage

Taking the `list` generated from the page scraping function and zips them together. The `zipped` variable is then wrapped into a pandas `dataframe`. 

In [298]:
zipped  = zip(title, domestic, international, worldwide, opening, opening_theatres)

In [299]:
scraper_frame = pd.DataFrame(list(zipped), 
                 columns=['Title', 'Domestic_Gross', 'International_Gross', 'Worldwide_Gross', 'Opening_Gross',
                         'Number_of_Opening_Theaters'])

Using `.where()` on the pandas `dataframe` to replace all 'Missing' and '-' values in the `dataframe`. 

In [300]:
scraper_frame.where(scraper_frame !='Missing', 0, inplace=True)
scraper_frame.where(scraper_frame !='-', 0, inplace=True)

Types the named `columns` as `int64` so mathematical operations can be performed. 

In [301]:
display(scraper_frame.tail(30))

# scraper_frame.astype({'Domestic_Gross':'int64',
#                      'International_Gross':'int64',
#                      'Worldwide_Gross':'int64',
#                      'Opening_Gross':'int64',
#                      'Number_of_Opening_Theaters':'int64'})



Unnamed: 0,Title,Domestic_Gross,International_Gross,Worldwide_Gross,Opening_Gross,Number_of_Opening_Theaters
8289,Tea With the Dames,889343,2291809,3181152,14777,1
8290,Iceman,2138,4581,6719,1372,7
8291,Detour,16172,0,16172,5127,1
8292,Jihadists,2104,0,2104,824,1
8293,Divide and Conquer,38510,0,38510,18833,15
8294,A German Youth,2343,0,2343,416,1
8295,A Fish in the Bathtub,1237,0,1237,1237,3
8296,Division 19,1699,981,2680,0,0
8297,Little Q,1652,16358581,16360233,1652,1
8298,Game Day,1624,0,1624,1624,4


Writes the `dataframe` to a `.csv` file. 

In [302]:

scraper_frame.to_csv(r'C:\Users\Nero_\Desktop\CourseWork\Project_Mod_1\dsc-mod-1-project-v2-1-onl01-dtsc-ft-070620\Scraper_test.csv', index=False)

## Feature Testing

In [178]:
test_url = 'https://www.boxofficemojo.com/release/rl24544769/?ref_=bo_yld_table_191'
req_page = requests.get(test_url)

In [179]:
test_rating = []
test_run_time = []
test_genre = []

In [192]:
def test_scrape(page_req):
    tree = html.fromstring(req_page.content)
    
    with_budget = ['5', '6', '7']
    without_budget = ['4', '5', '6']
    
    substring = '\n'
    substring2 = 'min'
    
    
    MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')
    if substring or substring2 in MPAAs[0]:
        print('Sub or Sub2')
        if substring2 in MPAAs[0]:
            test_rating.append('Missing')
            print('Sub2')
        if type(MPAAs) == 'list':
            print('List')
    elif substring in MPAAs[0]:
        print('Elif sub')
        MPAAss = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[0]))
        if MPAAss:
            test_rating.append(MPAAss[0])
        else:
            test_rating.append('Missing')
    else:
        print('Else')
        MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[0]))
        if MPAAsss:
            test_rating.append(MPAAsss[0])
        else:
            test_rating.append('Missing')
    
#     if substring in MPAAs[0]:
            
#         MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[0]))
#         if MPAAs:
#             test_rating.append(MPAAs[0])
#         else:
#             test_rating.append('Missing')
        
#         run_times = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[1]))
#         if run_times:
#             test_run_time.append(run_times[0])
#         else:
#             test_run_time.append('Missing')
        
#         genress = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(with_budget[2]))
#         if genress:
#             test_genre.append(genress[0].replace('\n','').split())
#         else:
#             test_genre.append('Missing')
        
#     else:
#         MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[0]))
#         if MPAAs:
#             test_rating.append(MPAAs[0])
#         else:
#             test_rating.append('Missing')
        
#         run_times = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[1]))
#         if run_times:
#             test_run_time.append(run_times[0])
#         else:
#             test_run_time.append('Missing')
        
#         genress = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[{}]/span[2]/text()'.format(without_budget[2]))
#         if genress:
#             test_genre.append(genress[0].replace('\n','').split())
#         else:
#             test_genre.append('Missing')
        
    
#     MPAAs = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')
#     print(MPAAs)
#     if substring in MPAAs[0]:
#         error_handle = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')
#         test_rating.append(error_handle[0])
#     elif MPAAs:
#         test_rating.append(MPAAs[0])
#     else:
#         test_rating.append('Missing')
    
#     run_times = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')
#     print(run_times)
#     if run_times:
#         test_run_time.append(run_times[0])
#     else:
#         test_run_time.append('Missing')

    
#     genress = tree.xpath('//*[@id="a-page"]/main/div/div[3]/div[4]/div[6]/span[2]/text()')
#     print(genress)
#     if genress:
#         test_genre.append(genress[0].replace('\n','').split())
#     else:
#         test_genre.append('Missing')

# Budget element on pages with budget data -> //*[@id="a-page"]/main/div/div[3]/div[4]/div[3]/span[2]/span/text()


# MPAA element on pages with budget data -> //*[@id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()
# Run-time element on pages with budget data -> //*[@id="a-page"]/main/div/div[3]/div[4]/div[6]/span[2]/font/font/text()
# Genres element on pages with budget data -> //*[@id="a-page"]/main/div/div[3]/div[4]/div[7]/span[2]/text()


# 3rd box xpath =  //*[@id="a-page"]/main/div/div[3]/div[4]/div[3]/span[2]/span
# 5th box xpath =  //*[@id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]
# 6th box xpath = //*[@id="a-page"]/main/div/div[3]/div[4]/div[7]/span[2]
    

In [193]:
test_scrape(req_page)

Sub or Sub2


In [194]:
test_rating

[]