### Web Scraping Box Office Mojo Movies
This file is a script that webscrapes information about daily box office (in the US) on a movie of choice.

We start by importing all needed packages. (Write "!pip install <package>" to install missing packages)

In [1]:
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as BS

We then get the website we are looking for.

Change the url to the url of the movie's daily box office. This one is for "Star Wars IX: The Rise of Skywalker".

In [2]:
url="https://www.boxofficemojo.com/release/rl3305145857/?ref_=bo_gr_rls"
page = requests.get(url)
soup = BS(page.content, 'html.parser')

To view each data point and their describtion, see the link you are webscraping from.

Use BeautifulSoup to get all interesting data and the movie title for file naming:

In [16]:
movie_title = soup.find_all('title')
movie_title = movie_title[0].string

days = soup.find_all('td', class_='a-text-left mojo-header-column mojo-truncate mojo-field-type-date_interval mojo-sort-column')
special_events = soup.find_all('span', class_='a-size-small a-color-secondary')
dow = soup.find_all('td', class_='a-text-left mojo-field-type-date_interval')
rank = soup.find_all('td', class_='a-text-right mojo-field-type-rank')
daily = soup.find_all('td', class_='a-text-right mojo-field-type-money mojo-estimatable')
theaters = soup.find_all('td', class_='a-text-right mojo-field-type-positive_integer mojo-estimatable')

special_events[0]

<span class="a-size-small a-color-secondary">Christmas Day</span>

Save all data points in arrays:

In [4]:
days_str = []
#special_events = np.empty(len(days), dtype='<U10')
days_ar = []
dow_ar = []
rank_ar = []
daily_ar = []
theaters_ar = []

for i in range(len(days)):
    #We need the date (days) in the format yyyy-mm-dd
    #Luckily we found this information in the "href" (the link that the day is connected to)
    #Now we extract it along with everything else
    days_ar.append(str(days[i].select('a'))) 
    days_ar[i] = days_ar[i][days_ar[i].find('2'):]
    days_ar[i] = days_ar[i][:days_ar[i].find('/')]
    
    dow_ar.append(dow[i].select('a')[0].string)
    rank_ar.append(rank[i].string)
    daily_ar.append(daily[i].string)
    theaters_ar.append(theaters[i].string)

Make a pandas dataframe with all of the arrays:

In [5]:
movie_df = pd.DataFrame({'days': days_ar, 'dow': dow_ar, 'rank':rank_ar, 'daily': daily_ar, 'theaters': theaters_ar})

All of the data points are strings, but we would like a float format so that we can do computations later.

We remove all commas and dollar-signs in order to convert strings to floats using the function below:

In [6]:
def from_str_to_float(column):
    float_column = np.empty(len(column), dtype='<U10')
    for i in range(len(column)):
        float_column[i] = column[i].replace(',','')
        if '$' in column[i]:
            float_column[i] = float(float_column[i].replace('$',''))
    return float_column

Now we filter the columns:

In [7]:
movie_df['daily'] = from_str_to_float(movie_df['daily'])
movie_df['rank'] = from_str_to_float(movie_df['rank'])
movie_df['theaters'] = from_str_to_float(movie_df['theaters'])

Now we want to save our data frame to a csv file:

In [9]:
#Because spaces are a no-go in file names
movie_title = movie_title.replace(" ", "_")

movie_df
#movie_df.to_csv(movie_title+".csv")

Unnamed: 0,days,dow,rank,daily,theaters
0,2019-12-20,Friday,1,89615288.0,4406
1,2019-12-21,Saturday,1,20339.0,4406
2,2019-12-22,Sunday,1,89615288.0,4406
3,2019-12-23,Monday,1,47467565.0,4406
4,2019-12-24,Tuesday,1,10773.0,4406
...,...,...,...,...,...
86,2020-03-15,Sunday,23,485616199.,189
87,2020-03-16,Monday,18,3529380.0,189
88,2020-03-17,Tuesday,18,1154.0,189
89,2020-03-18,Wednesday,16,489145579.,189
