### Madeline Carter and Alex Isbill

In this notebook, we will be scraping movie data from the-numbers.com. We will first import necessary classes: `Requests`, `BeautifulSoup4`, and `Pandas`. We will additionally important the `time` module to adhere to good web scraping practices.

In [11]:
import requests
import bs4
import pandas as pd
import time

The first thing we want to do is create a dictionary whose keys correlate to the column names of the data frame on the-numbers.com.

In [12]:
data = {
    "Release Date" : [],
    "Titles" : [],
    "Budget" : [],
    "TN Domestic Gross" : [],
    "Worldwide Gross" : [] 
}

We want to collect data on the top 1000 movies in the the-numbers.com database with the highest budgets. However, the url "https://www.the-numbers.com/movie/budgets/all" only displays the top 100 movies. The next 100 movies appends "/101" to the url, and the following hundreds of movies follow this example by incrementing the 100th place by 1. By creating a list *pages* with the appended additions, it enables us to concisely scrape data from all 10 urls.

We will traverse the said list. For each iteration of the for loop, we will scrape the data at the *url* and put it into our *data* dictionary.

To keep data consistent and allow conversion from strings to floats and ints (part 3), we removed commas, dollar signs, and letters (e.g. "$1,000,000M").

In [13]:
pages = ["", "/101", "/201", "/301", "/401", "/501", "/601", "/701", "/801", "/901"]
for i in pages:
    url = "https://www.the-numbers.com/movie/budgets/all"+ i
    response = requests.get(url)
    time.sleep(1)
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    mv = soup.find('table').find_all("tr")
    for movie in mv[1:]:
        data["Release Date"].append(movie.find_all('a')[0].text)
        data["Titles"].append(movie.find_all('a')[1].text)
        data["Budget"].append(movie.find_all(class_='data')[1].text.split()[0][1:].replace(',',''))
        data["TN Domestic Gross"].append(movie.find_all(class_='data')[2].text.split()[0][1:].replace(',',''))
        data["Worldwide Gross"].append(movie.find_all(class_='data')[3].text.split()[0][1:].replace(',',''))

Now we will convert our complete *data* dictionary to a data frame and display it.

In [14]:
df = pd.DataFrame(data)
df

Unnamed: 0,Release Date,Titles,Budget,TN Domestic Gross,Worldwide Gross
0,"Apr 23, 2019",Avengers: Endgame,400000000,858373000,2797800564
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802
2,"Apr 22, 2015",Avengers: Age of Ultron,365000000,459005868,1395316979
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817
4,"Apr 25, 2018",Avengers: Infinity War,300000000,678815482,2044540523
...,...,...,...,...,...
995,"Dec 25, 2008",The Spirit,60000000,19806188,39006188
996,"Oct 19, 2001",The Last Castle,60000000,18208078,20541668
997,"Jan 23, 2009",Inkheart,60000000,17303424,66655938
998,"Feb 18, 2020",Monster Hunter,60000000,15104790,44400541


Finally, we will download the collected data as a .csv file to use in part 3 of this lab.

In [15]:
df.to_csv("the_numbers.csv", index=False)