### Madeline Carter and Alex Isbill

In this notebook, we will be scraping movie data from imdb.com. We will first import necessary classes: `Requests`, `BeautifulSoup4`, and `Pandas`. We will additionally important the `time` module to adhere to good web scraping practices.

In [1]:
import requests
import bs4
import pandas as pd
import time

The first thing we want to do is create a dictionary whose keys correlate to the column names of the data frame on imdb.com.

In [2]:
data = {
    "Titles" : [],
    "Runtime" : [],
    "User Rating" : [],
    "Metascore" : [],
    "Votes" : [],
    "Gross" : [],
    "Genre" : [] 
}

We want to collect data on the top 1000 movies released between 2000-01-01 and 2020-12-31, and sorted by US Box Office in Descending order. However, the url "https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc&start=1" only displays the first 50 movies. The next 50 movies are shown when we remove '1' and append "51" to the url. The following pages of movies follow this example by incrementing the previous appended value by 50. By creating a list *pages* with the appended additions, it enables us to concisely scrape data from all 10 urls.

We will traverse the said list. For each iteration of the for loop, we will scrape the data at the *url* and put it into our *data* dictionary.

To keep data consistent and allow conversion from strings to floats and ints (part 3), we removed commas, dollar signs, and letters (e.g. "$1,000,000M" --> 1000000).

In [3]:
pages = [1, 51, 101, 151, 201, 251, 301, 351, 401, 451, 501, 551, 601, 651, 701, 751, 801, 851, 901, 951]
for i in pages:
    url = "https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc&start="+ str(i)
    response = requests.get(url)
    time.sleep(1)
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    mv = soup.find_all('div', class_="lister-item-content")
    for movie in mv:
        data["Titles"].append(movie.h3.a.text)
        data["Runtime"].append(movie.find(class_="runtime").text.rstrip(" min"))
        data["User Rating"].append(movie.find(class_="inline-block ratings-imdb-rating").text.split()[0])
        try:
            data["Metascore"].append(movie.find(class_="inline-block ratings-metascore").text.split()[0])
        except: 
            data["Metascore"].append("NaN")
        data["Votes"].append(movie.find_all("span", {"name":"nv"})[0].text.replace(',',''))
        data["Gross"].append(movie.find_all("span", {"name":"nv"})[1].text[1:].rstrip("M"))
        data["Genre"].append(movie.find(class_="genre").text.split()[0].rstrip(","))

Now we will convert our complete *data* dictionary to a data frame and display it.

In [4]:
df = pd.DataFrame(data)
df

Unnamed: 0,Titles,Runtime,User Rating,Metascore,Votes,Gross,Genre
0,Star Wars: Episode VII - The Force Awakens,138,7.8,80,887665,936.66,Action
1,Avengers: Endgame,181,8.4,78,951205,858.37,Action
2,Avatar,162,7.8,83,1160171,760.51,Action
3,Black Panther,134,7.3,88,681500,700.06,Action
4,Avengers: Infinity War,149,8.4,68,933216,678.82,Action
...,...,...,...,...,...,...,...
995,Son of God,138,5.7,37,18731,59.70,Biography
996,Love Actually,135,7.6,55,448957,59.70,Comedy
997,Gothika,98,5.8,38,110540,59.69,Horror
998,My Big Fat Greek Wedding 2,94,6.0,37,29640,59.69,Comedy


Finally, we will download the collected data as a .csv file to use in part 3 of this lab.

In [5]:
df.to_csv("imdb.csv", index=False)