# <center> Please go to https://ccv.jupyter.brown.edu </center>

## By the end of today you will learn about:

- Scraping IMDB for movies that came out in 2019
- Scraping a single movie
- Scraping all movies from a single page
- Scraping all movies from all pages

- Scraping IMDB for movies that came out in 2019
- <font color='LIGHTGRAY'> Scraping a single movie </font>
- <font color='LIGHTGRAY'> Scraping all movies from a single page </font>
- <font color='LIGHTGRAY'> Scraping all movies from all pages </font>

# Scraping IMDB Movie Ratings
Modified from https://www.dataquest.io/blog/web-scraping-beautifulsoup/

|Title|Year|Genre|Runtime|Rating|Synopsis|Director|Vote|
|---|---|---|---|---|---|---|---|
|...|...|...|...|...|...|...|...|

## Explore website to decide how to scrape

We want to scrape the movies released in 2019 that are in IMDB's database. https://www.imdb.com has an advanced search page (https://www.imdb.com/search/title) that we can use to generate a query to get this list of movies. 

We first need to figure out how querying works. Let's search for "Feature Films" released between 2019-01-01 and 2019-12-31 with a score between 1 and 10 (to exclude movies without votes). Let's set Display Options to "250 per page" and "Release Date Descending".  

The URL for the query is:
https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-12-31&user_rating=1.0,10.0&sort=release_date,desc&count=250

In [None]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import time
import warnings
from IPython.display import clear_output

- <font color='LIGHTGRAY'> Scraping IMDB for movies that came out in 2019 </font>
- Scraping a single movie
- <font color='LIGHTGRAY'> Scraping all movies from a single page </font>
- <font color='LIGHTGRAY'> Scraping all movies from all pages </font>

## Scrape a single movie

In [None]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-12-31&user_rating=1.0,10.0&sort=release_date,desc&count=250"
response = get(url)
print(response.status_code)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

### Find the movie containers

In [None]:
movie_containers_lst = soup.find_all('div', class_ = 'lister-item mode-advanced')
print(len(movie_containers_lst))

### Scrape the first movie container

In [None]:
first_movie = movie_containers_lst[0].find(class_='lister-item-content')
print(first_movie.prettify())

#### The html for a single movie container is very long. We will use developer tools to help find the data we want.

In [None]:
title_str = first_movie.h3.a.get_text()
print(title_str)

In [None]:
year_str = first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
print(year_str)

In [None]:
genre_str = first_movie.p.find('span', class_ = 'genre').get_text()
runtime_str = first_movie.p.find('span', class_ = 'runtime').get_text()
print(genre_str)
print(runtime_str)

In [None]:
rating_flt = float(first_movie.select('.ratings-bar div strong')[0].get_text())
print(rating_flt)

In [None]:
synopsis_str = first_movie.find_all('p', class_ = 'text-muted')[1].get_text()
print(synopsis_str)

In [None]:
director_str = first_movie.find_all('p')[2].a.get_text()
print(director_str)

#### Can search for a tag with special attributes like `<span name='nv'>`

In [None]:
votes_tag = first_movie.find('span', attrs = {'name':'nv'})
print(votes_tag)

#### Can treat tags like dictionaries, where key value pairs are attributes

In [None]:
votes_int = int(votes_tag['data-value'])
print(votes_int)

- <font color='LIGHTGRAY'> Scraping IMDB for movies that came out in 2019 </font>
- <font color='LIGHTGRAY'> Scraping a single movie </font>
- Scraping all movies from a single page
- <font color='LIGHTGRAY'> Scraping all movies from all pages </font>

## Next, we will scrape all movie containers from the page

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []
# Extract data from individual movie container
for container in movie_containers_lst:
    # movie title
    title_str = container.h3.a.get_text()
    titles_lst.append(title_str)
    # year
    year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
    years_lst.append(year_str)
    # genre(s)
    genre_str = container.p.find('span', class_ = 'genre').get_text()
    genres_lst.append(genre_str)
    # runtime
    runtime_str = container.p.find('span', class_ = 'runtime').get_text()
    runtimes_lst.append(runtime_str)
    # IMDB rating
    rating_flt = container.select('.ratings-bar div strong')[0].get_text()
    ratings_lst.append(rating_flt)
    # synopsis
    synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
    synopsi_lst.append(synopsis_str)
    # director(s)
    director_str = container.find_all('p', class_ = '')[1].a.get_text()
    directors_lst.append(director_str)
    # vote count
    votes_tag = container.find('span', attrs = {'name':'nv'})
    vote_int = int(votes_tag['data-value'])
    votes_lst.append(vote_int)

#### There are often exceptions to the rule in the web page - we need to debug to account for these cases.

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []
# Extract data from individual movie container
for container in movie_containers_lst:
    # movie title
    title_str = container.h3.a.get_text()
    titles_lst.append(title_str)
    print(title_str)
    # year
    year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
    years_lst.append(year_str)
    # genre(s)
    genre_str = container.p.find('span', class_ = 'genre').get_text()
    genres_lst.append(genre_str)
    # runtime
    runtime_str = container.p.find('span', class_ = 'runtime').get_text()
    runtimes_lst.append(runtime_str)
    # IMDB rating
    rating_flt = container.select('.ratings-bar div strong')[0].get_text()
    ratings_lst.append(rating_flt)
    # synopsis
    synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
    synopsi_lst.append(synopsis_str)
    # director(s)
    director_str = container.find_all('p', class_ = '')[1].a.get_text()
    directors_lst.append(director_str)
    # vote count
    votes_tag = container.find('span', attrs = {'name':'nv'})
    vote_int = int(votes_tag['data-value'])
    votes_lst.append(vote_int)

### The problem is that not all movies have a listed runtime. 

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []
# Extract data from individual movie container
for container in movie_containers_lst:
    # movie title
    title_str = container.h3.a.get_text()
    titles_lst.append(title_str)
    print(title_str)
    # year
    year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
    years_lst.append(year_str)
    # genre(s)
    genre_str = container.p.find('span', class_ = 'genre').get_text()
    genres_lst.append(genre_str)
    # runtime
    if container.p.find('span', class_ = 'runtime') is not None:
        runtime_str = container.p.find('span', class_ = 'runtime').get_text()
    else:
        runtime_str = ''
    runtimes_lst.append(runtime_str)
    # IMDB rating
    rating_flt = container.select('.ratings-bar div strong')[0].get_text()
    ratings_lst.append(rating_flt)
    # synopsis
    synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
    synopsi_lst.append(synopsis_str)
    # director(s)
    director_str = container.find_all('p', class_ = '')[1].a.get_text()
    directors_lst.append(director_str)
    # vote count
    votes_tag = container.find('span', attrs = {'name':'nv'})
    vote_int = int(votes_tag['data-value'])
    votes_lst.append(vote_int)
    print(votes_int)

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []
# Extract data from individual movie container
for container in movie_containers_lst:
    # movie title
    title_str = container.h3.a.get_text()
    titles_lst.append(title_str)
    print(title_str)
    # year
    year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
    years_lst.append(year_str)
    # genre(s)
    if container.p.find('span', class_ = 'genre') is not None:
        genre_str = container.p.find('span', class_ = 'genre').get_text()
    else:
        genre_str = ''
    genres_lst.append(genre_str)
    # runtime
    if container.p.find('span', class_ = 'runtime') is not None:
        runtime_str = container.p.find('span', class_ = 'runtime').get_text()
    else:
        runtime_str = ''
    runtimes_lst.append(runtime_str)
    # IMDB rating
    rating_flt = container.select('.ratings-bar div strong')[0].get_text()
    ratings_lst.append(rating_flt)
    # synopsis
    synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
    synopsi_lst.append(synopsis_str)
    # director(s)
    director_str = container.find_all('p', class_ = '')[1].a.get_text()
    directors_lst.append(director_str)
    # vote count
    votes_tag = container.find('span', attrs = {'name':'nv'})
    vote_int = int(votes_tag['data-value'])
    votes_lst.append(vote_int)

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []
# Extract data from individual movie container
for container in movie_containers_lst:
    # movie title
    title_str = container.h3.a.get_text()
    titles_lst.append(title_str)
    print(title_str)
    # year
    year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
    years_lst.append(year_str)
    # genre(s)
    if container.p.find('span', class_ = 'genre') is not None:
        genre_str = container.p.find('span', class_ = 'genre').get_text()
    else:
        genre_str = ''
    genres_lst.append(genre_str)
    # runtime
    if container.p.find('span', class_ = 'runtime') is not None:
        runtime_str = container.p.find('span', class_ = 'runtime').get_text()
    else:
        runtime_str = ''
    runtimes_lst.append(runtime_str)
    # IMDB rating
    rating_flt = container.select('.ratings-bar div strong')[0].get_text()
    ratings_lst.append(rating_flt)
    # synopsis
    synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
    synopsi_lst.append(synopsis_str)
    # director(s)
    if container.find_all('p', class_ = '')[1].a is not None:
        director_str = container.find_all('p', class_ = '')[1].a.get_text()
    else:
        director_str = ''
    directors_lst.append(director_str)
    # vote count
    votes_tag = container.find('span', attrs = {'name':'nv'})
    vote_int = int(votes_tag['data-value'])
    votes_lst.append(vote_int)

In [None]:
test_df = pd.DataFrame({'title': titles_lst,
'year': years_lst,
'genre': genres_lst,
'runtime': runtimes_lst,
'rating': ratings_lst,
'synopsis': synopsi_lst,
'director': directors_lst,
'vote': votes_lst
})
print(test_df)

### Let's create a function that will scrape a page. It takes `movies_container_lst` as input and assumes that empty lists have been created outside of the function.

In [None]:
def scrape_page(lst):
    # Extract data from individual movie container
    for container in lst:
        # movie title
        title_str = container.h3.a.get_text()
        titles_lst.append(title_str)
        # year
        year_str = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').get_text()
        years_lst.append(year_str)
        # genre(s)
        if container.p.find('span', class_ = 'genre') is not None:
            genre_str = container.p.find('span', class_ = 'genre').get_text()
        else:
            genre_str = ''
        genres_lst.append(genre_str)
        # runtime
        if container.p.find('span', class_ = 'runtime') is not None:
            runtime_str = container.p.find('span', class_ = 'runtime').get_text()
        else:
            runtime_str = ''
        runtimes_lst.append(runtime_str)
        # IMDB rating
        rating_flt = container.select('.ratings-bar div strong')[0].get_text()
        ratings_lst.append(rating_flt)
        # synopsis
        synopsis_str = container.find_all('p', class_ = 'text-muted')[1].get_text()
        synopsi_lst.append(synopsis_str)
        # director(s)
        if container.find_all('p', class_ = '')[1].a is not None:
            director_str = container.find_all('p', class_ = '')[1].a.get_text()
        else:
            director_str = ''
        directors_lst.append(director_str)
        # vote count
        votes_tag = container.find('span', attrs = {'name':'nv'})
        vote_int = int(votes_tag['data-value'])
        votes_lst.append(vote_int)
    return

In [None]:
# Lists to store the scraped data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []

scrape_page(movie_containers_lst)

test_df = pd.DataFrame({'title': titles_lst,
'year': years_lst,
'genre': genres_lst,
'runtime': runtimes_lst,
'rating': ratings_lst,
'synopsis': synopsi_lst,
'director': directors_lst,
'vote': votes_lst
})
print(test_df.shape)

- <font color='LIGHTGRAY'> Scraping IMDB for movies that came out in 2019 </font>
- <font color='LIGHTGRAY'> Scraping a single movie </font>
- <font color='LIGHTGRAY'> Scraping all movies from a single page </font>
- Scraping all movies from all pages

## Scrape multiple pages

* Make all the requests we want from within the loop.
* Control the loop’s rate to avoid bombarding the server with requests.
* Monitor the loop while it runs.

## Make all requests we want from within the loop

The next page has the following URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-12-31&user_rating=1.0,10.0&sort=release_date,desc&count=250&start=251&ref_=adv_nxt

`&start=251` refers to starting at movie 251. Incrementing this query parameter will allow us to navigate to all pages of the search.

In [None]:
movie_indices = [str(i) for i in range(1,5972,250)]
print(movie_indices)

In [None]:
base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-12-31&user_rating=1.0,10.0&sort=release_date,desc&count=250'
for movie_index in movie_indices:
    print(base_url + '&start=' + movie_index + '&ref_=adv_nxt')

## Controlling the crawl rate

Controlling the rate of crawling is beneficial for us, and for the website we are scraping. If we avoid hammering the server with tens of requests per second, then we are much less likely to get our IP address banned. We also avoid disrupting the activity of the website we scrape by allowing the server to respond to other users’ requests too.

We’ll control the loop’s rate by using the `sleep()` function from Python’s `time` module. `sleep()` will pause the execution of the loop for a specified amount of seconds.

In [None]:
for i in range(0,5):
    delay = 2
    print(delay)
    time.sleep(delay)

## Monitoring the scraping loop
* The frequency (speed) of requests, so we make sure our program is not overloading the server.
* The status code of our requests, so we make sure the server is sending back the proper responses.

In [None]:
# Set a starting time using the time() function from the time module, and assign the value to start_time.
start_time = time.time()

# Assign 0 to the variable requests which we’ll use to count the number of requests.
requests = 0

# Start a loop, and then with each iteration:
for i in range(5):
    # Simulate a request.
    # <<<A request would go here>>>
    # Increment the number of requests by 1.
    requests = requests + 1
    # Pause the loop for 1 second
    time.sleep(1)
    # Calculate the elapsed time since the first request, and assign the value to elapsed_time.
    elapsed_time = time.time() - start_time
    # Print the number of requests and the frequency.
    print('Request: ' + str(requests) + ' ' + 'Frequency: ' + str(requests/elapsed_time) + ' requests/sec')
    # clears the output of print, and waits until there is a new output
    clear_output(wait = True)

### Import the warn function to throw a warning if there is a non-200 response. Warn rather than throw an error because we will still scrape enough even if there are some hiccups

In [None]:
warnings.warn("Warning Simulation !!!")

## Full scraping snippet

In [None]:
# Redeclaring the lists to store data in
titles_lst = []
years_lst = []
genres_lst = []
runtimes_lst = []
ratings_lst = []
synopsi_lst = []
directors_lst = []
votes_lst = []

# Preparing the monitoring of the loop
start_time = time.time()
requests = 0
movie_indices = [str(i) for i in range(1, 5972, 250)]

# For every page in the interval 1-4
for movie_index in movie_indices:

    # Make a get request
    base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-12-31&user_rating=1.0,10.0&sort=release_date,desc&count=250'
    url = base_url + '&start=' + movie_index + '&ref_=adv_nxt'
    response = get(url)

    # Pause the loop
    time.sleep(1)

    # Monitor the requests
    requests = requests + 1
    elapsed_time = time.time() - start_time
    print('Request: ' + str(requests) + ' ' + 'Frequency: ' + str(requests/elapsed_time) + ' requests/sec')
    clear_output(wait = True)

    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warnings.warn('Request: ' + str(requests) + '; Status code: ' + str(response.status_code))

    # Parse the content of the request with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Select all the 250 movie containers from a single page and scrape
    movie_containers_lst = soup.find_all('div', class_ = 'lister-item mode-advanced')
    scrape_page(movie_containers_lst)

In [None]:
movies_df = pd.DataFrame({'title': titles_lst,
'year': years_lst,
'genre': genres_lst,
'runtime': runtimes_lst,
'rating': ratings_lst,
'synopsis': synopsi_lst,
'director': directors_lst,
'vote': votes_lst
})
print(movies_df)

In [None]:
movies_df.to_csv('data/imdb.csv', index=False)