# Web Scraping Popular Movies using BeautifulSoup

*A web scraping tutorial in Python for beginners.*
![](https://imgur.com/JAYEfY3.png)

The **Project Idea** is to curate a list of popular movies that I can watch using Web Scraping. Check out the TMdb website here: https://www.themoviedb.org/movie

**Web Scraping** is the process of gathering useful information from the web and making meaningful insights from it. In a way, web scarping is automating the process of data collection.

*Note:* Web Scraping code depends on the structure of the web page. So, if the structure changes then your code needs update too!


**Python** offers a variety of libraries to scrape the web such as BeautifulSoup, Requests, Scrapy, Selenium. If you are starting with web scraping, then Beautiful Soup will be the easy option.

We’ll be using the packages:
* **Requests** — for downloading the HTML code from the TMdb URL
* **BeautifulSoup4** — for extracting data from the HTML string
* **Pandas** — to gather my data into a dataframe for further processing



Let's see an outline of the steps we'll follow:
1. Load the TMdb movie web page https://www.themoviedb.org/movie using `Requests`.
2. Parse the HTML web page using BeautifulSoup.
3. Extract the list of movies from the landing page. For each page, we'll get the movie name, user rating and the movie page URL.
3. Again for each movie, we'll grab the release dates, genres, duration and directors.
4. Compile extracted movie details into Python Lists and Dictionaries.
4. We'll extend the above logic to scrape multiple pages.
5. Finally, we'll save all the movie informations into a csv file.

```
The csv file will be of the following format.
Name,rating,genre,release_date,runtime	director,url
Mortal Kombat,80,"Fantasy,Action, Adventure, Science Fiction, Thriller",04/23/2021,1h 50m,Lewis Tan,	https://www.themoviedb.org/movie/460465
Godzilla vs. Kong,82.0,"Science Fiction, Action",	03/31/2021,1h 53m,Alexander Skarsgård,	https://www.themoviedb.org/movie/399566
Nobody,85.0,"Action, Thriller, Crime",03/26/2021,1h 32m,Bob Odenkirk,https://www.themoviedb.org/movie/615457
Zack Snyder's Justice League,85.0,"Action, Adventure, Fantasy, Science Fiction",03/18/2021,4h 2m,Ben Affleck,https://www.themoviedb.org/movie/791373
```

### How to Run the code

You can execute the code by clicking the "Run" button or by selecting the "Run on Binder" option.

### Installing the Libraries
Let’s start by installing the required packages.

In [None]:
# Install pandas
!pip install pandas as pd --quiet

# Install the bs4 module from BeautifulSoup
!pip install beautifulsoup4 --upgrade --quiet

Let's import the necessary packages

In [None]:
# Let's import necessary packages
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Load the Webpage using Requests

The landing page of TMdb movies page consists of a list of popular movies. We can click on each of the movie items and navigate to the individual movie page to get more details on each movie.

Each page contains 20 movies. From the landing page, we will parse the list of movies, user ratings, and movie URLs. Then, we can navigate to the next pages using the ‘Load More’ button click.

In [None]:
# TMdb movie URL
tmdb_movies_url = 'https://www.themoviedb.org/movie'

In [None]:
# The movie page is downloaded using 'requests`
response = requests.get(tmdb_movies_url)

In [None]:
# Check if the request was successful
response.status_code

200

The above code validates if the requests was successful using the `.status_code = 200`.


In [None]:
page_contents = response.text
page_contents[:500]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDb)</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    \n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="ap'

Above shows first few snippet of the HTML code of the TMdb web page.

Let's now write the `page-contents` into a file.

In [None]:
with open ('tmdb_movie.html', 'w') as f:
    f.write(page_contents)

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

The HTML page content is extracted using BeautifulSoup into `doc`.

Let us create a function to perform the above.

In [None]:
def get_movies_page():
    """
    Function to download a web page using `requests` and check the status code to validate
    if the call was successful.
    """
    movies_url = 'https://www.themoviedb.org/movie'
    # Access the webpage using `requests`
    response = requests.get(movies_url)
    # Check if the request was successful
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(movies_url))
    # Parse the `response' text using BeautifulSoup
    movies_doc = BeautifulSoup(response.text, 'html.parser')
    return movies_doc

### Inspect the Web page

Chrome users can use the “Inspect” option by right-clicking on the page to examine the HTML code behind the page. A menu will appear, either on the bottom or right side of the page (based on the settings), with a long list of nested HTML tags. To find the correct tag associated with the information needed, select the details (ex. movie name) and click “Inspect” again and that will highlight a blue box. Now, you can click on the HTML tags and get the correct tag associated with the item of interest, here, movie name.

As we see in the image below, the movie names are embedded in the `h2` tags.

![](https://imgur.com/XzQ6OYC.png)

We can use the `h2.a.text.strip()` to retrieve the name of the movie. Note, that we need to exclude the first four lines, as those do not contain the movie names.

In [None]:
movies_names_tags = doc.find_all('h2')[4:]  #Exclude the first 4 lines
names = []
for h2 in movies_names_tags:
    names.append(h2.a.text.strip())
print(names)

['Mortal Kombat', 'Godzilla vs. Kong', 'Nobody', "Zack Snyder's Justice League", 'The Unholy', 'Thunder Force', 'The Marksman', 'Chaos Walking', 'Raya and the Last Dragon', 'Demon Slayer the Movie: Mugen Train', 'New Gods: Nezha Reborn', "Mortal Kombat Legends: Scorpion's Revenge", 'Monster Hunter', 'Vanquish', 'Wonder Woman 1984', 'Tom & Jerry', 'Sentinelle', 'Space Sweepers', 'Rise of the Mummy', 'Cherry']


Similarly, we can extarct the movie links.

In [None]:
links = []
for h2 in movies_names_tags:
    links.append(h2.a['href'])
print(links)

['/movie/460465', '/movie/399566', '/movie/615457', '/movie/791373', '/movie/632357', '/movie/615678', '/movie/634528', '/movie/412656', '/movie/527774', '/movie/635302', '/movie/663558', '/movie/664767', '/movie/458576', '/movie/804435', '/movie/464052', '/movie/587807', '/movie/793723', '/movie/581389', '/movie/791910', '/movie/544401']


Let's create functions to extract the movies names and movie URLs.

In [None]:
def get_movies_names(doc):
    """
    Function to extract the movie names from HTML source code using BeautifulSoup.
    """
    movies_names_tags = doc.find_all('h2')[4:]  #Exclude the first 4 lines
    movies_names = []
    # Loop through the page get all the movie names from the page
    for h2 in movies_names_tags:
        movies_names.append(h2.a.text.strip())
    return movies_names

`get_movies_names` can be used to get the list of popular movie names.

In [None]:
# Get the popular movie list from the webpage using the BeautifulSoup object `doc`.
get_movies_names(doc)

['Mortal Kombat',
 'Godzilla vs. Kong',
 'Nobody',
 "Zack Snyder's Justice League",
 'The Unholy',
 'Thunder Force',
 'The Marksman',
 'Chaos Walking',
 'Raya and the Last Dragon',
 'Demon Slayer the Movie: Mugen Train',
 'New Gods: Nezha Reborn',
 "Mortal Kombat Legends: Scorpion's Revenge",
 'Monster Hunter',
 'Vanquish',
 'Wonder Woman 1984',
 'Tom & Jerry',
 'Sentinelle',
 'Space Sweepers',
 'Rise of the Mummy',
 'Cherry']

The above shows the list of movies in the landing page of the TMdb movie web page.

Similarly,  let's define functions for movie user ratings and URLs.

The user ratings are embedded as part of the `div` tag under the `user_score_chart` class in the webpage as below.

![](https://imgur.com/WqCIgES.png)

In [None]:
def get_movies_rating(doc):
    """
    Function to extract the movie user rating from HTML source code using the BeautifulSoup.
    """
    desc_selector = 'user_score_chart'
    movies_rating_tags = doc.find_all('div', {'class': desc_selector})
    movies_rating = []
    # Loop through the webpage to get the ratings of all the movies in the page
    for tag in movies_rating_tags:
        movies_rating.append(tag.attrs['data-percent'])
    return movies_rating

In [None]:
# Get the ratings of each movies in the webpage using the BeautifulSoup object `doc`.
get_movies_rating(doc)

['79.0',
 '82.0',
 '85.0',
 '85.0',
 '57.0',
 '58.0',
 '73.0',
 '73.0',
 '83.0',
 '82.0',
 '88.0',
 '84.0',
 '71.0',
 '61.0',
 '68.0',
 '73.0',
 '60',
 '72.0',
 '50',
 '76.0']

The above shows the user ratings for movies in the landing page of the TMdb movie web page.

Each movie URL can be retrieved by appending the base URL of https://www.themoviedb.org to .a['href'].

![](https://imgur.com/8D8DYAq.png)

In [None]:
def get_movies_urls(doc):
    """
    Function to extract the movie links from HTML source code using BeautifulSoup.
    """
    movies_urls = []
    base_url = 'https://www.themoviedb.org'
    movies_names_tags = doc.find_all('h2')[4:]  #Exclude the first 4 lines
    # Loop through the webpage to get the URL of each movie
    for tag in movies_names_tags:
        movies_urls.append(base_url + tag.a['href'])
    return movies_urls

In [None]:
# Get the URLS of each movies in the webpage using the BeautifulSoup object `doc`.
get_movies_urls(doc)

['https://www.themoviedb.org/movie/460465',
 'https://www.themoviedb.org/movie/399566',
 'https://www.themoviedb.org/movie/615457',
 'https://www.themoviedb.org/movie/791373',
 'https://www.themoviedb.org/movie/632357',
 'https://www.themoviedb.org/movie/615678',
 'https://www.themoviedb.org/movie/634528',
 'https://www.themoviedb.org/movie/412656',
 'https://www.themoviedb.org/movie/527774',
 'https://www.themoviedb.org/movie/635302',
 'https://www.themoviedb.org/movie/663558',
 'https://www.themoviedb.org/movie/664767',
 'https://www.themoviedb.org/movie/458576',
 'https://www.themoviedb.org/movie/804435',
 'https://www.themoviedb.org/movie/464052',
 'https://www.themoviedb.org/movie/587807',
 'https://www.themoviedb.org/movie/793723',
 'https://www.themoviedb.org/movie/581389',
 'https://www.themoviedb.org/movie/791910',
 'https://www.themoviedb.org/movie/544401']

By now we have movie names, user rating, and the movie URLs for the first page.

Let’s first consider a sample movie web page: Godzilla vs. Kong and see how we parse HTML tags to get additional information like release date, genre, runtime, and director of each of the movies.

![](https://imgur.com/N1wxPw8.png)

To read additional movie information, let's create a function that can accept a movie url.

In [None]:
# Let's read a movie page
def get_movies_page(movies_url):
    """
    Function to read the HTML source code using BeautifulSoup.
    """
    # Download the page
    response = requests.get(movies_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(movies_url))
    # Parse using Beautiful soup
    movies_doc = BeautifulSoup(response.text, 'html.parser')
    return movies_doc

In [None]:
doc1 = get_movies_page('https://www.themoviedb.org/movie/399566')

We have the HTML source code in the
BeautifulSoup object `doc1`.

In [None]:
# Find the `div` tag under `facts` class to get the release date, genre and runtime
div_tags = doc1.find('div', class_ = 'facts')

release_date = div_tags.text.split()[1]
genre = div_tags.text.split()[3:-2]
runtime = div_tags.text.split()[-2:]

# Print and validate the result is correct
print(release_date, genre, runtime)

03/31/2021 ['Science', 'Fiction,', 'Action'] ['1h', '53m']


In [None]:
# Find the `div` tag under `scroller_wrap should_fade is_fading` class to get the director
d_tags = doc1.find_all('div', {'class':'scroller_wrap should_fade is_fading'})

# Print and validate the result
print (d_tags[0].text.strip().partition("\n")[0])

Alexander Skarsgård


The `div` tag under class `facts` contains the release date, genre and runtime details.

In [None]:
def get_movies_info(doc):
    """
    Function to get the movie informations -
    release date, genre, runtime and director.
    """
    div1_tags = doc.find('div', class_ = 'facts')
    release_date = div1_tags.text.split()[1]
    genre = div1_tags.text.split()[3:-2]
    runtime = div1_tags.text.split()[-2:]

    div2_tags = doc.find_all('div', {'class':'scroller_wrap should_fade is_fading'})
    director = div2_tags[0].text.strip().partition("\n")[0]

    return release_date, genre, runtime, director

In [None]:
# Call the `get_movies_info` for movie `Godzilla vs. Kong`.
get_movies_info(doc1)

('03/31/2021',
 ['Science', 'Fiction,', 'Action'],
 ['1h', '53m'],
 'Alexander Skarsgård')

In [None]:
# Get the page for the movie `Raya and the Last Dragon`.
doc2 = get_movies_page('https://www.themoviedb.org/movie/527774')

In [None]:
# Find the `div` tag under `facts` class to get the release date, genre and runtime
div_tags = doc2.find('div', class_ = 'facts')

release_date = div_tags.text.split()[1]
genre = div_tags.text.split()[3:-2]
runtime = div_tags.text.split()[-2:]

# Print and validate the result is correct
print(release_date, genre, runtime)

03/05/2021 ['Animation,', 'Adventure,', 'Fantasy,', 'Family,', 'Action'] ['1h', '47m']


In [None]:
# Call the `get_movies_info` for movie `Below Zero`.
get_movies_info(doc2)

('03/05/2021',
 ['Animation,', 'Adventure,', 'Fantasy,', 'Family,', 'Action'],
 ['1h', '47m'],
 'Kelly Marie Tran')

The above logic can be extended to get the release dates, genres, runtimes, and directors for all the URLs we have from the landing page.


In [None]:
def get_all_movies_details(urls):
    """
    Function to get lists of movie information as lists from all the pages.
    """
    genres = []
    release_dates = []
    runtimes = []
    directors = []

    # Loop through all the urls of the the movies
    for url in urls:
        movie_doc = get_movies_page(url)
        # get_movies_info returns release_date, genre, runtime, director.
        release_date, genre, runtime, director = get_movies_info(movie_doc)
        # Convert the genre list to string on ` `.
        genres.append(" ".join(genre))
        release_dates.append(release_date)
        runtimes.append(" ".join(runtime))
        directors.append(director)

    return genres, release_dates, runtimes, directors

We have all the details that we are looking to retrieve from the TMdb web page `name`, `ratings`, `genres`, `release_dates`, `runtimes`, `directors` and `urls`.

### Putting all the Pieces Together

We’ve got all the information in different pieces of our BeautifulSoup scraper. We need to assemble them into a single function and make it as reusable as possible.

I’ve used Python `Dictionary` to store the key-value pairs of the movie information. Later, I've copied the dictionary to `pandas DataFrame` to store the tabular movie information into rows and columns.

In [None]:
def scrape_movies():
    """
    Function to download web page using `requests` and
    to extract the HTML source code using BeautifulSoup.
    """
    # Let's get the popular movies listing from the TMdb website
    page_count = 1 # Initializing the movie page count to 1
    # Define lists for all the movie attributes
    all_names = []
    all_ratings = []
    all_genres = []
    all_release_dates = []
    all_runtimes = []
    all_directors = []
    all_urls = []

    while page_count < 8: # Looping for 8 pages of the TMdb web page
        movies_url = "https://www.themoviedb.org/movie?page=%d" %(page_count)
        # Access the webpage using `requests`
        response = requests.get(movies_url)
        # Check if the request was successful
        if response.status_code != 200:
            raise Exception('Failed to load page {}'.format(movies_url))
        # Parse the `response' text using BeautifulSoup
        doc = BeautifulSoup(response.text, 'html.parser')

        urls = get_movies_urls(doc)
        genres, release_dates, runtimes, directors = get_all_movies_details(urls)

        # Append each movie attribute to respective lists
        all_names += get_movies_names(doc)
        all_ratings += get_movies_rating(doc)
        all_genres += genres
        all_release_dates += release_dates
        all_runtimes += runtimes
        all_directors += directors
        all_urls += urls
        page_count += 1

        # Defining a dictionary to store the movie informations
    movies_dict = {
        'name': all_names,
        'rating': all_ratings,
        'genre': all_genres,
        'release_date': all_release_dates,
        'runtime': all_runtimes,
        'director': all_directors,
        'url': all_urls
    }
    return pd.DataFrame(movies_dict)

In my project, I’m scraping seven pages and since each page has 20 movies listed, my output dataset has 140 rows. It goes without saying that the more movie listing you want, the more web pages you should scrape.

Let's save the movies dataframe to a `.csv` file.

In [None]:
# Invoke the scrape_movies functionality
movies_df = scrape_movies()
movies_df.head() # View the first few rows of the output

Unnamed: 0,name,rating,genre,release_date,runtime,director,url
0,Mortal Kombat,79.0,"Fantasy, Action, Adventure, Science Fiction, T...",04/23/2021,1h 50m,Lewis Tan,https://www.themoviedb.org/movie/460465
1,Godzilla vs. Kong,82.0,"Science Fiction, Action",03/31/2021,1h 53m,Alexander Skarsgård,https://www.themoviedb.org/movie/399566
2,Nobody,85.0,"Action, Thriller, Crime",03/26/2021,1h 32m,Bob Odenkirk,https://www.themoviedb.org/movie/615457
3,Zack Snyder's Justice League,85.0,"Action, Adventure, Fantasy, Science Fiction",03/18/2021,4h 2m,Ben Affleck,https://www.themoviedb.org/movie/791373
4,The Unholy,57.0,Horror,04/02/2021,1h 39m,Jeffrey Dean Morgan,https://www.themoviedb.org/movie/632357


In [None]:
# Save the dataset to `.csv` format
movies_df.to_csv('movies.csv', index=None)

We can check that the CSV was created properly by reading the csv file using `pandas`.

In [None]:
df = pd.read_csv('movies.csv')
df.head()

Unnamed: 0,name,rating,genre,release_date,runtime,director,url
0,Mortal Kombat,79.0,"Fantasy, Action, Adventure, Science Fiction, T...",04/23/2021,1h 50m,Lewis Tan,https://www.themoviedb.org/movie/460465
1,Godzilla vs. Kong,82.0,"Science Fiction, Action",03/31/2021,1h 53m,Alexander Skarsgård,https://www.themoviedb.org/movie/399566
2,Nobody,85.0,"Action, Thriller, Crime",03/26/2021,1h 32m,Bob Odenkirk,https://www.themoviedb.org/movie/615457
3,Zack Snyder's Justice League,85.0,"Action, Adventure, Fantasy, Science Fiction",03/18/2021,4h 2m,Ben Affleck,https://www.themoviedb.org/movie/791373
4,The Unholy,57.0,Horror,04/02/2021,1h 39m,Jeffrey Dean Morgan,https://www.themoviedb.org/movie/632357


### Summary

1. Downloaded the TMdb movie web page using `Requests`
2. Extracted the movie details using BeautifulSoup (bs4).
3. Extracted all the movie informations - movie name, user rating, release date, genre, duration, directors and urls.
4. Complied the movie informations into Pandas lists and Dataframes.
5. Extracted the movie informations for multiple pages.
6. Saved the dataset into .`csv` format

### Future Work and References

In near future, I plan to  
* Perform web scraping using Selenium or Scrapy.
* Perform Data Cleaning on the Data
* Perform visualization on Data to get useful insights.

1. Let’s Build a Python Web Scraping Project from Scratch | Hands-On Tutorial by Aakash N S, CEO, Jovian: https://www.youtube.com/watch?v=RKsLLG-bzEY
2. Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
4. https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852


Let's now save the notebook to jovian platform.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(files = ['movies.csv'], project="web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
