Web Scraping IMDb Data with Python and BeautifulSoup

In the world of data analysis and research, web scraping is a powerful tool to collect data from websites for various purposes. In this tutorial, we will walk you through the process of scraping IMDb data using Python and the BeautifulSoup library. IMDb is a popular website that provides information about movies, including ratings, release years, and more.

We’ll go through the following steps:

Import Necessary Libraries
Define the Target URL
Fetch the HTML Content
Parse the HTML
Extract and Process Data
Transform and Save Data
Step 1: Import Necessary Libraries
We start by importing the libraries we need for web scraping:

In [12]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pdp
from urllib.request import urlopen, Request
import requests

urlopen from urllib.request: This library helps us open URLs and fetch HTML content from web pages.
BeautifulSoup from bs4: We use BeautifulSoup for parsing HTML, making it easier to navigate and extract data.
pandas is used for data manipulation and storage.
Step 2: Define the Target URL
Next, we define the URL of the IMDb webpage we want to scrape. In this example, we’re targeting a specific search page:

In [8]:
my_url = "http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012"

Step 3: Fetch the HTML Content
We open a connection to the URL and retrieve the HTML content:

In [9]:
req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
uClient = urlopen(req)
page_html = uClient.read()
uClient.close()

Here, we use urlopen to establish a connection to the IMDb webpage and read its HTML content. But the content is not in a readable format. We need to parse it so that content of html is visible in an easy format to understand.

Step 4: Parse the HTML
With BeautifulSoup, we parse the HTML content

In [10]:
page_soup = soup(page_html, "html.parser")
page_soup

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head><meta charset="utf-8"/><meta content="width=device-width" name="viewport"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1703508386995);
        }
    })</script><title>Advanced search</title><meta content="" data-id="main" name="description"/><meta content="https://www.imdb.com/search/" property="og:url"/><meta content="IMDb" property="og:site_name"/><meta content="Advanced search" property="og:title"/><meta content="" property=

This step allows us to navigate and extract specific data from the webpage more easily.

Step 5: Extract and Process Data
In this step, we would typically write code to extract specific data from page_soup and process it. However, the code for this part is not included in this snippet. You can customize this step based on the data you want to scrape from IMDb.

In [13]:
filename= "imdb_m.csv"
f= open(filename, "w")

containers = soup.findAll("div", {"class": "your-container-class"})

headers= "Name, Year, Runtime \n"
f.write(headers)

for container in containers:
    name= container.img["alt"]
    year_mov= container.findAll("span", {"class": "lister-item-year"})
    year=year_mov[0].text
    runtime_mov= container.findAll("span", {"class": "runtime"})
    runtime=runtime_mov[0].text
    
    print(name + "," + year + "," + runtime +  "\n")
    f.write(name + "," + year + "," + runtime  + "\n")
    
f.close()

AttributeError: 'str' object has no attribute 'descendants'

Step 6: Transform and Save Data
After extracting and processing the data, you can use pandas to create a DataFrame and save it to a CSV file

In [None]:
import pandas as pd
imdb = pd.read_csv("imdb_m.csv", encoding="latin1")
imdb.head()