<a href="https://colab.research.google.com/github/beriaacan/Web-Scraping-Applications/blob/main/IMDb%20Scraper%20and%20Web%20Application/IMDB_Scraper_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will explore web scraping techniques to gather data from IMDb's website, focusing on the top 1000 movies as rated by users. We will use the BeautifulSoup library in Python to extract movie details such as the movie name, certificate rating, duration, genre, IMDb rating, metascore, director, stars, votes, gross earnings, and a brief plot summary.


The goal of this project is to collect valuable data from IMDb, which can be used for various analyses, recommendations, and insights. We will perform web scraping on multiple pages to gather comprehensive information about these top-rated movies. The final dataset will be used for further analysis and visualization.

#Installing required libraries

Requests: This library allows us to make HTTP requests to web pages, which is crucial for fetching web content.

Beautiful Soup (beautifulsoup4): This is a powerful library for web scraping. It helps parse and extract data from HTML and XML documents.

We need these packages to scrape data from a website successfully. If you haven't installed these packages yet, you can run the following commands:

In [14]:
# This mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
FOLDERNAME ='deneme/web scraping'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# Now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))

In [16]:
!pip install beautifulsoup4



In [3]:
!pip install requests



#Imports libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup

Seaborn (sns): Seaborn is a data visualization library that works in conjunction with Matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics.

Requests: The 'requests' library enables making HTTP requests to fetch data from websites. It's essential for web scraping.

Beautiful Soup (from bs4 import BeautifulSoup): Beautiful Soup is a library for web scraping. It allows you to parse and navigate HTML and XML documents, making it easier to extract data from web pages.

#Web Scraping IMDb's Top 1000 Movies

In this cell, we perform the initial steps of web scraping IMDb's top 1000 movies:

URL Generation: We generate a list of URLs to access different pages of IMDb's top 1000 movies. IMDb lists movies in groups of 100, and we loop through 10 pages, each containing 100 movies. We use a base URL with parameters to specify the sorting and starting point.

HTTP Requests: For each URL, we send an HTTP GET request using the 'requests' library to fetch the webpage's HTML content.

Parsing with BeautifulSoup: We use BeautifulSoup to parse the HTML content of each page. BeautifulSoup allows us to navigate and extract specific information from the web pages.

In [5]:
url = []   # Initialize an empty list to store URLs
page = []  # Initialize an empty list to store webpage content
soup = []  # Initialize an empty list to store BeautifulSoup objects

In [6]:
for i in range(0, 10):
    # Generate IMDb URLs for the top 1000 movies, adjusting the 'start' parameter
    url.append(f"https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start={100 * i + 1}&ref_=adv_nxt")

    # Send an HTTP GET request to fetch the webpage content
    page.append(requests.get(url[i]))

    # Parse the webpage content using BeautifulSoup with the "html.parser" parser
    soup.append(BeautifulSoup(page[i].content, "html.parser"))

In [7]:
imdb_top1000 = pd.DataFrame(columns=['Movie Name', 'Certificate', 'Duration', 'Genre', 'IMDb Rating', 'Metascore', 'Director', 'Stars', 'Votes', 'Grossed', 'Plot'])

- Movie Name: The name of the movie.
- Certificate: The age certificate or rating of the movie.
- Duration: The duration of the movie.
- Genre: The genre(s) of the movie.
- IMDb Rating: The IMDb rating of the movie.
- Metascore: The Metascore rating of the movie (if available).
- Director: The director(s) of the movie.
- Stars: The main cast or stars of the movie.
- Votes: The number of votes/ratings the movie has received on IMDb.
- Grossed: The total gross earnings of the movie (if available).
- Plot: A brief plot summary of the movie.

In [25]:
for i in range(10):  # Loop through the 10 pages
    # Find all the movie containers on the page
    movie_containers = soup[i].find_all('div', class_='lister-item mode-advanced')

    for container in movie_containers:
        # Movie Name
        name = container.h3.a.text.strip()

        # Certificate
        certificate = container.p.find('span', class_='certificate').text.strip() if container.p.find('span', class_='certificate') else ''

        # Duration
        duration = container.p.find('span', class_='runtime').text.strip() if container.p.find('span', class_='runtime') else ''

        # Genre
        genre = container.p.find('span', class_='genre').text.strip() if container.p.find('span', class_='genre') else ''

        # IMDb Rating
        imdb_rating = container.strong.text.strip()

        # Metascore
        metascore = container.find('span', class_='metascore').text.strip() if container.find('span', class_='metascore') else ''

        # Director
        director = container.find('p', class_='').find('a').text.strip() if container.find('p', class_='').find('a') else ''

        # Stars
        stars = [star.text.strip() for star in container.find_all('a', href=lambda href: href and "st" in href)]

        # Votes
        votes = container.find('span', attrs={'name': 'nv'}).text.strip().replace(',', '')

        # Grossed
        grossed = container.find_all('span', attrs={'name': 'nv'})[1].text.strip().replace(',', '') if len(container.find_all('span', attrs={'name': 'nv'})) > 1 else ''

        # Plot
        plot = container.find_all('p', class_='text-muted')[-1].text.strip() if container.find_all('p', class_='text-muted') else ''

        # Append the data to the DataFrame
        imdb_top1000 = imdb_top1000.append({
            'Movie Name': name,
            'Certificate': certificate,
            'Duration': duration,
            'Genre': genre,
            'IMDb Rating': imdb_rating,
            'Metascore': metascore,
            'Director': director,
            'Stars': ', '.join(stars),
            'Votes': votes,
            'Grossed': grossed,
            'Plot': plot
        }, ignore_index=True)
