<a href="https://colab.research.google.com/github/anilsolanki2645/WebScarping/blob/main/IMDB_Movie_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Scraping IMDB website at a time to fetch top 250 movies data sorted by Rating.

* Mounts Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


* Importing all necessary libraries:

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint


* Movie Data Scraping and Storage Initialization

In [3]:
#Declaring the headers
headers = {"Accept-Language": "en-US,en;q=0.5"}

#declaring the list of empty variables, So that we can append the data overall
movie_name = []
year = []
time=[]
rating=[]
metascore =[]
votes = []
description = []

* Page Range Initialization for Movie Scraping: 1 to 250

In [4]:
#creating an array of values and passing it in the url for dynamic webpages
pages = np.arange(1,250,250)

### collect and parse the information using requests and BeautifulSoup
* Iterates through a list of pages, sends HTTP requests to IMDb's top movie listings, and extracts information such as movie name, year of release, runtime, rating, Metascore, votes, and description. The script uses BeautifulSoup for web scraping, random delays to avoid overwhelming the server, and headers to mimic a web browser. Extracted data is appended to respective lists, and at the end, a Pandas DataFrame is created to organize the movie information for further analysis or presentation.


In [5]:
# Iterate through pages
for page in pages:
    # Set user-agent header to mimic a web browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    }

    print(page)

    # Make a request to IMDb
    page_response = requests.get("https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_1000&count=250", headers=headers)

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(page_response.text, 'html.parser')

    # Find movie data containers
    movie_data = soup.findAll('div', {'class': 'sc-43986a27-0 gUQEVh'})

    # Pause to avoid overwhelming the server
    # sleep(randint(2, 8))

    # Print request status code for debugging
    # print("Request status code:", page_response.status_code)

    # Iterate through movie data
    for store in movie_data:
        # Extract information
        name = store.find('h3', class_='ipc-title__text').text
        movie_name.append(name)

        year_of_release = store.find('span', class_='sc-43986a27-8 jHYIIK dli-title-metadata-item').text
        year.append(year_of_release)

        runtime = store.findAll('span', class_='sc-43986a27-8 jHYIIK dli-title-metadata-item')[1].text
        time.append(runtime)

        # Handle the rating element gracefully
        rate = store.find('span', class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating").text.replace('\n', '').split('\xa0')[0]
        rating.append(rate)

        meta = store.find('span', class_="sc-b0901df4-0 bcQdDJ metacritic-score-box").text if store.find('span', class_="sc-b0901df4-0 bcQdDJ metacritic-score-box") else "*****"
        metascore.append(meta)

        # Extracting the votes
        value = store.findNext("div", class_="sc-53c98e73-0 kRnqtn").text.split('Votes')[-1]
        votes.append(value)

        # Extracting the movie description
        describe = store.findNext('div', class_='ipc-html-content ipc-html-content--base sc-53c98e73-1 gMTWhH dli-plot-container').text
        description_ = describe.replace('\n', '') if len(describe) > 1 else '*****'
        description.append(description_)


1


* Creating and Displaying Movie Data DataFrame


In [6]:
# Creating a Pandas DataFrame to store movie data with columns for Movie Name, Year of Release, Watch Time, Movie Rating, Metascore of Movie, Votes, and Description
movie_list = pd.DataFrame({
    "Movie Name": movie_name,
    "Year of Release": year,
    "Watch Time": time,
    "Movie Rating": rating,
    "Meatscore of movie": metascore,
    "Votes": votes,
    "Description": description
})

# Displaying the movie list DataFrame
movie_list


Unnamed: 0,Movie Name,Year of Release,Watch Time,Movie Rating,Meatscore of movie,Votes,Description
0,1. The Shawshank Redemption,1994,2h 22m,9.3,82,2830101,"Over the course of several years, two convicts..."
1,2. The Godfather,1972,2h 55m,9.2,100,1972590,"Don Vito Corleone, head of a mafia family, dec..."
2,3. The Dark Knight,2008,2h 32m,9.0,84,2811521,When the menace known as the Joker wreaks havo...
3,4. Schindler's List,1993,3h 15m,9.0,95,1422109,"In German-occupied Poland during World War II,..."
4,5. The Lord of the Rings: The Return of the King,2003,3h 21m,9.0,94,1937891,Gandalf and Aragorn lead the World of Men agai...
...,...,...,...,...,...,...,...
245,246. Stalker,1979,2h 42m,8.1,85,141957,A guide leads two men through an area known as...
246,247. 12 Years a Slave,2013,2h 14m,8.1,96,730822,"In the antebellum United States, Solomon North..."
247,248. Gran Torino,2008,1h 56m,8.1,72,804410,After a Hmong teenager tries to steal his priz...
248,"249. Lock, Stock and Two Smoking Barrels",1998,1h 47m,8.1,66,607410,Eddy persuades his three pals to pool money fo...


* Displaying the last three rows of the Movie Data DataFrame

In [7]:
movie_list[-3:]

Unnamed: 0,Movie Name,Year of Release,Watch Time,Movie Rating,Meatscore of movie,Votes,Description
247,248. Gran Torino,2008,1h 56m,8.1,72,804410,After a Hmong teenager tries to steal his priz...
248,"249. Lock, Stock and Two Smoking Barrels",1998,1h 47m,8.1,66,607410,Eddy persuades his three pals to pool money fo...
249,250. Warrior,2011,2h 20m,8.1,71,491172,The youngest son of an alcoholic former boxer ...


* Checking the data type of the Movie Data DataFrame

In [8]:
type(movie_list)

pandas.core.frame.DataFrame

* Saving Movie Data DataFrame to Excel: IMDB_Movie_Ratings.xlsx

In [9]:
movie_list.to_excel('/content/drive/MyDrive/Colab Notebooks/IMDB_Movie_Ratings.xlsx', header=True, index=False)