# <u>'Scraping Movies & Series from TMDB using Python'</u>

Using Python,Requests and Beautiful Soup - scraping movies/series from TMDB


<b>Use the "Run" button to execute the code.</b>

![](https://i.imgur.com/f9C3y66.png)

![](https://i.imgur.com/mXY7CEs.png)

 "The Movie Database" (TMDb) is an online community-driven platform that provides information about movies and TV shows. It is a popular website among movie enthusiasts and developers who want to access movie-related data for their projects.

## "So What Is Web Scraping??"

"Web scraping is the automated process of extracting data from websites. It involves using software or scripts to access web pages, analyze the HTML content, and extract specific information, such as text, images, or URLs. Web scraping is commonly employed for various purposes, including data mining, competitor analysis, sentiment analysis, and content aggregation. However, it raises ethical and legal concerns, as some websites may prohibit scraping or consider it a violation of their terms of service. It's essential to ensure compliance with the website's policies and respect the site's server load to avoid causing any disruption to their services."

## Pick a website and describe your objective
- Pick a site to scrape from the given list of websites
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your assignment idea in a paragraph using a Markdown cell and outline your strategy.

![](https://i.imgur.com/VJMdSi1.png)

### Project Outline:

- We're goiing to srcape from =https://www.themoviedb.org/movie
- We'll get a list of Popular Movies ,We'll get date of release,We'll get online rating of movies url.
- We'll get Top  100 movies rating  from the topic page
- For each topic we'll create a CSV file in the following format:
  ```

Name of the movie,Date of realse,Raiting,Movie URL ,

"Expend4bles (2023)",
https://www.themoviedb.org/movie/299054-expend4bles

The Equalizer 3 (2023),
https://www.themoviedb.org/movie/926393-the-equalizer-3,
,,,

 ```

![](https://i.imgur.com/GPHiKgE.png)

![](https://i.imgur.com/LP5sskw.png)

### <u> Output Format</u>

Release_date,Movie_name,Ratings,Urls

1)"Sep 22, 2023",Teri Expend4bles,64.0,https://www.themoviedb.org/movie/299054

2)"Sep 01, 2023",The Equalizer 3,73.0,https://www.themoviedb.org/movie/926393

### About Python

<i>"Python is a versatile, high-level programming language known for its simplicity and readability. It supports multiple paradigms, including object-oriented, functional, and procedural programming. Python's extensive standard library and vast community make it popular for web development, data analysis, artificial intelligence, automation, and more."</i>

### Importing Required Libraries

- <b>Numpy</b> and <b>Pandas</b> are pretty important, but not quite as critical as some people make them out to be. Sure, they're useful for certain tasks, but you could probably get by without them if you really had to. I mean, sure, it might be a little harder and less efficient, but where's the fun in easy and efficient?

- <b>Beautiful</b> Soup is a library for parsing HTML and XML documents. It allows you to extract specific elements from a web page, such as links or text, and can also be used to clean and organize the data you have extracted. It makes it easy to navigate, search, and modify the parse tree.

- The <b>requests</b> library allows you to send HTTP requests in Python, which is useful for interacting with websites and APIs. It can also be used to download the HTML of a web page, which can then be parsed using Beautiful Soup.

- The <b>"re"</b> library is a standard library that is used for working with regular expressions. Regular expressions are a powerful tool for matching patterns in strings, and they are often used in web scraping to search for specific text or patterns in the HTML of a web page.

- Combining the requests and BeautifulSoup library with "re" you can easily extract specific information from webpage or through regex you can find specified specific information in webpage.

In [None]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

## Scrape the List of Movies /Series from TMDB

How to Perform this:

- Use requests to download the page
- Use BS4 to parse and extract information
- convert to Pandas Dataframe
    
Let's write a fuction to dowload the page
    

## <u><b>Downloading the Webpage using Requests</b></u>

- requests.get() method is used to make a GET request to a specified URL and retrieve the response. The response object that is returned contains various information about the response, such as the status code, headers, and the content of the response. This response is saved as response variable and then it uses response.text to extract the HTML content of the response.

In [None]:
!pip install requests --upgrade --quiet

In [None]:
import requests
from bs4 import BeautifulSoup
topic_url = 'https://www.themoviedb.org/movie'
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'})
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [None]:
get_topic_page('https://www.themoviedb.org/tv')

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular TV Shows — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b6

In [None]:
 response = requests.get(topic_url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'})

In [None]:
response.status_code

200

Request is Successful

In [None]:
page_contents=response.text

In [None]:
len(response.text)

220584

The webpage contains over <b>2lakhs+</b> characters. Here are the first 1000 characters of the page:

In [None]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-icon" sizes="180x180" href=

The code above is HTML source code of our URL webpage. We can save it to a file and view it locally within Jupyter notebook using <b>"File -> Open"</b>.

In [None]:
with open ('webpage.html','w') as f:
    f.write(page_contents)

## <u><b>Parsing the HTML Source Code using Beautiful Soup</b></u>

- <b>Beautiful</b> Soup is a library for parsing HTML and XML documents. It allows you to extract specific elements from a web page, such as links or text, and can also be used to clean and organize the data you have extracted. It makes it easy to navigate, search, and modify the parse tree.

In [None]:
doc = BeautifulSoup(page_contents,'html.parser')

In [None]:
print(doc.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <title>
   Popular Movies — The Movie Database (TMDB)
  </title>
  <meta content="on" http-equiv="cleartype"/>
  <meta charset="utf-8"/>
  <meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
  <meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
  <meta content="#032541" name="msapplication-TileColor"/>
  <meta content="#032541" name="theme-color"/>
  <link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b

### Lets get Titile,Realse Date,Rating,Movie urls of web page using tag title

In [None]:
selection_class = 'card style_1'

In [None]:
realsedate_tags = doc.find_all('div',class_=selection_class)

In [None]:
realsedate_tags[0]('a')[0]['href']

'/movie/299054'

In [None]:
tv_url1 ="https://www.themoviedb.org"+realsedate_tags[0]('a')[0]['href']
print(tv_url1)

https://www.themoviedb.org/movie/299054


Extracting Shows/Movie names.

In [None]:
#This Contains Names of the Shows
show_names = []

for a_tag in realsedate_tags:
    show_info = a_tag.text.strip().split('\n')
    show_name = show_info[0]
    show_names.append(show_name)

print(show_names)

['Expend4bles', 'Mission: Impossible - Dead Reckoning Part One', 'The Equalizer 3', 'The Nun II', 'Uri: The Surgical Strike', 'Sound of Freedom', 'Mortal Kombat Legends: Cage Match', 'Talk to Me', 'Saw X', 'Nowhere', 'Ballerina', 'Meg 2: The Trench', 'Killers of the Flower Moon', 'Gran Turismo', '57 Seconds', 'Blue Beetle', 'The Ritual Killer', 'PAW Patrol: The Mighty Movie', 'V/H/S/85', 'Fast X']


Extracting Release Dates of the shows.

In [None]:
#This contains all Release dates:
realse_dates = []

for release in realsedate_tags:
    realse_dates.append(release.find('p').text)

print(realse_dates)

['Sep 15, 2023', 'Jul 08, 2023', 'Aug 30, 2023', 'Sep 06, 2023', 'Jan 11, 2019', 'Jul 03, 2023', 'Oct 17, 2023', 'Aug 04, 2023', 'Sep 26, 2023', 'Sep 29, 2023', 'Oct 05, 2023', 'Aug 02, 2023', 'Oct 18, 2023', 'Aug 09, 2023', 'Sep 29, 2023', 'Aug 18, 2023', 'Mar 09, 2023', 'Oct 13, 2023', 'Sep 22, 2023', 'May 19, 2023']


Extracting the  Movies/shows Ratings:

In [None]:
#This Contains Ratings
topics_url = 'https://www.themoviedb.org/movie'
response = requests.get(topics_url, headers={'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'})
page_contents = response.text
doc = BeautifulSoup(page_contents, 'html.parser')
ratings_elements = doc.find_all('div', class_='user_score_chart')

# Extract the data-percent values from the list of elements
data_percents1 = [div['data-percent'] for div in ratings_elements]

# Now data_percents contains all the data-percent values from the 'user_score_chart' divs
print(data_percents1)

['63.97', '77.19', '72.48', '69.97', '71.46', '80.87', '79.79', '71.94', '72.0', '76.16', '70.0', '68.15', '78.8', '80.77', '54.260000000000005', '70.66', '57.69', '73.0', '67.86', '72.48']


Extracting the Movies/shows Url:

In [None]:
#This contains the Url Of the shows:
shows_url = []
for tag in realsedate_tags:
    shows_url.append('https://www.themoviedb.org' + tag.find('a')['href'])
shows_url

['https://www.themoviedb.org/movie/299054',
 'https://www.themoviedb.org/movie/575264',
 'https://www.themoviedb.org/movie/926393',
 'https://www.themoviedb.org/movie/968051',
 'https://www.themoviedb.org/movie/554600',
 'https://www.themoviedb.org/movie/678512',
 'https://www.themoviedb.org/movie/1034062',
 'https://www.themoviedb.org/movie/1008042',
 'https://www.themoviedb.org/movie/951491',
 'https://www.themoviedb.org/movie/1151534',
 'https://www.themoviedb.org/movie/961268',
 'https://www.themoviedb.org/movie/615656',
 'https://www.themoviedb.org/movie/466420',
 'https://www.themoviedb.org/movie/980489',
 'https://www.themoviedb.org/movie/937249',
 'https://www.themoviedb.org/movie/565770',
 'https://www.themoviedb.org/movie/862552',
 'https://www.themoviedb.org/movie/893723',
 'https://www.themoviedb.org/movie/1032948',
 'https://www.themoviedb.org/movie/385687']

Creating a Dictionary.

In [None]:
Tv_shows_dict = {
    'show_titles':show_names,
    'Date_of_Release':realse_dates,
    'Show_Ratings':data_percents,
    'Shows_url':shows_url
    }

In [None]:
Tv_shows_df=pd.DataFrame(Tv_shows_dict)
Tv_shows_df

Unnamed: 0,show_titles,Date_of_Release,Show_Ratings,Shows_url
0,Expend4bles,"Sep 15, 2023",63.97,https://www.themoviedb.org/movie/299054
1,Mission: Impossible - Dead Reckoning Part One,"Jul 08, 2023",77.19,https://www.themoviedb.org/movie/575264
2,The Equalizer 3,"Aug 30, 2023",72.48,https://www.themoviedb.org/movie/926393
3,The Nun II,"Sep 06, 2023",69.97,https://www.themoviedb.org/movie/968051
4,Uri: The Surgical Strike,"Jan 11, 2019",71.46,https://www.themoviedb.org/movie/554600
5,Sound of Freedom,"Jul 03, 2023",80.87,https://www.themoviedb.org/movie/678512
6,Mortal Kombat Legends: Cage Match,"Oct 17, 2023",79.79,https://www.themoviedb.org/movie/1034062
7,Talk to Me,"Aug 04, 2023",71.94,https://www.themoviedb.org/movie/1008042
8,Saw X,"Sep 26, 2023",72.0,https://www.themoviedb.org/movie/951491
9,Nowhere,"Sep 29, 2023",76.16,https://www.themoviedb.org/movie/1151534


Function for Scraping MovieName,Date of Release,Rating,Urls.

In [None]:
import requests
from bs4 import BeautifulSoup

def get_topic_page(url):
    return BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text, 'html.parser')

def get_movie_names(doc):
    return [tag.text.strip().split('\n')[0] for tag in doc.find_all('div', class_='card style_1')]

def get_release_dates(doc):
    return [tag.find('p').text for tag in doc.find_all('div', class_='card style_1')]

def get_show_urls(doc):
    return ['https://www.themoviedb.org' + tag('a')[0]['href'] if tag('a') else 'None' for tag in doc.find_all('div', class_='card style_1')]

def Ratings(doc):
    ratings_elements = doc.find_all('div', class_='user_score_chart')
    return [div['data-percent'] for div in ratings_elements]

def movie_info(urls):
    d = {'Release_date': [], 'Movie_name': [],'Ratings': [], 'Urls': []}
    for url in urls:
        doc = get_topic_page(url)
        d['Release_date'] += get_release_dates(doc)
        d['Movie_name'] += get_movie_names(doc)
        d['Urls'] += get_show_urls(doc)
        d['Ratings'] += Ratings(doc)

    return d

In [None]:
urls=['https://www.themoviedb.org/tv','https://www.themoviedb.org/tv/top-rated',
                              'https://www.themoviedb.org/tv/airing-today']
r=movie_info(urls)
print(r)

{'Release_date': ['Jun 09, 2021', 'Dec 02, 2013', 'Mar 27, 2005', 'Oct 03, 2020', 'Sep 20, 1999', 'Oct 12, 2021', 'Sep 08, 2015', 'Sep 25, 2017', 'Sep 28, 2023', 'Sep 23, 2013', 'Jul 03, 2000', 'Jan 25, 2016', 'Apr 17, 2011', 'Jan 31, 1999', 'Sep 13, 2005', 'Jun 23, 2011', 'Dec 17, 1989', 'Oct 05, 2011', 'Oct 13, 2023', 'Sep 23, 2003', 'Jan 20, 2008', 'Nov 06, 2021', 'Oct 20, 1999', 'Dec 02, 2013', 'Mar 19, 2017', 'Apr 05, 2009', 'Feb 21, 2005', 'Apr 06, 2019', 'Mar 25, 2021', 'Apr 22, 2022', 'Apr 03, 2016', 'Dec 02, 2016', 'Jan 15, 2023', 'Apr 07, 2013', 'Sep 06, 2010', 'Jan 10, 2020', 'May 06, 2019', 'Feb 08, 2015', 'Jun 19, 2017', 'Oct 02, 2011', 'Apr 09, 2022', 'Feb 08, 2021', 'Oct 29, 2022', 'Oct 07, 2023', 'Oct 07, 2023', 'Jun 01, 2019', 'Jun 18, 2018', 'Oct 09, 2021', 'Oct 02, 2020', 'Sep 30, 2023', 'Oct 07, 2023', 'May 15, 2010', 'Jun 27, 2009', 'Dec 16, 2021', 'Oct 28, 2004', 'Jul 28, 2008', 'Nov 03, 2006'], 'Movie_name': ['Loki', 'Rick and Morty', "Grey's Anatomy", 'Jujutsu K

In [None]:
#Creating a Database
Tv_df = pd.DataFrame(movie_info(['https://www.themoviedb.org/movie','https://www.themoviedb.org/movie/upcoming','https://www.themoviedb.org/tv',
                              'https://www.themoviedb.org/tv/top-rated',
                              'https://www.themoviedb.org/tv/airing-today','https://www.themoviedb.org/tv/on-the-air','https://www.themoviedb.org/movie']))
Tv_df

Unnamed: 0,Release_date,Movie_name,Ratings,Urls
0,"Sep 15, 2023",Expend4bles,63.9,https://www.themoviedb.org/movie/299054
1,"Jul 08, 2023",Mission: Impossible - Dead Reckoning Part One,77.19,https://www.themoviedb.org/movie/575264
2,"Aug 30, 2023",The Equalizer 3,72.48,https://www.themoviedb.org/movie/926393
3,"Sep 06, 2023",The Nun II,70.0,https://www.themoviedb.org/movie/968051
4,"Jan 11, 2019",Uri: The Surgical Strike,71.46,https://www.themoviedb.org/movie/554600
...,...,...,...,...
132,"Aug 18, 2023",Blue Beetle,70.64,https://www.themoviedb.org/movie/565770
133,"Mar 09, 2023",The Ritual Killer,57.69,https://www.themoviedb.org/movie/862552
134,"Oct 13, 2023",PAW Patrol: The Mighty Movie,73.0,https://www.themoviedb.org/movie/893723
135,"Sep 22, 2023",V/H/S/85,67.86,https://www.themoviedb.org/movie/1032948


## Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Attach the CSV files with your notebook using jovian.commit.

In [None]:
Tv_df.to_csv('Movies.csv',index=False)

## <u>Summary of the Project</u>
In this project, we used the Python programming language to extract requried information of movies/series from TMBD website. The process included the following steps:

- Fetching the webpage using the 'requests' library
- Scraping the webpage for relevant information using the 'BeautifulSoup' library
- Storing the scraped data in a dictionary
- Converting the data in dictionary to a csv file using 'pandas' library
- The final outcome was a csv file containing the Movie name details rating and urls from TMBD website.

## <u>Future Work</u>
1. We can use a list to store all URLs and by using user Input info we can develop a formula that allows us to retrieve the desired URL by using list indexing.
2. Retrieve Movies/shows,Ratings,Release date,urls from the TMBD website to provide additional information to users.https://www.themoviedb.org/movie