<a href="https://colab.research.google.com/github/arhamshah/Movies-Web-Scraper/blob/main/Movies_Web_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Movies on [TMDb](https://www.themoviedb.org/movie)

*   Web Scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

*   In this project, I would be scraping TMDb website which is an online movie directory. I would be later structuring data in form of a DataFrame, which could be further used as a DataSet for various Applications.

*   Tools/Technologies used in this Project are Python, requests, Beautiful Soup, Pandas.




## Initializing and Importing

Here we are importing necessary libraries/frameworks which we are going to use.

In [17]:
!pip install requests --quiet
!pip install beautifulsoup4 --quiet
!pip install pandas --quiet

In [18]:
import requests
from bs4 import BeautifulSoup
import pandas as pd 

## Getting Started

In [19]:
# Defining Request Header
my_header = {'accept': 'text/html', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.59'}

# Defining Base Url
base_url = 'https://www.themoviedb.org/'

## Functions

Function are defined here and they would be used while parsing the pages

**To get Beautiful Soap Object of Inner Page**

In [34]:
def get_child_soap(url):

  # Creating Beautiful Soup object from given link
  response = requests.get(url, headers=my_header).text
  return BeautifulSoup(response, 'html.parser')

**To get Ratings of a Movie from the webpage**

In [21]:
def get_ratings(movie_webpage):

  # Parsing through website to get ratings of a movie
  popularity = movie_webpage.find('div', {'class':'user_score_chart'})

  # If not found return: None
  if popularity is not None:
    return float(popularity['data-percent'])
  else:
    return None

**To get Genres of a Movie from the webpage**

In [22]:
def get_genres(movie_webpage):

  # Parsing through website to get genre of a movie
  genres = []
  list_genres_link = movie_webpage.find('span', {'class':'genres'})

  # If not found return: None
  if list_genres_link is not None:
    list_genres_link = list_genres_link.find_all('a')
    
    for genre_link in list_genres_link:
      genres.append(genre_link.text)
    
    return genres
  else:
    return None

**To get Release Date of a Movie from the Webpage**

In [23]:
def get_release_date(movie_webpage):

  # Parsing through website to get release date of a movie
  release_date = movie_webpage.find('span', {'class': 'release'})
  
  # If not found return: None
  if release_date is not None:
    release_date = release_date.text.strip()
    date = release_date[:10] 
    return date
  else:
    return None

**To get Runtime of a Movie from the webpage**

In [24]:
def get_runtime(movie_webpage):

  # Parsing through website to get duration of a movie
  runtime = movie_webpage.find('span', {'class':'runtime'})

  # If not found return: None
  if runtime is not None:
    runtime = runtime.text.strip()
    
    index_h = runtime.find('h')
    index_m = runtime.find('m') 

    hours = int(runtime[:index_h])
    minutes = int(runtime[index_h+1:index_m]) if index_m != -1 else 0
    return minutes if hours == 0 else hours * 60 + minutes
  else:
    return None

**To get Facts of a Movie (Language, Budget, Revenue) from the webpage**

In [25]:
def get_facts(movie_webpage):

  # Parsing through website to get facts(language, budget, revenue) of a movie
  facts = movie_webpage.find('section', {'class':'facts left_column'})

  # If not found return: None
  if facts is not None:
    facts = facts.find_all('p')
    temp_list = []
    language = []

    for fact in facts:
      fact = fact.text.strip()
      if fact.startswith('Original Language') or fact.startswith('Budget') or fact.startswith('Revenue'):
        temp_list.append(fact)

    language = temp_list[0][18:].split('; ')
    budget = temp_list[1][7:]
    revenue = temp_list[2][8:]
    del(temp_list)

    budget_formatted = 0 if budget == '-' else float(budget[1:].replace(',',''))
    revenue_formatted = 0 if revenue == '-' else float(revenue[1:].replace(',',''))

    return language,budget_formatted,revenue_formatted
  else:
    return None,None,None

## Scraping the Data

*   The movies are paginated in TMDb, so we would be browsing through pages. For the sake of this project, I would be browsing through 50 pages.

*   From the listing page, url of a movie is extracted and parsed, to collect various facts and data related to a movie.



**Declaring Lists of various Data for collection**

In [40]:
list_titles = []
list_links = []
list_genres = []
list_popularities = []
list_release_dates = []
list_budgets = []
list_revenues = []
list_runtimes = []
list_languages = []

In [None]:
# Counter for parsing requests
# count = 0

# Iterating Over Pages till 50th Page
for i in range(1,51):

  # Requesting Home Page where movies are listed
  response = requests.get(f'{base_url}movie?page={i}', headers=my_header)
  
  # Creating a Beautiful Soap Object from response
  webpage = response.text
  soup = BeautifulSoup(webpage, 'html.parser')

  # Parsing through Movie Title Names and Links of Inner Pages
  title_divs = soup.find_all('div', {'class':'card style_1'})
  for title_div in title_divs:
    list_titles.append(title_div.find('h2').find('a').text)
    list_links.append("https://www.themoviedb.org/" + title_div.find('h2').find('a')['href'])

  # Parsing through Inner Pages of Movies
  for link in list_links:
    
    # For Printing Status...
    # count=count+1
    # print(f'{count}.parsing {link}....')

    # Get Inner Page Beautiful Soap Object
    movie_webpage = get_child_soap(link)

    # Appending the data of movies to various lists
    list_genres.append(get_genres(movie_webpage))
    list_popularities.append(get_ratings(movie_webpage))
    list_release_dates.append(get_release_date(movie_webpage))
    list_runtimes.append(get_runtime(movie_webpage))
    facts = get_facts(movie_webpage)
    list_languages.append(facts[0])
    list_budgets.append(facts[1])
    list_revenues.append(facts[2])
  
  # Resetting the list of links for next iteration
  list_links = []

**Converting Lists into Pandas**

In [43]:
# Transforming Lists into Dictionaries for Pandas
dict_df = {
    'Title':list_titles,
    'Genres':list_genres,
    'Popularity':list_popularities,
    'Release Date':list_release_dates,
    'Run Time':list_runtimes,
    'Languages':list_languages,
    'Budget':list_budgets,
    'Revenue':list_revenues
}

In [48]:
df = pd.DataFrame(dict_df)
df

Unnamed: 0,Title,Genres,Popularity,Release Date,Run Time,Languages,Budget,Revenue
0,Luca,"[Animation, Comedy, Family, Fantasy]",82.0,06/17/2021,95.0,[English],0.0,5000000.0
1,A Quiet Place Part II,"[Science Fiction, Thriller, Horror]",74.0,05/28/2021,97.0,[English],61000000.0,224400713.0
2,Infinite,"[Science Fiction, Action, Thriller]",0.0,09/08/2021,106.0,[English],0.0,0.0
3,Cruella,"[Comedy, Crime]",85.0,05/28/2021,134.0,[English],200000000.0,161317593.0
4,F9,"[Action, Adventure, Crime]",79.0,06/25/2021,145.0,[English],200000000.0,292457000.0
...,...,...,...,...,...,...,...,...
995,Porno,"[Comedy, Horror]",58.0,05/08/2020,98.0,[English],0.0,0.0
996,The Conquest Of Siberia,"[Action, History, Drama]",63.0,06/28/2019,109.0,[Russian],0.0,1964806.0
997,Chappie,"[Crime, Action, Science Fiction]",68.0,03/06/2015,120.0,[English],49000000.0,104399548.0
998,Spies in Disguise,"[Animation, Action, Adventure, Comedy, Family]",77.0,12/25/2019,102.0,[English],100000000.0,171616764.0
