# Scrapping TMDB


## Picking a website

* Here I am going scrape data from [TMDB](https://www.themoviedb.org/movie)
* [TMDB](https://www.themoviedb.org/movie) is a popular movies website which shows all the information for our favorite movies.


1. Objective

* To have  a better understanding of my what my result is going to look like I made this short spreadsheet which would help me only scrape data relevant to my needs.

* The main objective is to have data for a lot of movies like this in `csv` format as `csv` file can be read easily for Machine Learning and Data Analysis.

![](https://media.discordapp.net/attachments/864925487272558616/1096786878491000933/image.png)


# Project Outline:

- I am going to scrape : https://www.themoviedb.org/movie
- From here I will get a list of movie urls.
- For each movie I am going to get it's Name, Release Date, Genre, Duration, Ratings, Director, Budget, Revenue, Image/Poster Link.
- The goal is to scrape 50 pages of this website.
- Create a CSV file in the following format:

```
Name,Release Date,Genre,Duration,Ratings,Directior,Budget,Revenue,Image Link
The Super Mario Bros. Movie(2023),04/07/2023,"Animation,Adventure,Family,Fantasy,Comedy",1h 32m,75%,Michael Jelenic,"100,000,000","437,000,000",Poster Link for Super Mario Bros
Shazam! Fury of the Gods (2023),03/17/2023,"Action,Comedy,Fantasy",2h 10m,70%,David F. Sandberg,"125,000,000","120,864,255",Poster Link for Shazam!
Avatar: The Way of Water,12/16/2022,"Science Fiction, Adventure,Action",3h 12m,77%,James Cameron,"350,000,000","2,312,335,665",Poster Link for Avatar
```

2. Use the requests library to download web pages

* Inspect the website's HTML source and identify the right URLs to download.
* Download and save web pages locally using the requests library.
* Create a function to automate downloading for different topics/search queries.

In [19]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
web_page = 'https://www.themoviedb.org/movie'

In [3]:
#Need to add headers here because I was getting Status code:403
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(web_page, headers=headers)

In [4]:
#Status check 200=Success
response.status_code

200

In [5]:
#Using .text from request library to get html code as a string
content = response.text

3. Use Beautiful Soup to parse and extract information

* Parse and explore the structure of downloaded web pages using Beautiful soup.
* Use the right properties and methods to extract the required information.
* Create functions to extract from the page into lists and dictionaries.

In [25]:
def get_movie_links(web_page):
    
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    page = '?page='
    
    links_lst = []
    
    main_url_lst = []
    
    for page_num in range(1,50):
        main_url_lst.append(web_page + page + str(page_num))

    for url in main_url_lst :
        html = requests.get(url,headers=headers).text
        info_soup = BeautifulSoup(html,'html.parser')
        div = info_soup.find_all('div',class_ = 'card style_1')
        for data in div:
            h2 = data.find('h2')
            a = h2.find('a')
            link = a['href']
            links_lst.append(base_url + link)
            
    return links_lst


## About  get_movie_links(web_page):
The function `get_movie_links(web_page)` takes the web_page as a parameter breaks the content of links from multiple web pages into `links_lst`. It return a list containing links of all the movies accross 50 pages on the given web_page. 

In [26]:
base_url = 'https://www.themoviedb.org'
web_page  = 'https://www.themoviedb.org/movie'
lst = get_movie_links(web_page)

In [27]:
def get_movie_info(movie_page_url):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    with requests.Session() as session:
        response = requests.get(movie_page_url,headers=headers)
    movie_doc = BeautifulSoup(response.text,'html.parser')
    movie_name = movie_doc.find('div').find('h2').text.strip().replace('\n',' ')
    movie_genres = movie_doc.find('span',class_='genres').text.strip().replace('\xa0','')
    try : 
        movie_runtime = movie_doc.find('span',class_='runtime').text.strip()
    except AttributeError:
        movie_runtime = "NA"
    movie_release = movie_doc.find('span',class_='release').text.strip()[:-4]
    movie_ratings = movie_doc.find('div',class_='user_score_chart')['data-percent']
    directors = movie_doc.find_all('li', class_='profile')
    
    movie_directors = []
    
    for director in directors:
        if 'Director' in director.find('p', class_='character').text :
            movie_directors.append(director.find('a').text)
            
    stats = movie_doc.find('section',class_='facts left_column')
    
    budget = stats.text.strip().split('Budget')[1].split('\n')[0].strip()
    
    revenue = stats.text.strip().split('Revenue')[1].strip()
    
    poster = movie_doc.find('img',class_='poster lazyload')['data-src']
    
    poster_url = base_url + poster
    
    return movie_name, movie_genres, movie_runtime, movie_release, movie_ratings, movie_directors, budget, revenue, poster_url


## About  get_movie_info():
The function `get_movie_info()` takes the movie links as a parameter converts the website information into beautiful soup format in a variable and we use that variable to extract all the necessary information. Like movie name, genres, runtime, release, ratings, directors, stats, budget, revenue, poster url.
This function will be used in the next function to get the data and convert it to a dataframe in one step aswell.


In [28]:

def get_dataframe(movie_links):
    
    movie_info_dict = {
        'name' : [],
        'genres' : [],
        'runtime' : [],
        'release' : [],
        'ratings' : [],
        'directors' : [],
        'budget' : [],
        'revenue' : [],
        'poster_link' : []
    }

    for movie_url in movie_links:
        movie_info = get_movie_info(movie_url)
        movie_info_dict['name'].append(movie_info[0])
        movie_info_dict['genres'].append(movie_info[1])
        movie_info_dict['runtime'].append(movie_info[2])
        movie_info_dict['release'].append(movie_info[3])
        movie_info_dict['ratings'].append(movie_info[4])
        movie_info_dict['directors'].append(movie_info[5])
        movie_info_dict['budget'].append(movie_info[6])
        movie_info_dict['revenue'].append(movie_info[7])
        movie_info_dict['poster_link'].append(movie_info[8])

    movie_info_df = pd.DataFrame(movie_info_dict)
    
    return movie_info_df


## About get_dataframe():
The `get_datafame()` functions takes movie links as parameter this parameter is the returned values from the `get_movie_info()` function. This function stores all values for the names , genres, runtime, etc. under their respected keys in a dictionary, which is later converted to a dataframe.

4. Create CSV file(s) with the extracted information

* Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
* Execute the function with different inputs to create a dataset of CSV files.
* Verify the information in the CSV files by reading them back using Pandas.

In [30]:
tmdb_info = get_dataframe(lst)

In [31]:
tmdb_info.to_csv('tmdb_info_1.csv',index=None)

In [32]:
tmdb_info

Unnamed: 0,name,genres,runtime,release,ratings,directors,budget,revenue,poster_link
0,Transformers: Rise of the Beasts (2023),"Action,Adventure,Science Fiction",2h 7m,06/09/2023,73.0,[Steven Caple Jr.],"$195,000,000.00","$420,965,000.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
1,Guardians of the Galaxy Vol. 3 (2023),"Science Fiction,Adventure,Action",2h 30m,05/05/2023,81.0,[James Gunn],"$250,000,000.00","$840,155,006.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
2,Fast X (2023),"Action,Crime,Thriller",2h 22m,05/19/2023,74.0,[Louis Leterrier],"$340,000,000.00","$699,220,096.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
3,Knights of the Zodiac (2023),"Fantasy,Action,Adventure",1h 53m,04/28/2023,66.0,[Tomek Baginski],"$60,000,000.00","$6,794,519.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
4,Sound of Freedom (2023),"Action,Drama",2h 11m,07/04/2023,81.0,[Alejandro Monteverde],"$14,000,000.00","$85,498,581.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
...,...,...,...,...,...,...,...,...,...
975,Gurkha: Beneath the Bravery (2022),"Drama,War,History",1h 12m,07/22/2022,45.0,[Pradeep Shahi],-,-,https://www.themoviedb.org/t/p/w300_and_h450_b...
976,The Princess (2022),"Action,Fantasy",1h 34m,06/16/2022,69.0,[Lê Văn Kiệt],-,-,https://www.themoviedb.org/t/p/w300_and_h450_b...
977,Kingsman: The Golden Circle (2017),"Action,Adventure,Comedy",2h 21m,09/22/2017,70.0,[Matthew Vaughn],"$104,000,000.00","$410,902,662.00",https://www.themoviedb.org/t/p/w300_and_h450_b...
978,Roald Dahl's The Witches (2020),"Adventure,Fantasy,Comedy,Family,Horror",1h 46m,10/22/2020,64.0,[Robert Zemeckis],-,"$29,303,571.00",https://www.themoviedb.org/t/p/w300_and_h450_b...


## Objective Completed.

![](https://cdn.discordapp.com/attachments/864925487272558616/1130591594605727754/image.png)