<a href="https://colab.research.google.com/github/dniggl/Insights/blob/main/Web_Scraping_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Scrape a movies webpage, extract data and create a new file. 

In [18]:
# Import the required libraries.
import requests
import bs4
import pandas as pd
import csv


*   requests allows you to send HTTP requests which returns a Response Object   with all the response data (i.e. HTML).

*   beautifulsoup (bs4) is used to pull data out of HTML files and convert the data to a BeautifulSoup object, which represents the HTML as a nested data structure.
*   pandas is used for data analysis and manipulation.



**Retrieve and Convert the HTML**

Create an object (URL) containing the website address and send a get request for the specific URL's HTML. Then retrieve the HTML data that the server sends back and convert the data into a BeautifulSoup object.

In [19]:
# Extract the HTML and create a BeautifulSoup object.
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'

def get_page_contents(url):
    page = requests.get(url, headers={"Accept-Language": "en-US"})
    return bs4.BeautifulSoup(page.text, "html.parser")

soup = get_page_contents(url)

The HTML content of the webpages will be parsed and scraped using Beautiful Soup. Beautiful Soup is a great tool for parsing and scraping websites because of the numerous functions it provides to extract data from HTML. To learn more about BeautifulSoup select this link (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-usehttps://) 

For this project we will be using the IMDB Top 100 Movies webpage. You can find this webpage by selecting this link https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating

After reviewing the IMDB Top 100 Movies webpage, I have decided to extract the following data elements: 

*   Movie Title
*   Release Year
*   Audience Rating
*   Runtime
*   Genre
*   IMDB Rating
*   Number of Votes
*   Box Office Earnings

**Find and Extract the Data Elements**

We will first create a list of all distinct movies and their corresponding HTML. The findAll method creates a list where each entry contains the HTML that’s captured within the 'div' tag with the class 'lister-item-content'.   

In [20]:
movies = soup.findAll('div', class_='lister-item-content')

Next, for each data element we want to extract, we will loop through all movies, find all the HTML lines within the specified tag and class, and extract and store the data elements. In the case of votes and earnings data elements, a tag, name and some parameters are used to find the specific HTML lines. The data elements extacted will be stored in a list.     

In [21]:
titles = [movie.find('a').text for movie in movies]
release = [movie.find('span', class_='lister-item-year text-muted unbold').text for movie in movies]
audience_rating = [movie.find('span', class_='certificate') for movie in movies]
runtime = [movie.find('span', class_='runtime').text for movie in movies]
genre = [movie.find('span', class_='genre').text.strip() for movie in movies]
imdb_rating = [movie.find('div', 'inline-block ratings-imdb-rating', text_attribute=False).text.strip() for movie in movies]
votes = [movie.find('span', {'name' : 'nv'}, text_attribute=False, order=None).text for movie in movies]
earnings = [movie.find('span', {'name' : 'nv'},[1], text_attribute=False).text for movie in movies]

**Create and Display the Data Frame**

We will now create a new Data Frame containing the names and data elements that were extracted.

In [22]:
movies_dict = {'Title': titles, 'Relase': release, 'Audience Rating': audience_rating,
           'Runtime': runtime, 'Genre': genre, 'IMDB Rating': imdb_rating,
           'Votes': votes, 'Box Office Earnings': earnings}
        
movies = pd.DataFrame(movies_dict)
movies.head(10)

Unnamed: 0,Title,Relase,Audience Rating,Runtime,Genre,IMDB Rating,Votes,Box Office Earnings
0,K.G.F: Chapter 2,(2022),,168 min,"Action, Crime, Drama",9.7,44351,44351
1,Jai Bhim,(2021),[TV-MA],164 min,"Crime, Drama, Mystery",9.4,185638,185638
2,The Shawshank Redemption,(1994),[R],142 min,Drama,9.3,2574040,2574040
3,Soorarai Pottru,(2020),[TV-MA],153 min,Drama,9.2,109454,109454
4,The Godfather,(1972),[R],175 min,"Crime, Drama",9.2,1772427,1772427
5,The Dark Knight,(2008),[PG-13],152 min,"Action, Crime, Drama",9.0,2540879,2540879
6,The Lord of the Rings: The Return of the King,(2003),[PG-13],201 min,"Action, Adventure, Drama",9.0,1770365,1770365
7,Schindler's List,(1993),[R],195 min,"Biography, Drama, History",9.0,1311109,1311109
8,The Godfather: Part II,(1974),[R],202 min,"Crime, Drama",9.0,1225643,1225643
9,12 Angry Men,(1957),[Approved],96 min,"Crime, Drama",9.0,760189,760189


**Convert Data Frame to a CVS File**

If needed, we can create a csv file from the data frame that was created in the previous step. 

In [23]:
csv_data = movies.to_csv()
print('\nCSV String Values:\n', csv_data)


CSV String Values:
 ,Title,Relase,Audience Rating,Runtime,Genre,IMDB Rating,Votes,Box Office Earnings
0,K.G.F: Chapter 2,(2022),,168 min,"Action, Crime, Drama",9.7,"44,351","44,351"
1,Jai Bhim,(2021),"<span class=""certificate"">TV-MA</span>",164 min,"Crime, Drama, Mystery",9.4,"185,638","185,638"
2,The Shawshank Redemption,(1994),"<span class=""certificate"">R</span>",142 min,Drama,9.3,"2,574,040","2,574,040"
3,Soorarai Pottru,(2020),"<span class=""certificate"">TV-MA</span>",153 min,Drama,9.2,"109,454","109,454"
4,The Godfather,(1972),"<span class=""certificate"">R</span>",175 min,"Crime, Drama",9.2,"1,772,427","1,772,427"
5,The Dark Knight,(2008),"<span class=""certificate"">PG-13</span>",152 min,"Action, Crime, Drama",9.0,"2,540,879","2,540,879"
6,The Lord of the Rings: The Return of the King,(2003),"<span class=""certificate"">PG-13</span>",201 min,"Action, Adventure, Drama",9.0,"1,770,365","1,770,365"
7,Schindler's List,(1993),"<span class=""certificate"">R</span>",195 min,"Biog