# Web Scraping in Python using Beautiful Soup
### *By:* *`Ayobami Yusuf`*
### **Introduction:**
> This is a simple project/tutorial that seeks to explain how to programmatically **scrape** (extract) data from the **web** (internet) - a task generally refered to as **Web Scraping** - in python using two modules (***`requests`*** and ***`BeautifulSoup`***). <p> This notebook is structured in an easy-to-follow manner to enable begeinners fully gain the knowledge and skills required to successfully complete a web scraping task. <p> For this project, we will be accessing and extracting the **[IMDb Top 250 Movies of all Time data](https://www.imdb.com/chart/top)** and load the data into an excel file for analytical purposes. 

## Packages/Libraries/Modules Installations

> Apart from `BeautifulSoup`, and the popular `pandas` libraries, because we intend to load our extracted data into an excel file for further processing or analyses, we would be using a cool library for this task - **`openpyxl.`** Again, this tutorial does not cover any introduction to openpyxl but **[here's openpyxl's documentation](https://openpyxl.readthedocs.io/en/stable/)** for reference.

In [1]:
import requests #to access the website's html contents
from bs4 import BeautifulSoup #to parse the contents from the accessed html page

In [2]:
#import the openpyxl library
import openpyxl as pxl

#create an excel file that will contain the excel worksheet to store the data in
file = pxl.Workbook()

#activate the current(open/active) worksheet as the sheet to be used
sheet = file.active

#rename/retitle the activated worksheet
sheet.title = "IMDb Ratings"

#create the column headers
sheet.append(["rank", "title", "release_year", "imdb_rating"])

print(file.sheetnames)

['IMDb Ratings']


## Connect to the webpage that houses the needed data and extract the data from it

In [3]:
#url to the page to be accessed:
url = "https://www.imdb.com/chart/top"

#sends a request(to grant access) to the server hosting the page and returns a response object 
#that confirms access to the page (ie THE SOURCE CODE OF THE PAGE) and also some status information (200 means 'success')
source_code = requests.get(url) 

#parses the retrived contents(source code alone) using BeautifulSoup via the lxml parser
soup = BeautifulSoup(source_code.text, 'lxml') #the .text method is needed to retrieve ONLY the html source code

#now, lets take just a sneak peak into the html contents retrived
soup.head()[0:3]

[<meta charset="utf-8"/>,
 <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>,
 <script>
     if (typeof uet == 'function') {
       uet("bb", "LoadTitle", {wb: 1});
     }
 </script>]

Beautiful! (No pun intended), our soup has been beautifully prepared (still, no pun intended...well, maybe a little). We have successfully accessed the page's front end and retrievd it's source code and parsed it into a "pythonable" format using the popular ***`lxml`*** parser. By the way, **[here's](https://www.scrapingbee.com/blog/data-parsing/)** a good read to understand the concept of **[parsing/parser/parse](https://www.scrapingbee.com/blog/data-parsing/)** and why they are needed. A little (very little) potion of the content is shown to confirm that we truly have successfully completed this phase.

## Next up, we retrieve the actual data needed from this page
> Most of the time, we do not need all the information on the page we are trying to scrape from. Most data analytics operations requiring web scraping to be done usually need to access and retrieve ONLY data tables found on web pages like the one found on IMDb Top 250 Movies page that shows the **Rank, Title, Release Year,** and **IMDb Rating** of 250 movies considered as the Top 250 movies of all time. <p> Now this is where basic knowledge of HTML and HTML tags are very useful (not compulsory). For this project/tutorial, we won't be covering any introduction to HTML. You can refer to **[W3 Schools' HTML tutorial](https://www.w3schools.com/html/html_intro.asp)** to learn about the basics of HTML.

In [4]:
#the data table we need can be found in the <tbody> tag with the (lister-list) class attribute
#and the data are found in the <tr> children tags of the <tbody> parent tag. 
#we use this knowledge to locate and extract the data
movie_table = soup.find('tbody', class_='lister-list').find_all('tr')

In [5]:
#now that we have located all the tags that houses our data
#we can write a for loop that iterates through each <tr> tag and extracts all the needed data in each automatically
for movie in movie_table:
    rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
    title = movie.find('td', class_='titleColumn').a.text
    year = movie.find('td', class_='titleColumn').span.text.strip('()')
    rating = movie.find('td', class_='ratingColumn imdbRating').text.replace('\n','')
    
    #since we have our worksheet set up already, we can append the data extracted to the open sheet directly from the loop
    sheet.append([rank, title, year, rating])

## Next, we save the Excel file to local machine memory

In [7]:
file.save("IMDB Ratings Data.xlsx")

### Finally, to assess all that we have done, we can load the saved excel file into our notebook environment to be sure the data were truly scraped and saved.

In [12]:
import pandas as pd 
df = pd.read_excel("IMDB Ratings Data.xlsx")

#view the first five rows of the excel worsheet
df.head()

Unnamed: 0,rank,title,release_year,imdb_rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,9.0


## Tasks Completed.

And, that's it. You should see the worksheet on your computer (on the same folder as your jupyter notebok or your .py file). If you find this useful, or have any questions regarding web scraping, data engineering, or data analysis, kindly reach out to me on **[www.linkedin.com/in/ayobami-yusuf](www.linkedin.com/in/ayobami-yusuf)**