---
<center><h1> Lesson 7.1.1 - Web scraping with Python </center></h1>

---

### What is Web Scraping?

**Web Scraping** (also termed Screen Scraping, Web Data Extraction, etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Data displayed by most websites can only be viewed using a web browser. Examples are data litsings at yellow pages directories, real estate sites, social networks, industrial inventory, online shopping sites, contact databases etc. Most websites do not offer the functionality to save a copy of the data which they display to your computer. The only option then is to manually copy and paste the data displayed by the website in your browser to a local file in your computer - a very tedious job which can take many hours or sometimes days to complete.

Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.

A Web Scraping software will interact with websites in the same way as your web browser. But instead of displaying the data served by the website on screen, the Web Scraping software saves the required data from the web page to a local file or database.

<img src="images/scraping.png" width=120%>

### Python tools for Web Scraping

[**_BeautifulSoup_**](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is an incredible Python tool (open library) for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages.

You can install it usin pip, for eaxmple:

    $ pip install beautifulsoup4
    
BeautifulSoup does not fetch the web page for us. That’s why, we will use [**_requests_**](http://docs.python-requests.org/en/master/) library in combination with the BeautifulSoup library.

Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others:

* [`mechanize`](http://wwwsearch.sourceforge.net/mechanize/);
* [`scrapemark`](http://arshaw.com/scrapemark/);
* [`scrapy`](http://scrapy.org/).

Let's import need libraries:

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import warnings
warnings.filterwarnings('ignore')

### A few words about HTML tags

While performing web scarping, we deal with html tags. Thus, we must have good understanding of them. If you already know basics of HTML, you can skip this section. Below is the basic syntax of HTML:

<img src="images/html1.png">

This syntax has various tags as elaborated below:

1. **`<!DOCTYPE html>`**: HTML documents must start with a type declaration;
2. HTML document is contained between **`<html>`** and **`</html>`**;
3. The visible part of the HTML document is between **`<body>`** and **`</body>`**;
4. HTML headings are defined with the **`<h1>`** to **`<h6>`** tags;
5. HTML paragraphs are defined with the **`<p>`** tag.

Other useful HTML tags are:

* HTML links are defined with the **`<a>`** tag, **`<a href="http://www.test.com">This is a link for test.com</a>`**;
* HTML tables are defined with **`<table>`**, row as **`<tr>`** and rows are divided into data as **`<td>`**;

<img src="images/html2.png" width=45%>

* HTML list starts with **`<ul>`** (unordered) and **`<ol>`** (ordered). Each item of list starts with **`<li>`**;

<img src="images/html3.jpg" style="padding: 0 250px 0 0;" width=40%>

* There are two type of specific HTML tags **`<div>`** and **`<span>`** that play role of wrappers for some other tags;

* **`class`** and **`id`** attributes are identifiers of some HTML tags.

<img src="images/html4.jpg" width=36%>
<img src="images/html5.jpg" width=52%>

If you are new to this HTML tags, I would also recommend you to refer [HTML tutorial from W3schools](http://www.w3schools.com/html/), for example. This will give you a clear understanding about HTML tags.

### Scraping of Wikipedia site

The general idea behind web scraping is to retrieve data that exists on a website, and convert it into a format that is usable for analysis. Webpages are rendered by the brower from HTML and CSS code, but much of the information included in the HTML underlying any website is not interesting to us.

We will extract data about most popular movies from [Wikipedia](http://www.wikipedia.org/). The same approach can be applied to scraping data from any other web site.

First of all, we scrape data about most popular movies by genre, using this link: <br></br> https://en.wikipedia.org/wiki/List_of_films_considered_the_best.

We begin by reading in the source code for a given web page and creating a BeautifulSoup object with the `BeautifulSoup` function.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_films_considered_the_best"
# Get data from URL
page = requests.get(url)

# Display HTML page
print(page.text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of films considered the best - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_films_considered_the_best","wgTitle":"List of films considered the best","wgCurRevisionId":892823955,"wgRevisionId":892823955,"wgArticleId":304831,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","CS1 uses Japanese-language script (ja)","CS1 Japanese-language sources (ja)","CS1 Chinese-language sources (zh)","CS1 Croatian-language sources (hr)","CS1 Czech-language sources (cs)","CS1 Estonian-language sources (et)","CS1 Finnish

In [3]:
# Parse the HTML in the `page` variable, and store it in Beautiful Soup format
bs = BeautifulSoup(page.text, 'html.parser')
# `'html.parser'` is an not required argument which allows to process large HTML code
bs

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of films considered the best - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_films_considered_the_best","wgTitle":"List of films considered the best","wgCurRevisionId":892823955,"wgRevisionId":892823955,"wgArticleId":304831,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 German-language sources (de)","CS1 uses Japanese-language script (ja)","CS1 Japanese-language sources (ja)","CS1 Chinese-language sources (zh)","CS1 Croatian-language sources (hr)","CS1 Czech-language sources (cs)","CS1 Estonian-language sources (et)","CS1 Finnis

Each browser provides a tool for analysis and watching of web page content. Commonly, it is called Web Inspector. The easiest way to open Inspector window is a mouse right click on some webpage.

So, let's go to the page https://en.wikipedia.org/wiki/List_of_films_considered_the_best and open Inspector. The Elements menu shows the HTML code of the web page. The Inspector window should looks like this (or it may be display horizontally down the page)

<img src="images/wiki1.png">

Main text block is located in **`<div>`** with class "mw-parser-output". Each movie in the **`<ul>`** list on the web page in **`<li>`** tag of the HTML code, where **`<i>`** tag equals to hyperlink and `"title"` of the movie. After **`<i>`** tag and up to the end of an **`<li>`** the remaining info for the movie is located.

<img src="images/wiki2.png">

`findAll(tags)` method of BeautifulSoup collect all tags `tags` in the defined BeautifulSoup instance and return a Python list of all these object.

In [4]:
text_piece=bs.find("div","mw-parser-output")
print(text_piece.prettify())

<div class="mw-parser-output">
 <div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">
  Wikimedia list article
 </div>
 <p class="mw-empty-elt">
 </p>
 <div class="thumb tright">
  <div class="thumbinner" style="width:222px;">
   <a class="image" href="/wiki/File:Citizen-Kane-Welles-Podium.jpg">
    <img alt="" class="thumbimage" data-file-height="1230" data-file-width="984" decoding="async" height="275" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/34/Citizen-Kane-Welles-Podium.jpg/220px-Citizen-Kane-Welles-Podium.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/34/Citizen-Kane-Welles-Podium.jpg/330px-Citizen-Kane-Welles-Podium.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/34/Citizen-Kane-Welles-Podium.jpg/440px-Citizen-Kane-Welles-Podium.jpg 2x" width="220"/>
   </a>
   <div class="thumbcaption">
    <div class="magnify">
     <a class="internal" href="/wiki/File:Citizen-Kane-Welles-Podium.jpg" title="Enlarge">
    

In [5]:
movies_list=[]
genre_now=text_piece.find_next("h2",text="Genres or media")
check_title=text_piece.find_next("h2",text="Genres or media")
while 1:
    genre_next=genre_now.find_next_sibling("h3")
    if genre_next.find_previous_sibling("h2")==check_title:
        genre_now=genre_next
        for film in genre_now.find_next_sibling("ul").find_all("li"):
            movies_list.append(film)
    else:
        break
        
print("It was found {} movies on the web page\n".format(len(movies_list)))
movies_list[0].text

It was found 34 movies on the web page



"Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.[30]"

In [6]:
#Print movies list
for movie in movies_list:
    print(movie.find("i").text)

Mad Max 2
Die Hard
Pinocchio
What's Opera, Doc?
Hedgehog in the Fog
Tale of Tales
Nausicaä of the Valley of the Wind
Laputa: Castle in the Sky
Toy Story
Die Hard
Some Like It Hot
Blazing Saddles
Monty Python's Life of Brian
This Is Spinal Tap
Superman
The Dark Knight
The Poseidon Adventure
Man with a Movie Camera
Hoop Dreams
Bowling for Columbine
The Exorcist
The Texas Chain Saw Massacre
West Side Story
The Rocky Horror Picture Show
Casablanca
Brief Encounter
2001: A Space Odyssey
Blade Runner
Serenity
Battleship Potemkin
Rocky
Saving Private Ryan
Stagecoach
Johnny Guitar


Lets extract data for "Mad Max 2":

In [7]:
# All HTML tags are highlighted above. 
# Note, method `find()` returns only the first occurrence tag
# attribute `contents` gives content of the tag (what is recorded inside a tag)
# Pay attention how we search the neccesary blocks
movie_1=movies_list[0]
title = movie_1.find('i').text
year = re.findall("^.*?\([^\d]*(\d+)[^\d]*\).*$",movie_1.text)[0]
description = movie_1.text

print('Title: {}\nYear: {}\nDescription: {}'.format(title.encode('utf-8'), year,description))

Title: b'Mad Max 2'
Year: 1981
Description: Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.[30]


> ### Exercise 1.1:

> Supplement the above list of field data with genre of the movie, which is located inside **`<h3>`** tag before movies list. Call this variable `genre`.

> Further we will collect these fields for each movie.

In [8]:
# type your code here
genres=[]
genre_now=text_piece.find_next("h2",text="Genres or media")
check_title=text_piece.find_next("h2",text="Genres or media")
while 1:
    genre_next=genre_now.find_next_sibling("h3")
    if genre_next.find_previous_sibling("h2")==check_title:
        genre_now=genre_next
        for film in genre_now.find_all("span",class_='mw-headline'):
            genres.append(film.text.encode('utf-8'))
    else:
        break
        
print("It was found {} genres on the web page\n".format(len(genres)))
genres[0]
print (genres)
genre = []
for movie in movies_list:
    genre.append(movie.parent.find_previous_sibling("h3").text.encode('utf-8'))
print (genre)    

It was found 15 genres on the web page

[b'Action', b'Animation (shorts and features)', b'Christmas', b'Comedy', b'Comic/Superhero', b'Disaster', b'Documentary', b'Horror', b'Musical', b'Romance', b'Science fiction', b'Silent', b'Sports', b'War', b'Western']
[b'Action', b'Action', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Animation (shorts and features)', b'Christmas', b'Comedy', b'Comedy', b'Comedy', b'Comedy', b'Comic/Superhero', b'Comic/Superhero', b'Disaster', b'Documentary', b'Documentary', b'Documentary', b'Horror', b'Horror', b'Musical', b'Musical', b'Romance', b'Romance', b'Science fiction', b'Science fiction', b'Science fiction', b'Silent', b'Sports', b'War', b'Western', b'Western']


In [33]:
from test_helper import Test

Test.assertEqualsHashed(genre, '97c89a4d6630adeb18fa12ba9976a31413fe293e', 'Incorrect value of "genre" variable',
                        "Exercise 1.1 is successful")

1 test failed. Incorrect value of "genre" variable


We can extract data about each movie using `for` loop:

In [9]:
for movie in movies_list:
    title = movie.find('i').text
    year = re.findall("^.*?\([^\d]*(\d+)[^\d]*\).*$",movie.text)[0]
    description = movie.text
    print('Title: {}\nYear: {}\nDescription: {}\n'.format(title.encode('utf-8'), year.encode('utf-8'),description.encode('utf-8')))

Title: b'Mad Max 2'
Year: b'1981'
Description: b"Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.[30]"

Title: b'Die Hard'
Year: b'1988'
Description: b'Die Hard (1988) was voted the best action film of all time with 21 votes in a 2014 poll of 50 directors, actors, critics, and experts conducted by Time Out New York.[31]'

Title: b'Pinocchio'
Year: b'1940'
Description: b'Pinocchio (1940) was voted the best animated movie ever made in a 2014 poll of animators, filmmakers, critics, journalists, and experts conducted by Time Out.[32][33]'

Title: b"What's Opera, Doc?"
Year: b'1957'
Description: b"What's Opera, Doc? (1957), a Bugs Bunny cartoon, was selected as the greatest animated short film of all time by 1,000 animation professionals in the book The 50 Greatest Cartoons.[34]"

Title: b'Hedgehog in the Fog'
Year: b'1975'
Description: b'Hedgehog in the Fog (1975) was ranked number 1 in a poll at the 2003 Laputa 

We want also obtain more detailed information the movie. Link to these data is placed in the **`<a>`** tag. We will scrape links to its pages, after that we will scrape some data from these pages. 

Let's make data at first for "Mad Max 2", then we will repeat them for all other movies.

In [10]:
# Pay attention, `href` attribute of `<a>` tag contains the URL value.
# To obtain it we should call this attribute like a key of Python dictionary 
link_1 = "https://en.wikipedia.org"+movie_1.find("a")["href"]
print(link_1)
# Lets retrieve html page 
movie_page_1 = requests.get(link_1)
movie_bs_1=BeautifulSoup(movie_page_1.text, 'html.parser')

https://en.wikipedia.org/wiki/Mad_Max_2


Information that we need is located in **`<table>`** tag with 'infobox vevent' class.

In [11]:
info_table_1=movie_bs_1.find("table","infobox vevent")
table_rows=info_table_1.find_all("tr")
for trow in table_rows:
    print(trow.text)

Mad Max 2
Theatrical release poster
Directed byGeorge Miller
Produced byByron Kennedy
Screenplay byTerry HayesGeorge MillerBrian Hannant
StarringMel Gibson
Narrated byHarold Baigent
Music byBrian May
CinematographyDean Semler
Edited byDavid StivenMichael BalsonTim Wellburn
Productioncompany Kennedy Miller Productions 
Distributed byWarner Bros.
Release date
24 December 1981 (1981-12-24) (Australia)

Running time96 minutes[1]
CountryAustralia
LanguageEnglish
BudgetA$4.5 million[2]
Box officeA$10.8 million (Australia)[3]US$23.7 million (Canada and United States)[4]


Data we need located inside **`<tr>`** tags. Location of our data we will search by text in **`<th>`** tag, but sometimes text is located in **`<div>`** inside **`<th>`**. Also data that we will retrieve can be located as one line of text or can be a list. Looking at this, we'd have to create function to scrap this lines.

In [12]:
def get_data(data_table,text):
    try:
        td_data=data_table.find("th",text=text).find_next_sibling("td")
    except:
        try:
            td_data=data_table.find("div",text=text).parent.find_next_sibling("td")
        except:
            return None
    data_list=[]
    if len(td_data.find_all("li"))!=0:
        for li in td_data.find_all("li"):
            data_list.append(li.text)
        return data_list
    else:
        return td_data.text
    

directed=get_data(info_table_1,"Directed by")
produced=get_data(info_table_1,"Produced by")
screenplay=get_data(info_table_1,"Screenplay by")
starring=get_data(info_table_1,"Starring")
running_time=get_data(info_table_1,"Running time")
country=get_data(info_table_1,"Country")
release_date=get_data(info_table_1,"Release date")

print(directed)
print(produced)
print(screenplay)
print(starring)
print(running_time)
print(country)

George Miller
Byron Kennedy
['Terry Hayes', 'George Miller', 'Brian Hannant']
Mel Gibson
96 minutes[1]
Australia


Now let's combine all steps into one block

In [13]:
# Let's write scraped data into a list
# Beware it will take some time.
movies_all_data = []
for movie in movies_list:
    
    title = movie.find('i').text
    year = re.findall("^.*?\([^\d]*(\d+)[^\d]*\).*$",movie.text)[0]
    description = movie.text
    genre = movie.parent.find_previous_sibling("h3").text
    
    link = "https://en.wikipedia.org"+movie.find("a")["href"]

    movie_page = requests.get(link)
    movie_bs = BeautifulSoup(movie_page.text, 'html.parser')
    info_table = movie_bs.find("table","infobox vevent")

    directed = get_data(info_table,"Directed by")
    produced = get_data(info_table,"Produced by")
    screenplay = get_data(info_table,"Screenplay by")
    starring = get_data(info_table,"Starring")
    running_time = get_data(info_table,"Running time")
    country = get_data(info_table,"Country")
    
    # Let's collect each movie data into a dictionary
    movies_all_data.append({
        'title': title,
        'year': year,
        'description': description,
        'genre': genre,
        'link': link,
        'director': directed,
        'producer': produced,
        'screenwriters': screenplay,
        'starring': starring,
        'running_time': running_time,
        'country': country,
    })
    
print(len(movies_all_data))
# Display collected data for the first movie
movies_all_data[0]

34


{'title': 'Mad Max 2',
 'year': '1981',
 'description': "Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.[30]",
 'genre': 'Action',
 'link': 'https://en.wikipedia.org/wiki/Mad_Max_2',
 'director': 'George Miller',
 'producer': 'Byron Kennedy',
 'screenwriters': ['Terry Hayes', 'George Miller', 'Brian Hannant'],
 'starring': 'Mel Gibson',
 'running_time': '96 minutes[1]',
 'country': 'Australia'}

In [14]:
movies_all_data

[{'title': 'Mad Max 2',
  'year': '1981',
  'description': "Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.[30]",
  'genre': 'Action',
  'link': 'https://en.wikipedia.org/wiki/Mad_Max_2',
  'director': 'George Miller',
  'producer': 'Byron Kennedy',
  'screenwriters': ['Terry Hayes', 'George Miller', 'Brian Hannant'],
  'starring': 'Mel Gibson',
  'running_time': '96 minutes[1]',
  'country': 'Australia'},
 {'title': 'Die Hard',
  'year': '1988',
  'description': 'Die Hard (1988) was voted the best action film of all time with 21 votes in a 2014 poll of 50 directors, actors, critics, and experts conducted by Time Out New York.[31]',
  'genre': 'Action',
  'link': 'https://en.wikipedia.org/wiki/Die_Hard',
  'director': 'John McTiernan',
  'producer': ['Lawrence Gordon', 'Joel Silver'],
  'screenwriters': ['Jeb Stuart', 'Steven E. de Souza'],
  'starring': ['Bruce Willis',
   'Alan Rickman',
   'Alexander Godu

> ### Exercise 1.2:

> Some fields of collected data have a bad form, e.g. a description have refence number in brackets that we dont need, runtime contains "minutes.". We may slightly prettify fields content:

> 1. Replace all references (square brackets with number in it e.g. [1]) with empty string '' in all fields. (Use re.sub for the best results). Write a function (call it `remove_ref`) that takes a string and returns cleared output as described above. We will check this function. Pay attention that some fields contains lists of values. This function should process lists of strings too.
> 2. Remain only the first found number of minutes and without seconds in `running_time` field (in general this field may contain few values). Write a fuction (call it `get_num`) that takes a string or a list of strings and returns the number of minutes without seconds. We will check this function as well.

> Apply these changes to the following code. Result variable is 'movies_all_data_fixed' and all items in it must be string.

In [15]:
# type your code here
def remove_ref(string):
    output = string.split("[")[0]
    return output
def get_num(string):
    output = string.replace(" minutes","")
    return output
movies_all_data_fixed = []
for movie in movies_list:
    
    title = movie.find('i').text
    year = re.findall("^.*?\([^\d]*(\d+)[^\d]*\).*$",movie.text)[0]
    description = remove_ref(movie.text)
    genre = movie.parent.find_previous_sibling("h3").text
    
    link = "https://en.wikipedia.org"+movie.find("a")["href"]

    movie_page = requests.get(link)
    movie_bs = BeautifulSoup(movie_page.text, 'html.parser')
    info_table = movie_bs.find("table","infobox vevent")

    directed = get_data(info_table,"Directed by")
    produced = get_data(info_table,"Produced by")
    screenplay = get_data(info_table,"Screenplay by")
    starring = get_data(info_table,"Starring")
    running_time = remove_ref(get_num(get_data(info_table,"Running time")))
    country = get_data(info_table,"Country")
    
    # Let's collect each movie data into a dictionary
    movies_all_data_fixed.append({
        'title': title,
        'year': year,
        'description': description,
        'genre': genre,
        'link': link,
        'director': directed,
        'producer': produced,
        'screenwriters': screenplay,
        'starring': starring,
        'running_time': running_time,
        'country': country,
    })
    
print(len(movies_all_data_fixed))
# Display collected data for the first movie
print (movies_all_data_fixed[0])
to_check = movies_all_data_fixed

34
{'title': 'Mad Max 2', 'year': '1981', 'description': "Mad Max 2 (1981) was voted the greatest action film of all time in a readers' poll by American magazine Rolling Stone in 2015.", 'genre': 'Action', 'link': 'https://en.wikipedia.org/wiki/Mad_Max_2', 'director': 'George Miller', 'producer': 'Byron Kennedy', 'screenwriters': ['Terry Hayes', 'George Miller', 'Brian Hannant'], 'starring': 'Mel Gibson', 'running_time': '96', 'country': 'Australia'}


In [41]:
from test_helper import Test
hashed = Test._hash
hashed(to_check)

'f0c6d146a03941cf78a0143340aa0d4c2e886922'

In [43]:
Test.assertEqualsHashed(to_check, '41cb73247c48bf683dda8b7801413ab420fb805d', 'Incorrect data in "movies_all_data_fixed" variable',
                        "Exercise 1.2 is successful")

1 test failed. Incorrect data in "movies_all_data_fixed" variable


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>
