# Web scraping with Python

https://pieriantraining.com/how-to-perform-web-scraping-with-python/?utm_source=udemy&utm_medium=referral&utm_campaign=site_live_announcement

In [8]:
import requests
import bs4

In [5]:
res = requests.get('http://www.example.com')

In [6]:
type(res)

requests.models.Response

In [7]:
res.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [9]:
soup = bs4.BeautifulSoup(res.text,
                         'lxml')

In [10]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

In [11]:
soup.select('title')

[<title>Example Domain</title>]

In [13]:
title_tag = soup.select('title')

In [15]:
title_tag[0]

<title>Example Domain</title>

In [16]:
title_tag[0].getText()

'Example Domain'

## Now example with wikipedia

In [17]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Enigma_machine')

In [18]:
# Create a soup from request
soup = bs4.BeautifulSoup(res.text,
                         'lxml')

In [21]:
# soup.select('div') - all elements with the <div> tag
# soup.select('#some_id') - the html element containing the id attribute of some_id
# soup.select('.notice') - all the html element with the CSS class name notice
# soup.select('div span') - any elements named <span> that are within an element named <div>
# soup.select('div > span') - any elemtns named <span> that are directly within an element named <div>, with no other element in between

In [23]:
# Contents
soup.select('.toctext')

[<span class="toctext">History</span>,
 <span class="toctext">Breaking Enigma</span>,
 <span class="toctext">Design</span>,
 <span class="toctext">Electrical pathway</span>,
 <span class="toctext">Rotors</span>,
 <span class="toctext">Stepping</span>,
 <span class="toctext">Turnover</span>,
 <span class="toctext">Entry wheel</span>,
 <span class="toctext">Reflector</span>,
 <span class="toctext">Plugboard</span>,
 <span class="toctext">Accessories</span>,
 <span class="toctext"><i>Schreibmax</i></span>,
 <span class="toctext"><i>Fernlesegerät</i></span>,
 <span class="toctext"><i>Uhr</i></span>,
 <span class="toctext">Mathematical analysis</span>,
 <span class="toctext">Operation</span>,
 <span class="toctext">Basic operation</span>,
 <span class="toctext">Details</span>,
 <span class="toctext">Indicator</span>,
 <span class="toctext">Additional details</span>,
 <span class="toctext">Example enciphering process</span>,
 <span class="toctext">Models</span>,
 <span class="toctext">Commer

In [25]:
for item in soup.select('.toctext'):
    print(item.text)

History
Breaking Enigma
Design
Electrical pathway
Rotors
Stepping
Turnover
Entry wheel
Reflector
Plugboard
Accessories
Schreibmax
Fernlesegerät
Uhr
Mathematical analysis
Operation
Basic operation
Details
Indicator
Additional details
Example enciphering process
Models
Commercial Enigma
Enigma Handelsmaschine (1923)
Die schreibende Enigma (1924)
Die Glühlampenmaschine, Enigma A (1924)
Enigma B (1924)
Enigma C (1926)
Enigma D (1927)
"Navy Cipher D"
Enigma H (1929)
Enigma K
Military Enigma
Funkschlüssel C
Enigma G (1928–1930)
Wehrmacht Enigma I (1930–1938)
M3 (1934)
Two extra rotors (1938)
M4 (1942)
Surviving machines
Derivatives
Simulators
See also
Explanatory notes
References
Citations
General and cited references
Further reading
External links


## Another example
Grab image from wikipedia article

In [26]:
res = requests.get('https://en.wikipedia.org/wiki/Extreme_ironing')

In [27]:
soup = bs4.BeautifulSoup(res.text, 
                         'lxml')

In [28]:
image_info = soup.select('.thumbimage')

In [29]:
image_info

[<img alt="" class="thumbimage" data-file-height="1280" data-file-width="960" decoding="async" height="293" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/220px-Extermeironingrivelin.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/330px-Extermeironingrivelin.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/440px-Extermeironingrivelin.jpg 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="371" data-file-width="494" decoding="async" height="165" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/37/Highlander411_extreme_ironing.jpg/220px-Highlander411_extreme_ironing.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/37/Highlander411_extreme_ironing.jpg/330px-Highlander411_extreme_ironing.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/37/Highlander411_extreme_ironing.jpg/440px-Highlander411_extreme_ironing.jpg 2x" width="220"/>]

In [30]:
len(image_info)

2

In [31]:
image = image_info[0]

In [33]:
type(image)

bs4.element.Tag

In [34]:
image['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/220px-Extermeironingrivelin.jpg'

<img src='//upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/220px-Extermeironingrivelin.jpg'>

In [42]:
image_link = requests.get('http://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Extermeironingrivelin.jpg/220px-Extermeironingrivelin.jpg')

Let’s write this to a file:=, not the ‘wb’ call to denote a binary writing of the file.

In [43]:
f = open('my_new_file_name.jpg',
         'wb')

In [44]:
f.write(image_link.content)

11033

In [45]:
f.close()

<img src='my_new_file_name.jpg'>

## Example project

Let’s show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let’s try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.

We will do the following:

1. Figure out the URL structure to go through every page
2. Scrap every page in the catalogue
3. Figure out what tag/class represents the Star rating
4. Filter by that star rating using an if statement
5. Store the results to a list

In [47]:
# http://books.toscrape.com/catalogue/page-1.html
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

In [48]:
res = requests.get(base_url.format('1'))

In [52]:
soup = bs4.BeautifulSoup(res.text,
                         'lxml')

In [54]:
soup.select('.product_pod')
# Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

In [55]:
products = soup.select('.product_pod')

In [58]:
example = products[0]
example

<article class="product_pod">
<div class="image_container">
<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [59]:
type(example)

bs4.element.Tag

In [60]:
example.attrs

{'class': ['product_pod']}

In [61]:
list(example.children)

['\n',
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>,
 '\n',
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 '\n',
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 '\n',
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 '\n']

In [62]:
example.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

In [63]:
example.select('.star-rating.Two')

[]

In [68]:
len(example.select('.star-rating.Two'))

0

In [65]:
example.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [66]:
example.select('a')[1]

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [67]:
example.select('a')[1]['title']

'A Light in the Attic'

Okay, let’s give it a shot by combining all the ideas we’ve talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

In [69]:
two_star_title = []

for n in range(1, 51):
    
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = bs4.BeautifulSoup(res.text,'lxml')
    books = soup.select('.product_pod')
    
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_title.append(book.select('a')[1]['title'])

In [70]:
two_star_title

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day