<h1>Web Scraping</h1>
<li>Web scraping is a general term for techniques involving automating the gathering of data from a website</li>
<li>In this section we will learn how to use Python to conduct web scraping tasks, such as downloading images or information off a website</li>
<li>In order to web scrape with Python we need to understand the basic concepts of how a website works</li>
<li>When a browser loads a website, the user gets to see what is known as the "front-end" of the website</li>
<li>Main things we need to understand</li>
<li>> Rules of Web Scraping</li>
<li>> Limitations of Web Scraping</li>
<li>> Basic HTML and CSS</li>
<li>Rules:</li>
<li>> Always try to get permission before scraping!</li>
<li>> If you make too many scraping attempts or requests your IP Address could get blocked!</li>
<li>> Some sites outomatically block scraping software</li>
<li>Limitations of Web Scraping</li>
<li>> In general every website is unique, which means every web scraping script is unique</li>
<li>> A slight change or update to a website may completely break your web scraping script</li>
<li>When viewing a website, the browser doesn't show you all the source code behind the website, instead it shows you the HTML and some CSS and JS that the website sends to your browser</li>
<li>HTML is used to create he basic structure and content of a webpage</li>
<li>CSS is used for the design and style of a web page, where elements are placed and how it looks</li>
<li>JavaScript is used to define the interactive elements of a webpage</li>
<li>For effective basic web scraping we only need to have a basic understanding of HTML and CSS</li>
<li>Python can view these HTML and CSS elements programmatically, and then extract information from the website</li>
<li>To web scrape with Python we can use the BeautifulSoup and requests libraries</li>
<li>These are external libraries outside of Python so you need to install them with either conda or pip at your command line</li>
<li>Directly at your command line use:</li>
<li>> pip install requests</li>
<li>> pip install lxml</li>
<li>> pip install bs4</li>
<li>Or for Anaconda distributions, use conda install instead of pip install</li>

<h2>Setting up For web Scrapping</h2>
<li>Install the necessary libraries</li>
<li>Explore how to inspect elements and view source of a webpage</li>
<li>Note: We will suggest you use Chrome so you can follow along exactly as we do, but these tools are available in all major browsers</li>

In [3]:
import requests

In [4]:
import bs4

<h2>Grabbing a Page Title</h2>

In [9]:
import requests

In [10]:
result = requests.get("http://www.example.com")

In [11]:
type(result)

requests.models.Response

In [12]:
result.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [13]:
import bs4

In [14]:
soup = bs4.BeautifulSoup(result.text, "lxml")

In [15]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

In [16]:
soup.select('title')

[<title>Example Domain</title>]

In [17]:
soup.select('p')

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [18]:
soup.select('p')[0]

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

In [20]:
soup.select('title')[0].getText()

'Example Domain'

In [21]:
site_paragraphs = soup.select("p")

In [24]:
type(site_paragraphs[0])

bs4.element.Tag

In [25]:
site_paragraphs[0].getText()

'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'

<h2>Grabbing All Elements of a Class</h2>
<li>We previously mentioned a big part of web scraping with the BeautifulSoup library is figuring out what string syntax to pass into the soup.select() method</li>
<li>Let's go through a table with some common examples (these make a lot of sense if you know CSS syntax)</li>

In [26]:
res = requests.get('https://es.wikipedia.org/wiki/Grace_Murray_Hopper')

In [28]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [30]:
#soup

In [32]:
soup.select('.vector-toc-text')

[<div class="vector-toc-text">Inicio</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1</span>Biografía</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1.1</span>Estudios</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1.2</span>Ingreso en la Armada</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1.3</span>UNIVAC</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1.4</span>Cobol</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">1.5</span>Reingreso en la Armada</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">2</span>Curiosidades</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">3</span>Premios y reconocimientos</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">4</span>Fechas de rango</div>,
 <div class="vector-toc-text">
 <span class="vector-toc-numb">5</span>Legado</div>,
 <div class="vector-toc-text">
 <span 

In [33]:
soup.select('.vector-toc-text')[0]

<div class="vector-toc-text">Inicio</div>

In [37]:
first_item = soup.select('.vector-toc-text')[0]

In [39]:
first_item.text

'Inicio'

In [40]:
for item in soup.select('.vector-toc-text'):
    print(item.text)

Inicio

1Biografía

1.1Estudios

1.2Ingreso en la Armada

1.3UNIVAC

1.4Cobol

1.5Reingreso en la Armada

2Curiosidades

3Premios y reconocimientos

4Fechas de rango

5Legado

6Véase también

7Notas y referencias

8Enlaces externos


<h1>Grabbing an Image</h1>
<li>Now that we understand how to grab text information based on tags and element names, let's explore how to grab images from a website</li>
<li>Images on a website typically have their own URL link (ending in .jpg or .png)</li>
<li>Beautiful Soup can scan a page, locate the img tags and grab these URLs</li>
<li>Then we can download the URLs as images and write them to the computer</li>
<li>Note: You should always check copyright permission before downloading and using an image from a website</li>

In [41]:
res = requests.get('https://es.wikipedia.org/wiki/Deep_Blue_(computadora)')

In [42]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [46]:
soup.select('.mw-file-element')

[<img class="mw-file-element" data-file-height="601" data-file-width="400" decoding="async" height="451" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/300px-Deep_Blue.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/b/be/Deep_Blue.jpg 1.5x" width="300"/>,
 <img alt="Bandera de Estados Unidos" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/20px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/30px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/40px-Flag_of_the_United_States.svg.png 2x" width="20"/>,
 <img alt="Wd" class="mw-file-element" data-file-height="590" data-file-width="1050" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Wikid

In [48]:
computer = soup.select('.mw-file-element')[0]

In [49]:
computer

<img class="mw-file-element" data-file-height="601" data-file-width="400" decoding="async" height="451" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/300px-Deep_Blue.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/b/be/Deep_Blue.jpg 1.5x" width="300"/>

In [51]:
computer['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/300px-Deep_Blue.jpg'

<img src="//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/300px-Deep_Blue.jpg">

In [53]:
image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/300px-Deep_Blue.jpg')

In [54]:
image_link.content

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xfe\x00CFile source: http://commons.wikimedia.org/wiki/File:Deep_Blue.jpg\xff\xe2\x02@ICC_PROFILE\x00\x01\x01\x00\x00\x020ADBE\x02\x10\x00\x00mntrRGB XYZ \x07\xcf\x00\x06\x00\x03\x00\x00\x00\x00\x00\x00acspAPPL\x00\x00\x00\x00none\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3-ADBE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\ncprt\x00\x00\x00\xfc\x00\x00\x002desc\x00\x00\x010\x00\x00\x00kwtpt\x00\x00\x01\x9c\x00\x00\x00\x14bkpt\x00\x00\x01\xb0\x00\x00\x00\x14rTRC\x00\x00\x01\xc4\x00\x00\x00\x0egTRC\x00\x00\x01\xd4\x00\x00\x00\x0ebTRC\x00\x00\x01\xe4\x00\x00\x00\x0erXYZ\x00\x00\x01\xf4\x00\x00\x00\x14gXYZ\x00\x00\x02\x08\x00\x00\x00\x14bXYZ\x00\x00\x02\x1c\x00\x00\x00\x14text\x00\x00\x00\x00Copyright 1999 Adobe System

In [55]:
f = open('my_computer_image.jpg', 'wb')

In [56]:
f.write(image_link.content)

27373

In [57]:
f.close()