# Web Scraping Intro

### Hypertext Transfer Protocol (HTTP) is the foundation for data communication on the world wide web.
- Entering a URL is a request for the resource at that domain address
- Response is what happens (page loads? 404 error?)

To retrieve the contents of a website, we will be using the [_requests_](https://requests.readthedocs.io/en/master/) library.

In [1]:
import requests

In this notebook, we will be using a **GET** request. This is a request for data from a specified resource.  

Another common type or request is a **POST** request. POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

To perform a GET request, use `requests.get()` and pass in the desired url.

In [2]:
URL = 'http://en.wikipedia.org/wiki/Turing_Award'

response = requests.get(URL)

Let's see what kind of object we get.

In [3]:
type(response)

requests.models.Response

We can check the status code using the `status_code` attribute.

In [4]:
response.status_code

200

A 200 status code is the standard response for a successful request.  

Other common status codes:
 * 400: Bad Request
 * 404: Not Found

Let's see what happens if we request a non-existent URL.

In [5]:
requests.get('https://en.wikipedia.org/wiki/Tuning_Award')

<Response [404]>

**Back to the good correct request**, let's see what this request returned.

In [6]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Turing Award - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-c

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [7]:
from bs4 import BeautifulSoup as BS

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [8]:
soup = BS(response.text)

Now, we can print it out in a slightly more readable form.

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Turing Award - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-li

What we are looking at is the HTML for this page. This is rendered by your browser into the Wikipedia page that you see.


If you navigate to this page in your browser, you can view page source or inspect elements to see the underlying HTML.

If you are using Safari, this may not by avaiable and you'll need to activate it. According to [this](https://www.socialmeteor.com/2013/03/04/how-to-view-html-source-in-safari-web-browser/) website, you can activate this by following these steps:


1. Open Safari.
2. Select ‘Preferences’ from the ‘Safari’ menu.
3. In the ‘Advanced’ section and select ‘Show Develop menu’ in menu bar.’
4. Visit the web page you want to view HTML source for.
5. Select ‘Show Page Source’ from the ‘Develop’ menu that has been added to Safari.


Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup, telling it that we want to find the first `title` tag.

In [10]:
soup.find('title')

<title>Turing Award - Wikipedia</title>

Notice that this returns a bs4 Tag object.

In [11]:
type(soup.find('title'))

bs4.element.Tag

To extract out the text, you can use the `.text` attribute.

In [12]:
soup.find('title').text

'Turing Award - Wikipedia'

The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all images. We'll look for the `img` tag.

In [13]:
images = soup.findAll('img')
print(type(images))
images

<class 'bs4.element.ResultSet'>


[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img class="mw-file-element" data-file-height="463" data-file-width="314" decoding="async" height="118" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/80px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/120px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_crop

Let's look closer at the first image.

In [14]:
first_image = images[0]
print(type(first_image))
first_image

<class 'bs4.element.Tag'>


<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [15]:
first_image['src']

'/static/images/icons/wikipedia.png'

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [16]:
# Non-safe
first_image['class']

['mw-logo-icon']

In [17]:
# Safe
first_image.get('class')

['mw-logo-icon']

You can also specify a default value when using `get`.

In [18]:
first_image.get('class', default = 'No Class')

['mw-logo-icon']

If you want to grab a particular attribute for all images, an easy way to do so is with a list comprehension.

In [19]:
image_srcs = [x.get('src') for x in images]

In [20]:
image_srcs

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/80px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Marvin_Minsky_at_OLPCc.jpg/80px-Marvin_Minsky_at_OLPCc.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/49/John_McCarthy_Stanford.jpg/80px-John_McCarthy_Stanford.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Edsger_Wybe_Dijkstra.jpg/80px-Edsger_Wybe_Dijkstra.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Charles_Bachman_2012.jpg/80px-Charles_Bachman_2012.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4f/KnuthAtOpenContentAlliance.jpg/80px-KnuthAtOpenContentAlliance.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Herbert_A._Simon_and_Allen_Newell_Chess_Match_

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the second header on the page.

In [21]:
soup.findAll('header')[1]

<header class="mw-body-header vector-page-titlebar">
<nav aria-label="Contents" class="vector-toc-landmark" role="navigation">
<div class="vector-dropdown vector-page-titlebar-toc vector-button-flush-left" id="vector-page-titlebar-toc">
<input aria-haspopup="true" aria-label="Toggle the table of contents" class="vector-dropdown-checkbox" data-event-name="ui.dropdown-vector-page-titlebar-toc" id="vector-page-titlebar-toc-checkbox" role="button" type="checkbox"/>
<label aria-hidden="true" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" for="vector-page-titlebar-toc-checkbox" id="vector-page-titlebar-toc-label"><span class="vector-icon mw-ui-icon-listBullet mw-ui-icon-wikimedia-listBullet"></span>
<span class="vector-dropdown-label-text">Toggle the table of contents</span>
</label>
<div class="vector-dropdown-content">
<div class="vector-unpinned-container" id="vector-page-titlebar-toc-unpinne

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [22]:
soup.findAll('header')[1].find('h1').get('id')

'firstHeading'

In [23]:
soup.findAll('header')[1].find('h1').text

'Turing Award'

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [24]:
soup.findAll('table')

[<table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2">ACM Turing Award</th></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Awarded for</th><td class="infobox-data">Outstanding contributions in <a href="/wiki/Computer_science" title="Computer science">computer science</a></td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Country</th><td class="infobox-data location">United States</td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Presented by</th><td class="infobox-data attendee"><a href="/wiki/Association_for_Computing_Machinery" title="Association for Computing Machinery">Association for Computing Machinery</a> (ACM)</td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Reward(s)</th><td class="infobox-data">US $1,000,000<sup class="reference" id="cite_ref-million_1-0"><a href="#cite_note-million-1">[1]</a></sup></td></tr><tr><th class="infobox-label" scope="row" style="width

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [25]:
soup.find('table', attrs={'class' : 'wikitable'})

<table class="wikitable sortable">
<tbody><tr bgcolor="#ccccc">
<th>Year
</th>
<th>Recipient(s)
</th>
<th>Photo
</th>
<th>Rationale
</th>
<th>Affiliated institute(s)
</th></tr>
<tr>
<td>1966
</td>
<td><a href="/wiki/Alan_Perlis" title="Alan Perlis">Alan Perlis</a>
</td>
<td>
</td>
<td>For his influence in the area of advanced <a href="/wiki/Computer_programming" title="Computer programming">computer programming</a> techniques and <a href="/wiki/Compiler" title="Compiler">compiler</a> construction.<sup class="reference" id="cite_ref-10"><a href="#cite_note-10">[10]</a></sup>
</td>
<td><a href="/wiki/Carnegie_Mellon_University" title="Carnegie Mellon University">Carnegie Mellon University</a>
</td></tr>
<tr>
<td>1967
</td>
<td><a href="/wiki/Maurice_Wilkes" title="Maurice Wilkes">Maurice Wilkes</a>
</td>
<td><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Maurice_Vincent_Wilkes_1980_(3,_cropped).jpg"><img class="mw-file-element" data-file-height="463" data-file-wid

We can display the table by importing the `HTML` function.

In [26]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)



Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
1966,Alan Perlis,,For his influence in the area of advanced computer programming techniques and compiler construction.[10],Carnegie Mellon University
1967,Maurice Wilkes,,"Wilkes is best known as the builder and designer of the EDSAC, the second computer with an internally stored program. Built in 1949, the EDSAC used a mercury delay line memory. He is also known as the author, with Wheeler and Gill, of a volume on ""Preparation of Programs for Electronic Digital Computers"" in 1951, in which program libraries were effectively introduced.[11]",University of Cambridge
1968,Richard Hamming,,"For his work on numerical methods, automatic coding systems, and error-detecting and error-correcting codes.[12]",Bell Labs
1969,Marvin Minsky,,"For his central role in creating, shaping, promoting, and advancing the field of artificial intelligence.[13]",Massachusetts Institute of Technology
1970,James H. Wilkinson,,"For his research in numerical analysis to facilitate the use of the high-speed digital computer, having received special recognition for his work in computations in linear algebra and ""backward"" error analysis.[14]",National Physical Laboratory
1971,John McCarthy,,"McCarthy's lecture ""The Present State of Research on Artificial Intelligence"" is a topic that covers the area in which he has achieved considerable recognition for his work.[15]",Stanford University
1972,Edsger W. Dijkstra,,"Edsger Dijkstra was a principal contributor in the late 1950s to the development of the ALGOL, a high level programming language which has become a model of clarity and mathematical rigor. He is one of the principal proponents of the science and art of programming languages in general, and has greatly contributed to our understanding of their structure, representation, and implementation. His fifteen years of publications extend from theoretical articles on graph theory to basic manuals, expository texts, and philosophical contemplations in the field of programming languages.[16]","Centrum Wiskunde & Informatica, Eindhoven University of Technology, University of Texas at Austin"
1973,Charles Bachman,,For his outstanding contributions to database technology.[17],"General Electric Research Laboratory (now under Groupe Bull, an Atos company)"
1974,Donald Knuth,,"For his major contributions to the analysis of algorithms and the design of programming languages, and in particular for his contributions to ""The Art of Computer Programming"" through his well-known books in a continuous series by this title.[18]","California Institute of Technology, Center for Communications Research, Center for Communications and Computing, Institute for Defense Analyses, Stanford University"
1975,Allen Newell,,"In joint scientific efforts extending over twenty years, initially in collaboration with J. C. Shaw at the RAND Corporation, and subsequently with numerous faculty and student colleagues at Carnegie Mellon University, they have made basic contributions to artificial intelligence, the psychology of human cognition, and list processing.[19]","RAND Corporation, Carnegie Mellon University"


However, this does not give us a way to work with the data in the table, only to display it.

If we want to interact with the table, we can use the _pandas_ `read_html` method.

In [27]:
import pandas as pd

In [28]:
pd.read_html(str(soup.find('table', attrs={'class' : 'wikitable'})))[0]

Unnamed: 0,Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
0,1966,Alan Perlis,,For his influence in the area of advanced comp...,Carnegie Mellon University
1,1967,Maurice Wilkes,,Wilkes is best known as the builder and design...,University of Cambridge
2,1968,Richard Hamming,,"For his work on numerical methods, automatic c...",Bell Labs
3,1969,Marvin Minsky,,"For his central role in creating, shaping, pro...",Massachusetts Institute of Technology
4,1970,James H. Wilkinson,,For his research in numerical analysis to faci...,National Physical Laboratory
...,...,...,...,...,...
71,2019,Pat Hanrahan,,For fundamental contributions to 3-D computer ...,"Pixar, Princeton University, Stanford University"
72,2020,Alfred Aho,,For fundamental algorithms and theory underlyi...,"Bell Labs, Columbia University"
73,2020,Jeffrey Ullman,,For fundamental algorithms and theory underlyi...,"Bell Labs, Princeton University, Stanford Univ..."
74,2021,Jack Dongarra,,For pioneering contributions to numerical algo...,"Argonne National Laboratory, Oak Ridge Nationa..."
