# Scraping Web Data with Beautiful Soup

## Objectives

1. Understand the basic structure of a web page
2. Use `requests` to download a web page
3. Use `BeautifulSoup` to parse the web page
4. Inspect elements in a browser to locate the desired information
5. Use various search methods to navigate the web page
    1. `find`
    2. `find_all`
    3. `tag-like methods`
    4. `find_parent(s)`
6. Isolate information by *looking in* then *looking out*

**Source:** some material adapted from http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

## Web pages and the DOM

* Text files
* Use html tags 
* Have a tree structure
    * Called the Document Object Model (DOM)

<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png">

<img src="http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png">

## Using Firefox or Chrome to inspect a page

* Right click on something and select
    * Firefox: Inspect Element
    * Chrome: Inspect
* Opens a representation of the DOM
* Mouse over elements to highlight corresponding parts of the page.

## <font color="red"> Exercise 1</font>

Inspect the following part of https://en.wikipedia.org/wiki/Web_scraping

* The table of contents
* The section headers
* A link

## HTML tags

* Use `<` and `>`
* Most have beginning and end tags
    `<p> a paragraph </p>`
* Some common tags
    * `<div>`
    * `<span>`
    * `<a>` (link)
    * `<img>`
* Can contain other attributes
    * `<img src="my_image.png">`
    * `<div class="some-identifier">`

## HTML tags

Surrounded by `<` and `>`


##   Most have beginning and end tags

`<p> a paragraph </p>`

##  Some common tags

* `<div>`
* `<span>`
* `<a>` (link)
* `<img>`

##  Can contain other attributes

* `<img src="my_image.png">`
* `<div class="some-identifier">`

## Using `requests` to download a raw web page

* Three steps
    * Create a session
    * Use `get` method to get the result
    * Use `content` on the results to see the website as a string

In [3]:
import requests

s = requests.Session() # Start a session
r = s.get('https://en.wikipedia.org/wiki/Web_scraping') # Get a static page
r.content[:1000] # Look at the contect (string)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Web scraping - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Web_scraping","wgTitle":"Web scraping","wgCurRevisionId":830441507,"wgRevisionId":830441507,"wgArticleId":2696619,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Danish-language sources (da)","Articles needing additional references from June 2017","All articles needing additional references","Articles with limited geographic scope from October 2015","USA-centric","Pages using div col with deprecated parameters","Web scraping","World Wide Web","Spamming"],"wgBreakFrames":fal

## Using Beautiful Soup (`bs4`) to parse a page

* Module is `bs4`
* `BeautifulSoup` takes the content from the `requests` result
* Parses and adds search tools

In [4]:
import bs4

soup = bs4.BeautifulSoup(r.content, 'lxml') # parse the web page content
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Web scraping - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Web_scraping","wgTitle":"Web scraping","wgCurRevisionId":830441507,"wgRevisionId":830441507,"wgArticleId":2696619,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Danish-language sources (da)","Articles needing additional references from June 2017","All articles needing additional references","Articles with limited geographic scope from October 2015","USA-centric","Pages using div col with deprecated parameters","Web scraping","World Wide Web","Spamming"],"w

## Using `find` to find the first instance

* Look for the first instance
* Recursive search
* First argument is the html tag type
* Add additional information as needed
    * Frequently use `class_` for class
    * (class is a special python statement)

In [5]:
tag = soup.find('div', class_='toctitle')
tag

<div class="toctitle" dir="ltr" lang="en" xml:lang="en">
<h2>Contents</h2>
</div>

## Pulling attribute information from a tag

* We use something like indexing to access the information in a tag.

In [6]:
tag

<div class="toctitle" dir="ltr" lang="en" xml:lang="en">
<h2>Contents</h2>
</div>

In [7]:
tag['class']

['toctitle']

## Using tag attributes to get tags of a certain type

* allow access to the next embedded tags 
* using special html tag attributes


In [8]:
tag = soup.find('div', class_="mw-jump")
tag

<div class="mw-jump" id="jump-to-nav">
					Jump to:					<a href="#mw-head">navigation</a>, 					<a href="#p-search">search</a>
</div>

In [9]:
tag.a

<a href="#mw-head">navigation</a>

In [18]:
tag.a.attrs

{'href': '#mw-head'}

## Searching for text

* Web page text is a string
* Use the `string=` argument to search for any text

In [10]:
soup.find(string="Contents")

'Contents'

## Use `find_all` to find all instances of a tag

* Called the same way as `find`
* Returns a list of tags
    * Process with a comprehension

In [11]:
soup.find_all('a')[:5]

[<a id="top"></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/></a>,
 <a href="/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">verification</a>]

In [12]:
# Short cut tag( args) == tag.find_all(args)
soup('a')[:5]

[<a id="top"></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="image" href="/wiki/File:Question_book-new.svg"><img alt="" data-file-height="399" data-file-width="512" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/50px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/75px-Question_book-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/100px-Question_book-new.svg.png 2x" width="50"/></a>,
 <a href="/wiki/Wikipedia:Verifiability" title="Wikipedia:Verifiability">verification</a>]

In [20]:
titles = [tag['title'] for tag in soup('a') if 'title' in tag.attrs]
titles[:5]

['Wikipedia:Verifiability',
 'Help:Introduction to referencing with Wiki Markup/1',
 'Help:Maintenance template removal',
 'Data scraping',
 'Data scraping']

## Important note

Always use `find` when the page/subpage has exactly one of something

## Searching for parents

* Use `find_parent` and `find_parents` to move back up the tree

In [58]:
# We know the Contents is in the toc
soup.find(string="Contents")

'Contents'

In [64]:
# Keep stepping up the tree until we have the whole toc
# Not yet
soup.find(string="Contents").find_parent('div')

<div class="toctitle" id="toctitle">
<h2>Contents</h2>
</div>

In [3]:
toc = soup.find(string="Contents").find_parent('div').find_parent('div')
print(toc.prettify())

<div class="toc" id="toc">
 <div class="toctitle" id="toctitle">
  <h2>
   Contents
  </h2>
 </div>
 <ul>
  <li class="toclevel-1 tocsection-1">
   <a href="#Techniques">
    <span class="tocnumber">
     1
    </span>
    <span class="toctext">
     Techniques
    </span>
   </a>
   <ul>
    <li class="toclevel-2 tocsection-2">
     <a href="#Human_copy-and-paste">
      <span class="tocnumber">
       1.1
      </span>
      <span class="toctext">
       Human copy-and-paste
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-3">
     <a href="#Text_pattern_matching">
      <span class="tocnumber">
       1.2
      </span>
      <span class="toctext">
       Text pattern matching
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-4">
     <a href="#HTTP_programming">
      <span class="tocnumber">
       1.3
      </span>
      <span class="toctext">
       HTTP programming
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-5

In [4]:
# Use what we learn to clean this up
toc = soup.find(string="Contents").find_parent('div', class_= "toc")
print(toc.prettify())

<div class="toc" id="toc">
 <div class="toctitle" id="toctitle">
  <h2>
   Contents
  </h2>
 </div>
 <ul>
  <li class="toclevel-1 tocsection-1">
   <a href="#Techniques">
    <span class="tocnumber">
     1
    </span>
    <span class="toctext">
     Techniques
    </span>
   </a>
   <ul>
    <li class="toclevel-2 tocsection-2">
     <a href="#Human_copy-and-paste">
      <span class="tocnumber">
       1.1
      </span>
      <span class="toctext">
       Human copy-and-paste
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-3">
     <a href="#Text_pattern_matching">
      <span class="tocnumber">
       1.2
      </span>
      <span class="toctext">
       Text pattern matching
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-4">
     <a href="#HTTP_programming">
      <span class="tocnumber">
       1.3
      </span>
      <span class="toctext">
       HTTP programming
      </span>
     </a>
    </li>
    <li class="toclevel-2 tocsection-5

## Using list comprehensions

* `find_all` returns a list of soup objects.
* Use a comprehension to process all tags
    * Start with on example tag
    * Put the resulting expression in a comprehension

In [80]:
tags = toc.find_all('span', class_="toctext")
tags[:3]

[<span class="toctext">Techniques</span>,
 <span class="toctext">Human copy-and-paste</span>,
 <span class="toctext">Text pattern matching</span>]

In [74]:
example_tag = tags[0]
example_tag

<span class="toctext">Techniques</span>

In [77]:
example_tag.next

'Techniques'

In [81]:
sections = [tag.next for tag in toc.find_all('span', 'toctext')]
sections[:3]

['Techniques', 'Human copy-and-paste', 'Text pattern matching']

## More information

There is much more to Beautiful Soup, take a look at the documentation for more information.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down