# Web Scraping using Beautiful Soup 

Using python to programmatically look through html source code and pull out only the things that we want. Scraping the webpages for information that we want to collect. 

In [3]:
## load the necessary libraries 

import requests 
from bs4 import BeautifulSoup as bs

Load our first page 

In [5]:
## Load website content

r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

## Convert to a BeautifulSoup object

soup = bs(r.content)

## Print out html
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



Start using Beautiful Soup to scrape

In [6]:
soup

<html>
<head>
<title>HTML Example</title>
</head>
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
</html>

### find and find_all

Let's say I want to grab the h2 elements


In [8]:
first_header = soup.find("h2")

first_header

<h2>A Header</h2>

In [11]:
headers = soup.find_all("h2")

headers ### creates a list of all h2 elements

[<h2>A Header</h2>, <h2>Another header</h2>]

In [13]:
# Pass in a list of elements to look for. 
first_header = soup.find(["h2", "h1"])

first_header

<h1>HTML Webpage</h1>

In [14]:
headers = soup.find_all(["h1", "h2"])

headers

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

In [21]:
## You can pass in attributes to the find/find_all functions 

paragraph = soup.find_all("p", attrs={"id": "paragraph-id"})

paragraph 

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [23]:
# You can nest sind/find_all columns

body= soup.find("body")
div= body.find("div")
header = div.find("h1")

header

<h1>HTML Webpage</h1>

In [24]:
# We can search for specific strings in our find.find_all calls

print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



In [28]:
## Let's say we want to jund any paragraph with the text "some"

import re

paragraphs = soup.find_all("p", string = re.compile("Some"))
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [29]:
## Find elements with different capitalization 

headers = soup.find_all("h2", string = re.compile("(H|h)eader"))
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

### select (CSS selector) 

In [33]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [34]:
content = soup.select("div p")
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [35]:
paragraph = soup.select("h2 ~ p")
paragraph 

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [38]:
bold_text = soup.select("p#paragraph-id b")
bold_text

[<b>Some bold text</b>]

In [39]:
paragraphs = soup.select("body > p")
print(paragraphs)

for element in paragraphs: 
    print(element.select("i"))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


### Getting different properties of html

In [45]:
## let's say i want the content of the header 

## use .string
header = soup.find("h2")
header.string

'A Header'

In [46]:
## If multiple child elements use get_text

div = soup.find("div")
print(div.prettify())
print(div.get_text())

<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



Get a specific property from an element 

In [47]:
link = soup.find("a")
link 

<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>

In [48]:
link['href']

'https://keithgalli.github.io/web-scraping/webpage.html'

In [50]:
paragraphs = soup.select("p#paragraph-id")
paragraphs[0]['id']

'paragraph-id'

# Code Navigation

In [55]:
## path Syntax
soup.body

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>

In [60]:
soup.body.div.p.a.string

'keithgalli.github.io/web-scraping/webpage.html'

### Know the terms: parent, sibling and child 

In the case below, Parent= "body", Child="div", sibling= "h2" 

Siblings are element that are at the same level 

In [63]:
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



In [64]:
## finding the siblings of "div"
soup.body.find("div").find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]