<a href="https://colab.research.google.com/github/daryllman/basic-webscraper/blob/master/BasicWebscraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load in the necessary libraries

In [2]:
import requests
from bs4 import BeautifulSoup as bs #pip install beautifulsoup4

## Load sample webpage content



In [7]:
# Load the sample webpage
sample_url = 'https://keithgalli.github.io/web-scraping/example.html'
r = requests.get(sample_url)

# Convert to a beautiful soup object
soup = bs(r.content)

In [None]:
#print(r.content)
#print(soup)
print(soup.prettify())

## Using Beautiful Soup

### find() & find_all()

In [19]:
first_header = soup.find("h2")
print(first_header)

<h2>A Header</h2>


In [23]:
headers = soup.find_all("h2")
print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]


In [24]:
# Pass in a list of elements to look for
first_header = soup.find(["h1", "h2"])
print(first_header)

<h1>HTML Webpage</h1>


In [27]:
headers = soup.find_all(["h1", "h2"])
print(headers)

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [29]:
# Pass in attributes to the find() & find_all()
paragraph = soup.find_all("p")
print(paragraph)

paragraph2 = soup.find_all("p", attrs={"id":"paragraph-id"})
print(paragraph2)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<p id="paragraph-id"><b>Some bold text</b></p>]


In [35]:
# Nesting find() & find_all() calls
body = soup.find("body")
print(body)
print("________________________")
div = body.find("div")
print(div)
print("________________________")
header = div.find("h1")
print(header)

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
________________________
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
________________________
<h1>HTML Webpage</h1>


In [41]:
# Search for specific strings in find() & find_all()
import re #regex is useful for string manipulation
paragraphs = soup.find_all("p", text="Some")
print(paragraphs)

paragraphs2 = soup.find_all("p", text=re.compile("Some"))
print(paragraphs2)

headers = soup.find_all("h2", text=re.compile("(H|h)eader"))
print(headers)

[]
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<h2>A Header</h2>, <h2>Another header</h2>]


### Select (CSS selector)
useful link: [https://www.w3schools.com/cssref/css_selectors.asp](https://www.w3schools.com/cssref/css_selectors.asp)

In [None]:
print(soup.body) #simple shorthand
print("_______________________")
print(soup.body.prettify())

In [42]:
content = soup.select("p")
print(content)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [46]:
content2 = soup.select("div p")
print(content2)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]


In [48]:
paragraphs = soup.select("h2 ~ p") # get p directly after h2
print(paragraphs)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [50]:
bold_text = soup.select("p#paragraph-id b") # search for b under p  with id(paragraph-id) and 
print(bold_text)

[<b>Some bold text</b>]


In [53]:
paragraphs = soup.select("body > p")
print(paragraphs)

for paragraph in paragraphs:
  para2 = paragraph.select("i")
  print(para2)


[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [54]:
# Grab element with specific property
soup.select("[align=middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

Difference between find/find_all and select: Select is more helpful if you have a specific path you are querying for.

### Get different properties of HTML

In [55]:
header = soup.find("h2")
print(header)
print(header.string)

<h2>A Header</h2>
A Header


In [58]:
div = soup.find("div")
print(div.prettify())
print(div.string) # returns None - cant have any nested html elements 
print(div.get_text()) # use this to get all available texts


<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>

None

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [62]:
# Get a specific property from an element
link = soup.find("a")
print(link)
print(link["href"])

paragraphs = soup.select("p#paragraph-id")
print(paragraphs)
print(paragraphs[0]["id"])

<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>
https://keithgalli.github.io/web-scraping/webpage.html
[<p id="paragraph-id"><b>Some bold text</b></p>]
paragraph-id


### Code Navigation


In [65]:
# Path Syntax
print(soup.body.div)
print(soup.body.div.h1.string)

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
HTML Webpage


In [69]:
# Know the terms: Parent, Sibling, Child
print(soup.body.find("div"))
print(soup.body.find("div").find_next_siblings())

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
[<h2>A Header</h2>, <p><i>Some italicized text</i></p>, <h2>Another header</h2>, <p id="paragraph-id"><b>Some bold text</b></p>]


## Practices

From [https://keithgalli.github.io/web-scraping/webpage.html]()

### Load the Webpage

In [72]:
# Load the sample webpage
sample_url = 'https://keithgalli.github.io/web-scraping/webpage.html'
r = requests.get(sample_url)

# Convert to a beautiful soup object
webpage = bs(r.content)

In [None]:
# Take a look at the html
print(webpage.prettify())

### 1. Grab all of the social links from the webpage
(do in 3 different ways)
