<img src="https://sandeepmj.github.io/image-host/green-scrapes.png" >

# Scraping non-tabular content with ```BeautifulSoup```
### We'll learn to some basic scraping techniques using this mock site <a href="https://sandeepmj.github.io/scrape-example-page/demo-text.html">demo page</a>.


In [2]:
## create cells as necessary

In [3]:
#import library
#what pulls down all the html from the server
import requests

In [4]:
#requesting web content
url = "https://sandeepmj.github.io/scrape-example-page/demo-text.html"
#scrape url website
response = requests.get(url)

In [6]:
response.status_code

200

In [11]:
#what type of object did we capture
type(response)

requests.models.Response

In [12]:
response

<Response [200]>

In [13]:
#Response codes: when you scrape hundred of pages there's a chance that one of the URLS might be a dud
#we can set up a error control to see what kind of responses we get
# <Response 200> means website is accesible
# <Response 300> your request is being redirected
# <Response 403> your request was received but declined
# <Response 404> means broken link or page with no content
# <Response 500> means your request encountered some generic server error

In [15]:
#see response.text object
response.text

'<!doctype html>\n<html lang="en">\n<head>\n\t<title>title tag</title>\n\t<style>\nbody {padding: 20px; max-width: 700px; margin: 0 auto;}\n</style>\n</head>\n\n<body>\n\t<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p></h1>\n\t<p>Learning to scrape using BeautifulSoup.</p>\n\t<div class="content article">\n\t\t<section>\n\t\t<p>Here\'s some pretty useless info:</p>\n\t</section>\n\t\t<section class="main" id="all_plants">\n\t\t\t<h2 class="subhead" id="vegitation">Plants</h2>\n\t\t\t<p class="article">Three plants that thrive in deep shade:</p>\n\t\t\t<ol>\n\t\t\t\t<li><a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>\n\t\t\t\t<li><a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>\n\t\t\t\t<li><a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span></li>\n\t\t\t</ol>\n\t\t</secti

In [16]:
type(response.text)

str

In [18]:
#what object does response.content return?
type(response.content)

bytes

In [None]:
#introducing "beautiful soup"
#takes strings and turns them back into traversable, hierarchical data aka something that python can understand
#create a beautiful soup object:


In [19]:
#import beautiful soup
from bs4 import BeautifulSoup

In [20]:
#we add the name of our file 
soup = BeautifulSoup(response.text, "html.parser")

In [22]:
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<title>title tag</title>
<style>
body {padding: 20px; max-width: 700px; margin: 0 auto;}
</style>
</head>
<body>
<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></h1>
<p>Learning to scrape using BeautifulSoup.</p>
<div class="content article">
<section>
<p>Here's some pretty useless info:</p>
</section>
<section class="main" id="all_plants">
<h2 class="subhead" id="vegitation">Plants</h2>
<p class="article">Three plants that thrive in deep shade:</p>
<ol>
<li><a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>
<li><a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>
<li><a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span></li>
</ol>
</section>
<section class="main" id="all_animals">
<h2 class="subhead" id="creatures">Animals</h2>
<p class="arti

In [23]:
#prettify our printout
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   title tag
  </title>
  <style>
   body {padding: 20px; max-width: 700px; margin: 0 auto;}
  </style>
 </head>
 <body>
  <h1 class="title">
   <b>
    The title headline is Demo for BeautifulSoup
   </b>
  </h1>
  <p>
   Learning to scrape using BeautifulSoup.
  </p>
  <div class="content article">
   <section>
    <p>
     Here's some pretty useless info:
    </p>
   </section>
   <section class="main" id="all_plants">
    <h2 class="subhead" id="vegitation">
     Plants
    </h2>
    <p class="article">
     Three plants that thrive in deep shade:
    </p>
    <ol>
     <li>
      <a class="plants life" href="http://example.com/plant1" id="plant1">
       Plant 1
      </a>
      :
      <span class="cost">
       $10
      </span>
     </li>
     <li>
      <a class="plants life" href="http://example.com/plant2" id="plant2">
       Plant 2
      </a>
      :
      <span class="cost">
       $20
      </span>
     </li>
     <li>
 

In [24]:
#what type of file is it?
type(soup)

bs4.BeautifulSoup

In [None]:
#Targeting Content:
#HTML Tags

In [25]:
#get title of page
soup.title

<title>title tag</title>

In [29]:
#what about headlines if there's more than one of a tag
#it only returns the first one
soup.h2

<h2 class="subhead" id="vegitation">Plants</h2>

In [30]:
#Searching for IDs
#search by id for "animal1"
soup(id="animal1")

[<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>]

In [31]:
#finding class
#a wide net is not the best
soup.p

<p>Learning to scrape using BeautifulSoup.</p>

In [32]:
#method 1 - target the tag onlun
#simple but w/o precision
soup.find("p")

<p>Learning to scrape using BeautifulSoup.</p>

In [33]:
#method 2 - target the class only
soup.find(class_="article")

<div class="content article">
<section>
<p>Here's some pretty useless info:</p>
</section>
<section class="main" id="all_plants">
<h2 class="subhead" id="vegitation">Plants</h2>
<p class="article">Three plants that thrive in deep shade:</p>
<ol>
<li><a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>
<li><a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>
<li><a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span></li>
</ol>
</section>
<section class="main" id="all_animals">
<h2 class="subhead" id="creatures">Animals</h2>
<p class="article"> Three animals in the barn:</p>
<ol>
<li><a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span></li>
<li><a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$

In [34]:
soup.find("p", class_="article")

<p class="article">Three plants that thrive in deep shade:</p>

In [42]:
#find all tags, classes
soup.find_all("p", class_="article")

In [43]:
type(soup.find("p", class_="article"))

bs4.element.ResultSet

In [44]:
len((soup.find("p", class_="article"))

3

In [38]:
#capture class article
the_articles = soup.find_all("p", class_="article")

In [45]:
the_articles[0]

<p class="article">Three plants that thrive in deep shade:</p>

In [46]:
type(the_articles[0])

bs4.element.Tag

In [47]:
soup.find_all("a",class_="life")
lifeforms = soup.find_all("a",class_="life")

In [48]:
lifeforms

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>,
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>,
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>,
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>]

In [50]:
for life in lifeforms: 
    print(life.get_text())
    print("*****")

Plant 1
*****
Plant 2
*****
Plant 3
*****
Animal 1
*****
Animal 2
*****
Animal 3
*****


In [60]:
#store text only into a list called lifeforms_fl
lifeforms_fl = []
for life in lifeforms:
    print(type(life))
    life = life.get_text()
    lifeforms_fl.append(life)
    print(type(life))
    lifeforms_fl.append(life)
    
lifeforms_fl

<class 'bs4.element.Tag'>
<class 'str'>
<class 'bs4.element.Tag'>
<class 'str'>
<class 'bs4.element.Tag'>
<class 'str'>
<class 'bs4.element.Tag'>
<class 'str'>
<class 'bs4.element.Tag'>
<class 'str'>
<class 'bs4.element.Tag'>
<class 'str'>


['Plant 1',
 'Plant 1',
 'Plant 2',
 'Plant 2',
 'Plant 3',
 'Plant 3',
 'Animal 1',
 'Animal 1',
 'Animal 2',
 'Animal 2',
 'Animal 3',
 'Animal 3']

In [58]:
lifeforms_lc = [life.get_text() for life in lifeforms]
lifeforms_lc

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

In [61]:
type(lifeforms_lc)

list

In [62]:
lifeforms.get("href")

AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [63]:
for url in lifeforms:
    print(url.get('href'))

http://example.com/plant1
http://example.com/plant2
http://example.com/plant3
http://example.com/animal1
http://example.com/animal2
http://example.com/animal3


In [80]:
links_fl = []
for url in lifeforms:
    urls = url.get('href')
    links_fl.append(urls)

links_fl

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

In [85]:
links_lc = [url.get('href') for url in lifeforms]
links_lc

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

In [89]:
costs = soup.find_all("span",class_="cost")
costs

[<span class="cost">$10</span>,
 <span class="cost">$20</span>,
 <span class="cost">$30</span>,
 <span class="cost">$500</span>,
 <span class="cost">$600</span>,
 <span class="cost">$700</span>]

In [94]:
costs_fl = []
for cost in costs:
    costs_fl.append(cost.get_text().replace("$", ""))
    
costs_fl

['10', '20', '30', '500', '600', '700']

In [96]:
costs_lc = [cost.get_text().replace("$", "") for cost in costs]
costs_lc

['10', '20', '30', '500', '600', '700']

In [97]:
#creat dictionaries

In [110]:
#method 1
life_list_1 = []
for (lifeforms, costs, links) in zip(lifeforms_lc, costs_lc, links_lc):
    life_list_1.append({
        "product":lifeforms,
        "cost":costs,
        "link":links})
        
life_list_1

[{'product': 'Plant 1', 'cost': '10', 'link': 'http://example.com/plant1'},
 {'product': 'Plant 2', 'cost': '20', 'link': 'http://example.com/plant2'},
 {'product': 'Plant 3', 'cost': '30', 'link': 'http://example.com/plant3'},
 {'product': 'Animal 1', 'cost': '500', 'link': 'http://example.com/animal1'},
 {'product': 'Animal 2', 'cost': '600', 'link': 'http://example.com/animal2'},
 {'product': 'Animal 3', 'cost': '700', 'link': 'http://example.com/animal3'}]

In [107]:
import pandas as pd

In [111]:
pd.DataFrame(life_list_1)

Unnamed: 0,product,cost,link
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3


In [117]:
#method 2
life_list_2 = []
for my_items in zip(lifeforms_lc, costs_lc, links_lc):
    life_list_2.append(my_items)

life_list_2

[('Plant 1', '10', 'http://example.com/plant1'),
 ('Plant 2', '20', 'http://example.com/plant2'),
 ('Plant 3', '30', 'http://example.com/plant3'),
 ('Animal 1', '500', 'http://example.com/animal1'),
 ('Animal 2', '600', 'http://example.com/animal2'),
 ('Animal 3', '700', 'http://example.com/animal3')]

In [123]:
df = pd.DataFrame(life_list_2)
df

Unnamed: 0,0,1,2
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3


In [127]:
df.columns = ["Product", "Cost", "Link"]
df

Unnamed: 0,Product,Cost,Link
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3


Unnamed: 0,0,1,2
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3
