<img src="https://sandeepmj.github.io/image-host/green-scrapes.png" >

# Scraping non-tabular content with ```BeautifulSoup```
### We'll learn to some basic scraping techniques using this mock site <a href="https://sandeepmj.github.io/scrape-example-page/demo-text.html">demo page</a>.


## All web scraping requires a little sleuthing:
* Is the content on the page (use `Reveal Source`)?
* Where and how is the content held on the page?
* How can we access it?
* Is there a pattern?
* Is there anything that breaks the pattern?  

In [1]:
## create cells as necessary
# importing libraries
import requests

In [2]:
# requesting web content
url = "https://sandeepmj.github.io/scrape-example-page/demo-text.html"

# scrape url website
response = requests.get(url)

In [3]:
# did it work?
response.status_code

200

In [4]:
response

<Response [200]>

In [5]:
type(response)

requests.models.Response

### Response codes

When you scrape hundreds of pages, there's a chance  that one of the URLs might be a dud. 

We can set up an error control to see what kind of responses we get.

`<Response [200]>` means website is accessible. 

`<Response [300]>` your request is being re-directed.

`<Response [403]>` your request was received but declined.

`<Response [404]>` means broken link or no page on content.

`<Response [500]>` means your request encountered some generic server error.

In [6]:
# see response.text object
response.text

'<!doctype html>\n<html lang="en">\n<head>\n\t<title>title tag</title>\n\t<style>\nbody {padding: 20px; max-width: 700px; margin: 0 auto;}\n</style>\n</head>\n\n<body>\n\t<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p></h1>\n\t<p>Learning to scrape using BeautifulSoup.</p>\n\t<div class="content article">\n\t\t<section>\n\t\t<p>Here\'s some pretty useless info:</p>\n\t</section>\n\t\t<section class="main" id="all_plants">\n\t\t\t<h2 class="subhead" id="vegitation">Plants</h2>\n\t\t\t<p class="article">Three plants that thrive in deep shade:</p>\n\t\t\t<ol>\n\t\t\t\t<li><a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>\n\t\t\t\t<li><a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>\n\t\t\t\t<li><a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span></li>\n\t\t\t</ol>\n\t\t</secti

In [7]:
# what object does response.text return?
type(response.text)

str

In [8]:
# see response.content object
response.content

b'<!doctype html>\n<html lang="en">\n<head>\n\t<title>title tag</title>\n\t<style>\nbody {padding: 20px; max-width: 700px; margin: 0 auto;}\n</style>\n</head>\n\n<body>\n\t<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p></h1>\n\t<p>Learning to scrape using BeautifulSoup.</p>\n\t<div class="content article">\n\t\t<section>\n\t\t<p>Here\'s some pretty useless info:</p>\n\t</section>\n\t\t<section class="main" id="all_plants">\n\t\t\t<h2 class="subhead" id="vegitation">Plants</h2>\n\t\t\t<p class="article">Three plants that thrive in deep shade:</p>\n\t\t\t<ol>\n\t\t\t\t<li><a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>\n\t\t\t\t<li><a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>\n\t\t\t\t<li><a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span></li>\n\t\t\t</ol>\n\t\t</sect

In [9]:
# what object does response.content return?
type(response.content)

bytes

## Introducing `BeautifulSoup`

In [10]:
# importing library
from bs4 import BeautifulSoup

In [11]:
# turn it into soup
soup = BeautifulSoup(response.text, "html.parser")

In [12]:
# prettify the printout
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   title tag
  </title>
  <style>
   body {padding: 20px; max-width: 700px; margin: 0 auto;}
  </style>
 </head>
 <body>
  <h1 class="title">
   <b>
    The title headline is Demo for BeautifulSoup
   </b>
  </h1>
  <p>
   Learning to scrape using BeautifulSoup.
  </p>
  <div class="content article">
   <section>
    <p>
     Here's some pretty useless info:
    </p>
   </section>
   <section class="main" id="all_plants">
    <h2 class="subhead" id="vegitation">
     Plants
    </h2>
    <p class="article">
     Three plants that thrive in deep shade:
    </p>
    <ol>
     <li>
      <a class="plants life" href="http://example.com/plant1" id="plant1">
       Plant 1
      </a>
      :
      <span class="cost">
       $10
      </span>
     </li>
     <li>
      <a class="plants life" href="http://example.com/plant2" id="plant2">
       Plant 2
      </a>
      :
      <span class="cost">
       $20
      </span>
     </li>
     <li>
 

In [13]:
# what type of file is it?
type(soup)

bs4.BeautifulSoup

### Targeting content

In [14]:
# HTML tags
# get title of page

soup.title

<title>title tag</title>

In [15]:
soup.h2 ## it only returns the first one

<h2 class="subhead" id="vegitation">Plants</h2>

In [16]:
# Search by ID
soup(id="animal1") # returns a list

[<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>]

### Selecting `class`

In [17]:
# a wide net is not best
soup.p

<p>Learning to scrape using BeautifulSoup.</p>

### Method 1. Target the tag only.
`soup.find("tag_name")`

In [18]:
# simple but without precision
# still too wide a net
soup.find("p")

<p>Learning to scrape using BeautifulSoup.</p>

### Method 2. Target the class only
`soup.find(class_="class_name")`

In [19]:
soup.find(class_="article")

<div class="content article">
<section>
<p>Here's some pretty useless info:</p>
</section>
<section class="main" id="all_plants">
<h2 class="subhead" id="vegitation">Plants</h2>
<p class="article">Three plants that thrive in deep shade:</p>
<ol>
<li><a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span></li>
<li><a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span></li>
<li><a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span></li>
</ol>
</section>
<section class="main" id="all_animals">
<h2 class="subhead" id="creatures">Animals</h2>
<p class="article"> Three animals in the barn:</p>
<ol>
<li><a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span></li>
<li><a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$

### Method 3. Precision, clarity and simplicity
`soup.find("tag_name", class_="class_name")`

In [20]:
soup.find("p", class_="article")

<p class="article">Three plants that thrive in deep shade:</p>

### `find_all` tags, classes
`soup.find_all("tag_name", class_="class_name")`

## `find` only returns the first element; it does not return a list
## `find_all` returns a list

In [21]:
# return all p tag with the class `article`
soup.find_all("p", class_="article")

[<p class="article">Three plants that thrive in deep shade:</p>,
 <p class="article"> Three animals in the barn:</p>,
 <p class="article"> Three shiny rocks:</p>]

In [22]:
type(soup.find_all("p", class_="article"))

bs4.element.ResultSet

In [23]:
len(soup.find_all("p", class_="article"))

3

In [24]:
# capture class article
the_articles = soup.find_all("p", class_="article")
the_articles

[<p class="article">Three plants that thrive in deep shade:</p>,
 <p class="article"> Three animals in the barn:</p>,
 <p class="article"> Three shiny rocks:</p>]

In [25]:
# find all life forms on the page
life_forms = soup.find_all("a", class_="life")
life_forms

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>,
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>,
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>,
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>]

In [26]:
# target all plant life only
soup.find_all("a", class_="plants")

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>,
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>]

### `get_text` converts html to just strings; it can't convert lists, we have to iterate

In [27]:
soup.h1.get_text()

'The title headline is Demo for BeautifulSoup'

In [28]:
for life in life_forms:
    print(life.get_text())
    print("***")

Plant 1
***
Plant 2
***
Plant 3
***
Animal 1
***
Animal 2
***
Animal 3
***


### Storing values

In [29]:
lifeforms_fl = []

for life in life_forms:
    lifeforms_fl.append(life.get_text())

lifeforms_fl

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

In [30]:
lifeforms_lc = [ life.get_text() for life in life_forms ]
lifeforms_lc

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

## Capturing URLs
`.get("href")`

In [31]:
for link in life_forms:
    print(link.get("href"))

http://example.com/plant1
http://example.com/plant2
http://example.com/plant3
http://example.com/animal1
http://example.com/animal2
http://example.com/animal3


In [32]:
# use for loop and save into links_fl

links_fl = []

for link in life_forms:
    links_fl.append(link.get("href"))

links_fl

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

In [33]:
links_lc = [ link.get("href") for link in life_forms ]
links_lc

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

### Costs

In [34]:
costs = soup.find_all("span", class_="cost")
costs

[<span class="cost">$10</span>,
 <span class="cost">$20</span>,
 <span class="cost">$30</span>,
 <span class="cost">$500</span>,
 <span class="cost">$600</span>,
 <span class="cost">$700</span>]

In [35]:
costs_fl = []

for cost in costs:
    costs_fl.append(int(cost.get_text().replace("$", "")))

costs_fl

[10, 20, 30, 500, 600, 700]

In [36]:
costs_lc = [ int(cost.get_text().replace("$", "")) for cost in costs ]
costs_lc


[10, 20, 30, 500, 600, 700]

### Method 1: Create dictionary

In [37]:
life_list_dict = []

for (life_forms, costs, links) in zip(lifeforms_lc, costs_lc, links_lc):
    life_list_dict.append({
        "product": life_forms,
        "cost": costs,
        "link": links
    })

life_list_dict

[{'product': 'Plant 1', 'cost': 10, 'link': 'http://example.com/plant1'},
 {'product': 'Plant 2', 'cost': 20, 'link': 'http://example.com/plant2'},
 {'product': 'Plant 3', 'cost': 30, 'link': 'http://example.com/plant3'},
 {'product': 'Animal 1', 'cost': 500, 'link': 'http://example.com/animal1'},
 {'product': 'Animal 2', 'cost': 600, 'link': 'http://example.com/animal2'},
 {'product': 'Animal 3', 'cost': 700, 'link': 'http://example.com/animal3'}]

In [38]:
import pandas as pd

In [39]:
pd.DataFrame(life_list_dict)

Unnamed: 0,product,cost,link
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3


### Method 2: tuple

In [40]:
life_list = []

for my_items in zip(lifeforms_lc, costs_lc, links_lc):
    life_list.append(my_items)

life_list

[('Plant 1', 10, 'http://example.com/plant1'),
 ('Plant 2', 20, 'http://example.com/plant2'),
 ('Plant 3', 30, 'http://example.com/plant3'),
 ('Animal 1', 500, 'http://example.com/animal1'),
 ('Animal 2', 600, 'http://example.com/animal2'),
 ('Animal 3', 700, 'http://example.com/animal3')]

In [41]:
df = pd.DataFrame(life_list)
df.columns = ["Product", "Cost", "More Info"]
df

Unnamed: 0,Product,Cost,More Info
0,Plant 1,10,http://example.com/plant1
1,Plant 2,20,http://example.com/plant2
2,Plant 3,30,http://example.com/plant3
3,Animal 1,500,http://example.com/animal1
4,Animal 2,600,http://example.com/animal2
5,Animal 3,700,http://example.com/animal3
