# Introduction

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsObj = BeautifulSoup(html.read(), 'html5lib')
print(bsObj.h1)

<h1>An Interesting Title</h1>


Any of the following can produce the same output.

In [5]:
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

<h1>An Interesting Title</h1>

bsObj.tagName get the **first occurrence** of the tag found on the page

### Error Handling
* The page is not found on the server
  - the error may be '404 Page Not Found'
  - urlopen throw 'HTTPError'
* The server is not found
  - urlopen throw 'URLError'

In [14]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('http://www.danielmao.com/')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It worked!')

The server could not be found!


Besides, even the page is retrived successfully, it may still return error if the tag doesn't exists.  
If you call a tag that does not exist, the **Noneobject** is returned. It returns an attributeError if you retrieve another tag from the **Noneobject**.  
**print(bsObj.nonExistentTag.someTag)**

In [15]:
try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else: 
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

Tag was not found


  tag_name, tag_name))


# Advanced HTML Parsing

find() and findAll() are two functions that easily filter HTML pages using tags and their various attributes.  
  
**findAll(tag, attributes, recursive, text, limit, keywords)**

Below is an **bad** practice. Tiny change on the web can break the program. 

In [None]:
bsObj.findAll('table')[4].findAll('tr')[2].find('td').findAll('div')[1].find['a']

In [18]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsObj = BeautifulSoup(html, "html5lib")

nameList = bsObj.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


Scrap the product title and price for website http://www.pythonscraping.com/pages/page3.html

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bsObj = BeautifulSoup(html.read(), 'html5lib')
data = []

for tr in bsObj.findAll('tr', {'class': 'gift'}):
    data.append((tr.findAll('td')[0].get_text().strip(), tr.findAll('td')[2].get_text().strip()))

with open('test.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    for title, price in data:
        writer.writerow([title, price])

Regular expressions and BeautifulSoup go hand in hand when it comes to scraping the Web.  
Most functions that take in a string argument (e.g., find(id = 'aTagIdHere')) will also take in a regular expression as well.

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bsObj = BeautifulSoup(html.read(), 'html5lib')
images = bsObj.findAll('img', {'src': re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


Even below code generate the same results as the above code, the attribute defined by regualar expression narrow down the filter.

In [33]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bsObj = BeautifulSoup(html.read(), 'html5lib')
images = bsObj.findAll('img')
for image in images:
    print(image['src'])

../img/gifts/logo.jpg
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
