# Web Scraping with Python
https://www.safaribooksonline.com/library/view/web-scraping-with/9781491910283/ch01.html 

https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

# Chapter 1. Your First Web Scraper

In [15]:
#首先載入函式庫(外掛)
from urllib.request import urlopen #urllib下載URL資料
from bs4 import BeautifulSoup #bs4分析URL內容

In [57]:
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "lxml") #告訴BeautifulSoup用"lxml"方式解析

In [17]:
#兩種印出顯示方式，h1為標籤(檢視原始碼可以找到)，string只印出內容字串
print(soup.h1.string)
print(soup.find("h1").get_text())

Totally Normal Gifts
Totally Normal Gifts


怕有問題所以放個偵測錯誤判斷，後面就照著貼上把URL改掉即可

In [215]:
from urllib.request import urlopen
from urllib.error import HTTPError #偵錯
from bs4 import BeautifulSoup
import re #Regular正規表示

def geturl(url): #URL判斷
    try:
        html = urlopen(url)
    except HTTPError as e:
        print("HTTP Error")
        return
    try:
        soup = BeautifulSoup(html.read(), "lxml")
        res = soup.body.h1.get_text() #程式在這
    except AttributeError as e:
        print("Attribute Error")
        return
    return res

res = geturl("http://www.pythonscraping.com/pages/page3.html") #放URL進去

if url == None:
    print("Some problem")
else:
    print(res)

Totally Normal Gifts


# Chapter 2. Advanced HTML Parsing
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find() =>  limit =1

In [41]:
print(soup.find("span", {"class":"excitingNote"}).get_text())

Now with super-colorful bell peppers!


limit=3 只找三層

In [42]:
print(soup.findAll("span", {"class":"excitingNote"}, limit=3))

[<span class="excitingNote">Now with super-colorful bell peppers!</span>, <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>, <span class="excitingNote">Also hand-painted by trained monkeys!</span>]


findAll不能直接用.get_text()

In [43]:
nameList = soup.findAll("span", {"class":"excitingNote"}, limit=3)
for name in nameList:
    print(name.get_text())

Now with super-colorful bell peppers!
8 entire dolls per set! Octuple the presents!
Also hand-painted by trained monkeys!


In [152]:
#印出class="gift"第一層
titlelists = soup.findAll("tr", {"class":"gift"}, limit=1)
print(titlelists)

[<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>]


先找到/img/gifts/img1.jpg位置，然後parent找父母是td，previous_sibling找以前的兄弟是上一個td，get_text印出這td標籤內的文字

In [54]:
print(soup.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())


$15.00



In [None]:
#title 屬性
#title.name 屬性
#title.string 屬性
#title.parent.name 屬性
#a 屬性
#find_all() 方法
# next_siblings 平輩
# previous_sibling 親代
#contents 屬性
#children 屬性
#string 屬性

# 練習開始

In [154]:
titlelists = soup.find("tr", {"class":"gift"})
print(titlelists.get_text())


Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00





In [155]:
#findAll不能直接用get_text
titlelists = soup.findAll("tr", {"class":"gift"}, limit=1)
print(titlelists.get_text())

AttributeError: 'ResultSet' object has no attribute 'get_text'

In [181]:
#for in 
titlelists = soup.findAll("tr", {"class":"gift"}, limit=1)
for titlelist in titlelists:
    print(titlelist.get_text())


Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00





In [199]:
#取第一欄內容
titlelists = soup.find("tr", {"class":"gift"})
print(titlelists.contents[0].get_text())


Vegetable Basket



In [201]:
#取next下一欄內容
print(titlelists.next.get_text())


Vegetable Basket



In [212]:
#取next_sibling下一個兄弟內容
print(titlelists.next.next_sibling)

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>


In [216]:
images = soup.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images: 
    print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
