## 爬蟲
* Beautiful Soup
* Selenium


### 1. Beautiful Soup

#### 1.0 安裝套件
* Python 2+ : pip install beautifulsoup4
* Python 3+ : pip3 install beautifulsoup4

#### 1.1 引用 Beautiful Soup 模組

In [13]:
from bs4 import BeautifulSoup

#### 1.2 使用Beautiful Soup 解析網頁

In [248]:
html_doc = """
<html><head><title>Hello World</title>

<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>

</head>
<body><h2>Test Header</h2>
<p>This is a test.</p>
<a id="link1" href="https://www.google.com.tw"> Google網站</a>
<a id="link2" class="large" href="https://www.facebook.com.tw">FaceBook</a>
<p>Hello, <b id="link1" class="boldtext">Bold Text</b></p>
</body></html>
"""

# 以 Beautiful Soup 解析 HTML 程式碼
soup = BeautifulSoup(html_doc, 'html.parser')
#soup = BeautifulSoup(open('data/A.html'), 'html.parser')

In [249]:
print(soup.prettify())

<html>
 <head>
  <title>
   Hello World
  </title>
  <style>
   .large {
      color:blue;
      text-align: center;
    }
  </style>
 </head>
 <body>
  <h2>
   Test Header
  </h2>
  <p>
   This is a test.
  </p>
  <a href="https://www.google.com.tw" id="link1">
   Google網站
  </a>
  <a class="large" href="https://www.facebook.com.tw" id="link2">
   FaceBook
  </a>
  <p>
   Hello,
   <b class="boldtext" id="link1">
    Bold Text
   </b>
  </p>
 </body>
</html>



In [250]:
soup.html

<html><head><title>Hello World</title>
<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>
</head>
<body><h2>Test Header</h2>
<p>This is a test.</p>
<a href="https://www.google.com.tw" id="link1"> Google網站</a>
<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>
<p>Hello, <b class="boldtext" id="link1">Bold Text</b></p>
</body></html>

In [251]:
soup.head

<head><title>Hello World</title>
<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>
</head>

In [252]:
soup.title

<title>Hello World</title>

In [253]:
soup.body

<body><h2>Test Header</h2>
<p>This is a test.</p>
<a href="https://www.google.com.tw" id="link1"> Google網站</a>
<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>
<p>Hello, <b class="boldtext" id="link1">Bold Text</b></p>
</body>

In [254]:
soup.a

<a href="https://www.google.com.tw" id="link1"> Google網站</a>

In [255]:
soup.p

<p>This is a test.</p>

In [256]:
#.contents 属性可將tag的子節點以列表方式输出
print(soup.head.contents)
print(len(soup.head.contents))

for item in soup.head.contents:
    print(item)


[<title>Hello World</title>, '\n', <style>
    .large {
      color:blue;
      text-align: center;
    }
</style>, '\n']
4
<title>Hello World</title>


<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>




In [257]:
#.children 訪問子節點
for item in soup.head.children:
    print(item)

<title>Hello World</title>


<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>




In [258]:
#.parent 訪問父節點
print(soup.title)

print(soup.title.parent)

<title>Hello World</title>
<head><title>Hello World</title>
<style>
    .large {
      color:blue;
      text-align: center;
    }
</style>
</head>


In [259]:
print(soup.title.string)

print(soup.title.string.parent)

Hello World
<title>Hello World</title>


In [260]:
 #.next_sibling 和 .previous_sibling 属性来訪問同一層兄弟節點
print(soup.body)
print("-----")
print(soup.body.p)
print("-----")

body = soup.body
print(body.p)
print(body.p.next_sibling)
print(body.p.next_sibling.next_sibling)
print(body.p.next_sibling.next_sibling.previous_sibling.previous_sibling)

<body><h2>Test Header</h2>
<p>This is a test.</p>
<a href="https://www.google.com.tw" id="link1"> Google網站</a>
<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>
<p>Hello, <b class="boldtext" id="link1">Bold Text</b></p>
</body>
-----
<p>This is a test.</p>
-----
<p>This is a test.</p>


<a href="https://www.google.com.tw" id="link1"> Google網站</a>
<p>This is a test.</p>


In [261]:
body = soup.body
print(body)
print("-----")
for sibiling in body.p.next_siblings:
    print(sibiling)


<body><h2>Test Header</h2>
<p>This is a test.</p>
<a href="https://www.google.com.tw" id="link1"> Google網站</a>
<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>
<p>Hello, <b class="boldtext" id="link1">Bold Text</b></p>
</body>
-----


<a href="https://www.google.com.tw" id="link1"> Google網站</a>


<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>


<p>Hello, <b class="boldtext" id="link1">Bold Text</b></p>




In [262]:
#取得網頁所有文字內容
print(soup.getText())


Hello World

    .large {
      color:blue;
      text-align: center;
    }


Test Header
This is a test.
 Google網站
FaceBook
Hello, Bold Text




In [263]:
for string in soup.strings:
    print(string)
    #print(repr(string))



Hello World



    .large {
      color:blue;
      text-align: center;
    }





Test Header


This is a test.


 Google網站


FaceBook


Hello, 
Bold Text






In [264]:
# .stripped_strings 可以移除多餘的空白内容:
for string in soup.stripped_strings:
    print(string)
    #print(repr(string))

Hello World
.large {
      color:blue;
      text-align: center;
    }
Test Header
This is a test.
Google網站
FaceBook
Hello,
Bold Text


In [265]:
#取得節點文字內容
title_tag = soup.title
print(title_tag)
print(title_tag.text)
print(title_tag.string)

<title>Hello World</title>
Hello World
Hello World


In [266]:
#搜尋節點 by tag
a_tag = soup.find('a')
print(a_tag.text)
print(a_tag['href'])

 Google網站
https://www.google.com.tw


In [267]:
#搜尋節點 by tag ,tag id
a_tag = soup.find(name ='a',attrs={"id":"link2"})
#a_tag = soup.find('a',{'id':"link2"})
print(a_tag.text)
print(a_tag['href'])

FaceBook
https://www.facebook.com.tw


In [268]:
#搜尋節點 by tag ,class name
a_tag = soup.find(name ='a',attrs={"class":"large"})
#a_tag = soup.find('a',{'class':"large"})
#a_tag = soup.find('a','large')

print(a_tag.text)
print(a_tag['href'])
print(a_tag['class'])

FaceBook
https://www.facebook.com.tw
['large']


In [269]:
#搜尋節點
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.text)
    print(tag['href'])    

 Google網站
https://www.google.com.tw
FaceBook
https://www.facebook.com.tw


In [270]:
#取出節點屬性
for tag in a_tags:
    print(tag.get('href'))
    print(tag.get('class'))

https://www.google.com.tw
None
https://www.facebook.com.tw
['large']


In [271]:
tags = soup.find(["a", "b"])
print(tags)

<a href="https://www.google.com.tw" id="link1"> Google網站</a>


In [272]:
# 搜尋所有超連結與粗體字
tags = soup.find_all(["a", "b"])
print(tags)

for tag in tags:
    print(tag)
    print(tag.text)
    print(tag.get('href'))
    #print(tag['href'])

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>, <a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>, <b class="boldtext" id="link1">Bold Text</b>]
<a href="https://www.google.com.tw" id="link1"> Google網站</a>
 Google網站
https://www.google.com.tw
<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>
FaceBook
https://www.facebook.com.tw
<b class="boldtext" id="link1">Bold Text</b>
Bold Text
None


In [273]:
# 限制搜尋結果數量
tags = soup.find_all(["a", "b"], limit=2)
print(tags)

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>, <a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]


In [274]:
tag = soup.find('link1')
print(tag)

None


In [275]:
# select 選擇器
soup.select("title")

[<title>Hello World</title>]

In [276]:
soup.select("body a")

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>,
 <a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]

In [277]:
# Select by css name 
soup.select(".large")

[<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]

In [290]:
soup.select("a.large")

[<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]

In [291]:
# Select by id 
soup.select("#link1")

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>,
 <b class="boldtext" id="link1">Bold Text</b>]

In [293]:
soup.select("a#link1")

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>]

In [297]:
#依據屬性來選擇
soup.select('a[href]')

[<a href="https://www.google.com.tw" id="link1"> Google網站</a>,
 <a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]

In [298]:
soup.select('a[class]')

[<a class="large" href="https://www.facebook.com.tw" id="link2">FaceBook</a>]

In [299]:
soup.select('a[style]')

[]

### 2. Selenium

#### 2.0 安裝套件

* Python 2+ : pip install selenium
* Python 3+ : pip3 install selenium
* Install Chrome WebDriver : sudo apt-get install chromium-chromedriver

#### 2.1 Selenium基本操作

#### 2.1.1 引用套件

In [33]:
from selenium import webdriver

#### 2.1.2 啟動瀏覽器

In [45]:
driver = webdriver.Chrome()
#driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')
#driver = webdriver.Firefox()

#### 2.1.3 設定WebDriver

In [46]:
#設定視窗最大化
#driver.maximize_window()

#### 2.1. 訪問URL網站

In [47]:
url = 'http://www.google.com'
driver.get(url)

#time.sleep(3)

2. 
* id
* name
* class name
* link text
* partial link text
* tag name
* xpath
* css selector

In [11]:

browser.get('http://www.google.com.tw')
browser.find_element_by_id("lst-ib").clear()
browser.find_element_by_id("lst-ib").send_keys("abc")
browser.find_element_by_id("lst-ib").send_keys(Keys.ENTER)



NameError: name 'Keys' is not defined

In [19]:
driver = webdriver.Chrome()


driver.get('https://tw.yahoo.com')
#driver.set_window_position(0,0) #瀏覽器位置
#driver.set_window_size(700,700) #瀏覽器大小
#time.sleep(5)

driver.find_element_by_link_text('新聞').click() #點擊頁面上"天氣預報"的連結
time.sleep(5)



NameError: name 'time' is not defined

In [10]:
driver = webdriver.Chrome()
driver.get('https://hahow.in/courses')
#driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') #往下捲動



#### 2.1  截圖

In [4]:
driver = webdriver.Chrome()
driver.get('http://www.pixiv.net/')
driver.save_screenshot('pic/screen.png')  # 儲存截圖

True

#### 2.1  關閉連接
* close() : 單純關閉目前的瀏覽器
* quit() :  關閉瀏覽器，也會釋放 client/server 連線

In [22]:
driver.close()
#driver.quit()

WebDriverException: Message: chrome not reachable
  (Session info: chrome=70.0.3538.67)
  (Driver info: chromedriver=2.41.578737 (49da6702b16031c40d63e5618de03a32ff6c197e),platform=Windows NT 10.0.17134 x86_64)
