# Parsing HTML

## HTML format

```html
<html>
    <head>...</head>
    <body>
        <div>...</div>
    </body>
    <umsdataelement id="UMSSendDataEventElement"></umsdataelement>
</html>
```

* Syntax
* Structure

Find more at https://www.w3schools.com/html/

## Searching using the strings package

* The `strings` package in Go
* 允許我們針對字串物件進行操作，如：

    * searching for matches
    * counting occurrences
    * splitting strings into arrays


### Example – Counting links

`strings.Count(...)` method:

⇒ Count counts the number of non-overlapping instances of substr in s. 

⇒ If substr is an empty string, Count returns 1 + the number of Unicode code points in s.

```go
fmt.Println(strings.Count("cheese", "e"))
fmt.Println(strings.Count("five", "")) // before & after each rune

// Output:
3
5
```

In [1]:
import(
    "fmt"
    "log"
    "io/ioutil"
    "net/http"
    "strings" 
)

In [2]:
{
    r, err := http.Get("https://www.coursera.org")
    if err != nil {
        log.Fatalln(err)
    }
    
    data, err := ioutil.ReadAll(r.Body)
    if err != nil {
        log.Fatalln(err)
    }
    
    strBody := string(data)
    numLinks := strings.Count(strBody, "<a")
    fmt.Printf("Coursera homepage has %d links!\n", numLinks) 
}

Coursera homepage has 128 links!


### Example – Doctype check

`strings.Contains()` method:

⇒ Contains reports whether substr is within s.

```go
fmt.Println(strings.Contains("seafood", "foo"))
fmt.Println(strings.Contains("seafood", "bar"))
fmt.Println(strings.Contains("seafood", ""))
fmt.Println(strings.Contains("", ""))

// Output:
true
false
true
true
```

In [7]:
{
//     r, err := http.Get("https://www.coursera.org/robots.txt")
    r, err := http.Get("https://www.coursera.org")
    if err != nil {
        log.Fatalln(err)
    }
    
    data, err := ioutil.ReadAll(r.Body)
    if err != nil {
        log.Fatalln(err)
    }
    
    strBody := strings.ToLower(string(data))
    
    if strings.Contains(strBody, "<!doctype html>") {
        
        fmt.Println("網頁為 HTML5")
        
    } else if strings.Contains(strBody, "html/strict.dtd") {
        
        fmt.Println("網頁為 HTML4 (Strict)")

    } else if strings.Contains(strBody, "html/loose.dtd") {
        
        fmt.Println("網頁為 HTML4 (Tranistional)")
        
    } else if strings.Contains(strBody, "html/frameset.dtd") {
        
        fmt.Println("網頁為 HTML4 (Frameset)")
        
    } else {
        
        fmt.Println("無法偵測 doctype!")
        
    }
}

網頁為 HTML5


## Searching using the regexp package

使用正規表達式來進行搜尋。

### Example – Finding links

利用下述表達式來獲取真實連結：

    <a.*href\s*=\s*["'](http[s]{0,1}:\/\/.[^\s]*)["'].*> 

我們預期能找出在所有 `<a>` 標籤中，看起來像 URL 的字串。

In [2]:
import(
    "regexp" 
)

In [9]:
{
    r, err := http.Get("https://www.coursera.org")
    if err != nil {
        log.Fatalln(err)
    }
    
    data, err := ioutil.ReadAll(r.Body)
    if err != nil {
        log.Fatalln(err)
    }
    
    strBody := strings.ToLower(string(data))
    
    re := regexp.MustCompile(`<a.*href\s*=\s*["'](http[s]{0,1}:\/\/.[^\s]*)["'].*>`)
    linkMatches := re.FindAllStringSubmatch(strBody, -1)
    
    fmt.Printf("找到 %d 個連結:\n", len(linkMatches))
    
    for _, linkGroup := range linkMatches {
        fmt.Println(linkGroup[1])
    }
}

找到 3 個連結:
https://www.coursera.org/specializations/machine-learning-algorithms-real-world?utm_source=banners&amp;utm_medium=coursera&amp;utm_content=logged-out&amp;utm_campaign=2019aug-mlalgorithms-amii
https://www.coursera.org/business/?utm_campaign=website&amp;utm_content=banner-from-b2c-home&amp;utm_medium=coursera&amp;utm_source=enterprise
https://www.coursera.org/degrees/mcit-penn


### Example – Finding prices

使用正規表示式搜尋價格

In [57]:
{
    r, err := http.Get("https://www.packtpub.com/application-development/hands-go-programming")
    if err != nil {
        log.Fatalln(err)
    }
    
    data, err := ioutil.ReadAll(r.Body)
    if err != nil {
        log.Fatalln(err)
    }
    
    strBody := strings.ToLower(string(data))

    re := regexp.MustCompile(`<span class="price">*(\$[0-9]*\.[0-9]{0,2})`)
    priceMatches := re.FindStringSubmatch(strBody)
    
    // Try: 列印 priceMatches[0] 看看有何不同？
    fmt.Printf("售價: %s\n", priceMatches[1])
}

售價: $12.00


## Searching using XPath queries

XPath 全名為 XML Path Language, 即 XML 路徑語言，用於查詢 XML 文件中資訊的語言。

HTML 雖然不全然與 XML 相容，亦是相似結構的文檔，因此可用 XPath 來進行搜尋。

使用 XPath queries 如：`//a/@href` 瀏覽 HTML 文件結構，找到 `<a>` 標籤節點，取得其屬性。

參考：https://www.w3.org/TR/xpath/

### Example – Daily deals

使用套件：

    go get github.com/antchfx/htmlquery
    [推薦使用] go get -u github.com/storyicon/graphquery
    go get github.com/PuerkitoBio/goquery

In [59]:
import (
    "strings"
    "github.com/antchfx/htmlquery"
)

found packages not installed in LGOPATH: [github.com/antchfx/htmlquery]
(1/22) installed "github.com/antchfx/xpath"
(2/22) installed "golang.org/x/net/html/atom"
(3/22) installed "golang.org/x/net/html"
(4/22) installed "golang.org/x/text/encoding/internal/identifier"
(5/22) installed "golang.org/x/text/transform"
(6/22) installed "golang.org/x/text/encoding"
(7/22) installed "golang.org/x/text/encoding/internal"
(8/22) installed "golang.org/x/text/encoding/charmap"
(9/22) installed "golang.org/x/text/encoding/japanese"
(10/22) installed "golang.org/x/text/encoding/korean"
(11/22) installed "golang.org/x/text/encoding/simplifiedchinese"
(12/22) installed "golang.org/x/text/encoding/traditionalchinese"
(13/22) installed "golang.org/x/text/internal/utf8internal"
(14/22) installed "golang.org/x/text/runes"
(15/22) installed "golang.org/x/text/encoding/unicode"
(16/22) installed "golang.org/x/text/internal/tag"
(17/22) installed "golang.org/x/text/internal/language"
(18/22) installed "gola

In [106]:
// TODO::
{
    doc, err := htmlquery.LoadURL("https://www.packtpub.com/free-learning")
    if err != nil {
        log.Fatalln(err)
    }

    dealTextNodes := htmlquery.Find(doc, `//*[@id="free-learning-dropin"]/div[1]/div/div[1]/div/div/div[2]//text()`)
    
    fmt.Println("今日免費書")
    fmt.Println("==============================================")
    
    for _, node := range dealTextNodes {
        
        text := strings.TrimSpace(node.Data)
        
        matchTagNames, _ := regexp.Compile("^(div|span|h2|br|ul|li)$")
        text = matchTagNames.ReplaceAllString(text,"")
        if text != "" {
            fmt.Println(text)
        } else {
            fmt.Println("No Free Books available!")
        }
    }
}

今日免費書


### Example – Collecting products

use an XPath query to retrieve the latest releases from the Packt Publishing website. On this web page, there are a series of `<div>` tags that contain more `<div>` tags, which will eventually lead to our information. Each of these `<div>` tags hold an attribute called class, which describes what the purpose of the node is. In particular, we are concerned with the landing-page-row class. The book-related `<div>` tags within the landing-page-row class have an attribute called itemtype, which tells us that the div is for a book and should contain other attributes holding the names and prices. It would not be possible to achieve this with the strings package, and a regular expression would be very laborious to design.



In [108]:
import "strconv"

In [144]:
// TODO :: Fail. DEBUG
{
    //doc, err := htmlquery.LoadURL("https://www.packtpub.com/latest-releases")
    doc, err := htmlquery.LoadURL("https://www.amazon.com/s/browse?_encoding=UTF8&node=283155&ref_=nav_shopall-export_nav_mw_sbd_intl_books")
    if err != nil {
        log.Fatalln(err)
    }
    
    //nodes := htmlquery.Find(doc, `//div[@class="landing-page-rowcf"]/div[@itemtype="http://schema.org/Product"]`)
    nodes := htmlquery.Find(doc, `//*[@id="maincontent"]/div[4]/div/div[3]/text()`)
    fmt.Println("最新書籍")
    fmt.Println("==============================================")
    
    for _, node := range nodes {
        var title string
        var price float64
        
        for _, attribute := range node.Attr {
            fmt.Println(attribute.Key)
            /*
            switch attribute.Key {
            case "data-product-title":
                title = attribute.Val
            case "data-product-price":
                price, err = strconv.ParseFloat(attribute.Val, 64)
                if err != nil {
                    fmt.Println("Failed to parse price")
                }
            }
            */
        }
        fmt.Printf("%s ($%0.2f)\n", title, price)
    }
}

最新書籍


## Searching using Cascading Style Sheets selectors