# Web Scraping 禮節

## robots.txt

文字檔，裡頭紀錄著一行一行的存取權限規則，用來告知搜尋引擎、或爬蟲程式，可抓取哪些檔案(夾)。

通常置於網站根目錄底下，如：https://www.coursera.org/robots.txt

內容長得像這樣：

```
User-agent: *
Allow: /api/utilities/v1/imageproxy
Disallow: /maestro/api/
Disallow: /api/
Disallow: /maestro/
Disallow: /ui/
Disallow: /signature/voucher/
Disallow: /account/email_verify/
Disallow: /acclaimbadge/
Disallow: /voucher/
Sitemap: https://www.coursera.org/sitemap.xml
```

### directives

* `User-agent: *` 表示「任何瀏覽器」需遵守以下規則。
* `Allow: ...` 允許存取的路徑
* `Disallow: ...` 不允許存取的路徑
* `Crawl-delay: 2` 表示爬蟲程式必須遵守「每隔兩秒以上」才能請求再次抓取同一網頁的規則。

## User-Agent string

    * Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
    * cURL: curl/7.47.0
    * Go: Go-http-client/1.1
    * Java: Apache-HttpClient/4.5.2
    * Googlebot(for images): Googlebot-Image/1.0

當我們發送 HTTP Request 的時候，會夾帶此字串，告知伺服端我們使用的瀏覽器版本：

    GET /index.html HTTP/1.1
    Host: example.com
    User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0

## 套件

    go get github.com/temoto/robotstxt


In [2]:
import (
    "fmt"
    "log"
    "net/http"
    "github.com/temoto/robotstxt"
)

found packages not installed in LGOPATH: [github.com/temoto/robotstxt]
(1/1) installed "github.com/temoto/robotstxt"


In [3]:
{
    rsp, err := http.Get("https://www.coursera.org/robots.txt")
    if err != nil {
        log.Fatalln(err)
        return
    }
    
    data, err := robotstxt.FromResponse(rsp)
    if err != nil {
        log.Fatalln(err)
        return
    }
    
    grp := data.FindGroup("Go-http-client/1.1")
    if grp != nil {
        testUrls := []string{
            "/learn",
            "/courses",
            "/api/utilities/v1/imageproxy",
            
            // These paths are not accessable
            "/ui/",
            "/api/",
            "/maestro/api/",
        }
        
        for _, url := range testUrls {
            fmt.Println("checking " + url + "...")
            
            if grp.Test(url) == true {
                fmt.Println("OK")
            } else {
                fmt.Println("X")
            }
        }
    }
}

checking /learn...
OK
checking /courses...
OK
checking /api/utilities/v1/imageproxy...
OK
checking /ui/...
X
checking /api/...
X
checking /maestro/api/...
X
