## Throttle our scraper

遵守前述爬蟲禮節，可以減輕伺服器端的負擔。使用爬蟲程式時，如果 robots.txt 中有記載 `Crawl-Delay` 秒數，應優先遵守此規則。如果否，則考慮使用「每一網頁，每隔一秒才送出請求」的規則。

* 限速方法：
    追蹤每個 HTTP request 的 timestamps, 並確保相隔時間(elapsed time)大於等於我們想要的速率。

In [1]:
import (
    "fmt"
    "log"
    "net/http"
    "time"
)

In [2]:
{
    var lastReqTime time.Time
    maxNumbOfReq := 5
    pageDelay := 5 * time.Second
    
    for i := 0; i < maxNumbOfReq; i++ {
        elapsedTime := time.Now().Sub(lastReqTime)
        
        if elapsedTime < pageDelay {
            var timeDiff = pageDelay - elapsedTime
            
            fmt.Printf("Spleeping for %.2f(sec)\n", timeDiff.Seconds())
            
            time.Sleep(timeDiff)
        }
        
        fmt.Println("GET example.com/index.html")
        _, err := http.Get("http://www.example.com/index.html")
        if err != nil {
            log.Fatalln(err)
            return
        }
        
        // Update the last request time
        lastReqTime = time.Now()
    }
    fmt.Println("Done!\n lastReqTime is", lastReqTime)
}

GET example.com/index.html
Spleeping for 5.00(sec)
GET example.com/index.html
Spleeping for 5.00(sec)
GET example.com/index.html
Spleeping for 5.00(sec)
GET example.com/index.html
Spleeping for 5.00(sec)
GET example.com/index.html
Done! lastReqTime is 2019-08-16 02:13:11.8712837 +0000 UTC m=+34.778607701


# [Note] 抓取多個網站

如果抓取多個網站，可利用 lastReqTime 搭配 `map` (key-value structure) 來記錄爬取每個網站的時間點。

* key: 網站 URL
* value: timestamp for the last request

```go
var lastReqTime = map[string]time.Time{
    "example.com": time.Time{},
    "packtpub.com": time.Time{},
}
```

同時， for loop 內容也需調整：

```go
if i % 2 == 0 {
    webpage = pktPage
    elapsedTime = time.Now().Sub(lastReqTime["packtpub.com"])
} else {
    elapsedTime = time.Now().Sub(lastReqTime["example.com"])
}
```

In [18]:
{
    var lastReqTime map[string]time.Time = map[string]time.Time{
        "example.com": time.Time{},
        "packtpub.com": time.Time{},
    }
    maxNumbOfReq := 5
    pageDelay := 5 * time.Second
    exPage := "http://www.example.com/index.html"
    pktPage := "https://www.packtpub.com"
    
    for i := 0; i < maxNumbOfReq; i++ {
        var elapsedTime time.Duration
        webpage := exPage
        
        if i % 2 == 0 {
            webpage = pktPage
            elapsedTime = time.Now().Sub(lastReqTime["packtpub.com"])
        } else {
            elapsedTime = time.Now().Sub(lastReqTime["example.com"])
        }
        
        if elapsedTime < pageDelay {
            var timeDiff = pageDelay - elapsedTime
            
            fmt.Printf("Spleeping for %.2f(sec)\n", timeDiff.Seconds())
            
            time.Sleep(timeDiff)
        }       
        
        
        fmt.Println("GET", webpage)
        _, err := http.Get(webpage)
        if err != nil {
            log.Fatalln(err)
            return
        }
        
        // Update the last request time
        if i % 2 == 0 {
            lastReqTime["packtpub.com"] = time.Now()
        } else {
            lastReqTime["example.com"] = time.Now()
        }

        if i >= maxNumbOfReq-1 {
            if i % 2 == 0 {
                fmt.Println("Done!\n lastReqTime from packtpub.com is", lastReqTime["packtpub.com"])
            } else {
                fmt.Println("Done!\n lastReqTime from example.com is", lastReqTime["example.com"])
            }
        }
    }
}

GET https://www.packtpub.com
GET http://www.example.com/index.html
Spleeping for 4.24(sec)
GET https://www.packtpub.com
GET http://www.example.com/index.html
Spleeping for 2.54(sec)
GET https://www.packtpub.com
Done!
 lastReqTime from packtpub.com is 2019-08-16 02:57:31.5496663 +0000 UTC m=+1272.084318401
