# Advanced Scraping

_Tips and tricks for scraping the unscrapable._

### Installation

If you'd like to create a random user agent when scraping, install `fake_useragent`:

With pip<br>
`pip install fake_useragent`

Or with conda<br>
`conda install -c mlgill fake-useragent`

## How much is too much?

Sites have `robots.txt` pages that give guidelines about what they want to allow webcrawlers to access.

In [1]:
import requests

url = 'http://www.github.com/robots.txt'
response  = requests.get(url)
print(response.text)

# If you would like to crawl GitHub contact us at support@github.com.
# We also provide an extensive API: https://developer.github.com/

User-agent: baidu
crawl-delay: 1


User-agent: *

Disallow: */pulse
Disallow: */tree/
Disallow: */blob/
Disallow: */wiki/
Disallow: /gist/
Disallow: */forks
Disallow: */stars
Disallow: */download
Disallow: */revisions
Disallow: */issues/new
Disallow: */issues/search
Disallow: */commits/
Disallow: */commits/*?author
Disallow: */commits/*?path
Disallow: */branches
Disallow: */tags
Disallow: */contributors
Disallow: */comments
Disallow: */stargazers
Disallow: */archive/
Disallow: */followers
Disallow: */following
Disallow: */blame/
Disallow: */watchers
Disallow: */network
Disallow: */graphs
Disallow: */raw/
Disallow: */compare/
Disallow: */cache/
Disallow: /.git/
Disallow: */.git/
Disallow: /*.git$
Disallow: /search/advanced
Disallow: /search
Disallow: */search
Disallow: /*q=
Disallow: /*.atom

Disallow: /ekansa/Open-Context-Data
Disallow: /ekansa/openco

At the very end of this file, you will see:
> `User-agent: *` <br>
> `Disallow: /`

This means disallow everything to all user-agents not previously covered... that's us!

Boxofficemojo is much more accepting:

In [2]:
url = 'http://www.boxofficemojo.com/robots.txt'
response  = requests.get(url)
print(response.text)

# robots.txt for BoxOfficeMojo
User-agent: *
Allow: /



## How often is too often?

It is very common for sites to block you if you send too many requests in a certain time period. Sometimes you can get around this by adding well-designed pauses in your scraping. 

Options include:
* pause after every request
* pause after each `n` requests
* pause at random intervals

#### Short pause after every request

In [3]:
#every request
import time

page_list = ['page1','page2','page3']

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(2)

page1
page2
page3


#### Longer pause after 200 requests

In [4]:
import time

page_list = ['page1','page2','page3','page4','page5','page6']

for i, page in enumerate(page_list):
    ### scrape a website
    ### ...
    print(page)
    
    if (i+1 % 200 == 0):
        time.sleep(320)

page1
page2
page3
page4
page5
page6


#### Random pause after each page (more human-like)

In [5]:
import random

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(.5+2*random.random())

page1
page2
page3
page4
page5
page6


## How do I make requests look like a real browser?

In a human controlled browser, user agent information lets the web server know your browser specifications so the server can match to your software's capabilities.  

When making an automated request, you can specify your user agent to make your request look more authentic. 

In [6]:
url = 'http://www.reddit.com'

user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(url, headers = user_agent)

You can even generate a random user agent to send.

In [7]:
from fake_useragent import UserAgent

ua = UserAgent()
user_agent = {'User-agent': ua.random}
print(user_agent)

response  = requests.get(url, headers = user_agent)

ModuleNotFoundError: No module named 'fake_useragent'

Note the next user agent generated has different specifications.

In [None]:
user_agent = {'User-agent': ua.random}
print(user_agent)