# Multiple pages scraping I
To scrape multiple pages, it is important to pass the robot test of a website, otherwise the code will be blocked by website.
## Pretend to be a browser
We can set up parameters in the _requests.get_ function to pretend to be a browser.

In [1]:
import requests
from bs4 import BeautifulSoup
url = "https://www.zillow.com"
#r = requests.get(url)
# Use headers to set up 'User-Agent' to pass the robot test
headers = {'User-Agent': 'Chrome/51.0.2704.84'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, features='lxml')
print(soup.title)

<title>Zillow: Real Estate, Apartments, Mortgages &amp; Home Values</title>


## Be a real browser
### Selenium
[Selenium](https://www.seleniumhq.org/) automates browsers. Do install webdriver before using selenium.
- Chrome [driver](https://sites.google.com/a/chromium.org/chromedriver/downloads)
- Firefox [driver](https://github.com/mozilla/geckodriver/releases)
- Safari [driver](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)
- or google "your browser name" + "driver"
#### Selenium searches movies

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

URL = "https://www.imdb.com"

driver = webdriver.Chrome() # Open a web browser
driver.get(URL) # Input the URL

elem = driver.find_element_by_xpath('//*[@id="navbar-query"]') # Use the xpath to locate the search bar
elem.send_keys('Spiderman', Keys.ENTER) # Enter the keywords and press enter
#print(driver.page_source)
driver.close()

#### Selenium clicks url that navigates from one page to another

In [3]:
URL = "https://www.imdb.com/chart/top"
driver = webdriver.Chrome()
driver.get(URL)
driver.find_element_by_link_text("The Godfather").click() # Find a link named "The Godfather"

driver.get_screenshot_as_file("../img/Screenshot/screenshot1.png") # Save the screenshot of current page
driver.close()

#### Selenium scrapes Lego information from Amazon

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from pyquery import PyQuery as pq
import re

browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10) # Wait for 10 seconds before throwing "TimeoutException" Error
browser.get('https://www.amazon.com')
input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#twotabsearchtextbox"))) # Find the search bar
submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#nav-search > form > div.nav-right > div > input"))) # Find the search icon
input.send_keys(u'Lego')
submit.click()


In [5]:
# Check the total pages of the search
total = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#pagn > span.pagnDisabled')))
print(total.text)

400


In [6]:
# Find how to go to the next page
submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#pagnNextString')))
submit.click()

In [7]:
# Parse the information from the website
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#s-results-list-atf li")))
html = browser.page_source
doc = pq(html)
items = doc("#s-results-list-atf li").items()
for item in items:
    product = {
        'title': item.find('h2').text(),
        'price': item.find('span.a-offscreen').text(),        
    }
    print(product)

browser.close()

{'title': '[Sponsored]THE LEGO MOVIE 2 Escape Buggy 70829 Building Kit (549 Piece)', 'price': '[Sponsored] $40.00'}
{'title': '[Sponsored]LEGO City Sky Police Diamond Heist 60209 Building Kit (400 Piece)', 'price': '[Sponsored] $47.99'}
{'title': '[Sponsored]LEGO Architecture Skyline Collection 21044 Paris Building Kit (649 Piece)', 'price': '[Sponsored] $49.99'}
{'title': '"LEGO Star Wars Yoda\'s Jedi Starfighter 75168 Building Kit (262 Pieces)', 'price': '$23.99'}
{'title': 'LEGO City Arctic Supply Plane 60196 Building Kit (707 Piece)', 'price': '$65.99'}
{'title': 'LEGO Star Wars Solo: A Star Wars Story Kessel Run Millennium Falcon 75212 Building Kit (1414 Piece)', 'price': '$151.84'}
{'title': 'LEGO Creator 3in1 Mythical Creatures 31073 Building Kit (223 Piece)', 'price': '$11.99'}
{'title': 'LEGO City Forest Tractor 60181 Building Kit (174 Piece)', 'price': '$15.99'}
{'title': 'LEGO Ninjago Movie Fire Mech 70615 Building Kit (944 Piece)', 'price': '$69.99'}


To build an automatical scraping project, we need to define several functions that do the above tasks in each function. The functions should includes:
- search: input keywords into the page
- next_page: turn to next page
- get_products: extract information from the page
- main: control the entire project

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from pyquery import PyQuery as pq
import re

browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
def search():
    try:
        browser.get('https://www.amazon.com')
        input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#twotabsearchtextbox")))
        submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#nav-search > form > div.nav-right > div > input")))
        input.send_keys(u'Lego')
        submit.click()
        total = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#pagn > span.pagnDisabled')))
        get_products()
        return total.text
    except TimeoutException:
        return search()

def next_page():
    try:
        submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#pagnNextString')))
        submit.click()
        get_products()
    
    except TimeoutException:
        next_page()

def get_products():
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#s-results-list-atf li")))
    html = browser.page_source
    doc = pq(html)
    items = doc("#s-results-list-atf li").items()
    for item in items:
        product = {
            'title': item.find('h2').text(),
            'price': item.find('span.a-offscreen').text(),        
        }
        print(product)

def main():
    total = search()
    total = int(re.compile('(\d+)').search(total).group(1))
    for i in range(2, 5):
        next_page()

main()
        

{'title': '[Sponsored]LEGO Harry Potter Hogwarts Castle 71043 Building Kit (6020 Piece)', 'price': '[Sponsored] $399.99'}
{'title': '[Sponsored]LEGO Friends Mia’s House 41369 Building Kit (715 Piece)', 'price': '[Sponsored] $63.00'}
{'title': '[Sponsored]LEGO Architecture Skyline Collection 21043 San Francisco Building Kit (629 Piece)', 'price': '[Sponsored] $49.99'}
{'title': 'LEGO MINDSTORMS EV3 31313 Robot Kit… LEGO Creator Robo Explorer 31062 Robot… LEGO Ideas Exo Suit 21109', 'price': '$339.97 $22.46'}
{'title': 'LEGO Ideas NASA Apollo Saturn V 21309 Building Kit', 'price': '$119.99'}
{'title': 'LEGO Classic Medium Creative Brick Box 10696', 'price': '$28.00'}
{'title': 'LEGO Boost Creative Toolbox 17101 Fun Robot Building Set and Educational Coding Kit for Kids, Award-Winning STEM Learning Toy (847 Pieces)', 'price': '$159.95'}
{'title': 'LEGO Classic Large Creative Brick Box 10698', 'price': '$47.99'}
{'title': 'LEGO Creator Mighty Dinosaurs 31058 Dinosaur Toy', 'price': '$12.19