### This script scrapes the product name, price, and rating for all items of a given brand available on Amazon. In this example we are scraping Brawny products. 

In [1]:
import certifi
import urllib3
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())

In [2]:
from lxml import html  
from lxml.html import fromstring
import requests
from itertools import cycle
import re
from time import sleep
import pandas as pd
import math
from bs4 import BeautifulSoup
import numpy
from urllib.request import Request, urlopen
from urllib.error import URLError
import ssl
from fake_useragent import UserAgent
import csv
import random
from random import shuffle
import cfscrape

A common roadblock to large-scale web scraping is getting blocked. A website can block your IP address if it can tell you are a single 'bot' hitting the site over and over again in a small amount of time. One way to avoid getting blocked is to make it look like your requests are coming from different browsers. We accomplish this by using a different user agent in the header of each request. 

In [3]:
ua = UserAgent()

Rotating through a list of proxies is also an option to avoid getting blocked. However, proxies aren't foolproof and many of the free proxies won't be recognized, which will raise a connection error. So free proxies (you can pay for real proxies) are only marginally useful compared to rotating user agents, which get us most of the way there. 

In [4]:
def get_proxies():
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    parser = fromstring(response.text)
    proxies = set()
    for i in parser.xpath('//tbody/tr')[:100]:
        if i.xpath('.//td[7][contains(text(),"yes")]'):
            proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
            proxies.add(proxy)
    return proxies

proxies = get_proxies()
proxy_pool = cycle(proxies)

Since we will be scraping all of the pages associated with a particular brand, and the number of pages and products will vary among brands, we first need to find out the number of pages we will be scraping. For this exercise we will be scraping all Brawny product data. We specify Brawny by using its individual srs code, located at the end of the url. You will have to know these ahead of time. 

In [5]:
basePage1 = 'http://www.amazon.com/s?i=specialty-aps&srs='
#default number of results per page
resultsPerPage = 16
#srs code
Brawny = '3019634011'
master = [Brawny]

Now for the scraping part. We are making one request to a single page, so we are not rotating user agents yet.

In [6]:
for i in master:
    
    basePage = basePage1+i
    
    s = requests.Session()
    s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
    page1 = s.get(basePage)
    
    tree = html.fromstring(page1.content)
    
    ratingCount = tree.xpath('//*[@id="s-result-count"]//text()')
    
    ratingCount = ratingCount[0].replace('1-16 of ','')
    
    ratingCount = ratingCount.replace(' results for','')
    
    pages1 = math.ceil(int(ratingCount) / resultsPerPage)
    
    pages = [str(i) for i in range(1,pages1+1)]

In [7]:
basePage

'http://www.amazon.com/s?i=specialty-aps&srs=3019634011'

In [8]:
len(pages)

23

Using cfscrape module along with Requests to scrape each page. See https://github.com/Anorov/cloudflare-scrape for more info.

In [9]:
scraper = cfscrape.create_scraper()

Shuffling the order in which we scrape each page is another way we avoid detection. Accessing pages in seemingly random order makes our bot seem more 'human',giving it less chance of being recognized as a robot. So we shuffle our pages, take one proxy from the pool (hoping it works), and then choose a random user agent for each page requested. We then scrape the html text from each page and store the text in [html_list].  

In [10]:
random.shuffle(pages)

In [13]:
html_list=[]
proxy = next(proxy_pool)
for i in pages:
    try:
        headers = {'User-Agent':str(ua.random)}
        htmltext = scraper.get(basePage+'&page='+str(i),headers = headers,proxies={"http": proxy, "https": proxy}).content
        sleep(numpy.random.randint(1,4)) 
        htmltext = str(htmltext)
        html_list.append(htmltext)        
        print(i)
    except:
        for i in range(20):
            sleep(numpy.random.randint(1,4))

19
15
16
20
2
8
14
23
11
3
5
6
4
1
13
9
12
17
10
22
21
18
7


Check a few pages' html text to make sure the data is there. The length of each should be in the hundreds of thousands. If the text contains a message like "Sorry! Something went wrong on our end" or length is around 2000, you were blocked. Re-shuffle your pages. Try a new proxy. Test again.

In [32]:
html_list[1]
len(html_list[1])

228019

The purpose of [html_list] is to store the ASIN's from each page. These are Amazon's unique product identifiers that we then use to access the individual product pages. We use regular expressions to pull each ASIN from our [html_list]. 

In [16]:
asin_list=[]
for i in range(0,len(html_list)):
    pattern = re.compile("data-asin=\"([A-Z0-9]{10})\"")
    list2 = re.findall(pattern, html_list[i])
    asin_list.append(list(list2))       

In [17]:
asin_list2 = [item for sublist in asin_list for item in sublist]
asin_list2 = list(asin_list2)

In [18]:
#deduplicate
asin3 = []
for i in asin_list2:
    if i not in asin3:
        asin3.append(i)

In [19]:
len(asin3)

113

Now we use our ASIN list to access each product page and pull product name, price, and overall rating. We will use BeautifulSoup for this part.

In [20]:
def get_soup(url):
            
            context = ssl._create_unverified_context()
            retries = 2
    
            wait_time = 5
            read_url = None
            soup = ""
            tries = 1
            req = Request(url)
            req.add_header('User-agent', str(ua.random))
    
            # access url, with exponential back-off in event of failure
            while read_url is None:
                try:
                    read_url = urlopen(req, context=context).read()
                    soup = BeautifulSoup(read_url, "lxml")
                except URLError:
                    for i in range(20):
                        # Generating random delays
                        sleep(numpy.random.randint(1,3))
                        # Adding verify=False to avold ssl related issues
                    if tries == retries:
                        soup = ""
    
            return soup

In [21]:
product_list = []
for i in range(1,len(asin3)-1):
        reviewPage = 'https://www.amazon.com/dp/product-reviews/'+asin3[i]
        base = get_soup(reviewPage)
        print(i)
        name_raw = base.findAll('div',{'class':'a-row product-title'})
        name = re.findall(r'ie=UTF8">(.*?)</a>',str(name_raw))
        rating_raw = base.findAll('span',{'data-hook':'rating-out-of-text'})
        rating = re.findall(r'data-hook="rating-out-of-text">(.*?)</span>',str(rating_raw))
        price_raw = base.findAll('span',{'class':'a-color-price arp-price'})
        price = re.findall(r'"a-color-price arp-price">(.*?)</span>',str(price_raw))
        #handle currently unavailable items- they won't have price or rating
        if len(price) == 0:
            price = 'NA'
        else:
            price = price[0]
        if len(name) == 0:
            name = name
        else:
            name = name[0]  
        if rating[0] == '0.0 out of 5 stars':
            rating = 'NA'
        else:
            rating = rating[0]
        product_dict = {'Product Name': name,
                        'Price': price,
                        'Rating': rating}
        product_list.append(product_dict)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111


In [22]:
#store as data frame
df1 = pd.DataFrame(product_list)

In [38]:
df1 #looks good!

Unnamed: 0,Price,Product Name,Rating
0,$169.89,BRAWNY Medium Duty Premium Double Recrepe Wipe...,
1,,"Brawny Big Roll Paper Towels, White, 6 Count",4.0 out of 5 stars
2,$74.90,"Brawny Giant Roll, White, Pick-A-Size, 24 Coun...",
3,,"Brawny Paper Towels, Pick-A-Size, Big Roll, Wh...",3.0 out of 5 stars
4,$90.00,Brawny Ind Scrim All Purp Wpr Bx Whi 5/166,
5,,"Brawny Paper Towels, White, 8 Giant Rolls",3.9 out of 5 stars
6,$105.14,"GPC28611 - Airlaid Medium-Duty Folded Wipers, ...",
7,$108.64,GPC29221 - Light-Duty Paper Wipers,
8,$82.92,GPC29215 - Medium Duty Airlaid 1/4-Fold Wipers,
9,,"Brawny Tear-a-Square Paper Towel Rolls, 16 Count",


In [None]:
df1.to_csv('out.csv')