# amazon web scraper for customer reviews

## amazon search query

Amazon has a specific format for search query within url


1. https://www.amazon.com/s?k=iphone+13
"s?k=" is the path and any string after is the search item

2. https://www.amazon.com/s?k=iphone+13&page=3
same search query as 1. added "&page={page-number}" to scrape mltiple pages

***3 could be wrong or unnecessary***
3. https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/product-reviews/1491910291?pageNumber=2
amazon.../product-name/product-reivews/id

amazon... and product-reviews is fixed
***you must know the id for second url***
id aka ASIN (amazon specific product id, displays each page)


In [1]:
import requests
import operator

from bs4 import BeautifulSoup
from functools import reduce

In [2]:
# global variables
BASE_URL = "https://www.amazon.com/s?k="
HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0,5'}
# fake_headers = {"abc": "def"}

In [3]:
# test query
search_query = "iphone+13"
url = BASE_URL + search_query

# include headers to access web
# HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
#             'Accept-Language': 'en-US, en;q=0,5'}

search_result = requests.get(url, headers = HEADERS)
# search_result = requests.get(url, headers = fake_headers)


# construct soup
soup = BeautifulSoup(search_result.text, 'html.parser')

# url
# if successful status_code shoud be 200
# search_result.status_code

# raw string of page source code
# search_result.text
# search_result.content

# soup

In [4]:
# user defined func
# HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
#             'Accept-Language': 'en-US, en;q=0,5'}

# follow the amazon specific query format, returns string

# scrape the raw data, returns string of the html source code, unformatted, helper func
def getpage(query):
    url = BASE_URL + query
    response = requests.get(url, headers = HEADERS)
    
    if response.status_code == 200:
        return response.text
    else:
        return "Error: status_code != 200"

# convert to html readable string, returns BeautifulSoup object, helper func
def html_code(query):
    return BeautifulSoup(getpage(query), 'html.parser')

# test
search_query = "iphone+13"

# getpage(search_query)
# html_code(search_query)

## inspect the raw html data

1. find product name

notice that there is product name within a span tag with class="a-size-medium a-color-base a-text-normal" 

e.g. <span class="a-size-medium a-color-base a-text-normal">Apple iPhone 13 (128GB, Pink) [Locked] + Carrier Subscription</span>



In [71]:
### user defined function

def get_prod_name(query, start = 1, end = None):
    """
    query: string input. search items on amazon in browser url tab. 
    e.g. if you wanted to search for "iphone 13", it would be "iphone+13"

    start: integer input. starting page number, default = 1
    
    end: integer input. ending page number, default = None

    returns a list of product names from search
    """
    if end == None:
        query = query + f"&page={str(start)}"
        html_raw = html_code(query)

        products = html_raw.find_all("span", class_ = "a-size-medium a-color-base a-text-normal")
        return list(map(lambda x: x.text, products))

    else:
        pages = range(start, end + 1)
        query = list(map(lambda x: query + f"&page={str(x)}", pages))
        html_raw = list(map(lambda x: html_code(x), query))
        html_tag = list(map(lambda x: x.find_all("span", class_ = "a-size-medium a-color-base a-text-normal"), html_raw))
        html_tag = reduce(operator.iconcat, html_tag) # flatten the ResultSet
        products = list(map(lambda x: x.text, html_tag))
        
        return products

# test
search_query = "iphone+13"


# getproduct(search_query)

get_prod_name(search_query, 1, 5)

['Apple iPhone 13 (128GB, Pink) [Locked] + Carrier Subscription',
 'Apple iPhone 13 Mini (128GB, Starlight) [Locked] + Carrier Subscription',
 'Apple iPhone 13, 128GB, Blue - Unlocked (Renewed)',
 'iPhone 13, 128GB, Blue - Unlocked (Renewed Premium)',
 'iPhone 13 Pro, 128GB, Sierra Blue - Unlocked (Renewed Premium)',
 'Apple iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed)',
 'iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed Premium)',
 'Apple iPhone 13 Mini (512GB, Pink) [Locked] + Carrier Subscription',
 'Apple iPhone 13 (128GB, Pink) [Locked] + Carrier Subscription',
 'Apple iPhone 11 Pro, US Version, 256GB, Space Gray - Unlocked (Renewed)',
 'Apple iPhone 12 Pro, 128GB, Graphite - Fully Unlocked (Renewed)',
 'iPhone 13 Pro Max, 128GB, Graphite - Unlocked (Renewed Premium)',
 'Apple iPhone 12, 128GB, Green - Fully Unlocked (Renewed)',
 'Apple iPhone 13 Pro, 128GB, Gold - Unlocked (Renewed)',
 'Apple iPhone 13 Pro Max, 128GB, Sierra Blue - Unlocked (Renewed)',
 'Apple iPhone 11, 64GB,

### using ASIN to get the individual product detail

ASIN is a unique identifier for amazon products
URL has the following format www.amazon.com/dp/{asin}
create a function to retrieve asin

In [73]:
# user defined func
def get_asin(query, start = 1, end = None):
    """
    query: string input. search items on amazon in browser url tab. 
    e.g. if you wanted to search for "iphone 13", it would be "iphone+13"

    start: integer input. starting page number, default = 1
    
    end: integer input. ending page number, default = None

    returns a list of asin from search
    """
    if end == None:
        query = query + f"&page={str(start)}"
        html_raw = html_code(query)
        html_tag = html_raw.find_all("div", class_ = "s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16")
    
        return list(map(lambda x: x.attrs["data-asin"], html_tag))
    
    else:
        pages = range(start, end + 1)
        query = list(map(lambda x: query + f"&page={str(x)}", pages))
        html_raw = list(map(lambda x: html_code(x), query))
        html_tag = list(map(lambda x: x.find_all("div", class_ = "s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16"),
                            html_raw
                           )
                       )
        html_tag = reduce(operator.iconcat, html_tag)
        asin_list = list(map(lambda x: x.attrs["data-asin"], html_tag))

        return asin_list
    
# test
search_query = "iphone+13"

# get_asin(search_query)
get_asin(search_query, 1, 7)

['B09LNX6KQS',
 'B0BGQK54Z9',
 'B0BGYF4CZF',
 'B09LKF2RPP',
 'B0BGQWXTH9',
 'B09G9FMPT1',
 'B09G9CX7DK',
 'B08PNP5YGV',
 'B0BGYQ4TXJ',
 'B07ZQRL9XY',
 'B08PNZDM7L',
 'B09LP7YLF9',
 'B09LPDM924',
 'B07ZPJ8YZ6',
 'B08PMP778K',
 'B08PNN2SKF',
 'B09JF5ZHQS',
 'B0BN9P1GXC',
 'B0B5FLX9WS',
 'B07753NSQZ',
 'B0BN72FYFG',
 'B09JFFG8D7',
 'B0BN733951',
 'B07P976BBH',
 'B09JFC8JGG',
 'B09V3JPBK6',
 'B0B5FLS2CF',
 'B0BN92S2ZZ',
 'B09JFN8K6T',
 'B0BN93ZDJQ',
 'B0BBXLNQ1R',
 'B09MVZH5RB',
 'B0BN723RDK',
 'B07ZQSSJVQ',
 'B0BN952L3Y',
 'B09T3MQSVP',
 'B0B4JB2D7W',
 'B09JFQ9G5Z',
 'B09JFNMBWL',
 'B09JF7QNZV',
 'B0BDY71GRG',
 'B09MZCNGSD',
 'B09V3HZ8B5',
 'B09JFKQ6Y4',
 'B0B4FBKLWP',
 'B0BN72MLT2',
 'B0B3PSRHHN',
 'B0BN9426PP',
 'B0B3T9DLR3',
 'B0B2KLZ5PP',
 'B0B3PSRHHN',
 'B09MXK3H49',
 'B09JFGGR33',
 'B0B9P2VHKM',
 'B0B5TM6VWB',
 'B09JFS5CJK',
 'B0BN733951',
 'B09R6FJWWS',
 'B08R988XHQ',
 'B0B9HDXZPG',
 'B0BD6Z2QW4',
 'B08KJJWK2L',
 'B085TCRFST',
 'B09PFC2DVD',
 'B0BN991DGS',
 'B09JFLKR3H',
 'B09JFC96

references

1. https://github.com/vaisakhnambiar/Web-scraping/blob/master/Amazon%20Scraping%20assignment.ipynb
2. https://www.geeksforgeeks.org/web-scraping-amazon-customer-reviews/
3. https://medium.com/analytics-vidhya/web-scraping-amazon-reviews-a36bdb38b257