# amazon web scraper for customer reviews

 amazon search query

Amazon has a specific format for search query within url


1. https://www.amazon.com/s?k=iphone+13
"s?k=" is the path and any string after is the search item

2. https://www.amazon.com/s?k=iphone+13&page=3
same search query as 1. added "&page={page-number}" to scrape mltiple pages

3. https://www.amazon.com/product-reviews/B09LNX6KQS?pageNumber=2
amazon.../product-reivews/{asin}?pageNumber={n}

amazon... and product-reviews is fixed
***you must know the id for second url***
id aka ASIN (amazon specific product id, displays each page)


In [87]:
import requests
import operator

from bs4 import BeautifulSoup
from functools import reduce

In [96]:
# global variables
BASE_URL = "https://www.amazon.com/"
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36",
           "Accept-Language": "en-US, en;q=0.9"}

In [97]:
# test query
# search_query = "iphone+13"
# url = BASE_URL + "s?k=" + search_query

# include headers to access web
# HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
#             'Accept-Language': 'en-US, en;q=0,5'}

# search_result = requests.get(url, headers = HEADERS)
# search_result = requests.get(url, headers = fake_headers)


# construct soup
# soup = BeautifulSoup(search_result.text, 'html.parser')

# url
# if successful status_code shoud be 200
# search_result.status_code

# raw string of page source code
# search_result.text
# search_result.content

# soup

In [98]:
# user defined func
def getpage(query, url_type):
    """
    A helper function to create a BeautifulSoup object. Input query string and specify the url type.
    The goal is to return html code of a page.
    
    query: str input. Accepts search query by "item name" or "asin"
    
    url_type: specify the url type for the search you're doing. options = "item", "asin"
    """
    if url_type == "item":
        url = BASE_URL + "s?k=" + query
    elif url_type == "asin":
        url = BASE_URL + "product-reviews/" + query
    else:
        return "Error: url type unsupported. Choose from the following 'item', 'asin'"
        
    response = requests.get(url, headers = HEADERS)
    
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")
    else:
        return "Error: status_code != 200"

# test
# search_query = "iphone+13"
# search_query = "B0BGQLFB55"

# getpage(search_query, "item")
# getpage(search_query, "asin")

## inspect the raw html data

1. find product name

notice that there is product name within a span tag with class="a-size-medium a-color-base a-text-normal" 

e.g. <span class="a-size-medium a-color-base a-text-normal">Apple iPhone 13 (128GB, Pink) [Locked] + Carrier Subscription</span>



In [99]:
### user defined function

def get_prod_name(query, start = 1, end = None):
    """
    query: string input. search items on amazon in browser url tab. 
    e.g. if you wanted to search for "iphone 13", it would be "iphone+13"

    start: integer input. starting page number, default = 1
    
    end: integer input. ending page number, default = None

    returns a list of product names from search
    """
    if end == None:
        query = query + f"&page={str(start)}"
        html_raw = getpage(query, "item")
        products = html_raw.find_all("span", class_ = "a-size-medium a-color-base a-text-normal")
        
        return list(map(lambda x: x.text, products))

    else:
        pages = range(start, end + 1)
        query = list(map(lambda x: query + f"&page={str(x)}", pages))
        html_raw = list(map(lambda x: getpage(x, "item"), query))
        html_tag = list(map(lambda x: x.find_all("span", class_ = "a-size-medium a-color-base a-text-normal"), html_raw))
        html_tag = reduce(operator.iconcat, html_tag) # flatten the ResultSet
        products = list(map(lambda x: x.text, html_tag))
        
        return products

# test
search_query = "iphone+13"

# getproduct(search_query)

get_prod_name(search_query, 1, 5)

['Apple iPhone 13 (128GB, Pink) [Locked] + Carrier Subscription',
 'Apple iPhone 13 Mini (128GB, Starlight) [Locked] + Carrier Subscription',
 'Apple iPhone 13, 128GB, Blue - Unlocked (Renewed)',
 'iPhone 13, 128GB, Blue - Unlocked (Renewed Premium)',
 'Apple iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed)',
 'iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed Premium)',
 'Apple iPhone 13 Mini (512GB, Pink) [Locked] + Carrier Subscription',
 'Apple iPhone 12 Pro, 128GB, Graphite - Fully Unlocked (Renewed)',
 'iPhone 13 Pro Max, 128GB, Graphite - Unlocked (Renewed Premium)',
 'Apple iPhone 12, 64GB, Blue - Fully Unlocked (Renewed)',
 'Apple iPhone 11 Pro, US Version, 256GB, Space Gray - Unlocked (Renewed)',
 'iPhone 13 Pro, 128GB, Sierra Blue - Unlocked (Renewed Premium)',
 'Apple iPhone 13 Pro, 128GB, Gold - Unlocked (Renewed)',
 'Apple iPhone 11, 64GB, Yellow - Fully Unlocked (Renewed)',
 'Apple iPhone 13 Pro Max, 128GB, Sierra Blue - Unlocked (Renewed)',
 'Apple iPhone 12 Pro Max, 128G

### using ASIN to get the individual product detail

ASIN is a unique identifier for amazon products
URL has the following format www.amazon.com/dp/{asin}
1. create a function to retrieve asin

In [100]:
# user defined func
def get_asin(query, start = 1, end = None):
    """
    ***NEED QUERY FORMAT ITEMS TO RETRIEVE ASIN***
    query: string input. search items on amazon in browser url tab. 
    e.g. if you wanted to search for "iphone 13", it would be "iphone+13"

    start: integer input. starting page number, default = 1
    
    end: integer input. ending page number, default = None

    returns a list of asin associated with the items from search
    """
    if end == None:
        query = query + f"&page={str(start)}"
        html_raw = getpage(query, "item")
        html_tag = html_raw.find_all("div", class_ = "s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16")
    
        return list(map(lambda x: x.attrs["data-asin"], html_tag))
    
    else:
        pages = range(start, end + 1)
        query = list(map(lambda x: query + f"&page={str(x)}", pages))
        html_raw = list(map(lambda x: getpage(x, "item"), query))
        html_tag = list(map(lambda x: x.find_all("div", class_ = "s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16"),
                            html_raw
                           )
                       )
        html_tag = reduce(operator.iconcat, html_tag) # flatten the ResultSet
        asin_list = list(map(lambda x: x.attrs["data-asin"], html_tag))

        return asin_list
    
# test
search_query = "iphone+13"

# get_asin(search_query)
get_asin(search_query, 1, 5)

['B09LNX6KQS',
 'B0BGQK54Z9',
 'B09LKF2RPP',
 'B0BGQWXTH9',
 'B09G9FMPT1',
 'B08PNP5YGV',
 'B0BGYQ4TXJ',
 'B08PNM1LNZ',
 'B07ZQRL9XY',
 'B0BGYF4CZF',
 'B09LP7YLF9',
 'B07ZPJ8YZ6',
 'B09LPDM924',
 'B08PMP778K',
 'B08PNN2SKF',
 'B0BN9P1GXC',
 'B08PNN2SKF',
 'B0BN9P1GXC',
 'B0B5FLX9WS',
 'B09JF5ZHQS',
 'B0BN72FYFG',
 'B09JFS16CP',
 'B07753NSQZ',
 'B07P976BBH',
 'B0BN733951',
 'B0B5FLS2CF',
 'B09V3JPBK6',
 'B07ZZKP4D1',
 'B09JFC8JGG',
 'B0B4FBKLWP',
 'B0BN93ZDJQ',
 'B0B5PF8TW9',
 'B09JFFG8D7',
 'B0B4JB2D7W',
 'B07ZZKP4D1',
 'B0BN72FYFG',
 'B0BBXLNQ1R',
 'B0B5FCWW3V',
 'B09MVZH5RB',
 'B0B4FBKLWP',
 'B09JFKQ6Y4',
 'B07XLTSDKC',
 'B09JFC967X',
 'B09JFC8JGG',
 'B09V3J59BL',
 'B09T3MQSVP',
 'B07WW5MPMF',
 'B09JF7QNZV',
 'B09JFSH31K',
 'B07ZZKP4D1',
 'B0B5PF8TW9',
 'B07KBV982N',
 'B0BN94DL3R',
 'B09JFKQ6Y4',
 'B09JFSMFB5',
 'B09JFS5CJK',
 'B0B3T9DLR3',
 'B0B5FKTMP7',
 'B0B5TM6VWB',
 'B07SC58QBW',
 'B09JFC967X',
 'B0BCQXG46W',
 'B09JFLKR3H',
 'B0BN96CCM6',
 'B09JFJ1Q5C',
 'B09JFP32Y5',
 'B09JFTPQ

### 2. individual page with list of reviews

individual product reviews within each page
https://www.amazon.com/product-reviews/B09LNX6KQS?pageNumber=2

format is as follows:

amazon.../product-reivews/{asin}?pageNumber={n}


In [101]:
# user defined func
def get_reviews(asin: list, start = 1, end = None):
    """
    asin: list input. a list of ASIN
    
    start: integer input. starting page number, default = 1
    
    end: integer input. ending page number, default = None
    
    returns individual review contents of a product-review page.
    """
    if end == None:
        query = list(map(lambda x: x + f"?pageNumber={str(start)}", asin))
        html_raw = list(map(lambda x: getpage(x, "asin"), query))
        html_tag = list(map(lambda x: x.find_all("span", attrs = {"data-hook": "review-body"}), html_raw))
        html_tag = reduce(operator.iconcat, html_tag) # flatten the ResultSet
        reviews = list(map(lambda x: x.text, html_tag))
        reviews = [r.strip("\n") for r in reviews]

        return reviews
    
    else:
        pages = range(start, end + 1)
        queries = []
                
        for p in pages:
            for id in asin:
                query = id + f"?pageNumber={str(p)}"
                queries.append(query)
        
        html_raw = list(map(lambda x: getpage(x, "asin"), queries))
        html_tag = list(map(lambda x: x.find_all("span", attrs = {"data-hook": "review-body"}), html_raw))
        html_tag = reduce(operator.iconcat, html_tag) # flattent the ResultSet
        reviews = list(map(lambda x: x.text, html_tag))
        reviews = [r.strip("\n") for r in reviews]
        
        return reviews
            
        

# test   
search_query = "iphone+13"
asin_list = get_asin(search_query)

# get_reviews(asin_list)
get_reviews(asin_list, 1, 5)

['Ordered and Received the pink 256.The battery was 93%. No nicks or scratches to case or screen. Looked brand new. No issues after using it for a month. Came with SIM ejection pin.',
 'I couldn’t  be happier with my purchase. Not only is it much cheaper than in the store but the condition is absolutely perfect! I’m so happy with the condition of the phone. The battery life is at 100%. The phone is beautiful. I was able to get the pink phone unlocked and for a lot cheaper than the original price! Don’t hesitate to buy it if you’re concerned.',
 'Like new. Very good so far.',
 'This iPhone looks brand new.  It has 96% battery health.  I’m very impressed with Amazon renewed',
 'I ordered 2 phones that came in 2 seperate boxes. One phone looked great from the first box, but the second phone was actually missing from the second box. It became a hasle to get a resoloution, but it did get resolved.',
 'Todo ok',
 'While in good condition, the phone they sent me was on the Lost/Stolen List (p

references

1. https://github.com/vaisakhnambiar/Web-scraping/blob/master/Amazon%20Scraping%20assignment.ipynb
2. https://www.geeksforgeeks.org/web-scraping-amazon-customer-reviews/
3. https://medium.com/analytics-vidhya/web-scraping-amazon-reviews-a36bdb38b257

### concerns

1. using reduce(operator.iconcat, ...) to flatten lists -> is this necessary? better way to approach? the original resulting list was [[...], [...], ...]. thats why i had to flatten it
2. list(map(...)) vs list comprehension with [x... for x in iterable] -> which is better? just a matter of preference?
3. feel like a lot of functions have repetition of same code -> better to combine functions? suggestions?
4. the media could not be loaded ?? within the reviews