# SCRAPING FROM AMAZON

#### **This is a code that gets the name of the product, the title of the review, the number of stars given, and the full comment from the review page of a product you want from Amazon.com. I took airpods 2 comments as an example here**


#### Required libraries

- requests
- BeautifulSoup from bs4
- pandas


##### After importing our libraries, we define an empty list to convert to dataframe structure after receiving our comments.
##### We use the header structure (User-Agent) so that the Amazon site does not consider us as robots and prevent us from pulling data.
##### We define the function that we send a request to the site (get_soup). We send a request by typing the link of the product we want into "requests.get" and adding "headers" to the end. With BeautifulSoup, we split the data from the lxml method. (can also be done in html)
##### Then, with the "get_reviews" function, we select the previously obtained data as product, title, rating and body part according to the html structure.
##### Finally, we take the product parts in each comment and put it in the empty directory (reviewlist) we defined at the beginning and save it as an excel file. Using the for loop with range, we determine from the beginning how many pages of comments we want to receive. The if part at the end of the for loop is to avoid an error when the last page is reached.

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [16]:
reviewlist = []

In [17]:
header = {
    'Host': 'www.amazon.com',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'
}

In [18]:
def get_soup(url):
    req = requests.get("https://www.amazon.com/Apple-AirPods-Charging-Case-Renewed/product-reviews/B07SKLLYTW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews", headers = header)
    soup = BeautifulSoup(req.content, "lxml")
    return soup

In [22]:
print(soup.title.text)

Amazon.com: Customer reviews: Apple AirPods 2 with Charging Case - White (Renewed)


In [23]:
def get_reviews(soup):
    reviews = soup.find_all('div', {'data-hook': 'review'})
    try:
        for item in reviews:
            review = {
            'product' : soup.title.text.replace('Amazon.com: Customer reviews:', '').strip(),
            'title': item.find('a', {'data-hook': 'review-title'}).text.strip(),
            'rating':  float(item.find('i', {'data-hook': 'review-star-rating'}).text.replace('out of 5 stars', '').strip()),
            'body': item.find('span', {'data-hook': 'review-body'}).text.strip(),
            }
            reviewlist.append(review)
    except:
        pass

In [24]:
for x in range(1,6):
    soup = get_soup(f'https://www.amazon.com/Apple-AirPods-Charging-Case-Renewed/product-reviews/B07SKLLYTW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
    print(f'Getting page: {x}')
    get_reviews(soup)
    print(len(reviewlist))
    if not soup.find('li', {'class': 'a-disabled a-last'}):
        pass
    else:
        break

df = pd.DataFrame(reviewlist)
df.to_excel('abcdefg.xlsx', index=False)
print('May the force be with you!')

Getting page: 1
60
Getting page: 2
70
Getting page: 3
80
Getting page: 4
90
Getting page: 5
100
Fin.
