# Collecting Data from Amazon
---
In this notebook, we scrape the missing parts of our dataset directly from Amazon using BeautifulSoup.

The dataset we want:

| Book ID | Review Score | Sales Rank | Category    | Title | Author | Year    | Visual Features     |
| ------- | ------------ | ---------- | ----------- | ----- | ------ | ------- | ------------------- |
| |

<!---| Numeric | Numeric    | Numeric    | Categorical | Numeric | Numeric | Numeric/Categorical |--->

The data we have, as downloaded from [here](http://jmcauley.ucsd.edu/data/amazon/):

| asin | helpful | overall | reviewText | reviewTime | reviewerID | reviewerName | summary | unixReviewTime |
| ---- | ------- | ------- | ---------- | ---------- | ---------- | ------------ | ------- | -------------- |
| |

The `asin` column in the data we have corresponds to Book IDs, so we have that column already and more importantly, we have a way to reach the webpage of every book in the data—by accessing https://www.amazon.com/dp/book-id-here.

We can also get the review score of each book without scraping it, by taking the average of all the reviews each book received.

For everything else, there's ~~Mastercard~~ BeautifulSoup.

In [7]:
import requests
from bs4 import BeautifulSoup

import pandas as pd

import findspark
findspark.init()

from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import *

DATA_DIR = '/Users/dogatekin/Data/'

## Preprocessing

We already have the review data in Parquet format:

In [2]:
reviews = spark.read.parquet(DATA_DIR + "reviews.parquet")
reviews.head()

Row(asin='000100039X', helpful=[0, 0], overall=5.0, reviewText='Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!', reviewTime='12 16, 2012', reviewerID='A10000012B7CGYKOMPQ4L', reviewerName='Adam', summary='Wonderful!', unixReviewTime=1355616000)

Let's check how many unique books we have:

In [6]:
reviews.select(countDistinct('asin')).show()

+--------------------+
|count(DISTINCT asin)|
+--------------------+
|              367982|
+--------------------+



That's not too bad! We might as well do the aggregation for the average review score of each book and move over to Pandas.

In [21]:
books = reviews.groupBy('asin').agg(avg('overall'), count('overall')).show()

+----------+------------------+--------------+
|      asin|      avg(overall)|count(overall)|
+----------+------------------+--------------+
|0002216973| 4.833333333333333|            12|
|0006476155| 4.299418604651163|           344|
|0006544150| 4.222222222222222|             9|
|0006550479|3.5428571428571427|            35|
|0007163932|               4.5|            22|
|0023605103|               4.2|             5|
|0027861317|             4.625|             8|
|0028633784| 4.428571428571429|             7|
|0060087447| 3.606060606060606|            33|
|0060192097| 4.555555555555555|            18|
|0060392436| 4.857142857142857|             7|
|0060532564| 4.571428571428571|            21|
|0060540745| 4.228571428571429|            35|
|0060574437| 4.857142857142857|             7|
|0060598808|3.8271604938271606|            81|
|0060611561|              4.35|            20|
|0060731451|4.2631578947368425|            19|
|0060753080|               2.0|            19|
|0060753684| 

In [19]:
books = reviews.groupBy('asin').agg(avg('overall')).toPandas()
books.head()

Unnamed: 0,asin,avg(overall)
0,2216973,4.833333
1,6476155,4.299419
2,6544150,4.222222
3,6550479,3.542857
4,7163932,4.5


## Scraping

In [25]:
bookid = '0006476155'

In [105]:
# Make the request
r = requests.get('https://www.amazon.com/dp/' + bookid)

print('Response status code: {0}\n'.format(r.status_code))
print('Response headers: {0}\n'.format(r.headers))

Response status code: 200

Response headers: {'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Tue, 20 Nov 2018 23:17:36 GMT', 'Strict-Transport-Security': 'max-age=47474747; includeSubDomains; preload', 'Vary': 'Accept-Encoding,User-Agent,X-Amzn-CDN-Cache', 'P3P': 'policyref="https://www.amazon.com/w3c/p3p.xml",CP="CAO DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC "', 'Cache-Control': 'no-cache, no-transform', 'Content-Encoding': 'gzip', 'X-XSS-Protection': '1;', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'x-amz-rid': '520JZ7Z5ZRSMKS0H6YZC', 'X-Cache': 'Miss from cloudfront', 'Via': '1.1 87e53d6d1b409d9ddfa1cf973907c0eb.cloudfront.net (CloudFront)', 'X-Amz-Cf-Id': '9SSruJ1WyX__OEVM6cxWi-b9GjqdCYjKQjZLSVk5zIniQnKWaHQuew=='}



In [106]:
soup = BeautifulSoup(r.content, 'lxml')

In [108]:
soup.find("span", id='productTitle').string

'Along Came a Spider'

In [110]:
# Book details
book_info = []
for li in soup.select('table#productDetailsTable div.content ul li'):
    try:
        title = li.b
        key = title.text.strip().rstrip(':')
        value = title.next_sibling.strip()
        value = value.strip("()")
        book_info.append((key,value))
    except AttributeError:
        break

In [111]:
book_info

[('Paperback', '448 pages'),
 ('Publisher', 'HarperCollins; New Ed edition (March 2004'),
 ('Language', 'English'),
 ('ISBN-10', '9780006476153'),
 ('ISBN-13', '978-0006476153'),
 ('ASIN', '0006476155'),
 ('Product Dimensions', '4.3 x 1.5 x 6.8 inches'),
 ('Shipping Weight', '8.5 ounces'),
 ('Average Customer Review', ''),
 ('Amazon Best Sellers Rank', '#5,111,537 in Books ')]

In [130]:
ratings = soup.select('#histogramTable')[0].text

In [123]:
for i in soup.select('#histogramTable'):
    print(i.text)

5 star58%4 star25%3 star9%2 star4%1 star4%


In [139]:
import re

rat = re.findall(u'(\d) star(\d+)%', ratings)
rat

[('5', '58'), ('4', '25'), ('3', '9'), ('2', '4'), ('1', '4')]

In [170]:
%%timeit
avg = 0
for pair in rat:
    avg += int(pair[0])*int(pair[1])/100
avg

1.75 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [103]:
def book_details_amazon(isbn):
    
    # Amazon Scraping
    amazon_base_url = "https://www.amazon.com/dp/"
    amazon_url = amazon_base_url + isbn
#     req = Request(amazon_url, headers={'User-Agent': 'Mozilla/5.0'})
#     page = urlopen(req).read().decode("utf-8")
    req = requests.get(amazon_url, headers={'User-Agent': 'Mozilla/5.0'})
    page = req.text
    soup = BeautifulSoup(page, 'lxml')
    
    # Book title
    a_title = soup.find("span", id='productTitle').string
    
    # Book details
    book_info = []
    for li in soup.select('table#productDetailsTable div.content ul li'):
        try:
            title = li.b
            key = title.text.strip().rstrip(':')
            value = title.next_sibling.strip()
            value = value.strip("()")
            book_info.append((key,value))
        except AttributeError:
            break
            
    # Amazon reviews scraping
    amazon_review_base_url = "https://www.amazon.com/product-reviews/"
    amazon_review_url = amazon_review_base_url + isbn + "/ref=cm_cr_getr_d_paging_btm_2?pageNumber="
#     req = Request(amazon_review_url, headers={'User-Agent': 'Mozilla/5.0'})
#     page = urlopen(req).read().decode("utf-8")
#     soup = BeautifulSoup(page, 'html.parser')
    req = requests.get(amazon_url, headers={'User-Agent': 'Mozilla/5.0'})
    page = req.text
    soup = BeautifulSoup(page, 'lxml')
    
    # List of book reviews in Amazon
    reviews_list = []
    reviews_list_final = []
    for pg in range(1,5):
        amazon_review_url = amazon_review_base_url + isbn + "/ref=cm_cr_getr_d_paging_btm_2?pageNumber=" + str(pg)
        req = requests.get(amazon_url, headers={'User-Agent': 'Mozilla/5.0'})
        page = req.text
        soup = BeautifulSoup(page, 'lxml')
        txt = soup.find("div", id="cm_cr-review_list")
        try:
            for rawreview in txt.find_all('span', {'class' : 'a-size-base review-text'}):
                text = rawreview.parent.parent.parent.text
                startindex = text.index('5 stars') + 7
                endindex = text.index('Was this review helpful to you?')
                text = text[startindex:endindex]
                text = text.split("Verified Purchase")[1]
                rText = text.split(".")[:-1]
                review_text = ""
                for i in range(len(rText)):
                    review_text += rText[i]
                    review_text += "."
                if review_text is not "":
                    if "|" not in review_text:
                        reviews_list.append(review_text)
                    else:
                        rText = text.split(".")[:-2]
                        review_text = ""
                        for x in range(len(rText)):
                            review_text += rText[x]
                            review_text += "."
                        reviews_list.append(review_text)
        except AttributeError:
            review_text = "No reviews found."
    
#     if amazon_reviews_count < len(reviews_list):
#         reviews_list_final = reviews_list[:amazon_reviews_count]
#     else:
    reviews_list_final = reviews_list
        
    # Printing book details from Amazon
    print('**Book Details from Amazon**')
    #print("Book Details from Amazon\n")
    print("Book Title: ",a_title)
    #print("\n")
    for i in range(len(book_info)):
        print(f"{book_info[i][0]} : {book_info[i][1]}")
        #print("\n")
    print("\n")
    if len(reviews_list_final) == 0:
        print(review_text)
        print("\n")
    else:
        print(f"Displaying top {amazon_reviews_count} book reviews:\n")
        for i in range(len(reviews_list_final)):
            review_txt_list = reviews_list_final[i].split(".")[:3]
            review_txt = ""
            for j in range(len(review_txt_list)):
                review_txt += review_txt_list[j]
                review_txt += "."
            review_txt += ".."
            print(review_txt)
            print("\n")

In [104]:
book_details_amazon(bookid)

**Book Details from Amazon**
Book Title:  Along Came a Spider
Paperback : 448 pages
Publisher : HarperCollins; New Ed edition (March 2004
Language : English
ISBN-10 : 9780006476153
ISBN-13 : 978-0006476153
ASIN : 0006476155
Product Dimensions : 4.3 x 1.5 x 6.8 inches
Shipping Weight : 8.5 ounces
Average Customer Review : 
Amazon Best Sellers Rank : #5,111,537 in Books 


No reviews found.


