# Collecting Data from Amazon
---
In this notebook, we collect the necessary data by scraping it directly from Amazon.

The dataset we want:

| ID | Review Score | Sales Rank | Category    | Title | Author | Date    | Visual Features     |
| -- | ------------ | ---------- | ----------- | ----- | ------ | ------- | ------------------- |

The dataset we have, as downloaded from [here](https://github.com/uchidalab/book-dataset):

| ID | Filename | Image URL | Title | Author | Category ID | Category |
| -- | -------- | --------- | ----- | ------ | ----------- | -------- |

The `ID` column in the data can be used to access the webpage of each book, by connecting to https://www.amazon.com/dp/book-id. This allows us to scrape any data that is missing directly from Amazon.

We already have the Title, Author and Category of each book ready to be used.

For everything else, there's ~~Mastercard~~ BeautifulSoup.

In [3]:
# To request data from Amazon
import requests
from bs4 import BeautifulSoup

# To open image links
import urllib

# To process data
import pandas as pd
import numpy as np

# To extract information from weirdly formatted Amazon info
import re

# To create random delays to trick the Amazon bot detector
from time import sleep
import random

# To rotate IPs while scraping | WARNING: Don't forget to run `tor` in the terminal before executing this cell
from torrequest import TorRequest
tor = TorRequest(password='i<3cs401')

# To rotate user-agents while scraping
from fake_useragent import UserAgent
user_agent = UserAgent()

# To read data
import csv

# For file operations
import os
import shutil

# For parallelizing tasks 
import dask.dataframe as dd
import dask.multiprocessing
from dask import compute, delayed

# To print an image in the notebook programmatically
from IPython.display import Markdown

# Set data directories
ORIGINAL_DATA_DIR = 'Original Data/'
COLLECTED_DATA_DIR = 'Collected Data/'
IMAGE_DIR = COLLECTED_DATA_DIR + 'Cover Images/'
HTML_DIR = '/Users/dogatekin/Data/HTML Files/'

## Preprocessing the Original Data
---

Load the data:

In [4]:
header_names = ['ID', 'Filename', 'Image URL', 'Title', 'Author', 'Category ID', 'Category']

books = pd.read_csv(ORIGINAL_DATA_DIR + 'book32-listing.csv', encoding='latin1', header=None, names=header_names)
books.head()

Unnamed: 0,ID,Filename,Image URL,Title,Author,Category ID,Category
0,761183272,0761183272.jpg,http://ecx.images-amazon.com/images/I/61Y5cOdH...,Mom's Family Wall Calendar 2016,Sandra Boynton,3,Calendars
1,1623439671,1623439671.jpg,http://ecx.images-amazon.com/images/I/61t-hrSw...,Doug the Pug 2016 Wall Calendar,Doug the Pug,3,Calendars
2,B00O80WC6I,B00O80WC6I.jpg,http://ecx.images-amazon.com/images/I/41X-KQqs...,"Moleskine 2016 Weekly Notebook, 12M, Large, Bl...",Moleskine,3,Calendars
3,761182187,0761182187.jpg,http://ecx.images-amazon.com/images/I/61j-4gxJ...,365 Cats Color Page-A-Day Calendar 2016,Workman Publishing,3,Calendars
4,1578052084,1578052084.jpg,http://ecx.images-amazon.com/images/I/51Ry4Tsq...,Sierra Club Engagement Calendar 2016,Sierra Club,3,Calendars


Inspect the categories:

In [5]:
print('\n'.join(books['Category'].unique()))

Calendars
Comics & Graphic Novels
Test Preparation
Mystery, Thriller & Suspense
Science Fiction & Fantasy
Romance
Humor & Entertainment
Literature & Fiction
Gay & Lesbian
Engineering & Transportation
Cookbooks, Food & Wine
Crafts, Hobbies & Home
Arts & Photography
Education & Teaching
Parenting & Relationships
Self-Help
Computers & Technology
Medical Books
Science & Math
Health, Fitness & Dieting
Business & Money
Law
Biographies & Memoirs
History
Politics & Social Sciences
Reference
Christian Books & Bibles
Religion & Spirituality
Sports & Outdoors
Teen & Young Adult
Children's Books
Travel


We only want the Children's Books:

In [6]:
books = books[books['Category'] == "Children's Books"].reset_index(drop=True)
# We don't need the Category or Category ID columns anymore
books.drop(columns=['Category ID', 'Category'], inplace=True)
books.head()

Unnamed: 0,ID,Filename,Image URL,Title,Author
0,545790352,0545790352.jpg,http://ecx.images-amazon.com/images/I/51MIi4p2...,Harry Potter and the Sorcerer's Stone: The Ill...,J.K. Rowling
1,1419717014,1419717014.jpg,http://ecx.images-amazon.com/images/I/61YgGsg-...,Diary of a Wimpy Kid: Old School,Jeff Kinney
2,1423160916,1423160916.jpg,http://ecx.images-amazon.com/images/I/611CmvkL...,"Magnus Chase and the Gods of Asgard, Book 1: T...",Rick Riordan
3,1476789886,1476789886.jpg,http://ecx.images-amazon.com/images/I/51KqU7Dw...,Rush Revere and the Star-Spangled Banner,Rush Limbaugh
4,1338029991,1338029991.jpg,http://ecx.images-amazon.com/images/I/61kvq74k...,Harry Potter Coloring Book,Scholastic


Let's check how many books we have left:

In [7]:
len(books)

13605

Finally, let's fix the IDs in the dataset. For some reason, the ID column has the leading 0s removed (normally all of them should be 10 characters long), which makes the webpages inaccessible. The filename column has the correct IDs with the correct number of leading 0s. So let's use the Filename column as the new ID column, we can add the `.jpg` extension later when downloading:

In [8]:
books['ID'] = books['Filename'].apply(lambda row: re.findall(u'(.*).jpg', row)[0])
books.drop(columns='Filename', inplace=True)
books.head()

Unnamed: 0,ID,Image URL,Title,Author
0,545790352,http://ecx.images-amazon.com/images/I/51MIi4p2...,Harry Potter and the Sorcerer's Stone: The Ill...,J.K. Rowling
1,1419717014,http://ecx.images-amazon.com/images/I/61YgGsg-...,Diary of a Wimpy Kid: Old School,Jeff Kinney
2,1423160916,http://ecx.images-amazon.com/images/I/611CmvkL...,"Magnus Chase and the Gods of Asgard, Book 1: T...",Rick Riordan
3,1476789886,http://ecx.images-amazon.com/images/I/51KqU7Dw...,Rush Revere and the Star-Spangled Banner,Rush Limbaugh
4,1338029991,http://ecx.images-amazon.com/images/I/61kvq74k...,Harry Potter Coloring Book,Scholastic


Save this small version of the data for later use:

In [11]:
books.drop(columns=['Image URL']).to_csv(COLLECTED_DATA_DIR + 'orig_data.csv', index=False)

## Scraping New Data
---

The columns we need to scrape are: `Review Score`, `Sales Rank` and `Date`. We also need to download the images from the URLs so that we can extract visual features from them, completing our dataset. Just in case we need some other information in the future from the webpages, we will also save the raw HTML files so we don't have to scrape them from Amazon again.

First we will demonstrate the scraping process for each column on an arbitrary example, then we will combine these in a function and scrape the information for all the books.

In [7]:
example_book = books.iloc[0]
example_book

ID                                                  0545790352
Image URL    http://ecx.images-amazon.com/images/I/51MIi4p2...
Title        Harry Potter and the Sorcerer's Stone: The Ill...
Author                                            J.K. Rowling
Name: 0, dtype: object

### Connecting to Amazon

This step is trickier than it sounds. Sending many requests to Amazon servers in quick succession always leads to Captcha pages that check if the request came from a human. In this case, it is indeed not coming from a human so we need to be smarter. We use Tor requests to be able to change our IP at any time and also rotate the User Agent we use to send the request.

We also noticed that at least one Tor IP was unable to connect to the servers, so we try the initial request many times with different IPs and user agents until we get a response without any connection errors or getting caught by the bot detector. When a request is successful, we keep using the found IP-agent pair until it fails:

In [123]:
def connect(book_id, agent=user_agent.random, max_tries=10, wait=600):
    for i in range(max_tries):
        try:
            # Creating random delays before requests helps to avoid detection
#             sleep(random.randint(1, 2))
            
            # Try to connect
            response = tor.get('https://www.amazon.com/dp/' + book_id, headers={'User-Agent': agent})
            status = response.status_code
            
            # Check if page still exists
            if(status != 200):
                return status, None, agent, None
            
            # Make soup if we didn't get any errors
            soup = BeautifulSoup(response.text, 'lxml')
            
            # If we get redirected to a Captcha page raise error to try again
            if(soup.title.string == 'Robot Check'):
                raise ConnectionError
            
            # If we successfully reach the webpage, return the soup, successful agent and raw HTML
            return status, soup, agent, response.text
        
        except ConnectionError:
            # If something is wrong with the IP, get a new IP and user agent and try again
            tor.reset_identity()
            agent = user_agent.random
            if(i == int(max_tries / 2)):
                print(f'Half of {max_tries} trials failed for book ID {book_id}, waiting for {int(wait/60)} mins.      ', end='\r')
                sleep(wait)
            else:
                print(f'Trial {i+1} failed to connect for book ID {book_id}, resetting IP and trying again.', end='\r')
    
    raise ConnectionError

Try it on our example book:

In [63]:
_, soup, _, _ = connect(example_book['ID'])
soup.title.string

"Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter, Book 1): J.K. Rowling, Jim Kay: 9780545790352: Amazon.com: Books"

### Sales Rank and Date

We can get both of these from the product details table on the webpage, which is in a table conveniently named `productDetailsTable`:

In [120]:
soup.select('#productDetailsTable li b')

[<b>Age Range:</b>,
 <b>Grade Level:</b>,
 <b>Series:</b>,
 <b>Hardcover:</b>,
 <b>Publisher:</b>,
 <b>Language:</b>,
 <b>ISBN-10:</b>,
 <b>ISBN-13:</b>,
 <b>
     Product Dimensions: 
     </b>,
 <b>Shipping Weight:</b>,
 <b>Average Customer Review:</b>,
 <b>Amazon Best Sellers Rank:</b>,
 <b><a href="https://www.amazon.com/gp/bestsellers/books/3153/ref=pd_zg_hrsr_books_1_5_last/134-2712085-9861750">Friendship</a></b>,
 <b><a href="https://www.amazon.com/gp/bestsellers/books/2967/ref=pd_zg_hrsr_books_2_3_last/134-2712085-9861750">Action &amp; Adventure</a></b>,
 <b><a href="https://www.amazon.com/gp/bestsellers/books/3017/ref=pd_zg_hrsr_books_3_4_last/134-2712085-9861750">Fantasy &amp; Magic</a></b>]

We can use regex to extract the info we need from the table:

In [121]:
for li in soup.select('#productDetailsTable li'):
    # We only need two of the list items
    if(li.b.string == 'Amazon Best Sellers Rank:'):
        # The rank is given in the format #1,234,567
        sales_rank = re.findall(u'#([\d,]+)', li.b.nextSibling)[0]
    elif(li.b.string == 'Publisher:'):
        # The date is in the last set of parantheses
        date = re.findall(u'\(([^\(\)]*)\)$', li.b.nextSibling)[0]
        
print(f'Sales Rank: {sales_rank}\nDate: {date}')

Sales Rank: 124
Date: October 6, 2015


Turn it into a function:

In [9]:
def extract_rank_date(soup):
    # Initial values to return if cannot be scraped
    sales_rank = date = None
    
    for li in soup.select('#productDetailsTable li'):
        try:
            if(li.b.string == 'Amazon Best Sellers Rank:'):
                sales_rank = re.findall(u'#([\d,]+)', li.b.nextSibling)[0]  # Format: #1,234,567
                sales_rank = int(sales_rank.replace(',',''))  # Remove the commas and convert to integer
        except:
            sales_rank = None  # couldn't scrape
            
        try:
            if(li.b.string == 'Publisher:'):
                date = re.findall(u'\(([^\(\)]*)\)$', li.b.nextSibling)[0]  # Format: Inside last parantheses
        except:
            date = None  # couldn't scrape
                
    return sales_rank, date

Try on example:

In [123]:
extract_rank_date(soup)

(124, 'October 6, 2015')

### Review Score

You might have noticed there is also an item called `Average Customer Review` in the table we just used to extract the Rank and Date. Inside that item, all the review scores are found in a table with the id `histogramTable`, that gives the percentages of users for each score from 1 to 5 stars.

In [124]:
reviews = soup.select('#histogramTable')[0].text
reviews

'5 star87%4 star8%3 star2%2 star1%1 star2%'

The formatting is not great, but it's nothing we can't fix by using a simple regular expression:

In [125]:
reviews = re.findall(u'(\d) star(\d+)%', reviews)
reviews

[('5', '87'), ('4', '8'), ('3', '2'), ('2', '1'), ('1', '2')]

The weighted average of these scores is our final Review Score for the given book:

In [126]:
score = 0
for pair in reviews:
    score += int(pair[0]) * int(pair[1])/100  # weights are percentages

round(score, 3)

4.77

Turn into a function:

In [10]:
def extract_score(soup):
    # Initial value to return if cannot be scraped
    score = None
    
    try:
        reviews = soup.select('#histogramTable')[0].text
        reviews = re.findall(u'(\d) star(\d+)%', reviews)

        score = 0
        for pair in reviews:
            score += int(pair[0]) * int(pair[1])/100  # weights are percentages

        score = round(score, 3)
    except:
        score = None  # couldn't scrape
    
    return score

Try on example:

In [128]:
extract_score(soup)

4.77

### Cover Image

The image URL of each book is available in the original dataset, let's make a HashMap of `ID:URL` pairs:

In [14]:
urls = books[['ID', 'Image URL']].set_index('ID').to_dict()['Image URL']

# Show random 5 mappings
dict(list(urls.items())[:5])

{'0545790352': 'http://ecx.images-amazon.com/images/I/51MIi4p2YyL.jpg',
 '1419717014': 'http://ecx.images-amazon.com/images/I/61YgGsg-k-L.jpg',
 '1423160916': 'http://ecx.images-amazon.com/images/I/611CmvkLO4L.jpg',
 '1476789886': 'http://ecx.images-amazon.com/images/I/51KqU7Dw9SL.jpg',
 '1338029991': 'http://ecx.images-amazon.com/images/I/61kvq74kVSL.jpg'}

Test it on our example book:

In [130]:
example_url = urls[example_book['ID']]
example_url

'http://ecx.images-amazon.com/images/I/51MIi4p2YyL.jpg'

Have a look:

In [131]:
Markdown(f'![Example Image]({example_url})')

![Example Image](http://ecx.images-amazon.com/images/I/51MIi4p2YyL.jpg)

Let's turn it into a function:

In [4]:
def download_image(book_id):
    url = urls[book_id]
    filename = book_id + '.jpg'
    
    # Download only if not already downloaded
    if not os.path.isfile(IMAGE_DIR + filename):
        downloaded_img = urllib.request.urlopen(url)
        f = open(IMAGE_DIR + filename, mode='wb')
        f.write(downloaded_img.read())
        downloaded_img.close()
        f.close()

### Raw HTML

Save the raw HTML files so we don't have to scrape them from Amazon again.

In [13]:
def save_html(book_id, html_text):
    filename = book_id + '.html'
    
    # Save only if not already saved
    if not os.path.isfile(HTML_DIR + filename):
        html_file = open(HTML_DIR + filename,"w")
        html_file.write(html_text)
        html_file.close()

### Bringing it together

Let's bring all of the functions we created together under one function that will connect to the webpage, scrape all the necessary info, download the cover image and save the HTML file.

In [14]:
def scrape_info(book_id, agent=user_agent.random):
    try:
        # Connect to Amazon, keep track of agent
        status, soup, current_agent, raw_html = connect(book_id, agent)

        if(status == 200):
            # Save the HTML file
            save_html(book_id, raw_html)

            # Get sales rank and date
            sales_rank, date = extract_rank_date(soup)

            # Get average review score
            score = extract_score(soup)

            # Download cover image
            download_image(book_id)
        else:
            # Log the error
            sales_rank = date = score = f'Error {status}'
        
    except ConnectionError:
        current_agent = agent
        sales_rank = date = score = None
        
    return current_agent, book_id, sales_rank, date, score

Let's do a final test on the example book we used above:

In [178]:
scraped = scrape_info(example_book['ID'])
scraped[1:]

('0545790352', 132, 'October 6, 2015', 4.77)

## Completing the dataset
---

To be able to stop and continue at will, we will write the scraped info to a csv file as we go along, and simultaneously download cover images. Let's initialize this file with a meaningful header:

In [136]:
with open(COLLECTED_DATA_DIR + 'scraped.csv', 'a') as file:
    writer = csv.writer(file)
    writer.writerow(['ID', 'Sales Rank', 'Date', 'Review Score'])

Now we go through the dataset, starting scraping from where we last left off:

In [15]:
with open(COLLECTED_DATA_DIR + 'scraped.csv', 'a+') as file:
    reader = csv.reader(file)
    writer = csv.writer(file)
    
    # Look at the last scraped book to continue from the next one in the dataset
    file.seek(0)
    last_scraped = next(reversed(list(reader)))[0]
    
    if(last_scraped == 'ID'):
        # Nothing was scraped yet, start from the beginning
        index = 0
    else:
        # At least one book was scraped, find the index of the last scraped book and start from the next one
        last_scraped_index = books.index[books['ID'] == last_scraped].tolist()[0]
        index = last_scraped_index + 1
     
    try:
        agent = user_agent.random
        count = 0    
        while(index < books.shape[0]):
            current_id = books.iloc[index]['ID']
            scraped = scrape_info(current_id, agent)

            # Keep track of agent
            agent = scraped[0]

            writer.writerow(scraped[1:])
            file.flush()

            index += 1
            count += 1

            # Clean the previous line while printing info about scraping progress
            print(f'Number of scraped books: {count}                                                     ', end='\r')
            
        print(f'Scraping finished, enjoy your {books.shape[0]} books!')
    except KeyboardInterrupt:
        print(f'Scraping stopped by manual interruption. Check the last downloaded book cover image and the last row of the CSV file to make sure there were no corruptions. Total number of books scraped until interruption: {count}.')

Scraping finished, enjoy your 13605 books!                                       


## Completing the Missing Parts
---

Because there were some imperfections in the scraping, here we collect whatever we missed in the first pass-through.

In [67]:
scraped = pd.read_csv('Collected Data/scraped.csv')
display(scraped.head(), scraped.shape)

Unnamed: 0,ID,Sales Rank,Date,Review Score
0,545790352,118,"October 6, 2015",4.77
1,1419717014,399,"November 3, 2015",4.8
2,1423160916,9637,"October 6, 2015",4.6
3,1476789886,5439,"October 27, 2015",4.9
4,1338029991,196,"November 10, 2015",4.61


(13605, 4)

The books where the scraping failed were written to the file with three `None`s. So those rows are the ones we should go over again:

In [68]:
missed = scraped[scraped.drop(columns='ID').isna().all(1)]
display(missed.head(), missed.shape)

Unnamed: 0,ID,Sales Rank,Date,Review Score
835,62233009,,,
839,753456095,,,
843,439903742,,,
844,399256059,,,
847,1770496459,,,


(946, 4)

Let's write functions to do this and for now save them in another folder to be safe:

In [16]:
MISSED_HTML_DIR = '/Users/dogatekin/Data/Missed HTML Files/'

def missed_save_html(book_id, html_text, path=MISSED_HTML_DIR):
    filename = book_id + '.html'
    
    # Save only if not already saved
    if not os.path.isfile(path + filename):
        html_file = open(path + filename,"w")
        html_file.write(html_text)
        html_file.close()

def collect_missed(book_id, path=MISSED_HTML_DIR):
    filename = book_id + '.html'
    
    if not os.path.isfile(path + filename):
        status, _, _, html_text = connect(book_id)
        
        if status == 200:
            missed_save_html(book_id, html_text, path)
        
        return book_id, status

Do the scraping and see if we had any errors:

In [72]:
errors = np.array(compute(*[delayed(collect_missed)(ID) for ID in missed['ID']], scheduler='processes'))

print(f'Successfully downloaded: {len(errors[errors == None])}')
print(f'Encountered errors on: {(len(errors[errors != None]))}')
errors[errors != None]

Successfully downloaded: 944
Encountered errors on: 2


array([('1508457123', 404), ('1452112657', 404)], dtype=object)

There were two unreachable books among the ones we missed (pages that give 404 errors cannot be reached by browsers either), but we successfully downloaded the HTML files of all the other missing books. We now add these to the previously collected files:

In [92]:
for file in os.listdir(MISSED_HTML_DIR):
    shutil.copy(MISSED_HTML_DIR + file, HTML_DIR)
    
len(os.listdir(HTML_DIR))

13345

We are still missing a few HTML files. That is because we didn't start collecting the HTML files until some time after we started the scraping process. Let's quickly complete those as well:

In [89]:
errors = np.array(compute(*[delayed(collect_missed)(ID, HTML_DIR) for ID in books['ID']], scheduler='processes'))

print(f'Downloaded HTML files: {len(errors[errors == None])}')
print(f'Encountered errors on: {(len(errors[errors != None]))}')
errors[errors != None]

Downloaded HTML files: 13535
Encountered errors on: 70


array([('1507745923', 404), ('1423160657', 404), ('1508457123', 404),
       ('1452112657', 404), ('151206212X', 404), ('0375848134', 404),
       ('1846432065', 404), ('1494431726', 404), ('1500600016', 404),
       ('1508743061', 404), ('1512373753', 404), ('1511822074', 404),
       ('1512212512', 404), ('150310902X', 404), ('151413151X', 404),
       ('1511878282', 404), ('1497403693', 404), ('1505211905', 404),
       ('1505488818', 404), ('1505598257', 404), ('1505532914', 404),
       ('1505613361', 404), ('1503322971', 404), ('1505808871', 404),
       ('1514369303', 404), ('1505630592', 404), ('1503324354', 404),
       ('B0144KN6PC', 404), ('B00XLZW19O', 404), ('B00XLX3W9O', 404),
       ('B00PA1HRSW', 404), ('B00R5K8JPG', 404), ('B00PDDRNAO', 404),
       ('B00P8HRS8W', 404), ('B00NMNQNTY', 404), ('B00QZ86LRW', 404),
       ('B00PO53R8I', 404), ('B00PSOCBXM', 404), ('B00SI7QN2G', 404),
       ('B00SI7QG0K', 404), ('B00KUV5FY0', 404), ('B00PG06PCQ', 404),
       ('B00Q8U5H6S'

70 of the pages in the data are unreachable, but we have everything else. Now that we have the HTML files of all the books, it is easy to extract the information locally without worrying about being blocked by Amazon. In fact, let's quickly rewrite our extraction functions to work on local HTML data while also cleaning them and improving them. In fact, let's add a review count extractor as well:

In [38]:
soup = BeautifulSoup(open(HTML_DIR + '0807588997.html'), 'lxml')
re.findall(r'#([\d,]+) in Books \(', soup.select('#SalesRank')[0].find('td', {'class':'value'}).text)[0]

'85,536'

In [44]:
for i in range(10) if i > 5:
    print(i)

SyntaxError: invalid syntax (<ipython-input-44-e2aa78b590ac>, line 1)

In [56]:
def extract_sales_rank(soup):
    if soup.select('#SalesRank'):
        if soup.select('#SalesRank')[0].b:
            if re.findall(r'#([\d,]+) in Books \(', soup.select('#SalesRank')[0].b.nextSibling):
                sales_rank = re.findall(r'#([\d,]+) in Books \(', soup.select('#SalesRank')[0].b.nextSibling)[0]
                sales_rank = int(sales_rank.replace(',',''))
            else:
                sales_rank = None
        elif soup.select('#SalesRank')[0].find('td', {'class':'value'}):
            sales_rank = re.findall(r'#([\d,]+) in Books \(', soup.select('#SalesRank')[0].find('td', {'class':'value'}).text)[0]
            sales_rank = int(sales_rank.replace(',',''))
        else:
            sales_rank = None
    elif soup.select('#productDetails_detailBullets_sections1'):
        if re.findall(r'#([\d,]+) in Books \(', soup.select('#productDetails_detailBullets_sections1')[0].text):
            sales_rank = re.findall(r'#([\d,]+) in Books \(', soup.select('#productDetails_detailBullets_sections1')[0].text)[0]
            sales_rank = int(sales_rank.replace(',',''))
        else:
            sales_rank = None
    else:
        # Sales rank is either not given, or it is not given for the Books category
        sales_rank = None

    return sales_rank

def extract_date(soup):
#     if re.findall(r'– (.*)', soup.select('#title')[0].findAll('span')[-1].text):
#         date = re.findall(r'– (.*)', soup.select('#title')[0].findAll('span')[-1].text)[0]
#     else:
    for li in soup.select('#productDetailsTable li'):
        if(li.b and li.b.string == 'Publisher:' and re.findall(u'\(([^\(\)]*)\)$', li.b.nextSibling)):
            date = re.findall(u'\(([^\(\)]*)\)$', li.b.nextSibling)[0]
            break
    else:
        # Date could not be found
        date = None

    return date

def extract_score(soup):
    if soup.select('#histogramTable'):
        reviews = soup.select('#histogramTable')[0].text
        reviews = re.findall(u'(\d) star(\d+)%', reviews)

        score = 0
        for pair in reviews:
            score += int(pair[0]) * int(pair[1])/100  # weights are percentages

        score = round(score, 3)
    else:
        # Score could not be found
        score = None
        
    return score

def extract_review_count(soup):
    if soup.select('#acrCustomerReviewText'):
        review_count = re.split(' ', soup.select('#acrCustomerReviewText')[0].string)[0]
        review_count = int(review_count.replace(',',''))
    elif soup.select('#acrCustomerWriteReviewText')[0].string == 'Be the first to review this item':
        review_count = 0
    else:
        # Review count cannot be found
        review_count = None
    
    return review_count

And one function to bring them together:

In [40]:
def extract_from_html(book_id, path=HTML_DIR):    
    filename = book_id + '.html'
    
    soup = BeautifulSoup(open(path + filename), "lxml")
    
    try:
        sales_rank = extract_sales_rank(soup)
        date = extract_date(soup)
        score = extract_score(soup)
        review_count = extract_review_count(soup)

        # Download the image as well if it is not already downloaded
        download_image(book_id)

        return book_id, sales_rank, date, score, review_count
    except Exception as e:
        print(f'Error on: https://www.amazon.com/dp/{book_id}')
        raise e

Now use the efficiency of parallelization through Dask to complete the dataset in a few minutes:

In [57]:
# Get the IDs for the local HTML files
htmlIDs = [re.findall(r'(.*).html', file)[0] for file in os.listdir(HTML_DIR) if file != '.DS_Store']

dataset = pd.DataFrame(list(compute(*[delayed(extract_from_html)(ID) for ID in htmlIDs], scheduler='processes')))
display(dataset.head(), dataset.shape)

Unnamed: 0,0,1,2,3,4
0,0553522779,799.0,"July 28, 2015",4.78,1122
1,1596471328,217944.0,"January 1, 2007",4.63,8
2,0981973353,711910.0,"March 30, 2011",5.0,12
3,0822549328,8034749.0,"August 1, 1998",4.0,1
4,061559722X,1069427.0,"November 28, 2014",5.0,4


(13535, 5)

In [64]:
dataset = dataset.rename({0:'ID', 1:'Sales Rank', 2:'Date', 3:'Review Score', 4:'Review Count'}, axis=1)
dataset.to_csv(COLLECTED_DATA_DIR + 'book_info.csv', index=False)

**Sanity Check:** Do the book IDs in the images folder, HTML folder and the dataset match?

In [71]:
dataIDs = dataset['ID']
imageIDs = [re.findall(r'(.*).jpg', image)[0] for image in os.listdir(IMAGE_DIR) if image != '.DS_Store' and image != 'Icon\r']
htmlIDs = [re.findall(r'(.*).html', file)[0] for file in os.listdir(HTML_DIR) if file != '.DS_Store']

set(dataIDs) == set(htmlIDs) == set(imageIDs)

True

## Processing the Collected Data
---

Now that we have the data collected, we should make sure it's clean before moving on.

In [72]:
data = pd.read_csv(COLLECTED_DATA_DIR + 'book_info.csv')
data.head()

Unnamed: 0,ID,Sales Rank,Date,Review Score,Review Count
0,0553522779,799.0,"July 28, 2015",4.78,1122
1,1596471328,217944.0,"January 1, 2007",4.63,8
2,0981973353,711910.0,"March 30, 2011",5.0,12
3,0822549328,8034749.0,"August 1, 1998",4.0,1
4,061559722X,1069427.0,"November 28, 2014",5.0,4


Let's check how many books we currently have:

In [73]:
data.shape

(13535, 5)

Let's see which of the rows have missing data:

In [94]:
missing = data[data['Sales Rank'].isna() | data['Date'].isna() | data['Review Score'].isna() | data['Review Count'].isna()]
display(missing.head(), missing.shape)

Unnamed: 0,ID,Sales Rank,Date,Review Score,Review Count
61,1603800425,,,4.6,10
86,B0056KOL7M,,,0.0,0
206,0789472570,,"March 15, 2000",0.0,0
363,0750222247,,"October 31, 1998",0.0,0
384,1910199427,,,5.0,1


(142, 5)

After looking at a bunch of these listings on Amazon, we see that the missing values are simply not given on their webpages. That is, they don't have a sales rank and/or their publishing date is not written. For this initial analysis, we drop these:

In [95]:
data.dropna(inplace=True)

Let's look at the types:

In [97]:
data.dtypes

ID               object
Sales Rank      float64
Date             object
Review Score    float64
Review Count      int64
dtype: object

The Date column could be converted into datetime, but we have to keep in mind that we don't have all the date information for each book. For some of them we have all of day, month and year; for some we only have month and year; for others we only have the year. So when we convert, we will see the first day of the month for the ones we don't have day data and we will see the first of January for the ones we don't have day or month data.

In [98]:
data['Date'] = pd.to_datetime(data['Date'])

Let's look at some statistics:

In [103]:
# Numerical features
display(data.describe())

# Date
display(data.describe(include=np.datetime64))

Unnamed: 0,Sales Rank,Review Score,Review Count
count,13393.0,13393.0,13393.0
mean,1448022.0,4.038758,126.155006
std,2821379.0,1.500238,659.159407
min,9.0,0.0,0.0
25%,47113.0,4.25,4.0
50%,297538.0,4.62,21.0
75%,1346889.0,4.79,73.0
max,21471870.0,5.0,27882.0


Unnamed: 0,Date
count,13393
unique,3285
top,2015-09-01 00:00:00
freq,87
first,1920-01-01 00:00:00
last,2018-06-12 00:00:00


We have books all the way from 1920 to a few months ago! We would also like to see if the sales ranks are unique:

In [132]:
print(f'{len(data["Sales Rank"].unique())}')
print(f'{len(data)}')

13214
13393


The duplicated sales ranks are interesting because we would expect each ranking to be unique; we should look into that. Let's look at the top one:

In [119]:
ranks, counts = np.unique(data['Sales Rank'], return_counts=True)

duplicates = data[data['Sales Rank'] == ranks[np.argmax(counts)]]
duplicates

Unnamed: 0,ID,Sales Rank,Date,Review Score,Review Count
8083,0843107588,777.0,2004-02-02,4.68,277
9132,084313271X,777.0,2008-09-04,4.69,247
12483,142630918X,777.0,2012-05-08,4.87,175


The easiest way to investigate is to scrape their ranks again from live data:

In [124]:
duplicates.apply(lambda row: extract_sales_rank(connect(row['ID'])[1]), axis=1)

8083     665
9132     600
12483    528
dtype: int64

The sales ranks on Amazon can change very easily over time as millions of users buy items every second. We see that these books are quite close to the Sales Rank values they have in our data but all of them have changed slightly.

Since the scraping is not done all at once and since the rankings keep changing, it is understandable that multiple books that have comparable sales have the same ranking at the time we scrape them. We decided that this is not a problem at this point.