# Collecting Data from Amazon
---
In this notebook, we scrape the missing parts of our dataset directly from Amazon using BeautifulSoup.

The dataset we want:

| ID | Review Score | Sales Rank | Category    | Title | Author | Date    | Visual Features     |
| -- | ------------ | ---------- | ----------- | ----- | ------ | ------- | ------------------- |
| |

The dataset we have, as downloaded from [here](https://github.com/uchidalab/book-dataset):

| ID | Filename | Image URL | Title | Author | Category ID | Category |
| -- | -------- | --------- | ----- | ------ | ----------- | -------- |
| |

The `ID` column in the data can be used to access the webpage of each book, by connecting to https://www.amazon.com/dp/book-id. This allows us to scrape any data that is missing directly from Amazon.

We already have the Title, Author and Category of each book ready to be used.

For everything else, there's ~~Mastercard~~ BeautifulSoup.

In [45]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import re
from datetime import datetime
from time import sleep
from random import randint
import csv

## Preprocessing

Load the data:

In [26]:
header_names = ['ID', 'Filename', 'Image URL', 'Title', 'Author', 'Category ID', 'Category']

books = pd.read_csv('book32-listing.csv', encoding='latin1', header=None, names=header_names)
books.head()

Unnamed: 0,ID,Filename,Image URL,Title,Author,Category ID,Category
0,761183272,0761183272.jpg,http://ecx.images-amazon.com/images/I/61Y5cOdH...,Mom's Family Wall Calendar 2016,Sandra Boynton,3,Calendars
1,1623439671,1623439671.jpg,http://ecx.images-amazon.com/images/I/61t-hrSw...,Doug the Pug 2016 Wall Calendar,Doug the Pug,3,Calendars
2,B00O80WC6I,B00O80WC6I.jpg,http://ecx.images-amazon.com/images/I/41X-KQqs...,"Moleskine 2016 Weekly Notebook, 12M, Large, Bl...",Moleskine,3,Calendars
3,761182187,0761182187.jpg,http://ecx.images-amazon.com/images/I/61j-4gxJ...,365 Cats Color Page-A-Day Calendar 2016,Workman Publishing,3,Calendars
4,1578052084,1578052084.jpg,http://ecx.images-amazon.com/images/I/51Ry4Tsq...,Sierra Club Engagement Calendar 2016,Sierra Club,3,Calendars


Inspect the categories:

In [27]:
print('\n'.join(books['Category'].unique()))

Calendars
Comics & Graphic Novels
Test Preparation
Mystery, Thriller & Suspense
Science Fiction & Fantasy
Romance
Humor & Entertainment
Literature & Fiction
Gay & Lesbian
Engineering & Transportation
Cookbooks, Food & Wine
Crafts, Hobbies & Home
Arts & Photography
Education & Teaching
Parenting & Relationships
Self-Help
Computers & Technology
Medical Books
Science & Math
Health, Fitness & Dieting
Business & Money
Law
Biographies & Memoirs
History
Politics & Social Sciences
Reference
Christian Books & Bibles
Religion & Spirituality
Sports & Outdoors
Teen & Young Adult
Children's Books
Travel


We only want the Children's Books and we don't need the Category ID:

In [28]:
books = books[books['Category'] == "Children's Books"].reset_index(drop=True)
books.drop(columns=['Category ID'], inplace=True)
books.head()

Unnamed: 0,ID,Filename,Image URL,Title,Author,Category
0,545790352,0545790352.jpg,http://ecx.images-amazon.com/images/I/51MIi4p2...,Harry Potter and the Sorcerer's Stone: The Ill...,J.K. Rowling,Children's Books
1,1419717014,1419717014.jpg,http://ecx.images-amazon.com/images/I/61YgGsg-...,Diary of a Wimpy Kid: Old School,Jeff Kinney,Children's Books
2,1423160916,1423160916.jpg,http://ecx.images-amazon.com/images/I/611CmvkL...,"Magnus Chase and the Gods of Asgard, Book 1: T...",Rick Riordan,Children's Books
3,1476789886,1476789886.jpg,http://ecx.images-amazon.com/images/I/51KqU7Dw...,Rush Revere and the Star-Spangled Banner,Rush Limbaugh,Children's Books
4,1338029991,1338029991.jpg,http://ecx.images-amazon.com/images/I/61kvq74k...,Harry Potter Coloring Book,Scholastic,Children's Books


Let's check how many books we have left:

In [19]:
len(books)

13605

We are going to lose quite a bit more because some of them will not be on sale on Amazon anymore, so we will not be able to scrape information on them.

## Scraping

The columns we need to scrape are: `Review Score`, `Sales Rank` and `Date`. We also need to download the images from the URLs so that we can extract visual features from them, completing our dataset.

First we will demonstrate the scraping process for each column on an arbitrary example, then we will combine these in a function and scrape the information for all the books.

In [29]:
book_id = '0006476155'
request = requests.get('https://www.amazon.com/dp/' + book_id)
soup = BeautifulSoup(request.text, 'lxml')

#### Sales Rank and Date

We can get both of these from the product details table on the webpage, which is in a table conveniently named `productDetailsTable`:

In [30]:
soup.select('#productDetailsTable li b')

[<b>Paperback:</b>,
 <b>Publisher:</b>,
 <b>Language:</b>,
 <b>ISBN-10:</b>,
 <b>ISBN-13:</b>,
 <b>ASIN:</b>,
 <b>
     Product Dimensions: 
     </b>,
 <b>Shipping Weight:</b>,
 <b>Average Customer Review:</b>,
 <b>Amazon Best Sellers Rank:</b>,
 <b><a href="https://www.amazon.com/gp/bestsellers/books/10482">Police Procedurals</a></b>]

Again, we can use regex to extract the info we need from the table:

In [31]:
for li in soup.select('#productDetailsTable li'):
    # We only need two of the list items
    if(li.b.string == 'Amazon Best Sellers Rank:'):
        # The rank is given in the format #1,234,567
        sales_rank = re.findall(u'#([\d,]+)', li.b.nextSibling)[0]
    elif(li.b.string == 'Publisher:'):
        # The date is in parantheses and in the format (MM YYYY)
        date = re.findall(u'\((.*)\)', li.b.nextSibling)[0]
        
print(f'Sales Rank: {sales_rank}\nDate: {date}')

Sales Rank: 5,119,119
Date: March 2004


#### Review Score

You might have noticed there is also an item called `Average Customer Review` in the table we just used to extract the Rank and Date. Inside that item, all the review scores are found in a table with the id `histogramTable`, that gives the percentages of users for each score from 1 to 5 stars.

In [32]:
reviews = soup.select('#histogramTable')[0].text
reviews

'5 star58%4 star25%3 star9%2 star4%1 star4%'

The formatting is not great, but it's nothing we can't fix by using a simple regular expression:

In [33]:
reviews = re.findall(u'(\d) star(\d+)%', reviews)
reviews

[('5', '58'), ('4', '25'), ('3', '9'), ('2', '4'), ('1', '4')]

The weighted average of these scores is our final Review Score for the given book:

In [34]:
score = 0
for pair in reviews:
    score += int(pair[0]) * int(pair[1])/100  # weights are percentages

score

4.29

#### Bringing it together

In [35]:
# To use to parse dates later on
# from dateutil import parser
# parser.parse(date)

In [204]:
def scrape_info(book_id):
    # Initialize default values
    sales_rank = date = score = None
    
    # Trick the bot detector?
    sleep(randint(1,3))
    
    # Get the soup of the relevant page
    request = requests.get('https://www.amazon.com/dp/' + book_id)
    
    if(request.status_code == 200):
        soup = BeautifulSoup(request.text, 'lxml')

        # Get sales rank and time
        for li in soup.select('#productDetailsTable li'):
            if(li.b.string == 'Amazon Best Sellers Rank:'):
                sales_rank = re.findall(u'#([\d,]+)', li.b.nextSibling)[0]  # Format: #1,234,567
                sales_rank = int(sales_rank.replace(',',''))  # Remove the commas and convert to integer
            elif(li.b.string == 'Publisher:'):
                date = re.findall(u'\(([^\(\)]*)\)$', li.b.nextSibling)[0]  # Format: (MM YYYY)

        # Get average review score
        try:
            reviews = soup.select('#histogramTable')[0].text
            reviews = re.findall(u'(\d) star(\d+)%', reviews)

            score = 0
            for pair in reviews:
                score += int(pair[0]) * int(pair[1])/100  # weights are percentages
            score = round(score, 4)
        except:
            pass
    else:
        sales_rank = date = score = f'Error {request.status_code}'

    return book_id, sales_rank, date, score

Let's do a final test on the example book we used above:

In [203]:
scrape_info(book_id)

('0006476155', 'Error 503', 'Error 503', 'Error 503')

## Completing the dataset

In [39]:
books.head()

Unnamed: 0,ID,Filename,Image URL,Title,Author,Category
0,545790352,0545790352.jpg,http://ecx.images-amazon.com/images/I/51MIi4p2...,Harry Potter and the Sorcerer's Stone: The Ill...,J.K. Rowling,Children's Books
1,1419717014,1419717014.jpg,http://ecx.images-amazon.com/images/I/61YgGsg-...,Diary of a Wimpy Kid: Old School,Jeff Kinney,Children's Books
2,1423160916,1423160916.jpg,http://ecx.images-amazon.com/images/I/611CmvkL...,"Magnus Chase and the Gods of Asgard, Book 1: T...",Rick Riordan,Children's Books
3,1476789886,1476789886.jpg,http://ecx.images-amazon.com/images/I/51KqU7Dw...,Rush Revere and the Star-Spangled Banner,Rush Limbaugh,Children's Books
4,1338029991,1338029991.jpg,http://ecx.images-amazon.com/images/I/61kvq74k...,Harry Potter Coloring Book,Scholastic,Children's Books


In [171]:
scraped = dataset.head(100).apply(lambda row: scrape_info(row['ID']), axis=1)
scraped

2636       (3632, October 13, 2015, 4.72)
2637    (25151, September 15, 2015, 4.85)
2638           (2405, May 19, 2009, 4.72)
2639         (4062, March 19, 2008, 4.65)
2640       (3986, October 16, 2012, 4.82)
2641       (78666, October 6, 2015, 4.72)
2642     (27529, September 1, 2015, 4.74)
2643      (25807, October 20, 2015, 4.92)
2644     (49660, November 10, 2015, 4.61)
2645      (16097, December 1, 2015, 4.92)
2646    (Error 404, Error 404, Error 404)
2647    (Error 404, Error 404, Error 404)
2648    (Error 404, Error 404, Error 404)
2649    (80905, September 22, 2015, 4.69)
2650    (Error 404, Error 404, Error 404)
2651          (3005, July 22, 2014, 4.51)
2652          (447, March 17, 2014, 4.81)
2653      (55551, October 20, 2015, 4.84)
2654      (202321, October 6, 2015, 4.91)
2655    (Error 404, Error 404, Error 404)
2656         (4054, April 18, 2012, 4.66)
2657    (Error 404, Error 404, Error 404)
2658    (21204, September 22, 2015, 4.64)
2659         (674, October 2, 2012

### Parallelize It

In [128]:
from dask import dataframe as dd
from dask.multiprocessing import get
from multiprocessing import cpu_count
nCores = cpu_count()

In [180]:
test = dd.from_pandas(books.head(12), npartitions=nCores).\
   map_partitions(
      lambda df : df.apply(
         lambda row : scrape_info(row['ID']), axis=1)).\
   compute(get=get)

In [186]:
test.apply(pd.Series)

Unnamed: 0,0,1,2,3
0,545790352,Error 404,Error 404,Error 404
1,1419717014,572,"November 3, 2015",4.8
2,1423160916,10711,"October 6, 2015",4.6
3,1476789886,Error 503,Error 503,Error 503
4,1338029991,Error 503,Error 503,Error 503
5,399172750,Error 404,Error 404,Error 404
6,399255370,Error 404,Error 404,Error 404
7,545392551,Error 404,Error 404,Error 404
8,399226907,Error 404,Error 404,Error 404
9,1481457047,5249,"October 20, 2015",4.81


In [187]:
with open('yo.csv', 'a') as file:
    dd.from_pandas(books.head(nCores), npartitions=nCores).\
       map_partitions(
          lambda df : df.apply(
             lambda row : scrape_info(row['ID']), axis=1)).\
       compute(get=get).apply(pd.Series).to_csv(file, header=False, index=False)

### Scrape to CSV

Initialize CSV file:

In [122]:
with open('scraped.csv', 'a') as file:
    writer = csv.writer(file)
    writer.writerow(['ID', 'Sales Rank', 'Date', 'Review Score'])

Scrape and append to CSV file:

In [194]:
with open('scraped.csv', 'a+') as file:
    reader = csv.reader(file)
    writer = csv.writer(file)
    
    try:
        file.seek(0)
        last_scraped = next(reversed(list(reader)))[0]
        last_scraped_index = books.index[books['ID'] == last_scraped].tolist()[0]
        index = last_scraped_index + 1
    except:
        index = 0
    finally:
        try:
            while(True):
                writer.writerow(scrape_info(books.iloc[index]['ID']))
                file.flush()
                index += 1
        except KeyboardInterrupt:
            pass

In [190]:
books[5:5+12].shape

(12, 6)

In [191]:
books.iloc[5:5+12]

Unnamed: 0,ID,Filename,Image URL,Title,Author,Category
5,399172750,0399172750.jpg,http://ecx.images-amazon.com/images/I/51ZnaboA...,The Day the Crayons Came Home,Drew Daywalt,Children's Books
6,399255370,0399255370.jpg,http://ecx.images-amazon.com/images/I/51HoHYJv...,The Day the Crayons Quit,Drew Daywalt,Children's Books
7,545392551,0545392551.jpg,http://ecx.images-amazon.com/images/I/51qvh4MA...,Giraffes Can't Dance,Giles Andreae,Children's Books
8,399226907,0399226907.jpg,http://ecx.images-amazon.com/images/I/41zqrOnj...,The Very Hungry Caterpillar,Eric Carle,Children's Books
9,1481457047,1481457047.jpg,http://ecx.images-amazon.com/images/I/61GfN%2B...,Dork Diaries 10: Tales from a Not-So-Perfect P...,Rachel RenÃ©e Russell,Children's Books
10,62304186,0062304186.jpg,http://ecx.images-amazon.com/images/I/61OU9ICZ...,Pete the Cat: Five Little Pumpkins,James Dean,Children's Books
11,141694737X,141694737X.jpg,http://ecx.images-amazon.com/images/I/51SQFs%2...,Dear Zoo: A Lift-the-Flap Book (Dear Zoo & Fri...,Rod Campbell,Children's Books
12,803741715,0803741715.jpg,http://ecx.images-amazon.com/images/I/4148uGcg...,The Book with No Pictures,B.J. Novak,Children's Books
13,544568036,0544568036.jpg,http://ecx.images-amazon.com/images/I/51xF%2B0...,Little Blue Truck board book,Alice Schertle,Children's Books
14,385376715,0385376715.jpg,http://ecx.images-amazon.com/images/I/51q3%2Bu...,The Wonderful Things You Will Be,Emily Winfield Martin,Children's Books


In [192]:
books.iloc[17]

ID                                                   553512331
Filename                                        0553512331.jpg
Image URL    http://ecx.images-amazon.com/images/I/51AeOMhY...
Title                                            Spooky Pookie
Author                                          Sandra Boynton
Category                                      Children's Books
Name: 17, dtype: object