## Amazon Best Sellers Book Links
- https://www.amazon.com/best-sellers-books-Amazon/zgbs/books 
- https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_pg_2?_encoding=UTF8&pg=2

## 1. Import libraries

In [2]:
import requests as rq
import pandas as pd
from bs4 import BeautifulSoup as bs
import lxml
import re

## 2. Set base urls

In [3]:
url1 = 'https://www.amazon.com/best-sellers-books-Amazon/zgbs/books'
url2 = 'https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_pg_2?_encoding=UTF8&pg=2'

## 3. Make a request

In [4]:
# set response object
res = rq.get(url1)
# check status code 200 = OK
res.status_code

200

## 4. Create our soup object

In [5]:
soup = bs(res.text,'lxml')
soup.find('title').text

'Amazon Best Sellers: Best Books'

## 5. Find the data:
* rank
* title
* author
* price
* rating
* number_of_reviews

In [6]:
soup.find("li", "zg-item-immersion").text.strip()

'#1\n\n            The Room Where It Happened: A White House Memoir\n        \nJohn Bolton\n\n\n3.5 out of 5 stars\n\n64\n\nHardcover$19.42'

## Find Titles

In [8]:
titles_class = "zg-item"
titles = [x.a.text.strip().title() for x in soup.find_all("span", titles_class)]
titles[:5]

['The Room Where It Happened: A White House Memoir',
 "White Fragility: Why It'S So Hard For White People To Talk About Racism",
 'How To Be An Antiracist',
 'Too Much And Never Enough: How My Family Created The World’S Most Dangerous Man',
 'Stamped From The Beginning: The Definitive History Of Racist Ideas In America (National Book Award Winner)']

## Find ranks

In [9]:
rank_class = "zg-badge-text"
ranks = [x.text.replace("#", "") for x in soup.find_all("span", rank_class)]
ranks[:5]

['1', '2', '3', '4', '5']

## Find authors

In [12]:
# we need to create a list for all of the book "formats"
blacklist = ["Paperback", "Hardcover", "Board book", "Mass Market Paperback"]
author_class = "a-row a-size-small"
authors = [x.text for x in soup.find_all("div", author_class) if x.text not in blacklist]
authors[:5]

['John Bolton',
 'Robin DiAngelo',
 'Ibram X. Kendi',
 'Mary L. Trump Ph.D.',
 'Ibram X. Kendi']

## Find Prices

In [13]:
price_class = "p13n-sc-price"
prices = [x.text.replace("$", "") for x in soup.find_all("span", price_class)]
prices[:5]

['19.42', '11.68', '14.79', '17.19', '12.03']

## Find ratings

In [14]:
ratings_class = "zg-item"
pattern = r"(\d.\d)\sout"
ratings = [''.join(re.findall(pattern ,x.text)) for x in soup.find_all("span", ratings_class)]
ratings[:5]

['3.5', '4.2', '4.7', '', '4.8']

## Find number of reviews

In [15]:
reviews_class2 = "zg-item"
pattern = r"stars\s+(\d+\W\d+ | \d+)"
n_reviews = ["".join(re.findall(pattern, x.text.strip().replace("\n", " "))).strip() for x in soup.find_all("span", reviews_class2)]
n_reviews[:5]

['64', '2,091', '677', '', '670']

## 6. Create the Data Frame object

In [16]:
data = pd.DataFrame({
    "rank": ranks, 
    "title": titles, 
    "author": authors,  
    "ratings": ratings, 
    "reviews": n_reviews,
    "price": prices,})

#replace null values by zero
data.replace("", 0, inplace=True)
data.head()

Unnamed: 0,rank,title,author,ratings,reviews,price
0,1,The Room Where It Happened: A White House Memoir,John Bolton,3.5,64,19.42
1,2,White Fragility: Why It'S So Hard For White Pe...,Robin DiAngelo,4.2,2091,11.68
2,3,How To Be An Antiracist,Ibram X. Kendi,4.7,677,14.79
3,4,Too Much And Never Enough: How My Family Creat...,Mary L. Trump Ph.D.,0.0,0,17.19
4,5,Stamped From The Beginning: The Definitive His...,Ibram X. Kendi,4.8,670,12.03


## 7. Export csv file

In [13]:
data.to_csv("../data/amazon.csv" ,index=False)

## Import csv file

In [14]:
pd.read_csv("../data/amazon.csv").head()

Unnamed: 0,rank,title,author,ratings,reviews,price
0,1,The Room Where It Happened: A White House Memoir,John Bolton,0.0,0,19.5
1,2,Too Much And Never Enough: How My Family Creat...,Mary L. Trump Ph.D.,0.0,0,25.2
2,3,How To Be An Antiracist,Ibram X. Kendi,4.7,559,14.79
3,4,White Fragility: Why It'S So Hard For White Pe...,Robin DiAngelo,4.3,1908,11.35
4,5,Stamped From The Beginning: The Definitive His...,Ibram X. Kendi,4.7,645,12.16
