## Amazon Best Sellers Book Links
https://www.amazon.com/best-sellers-books-Amazon/zgbs/books 
https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_pg_2?_encoding=UTF8&pg=2

## 1. Import libraries

In [1]:
import requests as rq
import pandas as pd
from bs4 import BeautifulSoup as bs
import lxml
import re

## 2. Set base urls

In [2]:
url1 = 'https://www.amazon.com/best-sellers-books-Amazon/zgbs/books'
url2 = 'https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_pg_2?_encoding=UTF8&pg=2'

## 3. Make a request

In [3]:
# set response object
res = rq.get(url1)
# check status code 200 = OK
res.status_code

200

## 4. Create our soup object

In [4]:
soup = bs(res.text,'lxml')
soup.find('title').text

'Amazon Best Sellers: Best Books'

## 5. Find the data:
* rank
* title
* author
* price
* rating
* number_of_reviews

In [5]:
soup.find("li", "zg-item-immersion").text.strip()

'#1\n\n            The Room Where It Happened: A White House Memoir\n        \nJohn BoltonHardcover$19.50'

In [6]:
rank_class = "zg-badge-text"
ranks = [x.text.replace("#", "") for x in soup.find_all("span", rank_class)]
len(ranks)

50

In [7]:
titles_class = "zg-item"
titles = [x.a.text.strip() for x in soup.find_all("span", titles_class)]
len(titles)

50

In [8]:
blacklist = ["Paperback", "Hardcover", "Board book", "Mass Market Paperback"]
author_class = "a-row a-size-small"
authors = [x.text for x in soup.find_all("div", author_class) if x.text not in blacklist]
len(authors)

50

In [9]:
price_class = "p13n-sc-price"
prices = [x.text.replace("$", "") for x in soup.find_all("span", price_class)]
len(prices)

50

In [11]:
ratings_class = "zg-item"
pattern = r"(\d.\d)\sout"
ratings = [''.join(re.findall(pattern ,x.text)) for x in soup.find_all("span", ratings_class)]
len(ratings)

50

In [12]:
reviews_class2 = "zg-item"
pattern = r"stars\s+(\d+\W\d+ | \d+)"
n_reviews = ["".join(re.findall(pattern, x.text.strip().replace("\n", " "))).strip() for x in soup.find_all("span", reviews_class2)]
len(n_reviews)

50

## 6. Create the Data Frame object

In [17]:
data = pd.DataFrame({"rank": ranks, "title": titles, "author": authors, "price": prices, "rating": ratings, "reviews": n_reviews})

data.replace("", 0)

Unnamed: 0,rank,title,author,price,rating,reviews
0,1,The Room Where It Happened: A White House Memoir,John Bolton,19.5,0.0,0
1,2,I'm Your Emotional Support Animal: Navigating ...,Adam Carolla,25.2,4.8,7
2,3,How to Be an Antiracist,Ibram X. Kendi,14.79,4.7,550
3,4,White Fragility: Why It's So Hard for White Pe...,Robin DiAngelo,10.9,4.3,1897
4,5,Too Much and Never Enough: How My Family Creat...,Mary L. Trump Ph.D.,25.2,0.0,0
5,6,Stamped from the Beginning: The Definitive His...,Ibram X. Kendi,12.68,4.7,645
6,7,So You Want to Talk About Race,Ijeoma Oluo,11.56,4.7,552
7,8,Countdown 1945: The Extraordinary Story of the...,Chris Wallace,18.0,4.2,59
8,9,Where the Crawdads Sing,Delia Owens,9.59,4.8,50160
9,10,Deacon King Kong: A Novel,James McBride,18.99,4.6,213


## 7. Export csv file

In [19]:
data.to_csv("../data/amazon.csv" ,index=False)

In [20]:
pd.read_csv("../data/amazon.csv")

Unnamed: 0,rank,title,author,price,rating,reviews
0,1,The Room Where It Happened: A White House Memoir,John Bolton,19.5,,
1,2,I'm Your Emotional Support Animal: Navigating ...,Adam Carolla,25.2,4.8,7.0
2,3,How to Be an Antiracist,Ibram X. Kendi,14.79,4.7,550.0
3,4,White Fragility: Why It's So Hard for White Pe...,Robin DiAngelo,10.9,4.3,1897.0
4,5,Too Much and Never Enough: How My Family Creat...,Mary L. Trump Ph.D.,25.2,,
5,6,Stamped from the Beginning: The Definitive His...,Ibram X. Kendi,12.68,4.7,645.0
6,7,So You Want to Talk About Race,Ijeoma Oluo,11.56,4.7,552.0
7,8,Countdown 1945: The Extraordinary Story of the...,Chris Wallace,18.0,4.2,59.0
8,9,Where the Crawdads Sing,Delia Owens,9.59,4.8,50160.0
9,10,Deacon King Kong: A Novel,James McBride,18.99,4.6,213.0
