## Scraping Books with `requests-html`

In last article() we will discuss how to make web scraping fast by doing what is called `async` web scraping using `requests-html`. Even though the `BeautifulSoup` and `httpx` combination is known to work well as I showed in my last [article](), there are other tools that can help us accomplish the same thing. In this article, we will discuss the following:

1. Discover `requests-html` and how to use it.
2. Async web scraping with `requests-html` package
3. Data cleaning with `pandas`


We will scrape bestsellers from the [bookdepository]() website.

To get started in using `requests-html` we let's learn a little bit about the the package. `requests-html` is a Python package for making the parsing of HTML easy and intuitive. It was created by [Kenneth Reitz](https://kennethreitz.org/), the same guy who created the `requests` library. It supplements the `requests` package especially. It comes with the following features:

- Full JavaScript support!
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint of heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support
`

Now let's get started py installing it. One of the cool things about `requests-html` is that it has `async` support out of the box. That means that scrape of websites fast(assyncronously).

In [None]:
!pip install requests-html

In [None]:
from requests_html import AsyncHTMLSession

asession = AsyncHTMLSession()

r = await asession.get("https://www.bookdepository.com/bestsellers")

We first imported the `AsyncHTMLSession` class from `requests-html` and created an instance of it. Then we use we use the session's `get` method to get our website. Now if we check the status code, we will realize that it was successful(200).

Now we can start getting the data we are interested starting with the titles of the books.

To find out which `CSS selected or xpath` expressions we need to use to get our data, we need to inspect the html using developer tools in chrome. 

In [None]:
r.status_code # find out if successful

200

In [None]:
# We get the titles
page = 1
titles = []
while page != 35:
  for x in r.html.find("h3.title"):
    titles.append(x.text)
  page +=1


Notice that we first created 2 variables; `page` which will keep track of the pages on the website while `titles` will hold our data. We then used a `while loop` to traverse our pages and get the data while our page is less than the total number of 35(There are 34 pages on the website). I know this because I used the `chrome devtools` to inspect the `pagination` on the page.

Then we used the `find` method from `requests-html` on our HTML content to get our data while passing in our CSS selector(which is an `h3` tag with the class `title`). After that we appended the result to our `titles` list. Now if you check the length of our list, you will see about 1020 titles.

We will repeat process for the rest of the variables we are interested in. Don't forget to inspect the page source for the relevant CSS selectors for the items you need.

In [None]:
len(titles) # check the length

1020

In [None]:
titles[:10] # take a peek

["It Ends With Us: The most heartbreaking novel you'll ever read",
 'It Starts with Us',
 'It Starts with Us',
 'The Climate Book',
 'Rooms of Wonder',
 'Verity',
 'Fire and Blood',
 'The Seven Moons of Maali Almeida',
 'The Body Keeps the Score',
 'TommyInnit Says...The Quote Book']

In [None]:
# get all authors
page = 1
authors = []
while page != 35:
  for x in r.html.find("p.author"):
    authors.append(x.text)
  page +=1


In [None]:
len(authors)

1020

In [None]:
authors[:10]

['Colleen Hoover',
 'Colleen Hoover',
 'Colleen Hoover',
 'Greta Thunberg',
 'Johanna Basford',
 'Colleen Hoover',
 'George R.R. Martin',
 'Shehan Karunatilaka',
 'Bessel Van Der Kolk',
 'Tom Simons']

In [None]:
# get the prices
page = 1
prices = []
while page != 35:
  for x in r.html.find("p.price"):
    prices.append(x.text)
  page +=1


In [None]:
len(prices)

1020

In [None]:
prices[:10]

['US$9.24 \xa0US$11.10',
 'US$14.84',
 'US$16.42 \xa0US$16.73',
 'US$31.60',
 'US$17.49',
 'US$8.24 \xa0US$10.03',
 'US$13.54',
 'US$17.61 \xa0US$18.96',
 'US$14.26 \xa0US$14.50',
 'US$12.94 \xa0US$16.73']

In [None]:
# We get the ratings 
page = 1
stars = []
while page != 35:
  for x in r.html.find("div.stars"):
    result = x.find("span.star.full-star")
    stars.append(len(result))
  page +=1


In [None]:
len(stars)

748

In [None]:
stars[34]

3

In [None]:
# We get the book formats
page = 1
formats = []
while page != 35:
  for x in r.html.find("p.format"):
    formats.append(x.text)
  page +=1

In [None]:
len(formats)

1020

In [None]:
formats[:10]

['Paperback',
 'Paperback',
 'Hardback',
 'Hardback',
 'Paperback',
 'Paperback',
 'Paperback',
 'Hardback',
 'Paperback',
 'Hardback']

We have successfull scraped our data, but it is not clean yet. We need to clean and save it for our analysis work later. To do that, we need to use `pandas`. We will create a dataframe our our data. 

In [None]:
# We put it into a DataFrame
import pandas as pd

stars = pd.Series(stars)

df = pd.DataFrame(list(zip(titles, authors, prices, formats)), 
                columns=["titles", "authors", "prices", "formats"])

In [None]:
df.shape

(1020, 4)

In [None]:
df["rating"] = stars # to add the stars

In [None]:
df.shape

(1020, 5)

In [None]:
df.head() # check our data 

Unnamed: 0,titles,authors,prices,formats,rating
0,It Ends With Us: The most heartbreaking novel ...,Colleen Hoover,US$9.24 US$11.10,Paperback,4.0
1,It Starts with Us,Colleen Hoover,US$14.84,Paperback,5.0
2,It Starts with Us,Colleen Hoover,US$16.42 US$16.73,Hardback,5.0
3,The Climate Book,Greta Thunberg,US$31.60,Hardback,5.0
4,Rooms of Wonder,Johanna Basford,US$17.49,Paperback,4.0


In [None]:
df.tail() # check the last values

Unnamed: 0,titles,authors,prices,formats,rating
1015,Twisted Love,Ana Huang,US$10.56,Paperback,
1016,"Chainsaw Man, Vol. 1",Tatsuki Fujimoto,US$8.76 US$9.99,Paperback,
1017,Spare,Prince Harry,US$27.97 US$31.25,Hardback,
1018,The Perfect Loaf,Maurizio Leo,US$35.80 US$40.00,Hardback,
1019,The Husky and His White Cat Shizun: Erha He Ta...,Rou Bao Bu Chi Rou,US$17.89 US$19.99,Paperback,


You will notice that we have some missing values in our data. We need to deal with this later. There some many of dealing with missing values. You can either drop them or treat them. Since, droping them for our small dataset is not an option, we will treat them. There many ways to do this too. We can fill the values with the mean or median of the variable in question or replace them with new values(non-numeric)

### Cleaning the data

In [None]:
# We clean the price values

# We remove the strike-through price.
df["prices"] = df.prices.str.replace("\xa0*", "")

  This is separate from the ipykernel package so we can avoid doing imports until


In this code, we are using the `str` methods `replace` method to remove the unwanted characters in the data. You will also realize that there are 2 prices in our data. We have the new data first and then the old data in some rows. We are only interested in the new price which comes first in the price column.

To get the new price, we use `pandas`'s `apply` method to apply a function that a function that splits the values by the whitepsace and then extract the first value which is our new price. 

In [2]:
# To get the current price value
df["prices"] = df["prices"].apply(lambda x: x.split(" ")[0]) 

In [None]:
df.head()

Unnamed: 0,titles,authors,prices,formats,rating
0,It Ends With Us: The most heartbreaking novel ...,Colleen Hoover,US$9.24,Paperback,4.0
1,It Starts with Us,Colleen Hoover,US$14.84,Paperback,5.0
2,It Starts with Us,Colleen Hoover,US$16.42,Hardback,5.0
3,The Climate Book,Greta Thunberg,US$31.60,Hardback,5.0
4,Rooms of Wonder,Johanna Basford,US$17.49,Paperback,4.0


If we check our data now, we will see that the price column is almost clean. We can however see that the `US$` characters are still in our values. We don't want that for a column that should be a decimal or floating point value. So, we are going to remove then and then convert our prices to floating point values. Fortunately, this is easy with `pandas`. We just need to replace these characters with nothing.

In [None]:
# To remove the `US` abbreviation
df["prices"] = df["prices"].str.replace("US", "")


# To remove dollar sign
df["prices"] = df["prices"].str.replace("$", "")
df.head()

Just like that and we have a clean price column but somethings else is missing. When we check the data types of our dataframe, we will realize that price is still a string type. We know that shouldn't be the case for a numeric column. We just need to convert this data type from `string` to a `floating point` value.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   titles   1020 non-null   object 
 1   authors  1020 non-null   object 
 2   prices   1020 non-null   object 
 3   formats  1020 non-null   object 
 4   rating   748 non-null    float64
dtypes: float64(1), object(4)
memory usage: 40.0+ KB


In [None]:
# To convert prices to float type
df["prices"] = df.prices.astype("float")
df.info()

We have reached the point where we will deal with our missing values in the rating column. Let's check to see how many missing values we have. We can see that we have about 272 row with missing values. Since, our data is small, we can't drop it, so we will fill them with the average of the column. If we recheck our rating column, we will see that we have no missing values now.

In [None]:
# We fill Na values in rating
df.rating.isna().sum() # 272

import numpy as np
df["rating"] = df["rating"].fillna(round(np.mean(df.rating), 1))


# We recheck for missing values -> 0
df.rating.isna().sum() 

272

If we check our final that we can see that the data is clean and the columns are of the right types. We can now export our data into csv file for further analysis.

In [None]:
df.head()

Unnamed: 0,titles,authors,prices,formats,rating
0,It Ends With Us: The most heartbreaking novel ...,Colleen Hoover,9.24,Paperback,4.0
1,It Starts with Us,Colleen Hoover,14.84,Paperback,5.0
2,It Starts with Us,Colleen Hoover,16.42,Hardback,5.0
3,The Climate Book,Greta Thunberg,31.6,Hardback,5.0
4,Rooms of Wonder,Johanna Basford,17.49,Paperback,4.0


### Save the data for analysis

In [None]:
df.to_csv("book_depo_clean.csv")# save for further analysis