<b>Scraping reviews of my favourite book on Amazon</b> https://www.natasshaselvaraj.com/web-scraping/

In this tutorial, I will walk you through all the steps I took to scrape the book reviews on Amazon. I will use two Python libraries to do this - BeautifulSoup and Selenium.

<b> First, import the following packages: </b>

In [1]:
!pip install selenium




In [2]:
!pip install webdriver-manager



In [3]:
import pandas as pd
import numpy as np
from pprint import pprint
import io
import os
from bs4 import BeautifulSoup
import requests,json, re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

Now, we can use the Selenium web driver to automatically download the first 4 pages of the site that we want to scrape.

First, we can create a folder to save the HTML files in. Then, we define the path in Python.

In [4]:
# path you want to save the HTML files in

path = 'C:/Users/Dharmender/Desktop/ML_Learning/ML_Data/'

Next, we can loop through the number of pages we want to scrape. In our case, its page 1 to page 5.

Then, we can create a webdriver object, get all the pages we want to scrape and save the pages in our directory.

In [6]:
import time
from time import sleep

In [7]:
for i in range(1,5) :
    driver = webdriver.Chrome(ChromeDriverManager().install())
    url = 'https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/product-reviews/1492032646/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews='+str(i)
    
    driver.get(url)
    time.sleep(1)
    html = driver.page_source 
    soup = BeautifulSoup(html, 'html.parser')
    
    # save all 50 files in your path
    
    with io.open(path+"amazon_page_"+str(i)+".html", "w", encoding="utf-8") as f:
        f.write(html)



Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
There is no [win32] chromedriver for browser 92.0.4515 in cache
Get LATEST driver version for 92.0.4515
Trying to download new driver from https://chromedriver.storage.googleapis.com/92.0.4515.107/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Dharmender\.wdm\drivers\chromedriver\win32\92.0.4515.107]


Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [C:\Users\Dharmender\.wdm\drivers\chromedriver\win32\92.0.4515.107\chromedriver.exe] found in cache


Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [C:\Users\Dharmender\.wdm\drivers\chromedriver\win32\92.0.4515.107\chromedriver.exe] found in cache


Current google-chrome version is 92.0.4515
Get LATEST driver version for 92.0.4515
Driver [C:\Users\Dharmender\.wdm\drivers\chromedriver\win32\92.0.4515.107\chromedriver.exe] found in cache


After saving all 4 pages into our local directory, we can start with the web scraping.

<b> Lets start with scraping one web page first: </b>

To scrape it, we need to look at the HTML structure of the page. Right click on the review and click on "inspect."

We can see that all the review text is wrapped in a span class called "a-size-base review-text review-text-content."

This is the element we need to extract when scraping reviews.

To do this, you first need to open one of the pages you saved previously:

In [8]:
# open the first page you saved

file = open(path+'amazon_page_1'+'.html',encoding='utf-8')

Then, create a BeautifulSoup object to collect all the HTML code of that page:

In [9]:
soup = BeautifulSoup(file, 'html.parser')

Now, grab the 'span' class we found earlier using BeautifulSoup:

This code will render output that looks like this (it should show you around 10 reviews since you are only scraping the first page):

In [10]:
for i in soup.find('div',{'id':'cm_cr-review_list'}):
        review = i.find('span',{'class':'a-size-base review-text review-text-content'})
        
        if review is not None:
                print(review.text)



  I've read all of the predominant machine learning related python books and this one is by far the best one. I was excited to see the second edition of this book come out. It is packed with new information (1.5x the length of the first edition) and updated for TensorFlow 2. I have the Kindle edition and find it very helpful to highlight key points. I look forward to receiving the print edition as well once it is released.EDIT: Just received the print edition of the book and it's in color! The first edition wasn't. This is a pleasant surprise as it makes it easier to read with various charts and graphics.




  While I enjoy learning from this book, the math font in kindle edition is a mess which makes the reading unpleasant. I know to some this probably shouldn't be a deal breaker but for someone who wants to move from hard copy to kindle, it was a disappointment.




  This book gives you a hands-on approach to learning by doing. As opposed to the trendy deep learning books that di

Great!

The code works, and we have grabbed all the reviews from the first page and printed them out.

Now, we just need to loop through all 4 pages and do the same thing.

Looping through all the pages and collecting reviews:

In [46]:
# create an empty list
rev = [] 

# loop through all 4 pages
for pages in range(1,5):
    file = open(path+'amazon_page_'+str(pages)+'.html',encoding='utf-8')
    soup = BeautifulSoup(file, 'html.parser')
    
    # find the reviews
    for i in soup.find('div',{'id':'cm_cr-review_list'}):
        review = i.find('span',{'class':'a-size-base review-text review-text-content'})
        
        # append reviews to list
        if review is not None:
            rev.append(review.text)
        else:
            rev.append('-1')

That's all!

Once we're done running this block of code, all the saved reviews will be saved in the list 'rev' that we initialized.

Now, all we need to do is turn that list into a Pandas data frame:

In [47]:
final_df = pd.DataFrame({'Reviews':rev})

We can take a look at the head of the data frame:

In [49]:
final_df.head()

Unnamed: 0,Reviews
0,-1
1,\n\n I've read all of the predominant machine...
2,"\n\n While I enjoy learning from this book, t..."
3,\n\n This book gives you a hands-on approach ...
4,\n\n This is an update for my previous review...


In [53]:
final_df.shape

(48, 1)

All the reviews we scraped are in there.

The data frame will have around 48 rows, but if you drop the '-1' character, you'd get around 40 rows.

If you want to collect more reviews, all you need to do is scrape more pages.