Folks,

Recall we'd used rvest to scrape consumer reviews from structured amazon webpages.

Here, I present a py equivalent for the same using 'Beautiful Soup' module. 

Plus, in addition to SelectorGadget, we'll also see the use of 'Inspect' elements to fine-tune our scraping code.

### Setup Chunk

In [1]:
## using py to scrape amazon reviews
import requests
from bs4 import BeautifulSoup
import pandas as pd

In rvest we used the following code to read-in the html doc into the system. 
> url1 = "https://www.amazon.in/OnePlus-Mirror-Black-128GB-Storage/product-reviews/B07DJD1Y3Q/ref=cm_cr_getr_d_paging_btm_next_2?showViewpoints=1&sortBy=recent&pageNumber=5"
page1 = url1 %>% read_html()

We'll do something similar in py. The equivalent for 'page1' in R is 'page1.content' in Py. See below. 

In [2]:
url1 = "https://www.amazon.in/OnePlus-Mirror-Black-128GB-Storage/product-reviews/B07DJD1Y3Q/?showViewpoints=1&sortBy=recent&pageNumber=5"
page1 = requests.get(url1)  # page-source d/l-ed
type(page1)   # requests.models.Response
page1   # a response code starting with 2 == 'success'. '4' r '5' means error in getting the page.

<Response [200]>

In [3]:
# now create BSoup object
soup1 = BeautifulSoup(page1.content, "lxml")  # using 'lxml' parser on page1.content
type(soup1)   # a parsedTree object

bs4.BeautifulSoup

Corresponding to the rvest code quoted below:
> nodes1 = page1 %>% html_nodes('#cm_cr-review_list .a-icon-alt')
rating = nodes1 %>% html_text()
tail(rating)

we do the following in Py, using '.get_text()' as analogous to html_text()

In [4]:
rating_nodes1 = soup1.find_all(class_ = 'a-icon-alt')  
rating_nodes1[0].get_text()  # see effect on 1 element

# loop over all elems in list comprehension
ratings = [item.get_text() for item in rating_nodes1]
ratings[0:5]

['4.5 out of 5 stars',
 '5.0 out of 5 stars',
 '2.0 out of 5 stars',
 '4.0 out of 5 stars',
 '1.0 out of 5 stars']

Above was too broad a match. So will use the hierarchical (parent) nodesets we saw in class.
> ### understanding the parent nodeset
parent_node = soup1.select('#cm_cr-review_list')

In [5]:
# but above was too broad a search (18 matches!). Narrowing using parent nodeset
nodes2 = soup1.select('#cm_cr-review_list .a-icon-alt')
nodes2[0].get_text()
ratings = [item.get_text() for item in nodes2]
ratings[0:5]

['4.0 out of 5 stars',
 '1.0 out of 5 stars',
 '1.0 out of 5 stars',
 '2.0 out of 5 stars',
 '5.0 out of 5 stars']

Below I will follow what we did in class and proceed. 

One new point to highlight is the use of 'Inspect' html element (found by right clicking on the element we want to inspect). The chrome developer window that opens often throws light on what unique elements in the 'class=' identifiers we can use to narrow down our searches and matches.

In [6]:
# now do the same for review title, author, text etc.
title = [item.get_text() for item in soup1.select('#cm_cr-review_list .a-text-bold span')]
title[0:5]

# right-click inspect on review title and see what in 'class=' is unique to review titles
title_link = [item.get('href') for item in soup1.select('#cm_cr-review_list .review-title-content')]
title_link[0:5]

author = [item.get_text() for item in soup1.select('#cm_cr-review_list .a-profile-name')]
author[0:5]

date = [item.get_text() for item in soup1.select('#cm_cr-review_list .review-date')]
date[0:5]

text = [item.get_text() for item in soup1.select('#cm_cr-review_list .review-text-content span')]
text[0:5]

['Good mobile, connectivity is good, does not hang.',
 'Camera is not upto the mark I have S9+ its camera is far better than 1+ 6t only good is its user interface i.e oxygen OS... who said it is flagship killer... Ghanta!!!!',
 'Battery drain very fast',
 'Nowdays, My mobile gets heated up regularly while playing games within 30mins only. And also my mobile hangs once or twice a day. Request you to kindly rectify these issues as I always love to be a part of "ONEPLUS COMMUNITY".Thank you.',
 'Best mobile,superb camera, fastest']

Finally, time to bind results into a DF and close.

In [7]:
## OK. Now bind into DF
py_df = pd.DataFrame(
{'title': title,
'links': title_link,
'author' : author,
'reviewDate' : date,
'reviewText': text})
py_df[:4]

Unnamed: 0,title,links,author,reviewDate,reviewText
0,Good,/gp/customer-reviews/R2TLDIS6D7QMQ0?ASIN=B07DJ...,Dr.Akhil Kumar Srivastava,25 April 2019,"Good mobile, connectivity is good, does not hang."
1,Not Upto The Mark..,/gp/customer-reviews/R1PYA11US51HU9?ASIN=B07DJ...,Rishabh,25 April 2019,Camera is not upto the mark I have S9+ its cam...
2,Battery problem,/gp/customer-reviews/R2GIH5UEWOD1PI?ASIN=B07DJ...,r c meena,25 April 2019,Battery drain very fast
3,My mobile gets heated up & Hangs.,/gp/customer-reviews/R1FNQXMZY625H0?ASIN=B07DJ...,Rahul Biradar,25 April 2019,"Nowdays, My mobile gets heated up regularly wh..."


Signing off here.

Sudhir