<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-scraper-notebook:" data-toc-modified-id="The-scraper-notebook:-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The scraper notebook:</a></span><ul class="toc-item"><li><span><a href="#Task-1:-Collecting-book-attributes" data-toc-modified-id="Task-1:-Collecting-book-attributes-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Task 1: Collecting book attributes</a></span></li><li><span><a href="#Task-2:-Collecting-reviews" data-toc-modified-id="Task-2:-Collecting-reviews-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Task 2: Collecting reviews</a></span></li></ul></li></ul></div>

# The scraper notebook: 

This notebook contains cells to scrape the needed data. There are two major tasks in this notebook. 
1. For each of the four genres, 1250 books and their attributes (e.g. book title, book author) are collected. 
2. For the fiction genre, as many reviews are collected.

Download all the requirement packages and the testing package.

In [3]:
# !pip install -e git+https://github.com/gauravmm/jupyter-testing.git#egg=jupyter-testing

In [20]:
# !pip install -r requirements.txt



In [5]:
# setup library imports
# For now the requests library will not be used since we are collecting the data manually
# import requests

import os 
import bs4
from bs4 import BeautifulSoup
from testing.testing import test

import pandas as pd
import numpy as np

## Task 1: Collecting book attributes

Obtain the BS4 objects from the HTML files that we have collected.

In [6]:
def get_html(file_path): 
    """
    Retrieve ALL the html pages on goodreads for given genre.

    Returns:
        roots (list): list of bs4 objects for html file
    """
    lst_html = list()
    roots = list()

    
    for filename in sorted(os.listdir(file_path)):
        with open(os.path.join(file_path, filename)) as f:
            content = f.read()
            lst_html.append(content)

    for html_page in lst_html: 
        # response.text (string): String of HTML corresponding to a page of 50 books
        root = BeautifulSoup(html_page, 'html.parser')        
        roots.append(root)

    return roots
            

Parsing the information in the HTML file. `book_id`, `book_url`, `book_title`, `author_name`, `ratings`, `num_of_ratings`, `date_published`, `book_shelved`, `book_genre` are placed into a list.

In [21]:
def parse_page(roots):
    """
    Parse the reviews on each of the 25 pages.
    
    Args:
        book_attributes (list): book_title, author_name, ratings, num_of_ratings, date_published

    Returns:
        book_attributes (list) : 
        - book_url, book_title, author_name, ratings, num_of_ratings, date_published
    """
    
    book_id, book_url, book_title, author_name, ratings, num_of_ratings, date_published, book_shelved, book_genre = list(), list(), list(), list(), list(), list(), list(), list(), list()
    book_attributes = list()

    for root in roots:
        book_link_prefix = "https://www.goodreads.com"
        book_url_page = [x['href'] for x in root.find_all("a", class_="bookTitle")]
        
        book_id_page = [book_link.split("/book/show/")[1].split(".")[0] for book_link in book_url_page]
        book_id.extend(book_id_page)

        book_url_page = [book_link_prefix+book_link for book_link in book_url_page]
        book_url.extend(book_url_page)
        
        book_title_page = [x.get_text() for x in root.find_all("a", class_="bookTitle")]
        book_title.extend(book_title_page)

        author_name_page = [x.get_text() for x in root.find_all("a", class_="authorName")]
        author_name.extend(author_name_page)

        ratings_data = []
        shevles_genre_data = []

        for div in root.find_all("div", class_="left"):
            start = 'shelved'
            end = 'avg rating'
            s = div.get_text()
            shevles_genre_data = s[s.find(start)+len(start):s.rfind(end)]

            keyword = " times as "
            before_keyword, keyword, after_keyword = shevles_genre_data.partition(keyword)
            book_shelved.append(int(before_keyword))
            book_genre.append(after_keyword.split()[0][:-1])
            

        for div in root.find_all("div", class_="left"):
            for span in div.find_all('span', {'class' : 'greyText smallText'}):
                ratings_data.append(span.get_text())
        
        for elem in ratings_data: 

            ratings.append(elem.split()[2])
            num_of_ratings.append(elem.split()[4])
            
            # If date published is not given pass in nan value
            if len(elem.split()) < 9: 
                date_published.append(np.nan)
            else: 
                date_published.append(elem.split()[8])

    book_attributes = [book_id, book_url, book_title, author_name, ratings, num_of_ratings, date_published, book_shelved, book_genre]
    
    return book_attributes


Converting the list we have to a DataFrame

In [22]:
def create_dataframe(book_attributes):
    """
    Create a dataframe
    
    Args:
        book_attributes (list): book_title, author_name, ratings, num_of_ratings, date_published
        
    Returns:
        df (pd.DataFrame) : 
        - Columns: book_title, author_name, ratings, num_of_ratings, date_published
    """

    df = pd.DataFrame(
        {'book_id': book_attributes[0],
        'book_url': book_attributes[1],
        'book_title': book_attributes[2],
        'author_name': book_attributes[3],
        'ratings': book_attributes[4],
        'num_of_ratings': book_attributes[5],
        'date_published': book_attributes[6],
        'book_shelved': book_attributes[7],
        'book_genre': book_attributes[8]
        })
    

    return df

Use the function above to get dataframes of different genres, and then save them as separate csv files.

In [23]:
file_paths = ['../HTML/Fiction', '../HTML/Science', '../HTML/Religion', '../HTML/Crime']
for file_path in file_paths: 
    roots = get_html(file_path)
    book_attributes = parse_page(roots)
    df = create_dataframe(book_attributes)
    filename = 'goodreads_' + file_path.split('/')[-1] + '.csv'
    df.to_csv(filename, index=False)


## Task 2: Collecting reviews

The second part of the scapper is to collect the reviews for each book. We modified some external code from <https://github.com/maria-antoniak/goodreads-scraper>. The logic of the scapper is to use the Chrome driver to open the book url and then read the reviews from the page.
At the beginning, we wrote code to scrap the review. However, we found the Browser will be blocked or raise a login page which prevent us to scrap the page. After modify the external code. We succuessfuly extract around 294,000 reviews for 1250 fiction books. In order to complete the task, we sacrificed some of the data by reducing the time interval and num of re-scrapping. It is due to 3 scapper prevention techniques on the website.
1. The page will give duplicate reviews if the time interval is too short.
2. The login panel will pop out, which prevent the scrapper from reading the information
3. After scrapped about 5 books, the webpage will suspend the connection for around 10 sec.

All of these costed us around 20 hrs to collect the dataset with missing data and duplicate information. The further data wrangling and visualization will be shown in another notebook.

The dataframe below is the one we have collected for fiction books with basic info. In this case, we created a file containing all the `book_ids`. This file will be read by the python document called `get_reviews.py`.

In [9]:
df

Unnamed: 0,book_id,book_url,book_title,author_name,ratings,num_of_ratings,date_published,book_shelved,book_genre
0,2657.To_Kill_a_Mockingbird,https://www.goodreads.com/book/show/2657.To_Ki...,To Kill a Mockingbird (Paperback),Harper Lee,4.27,5025333,1960,24464,fiction
1,40961427-1984,https://www.goodreads.com/book/show/40961427-1984,1984 (Kindle Edition),George Orwell,4.19,3609831,1949,24368,fiction
2,4671.The_Great_Gatsby,https://www.goodreads.com/book/show/4671.The_G...,The Great Gatsby (Paperback),F. Scott Fitzgerald,3.93,4217051,1925,22232,fiction
3,170448.Animal_Farm,https://www.goodreads.com/book/show/170448.Ani...,Animal Farm (Mass Market Paperback),George Orwell,3.97,3105131,1945,20400,fiction
4,3.Harry_Potter_and_the_Sorcerer_s_Stone,https://www.goodreads.com/book/show/3.Harry_Po...,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,8031019,1997,20064,fiction
...,...,...,...,...,...,...,...,...,...
1245,7278752-dolores-claiborne,https://www.goodreads.com/book/show/7278752-do...,Dolores Claiborne (ebook),Stephen King,3.89,135428,1992,1241,fiction
1246,99300.The_Yellow_Wallpaper_and_Other_Stories,https://www.goodreads.com/book/show/99300.The_...,The Yellow Wallpaper and Other Stories (Paperb...,Charlotte Perkins Gilman,4.05,83538,1892,1241,fiction
1247,10628.Night_Shift,https://www.goodreads.com/book/show/10628.Nigh...,Night Shift (Paperback),Stephen King,4.02,157161,1978,1240,fiction
1248,2547.The_Prophet,https://www.goodreads.com/book/show/2547.The_P...,The Prophet (Paperback),Kahlil Gibran,4.21,261164,1923,1240,fiction


In [9]:
# F = open("IDS.txt", "w")
# for line in df["book_id"]:
#     F.write(line)
#     F.write("\ndf")
# F.close()

Making a review_folder to hold the JSON file for each book.

In [None]:
# !mkdir book_reviews_folder

Execute the python code and convert the JSON file to a dataframe. And store the dataframe as a csv file, which is called `review.csv`.

In [4]:
!python get_reviews.py --book_ids_path IDS.txt \
--output_directory_path book_reviews_folder --sort_order default --browser chrome 

  driver = webdriver.Chrome(executable_path=binary_path)
2021-11-28 23:30:22.369205 get_reviews.py: Scraping 4708.The_Beautiful_and_Damned...
2021-11-28 23:30:22.369246 get_reviews.py: #1222 out of 1250 books
Scraped page 1
  if driver.find_element_by_link_text(str(page_counter)):
  driver.find_element_by_link_text(str(page_counter)).click()
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Scraped page 8
Scraped page 9
ERROR: StaleElementReferenceException
Refreshing Goodreads site and skipping problem page {page_counter} 
2021-11-28 23:30:35.889118 get_reviews.py: Scraped ✨270✨ reviews for 4708.The_Beautiful_and_Damned
2021-11-28 23:30:35.898636 get_reviews.py: Scraping 16255.Tales_of_the_City...
2021-11-28 23:30:35.898648 get_reviews.py: #1223 out of 1250 books
Scraped page 1
Scraped page 2
Scraped page 3
ERROR: StaleElementReferenceException
Refreshing Goodreads site and skipping problem page {page_counter} 
Scraped page 5
Scraped page 6
ERRO

Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
ERROR: StaleElementReferenceException
Refreshing Goodreads site and skipping problem page {page_counter} 
Scraped page 8
Scraped page 9
2021-11-28 23:34:14.650720 get_reviews.py: Scraped ✨240✨ reviews for 34128219-la-belle-sauvage
2021-11-28 23:34:14.657889 get_reviews.py: Scraping 15729539-nos4a2...
2021-11-28 23:34:14.657901 get_reviews.py: #1237 out of 1250 books
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
ERROR: StaleElementReferenceException
Refreshing Goodreads site and skipping problem page {page_counter} 
Scraped page 8
Scraped page 9
Scraped page 10
2021-11-28 23:34:35.608991 get_reviews.py: Scraped ✨270✨ reviews for 15729539-nos4a2
2021-11-28 23:34:35.617742 get_reviews.py: Scraping 25200.Silence...
2021-11-28 23:34:35.617754 get_reviews.py: #1238 out of 1250 books
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scr

In [7]:
reviews_df = pd.read_json('book_reviews_folder/all_reviews.json')
reviews_df

Unnamed: 0,book_id_title,book_id,book_title,review_url,review_id,date,rating,user_name,user_url,text,num_likes,sort_order,shelves
0,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/1855355089,1855355089,2016-12-29,2,Sam,/user/show/59357213-sam,I didn't really respond well to Conversations ...,1043,default,[2017-reads]
1,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2098766690,2098766690,2017-08-23,5,Jill,/user/show/2228181-jill,I’ve been thinking a lot about aging lately: t...,937,default,[]
2,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/1948088321,1948088321,2017-06-09,3,Esil,/user/show/3643764-esil,A very tepid 3 stars. Conversations with Frien...,839,default,[netgalley]
3,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2831723058,2831723058,2019-05-23,5,emma,/user/show/32879029-emma,have been truly dealt a series of death blows ...,592,default,"[couldn-t-wait-to-read, favorites, literary-fi..."
4,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2340296379,2340296379,2018-03-26,2,Barry Pierce,/user/show/4593541-barry-pierce,The narrator of Sally Rooney's Conversations w...,480,default,"[21st-century, read-in-2018]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
297450,54493401-project-hail-mary,54493401-project-hail-mary,Project Hail Mary,https://www.goodreads.com/review/show/3553164073,3553164073,2020-09-17,5,Jenna,/user/show/3536004-jenna,~~~~~~~~~~~~~~It's publication day!~~~~~~~~~~~...,158,default,"[science-fiction, edelweiss]"
297451,54493401-project-hail-mary,54493401-project-hail-mary,Project Hail Mary,https://www.goodreads.com/review/show/3763008971,3763008971,2021-05-24,5,Bradley,/user/show/4213258-bradley,And we're back. I loved the Martian and I was ...,152,default,"[fantasy, 2021-shelf, sci-fi]"
297452,54493401-project-hail-mary,54493401-project-hail-mary,Project Hail Mary,https://www.goodreads.com/review/show/4077397636,4077397636,2021-06-28,5,Kevin Kuhn,/user/show/59568642-kevin-kuhn,"Let’s start with this, I completely enjoyed th...",135,default,"[science-fiction, favorites]"
297453,54493401-project-hail-mary,54493401-project-hail-mary,Project Hail Mary,https://www.goodreads.com/review/show/3882968267,3882968267,2021-04-02,4,Kemper,/user/show/405390-kemper,I received a free advanced copy of this from N...,138,default,"[arc, space, 2021, sci-fi]"
