<a href="https://colab.research.google.com/github/achapman49/GoodreadsWebScraping/blob/main/Goodreads_Data_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Dependencies

In [4]:
!pip install bs4
!pip install progressbar
!pip install gender_guesser
!pip install translators --upgrade
!pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting progressbar
  Downloading progressbar-2.5.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: progressbar
  Building wheel for progressbar (setup.py) ... [?25l[?25hdone
  Created wheel for progressbar: filename=progressbar-2.5-py3-none-any.whl size=12081 sha256=ce08891301d54052c11587da09c5e4421ffed8ac4d58fd675ebce7909e29d892
  Stored in directory: /root/.cache/pip/wheels/2c/67/ed/d84123843c937d7e7f5ba88a270d11036473144143355e2747
Successfully built progressbar
Installing collected packages: progressbar
Successfully installed progressbar-2.5


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gender_guesser
  Downloading gender_guesser-0.4.0-py2.py3-none-any.whl (379 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m379.3/379.3 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gender_guesser
Successfully installed gender_guesser-0.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting translators
  Downloading translators-5.5.6-py3-none-any.whl (33 kB)
Collecting pathos>=0.2.9
  Downloading pathos-0.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cryptography>=38.0.1
  Downloading cryptography-39.0.0-cp36-abi3-manylinux_2_28_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m66.0 MB/s[0m et

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 KB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=834fdfdca4a5c19900e1249f35f6c4cf22550b9fb80c07812d2423705bd9b7a7
  Stored in directory: /root/.cache/pip/wheels/13/c7/b0/79f66658626032e78fc1a83103690ef6797d551cb22e56e734
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


## 2. Imports

In [5]:
##Standard imports for scraping and manipulating data
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import time
import re

#Progressbar is helpful to see the progress of functions which take a long time to run
import progressbar

from datetime import datetime
import ast

#Gender guesser is used to try and parse author gender
import gender_guesser.detector as gender

#Translator for non-English-language records and language detector to find them
import translators as ts
import translators.server as tss
from langdetect import detect
from langdetect import DetectorFactory
DetectorFactory.seed = 0

Using state District of Columbia server backend.


## 3. Scrape a list

### 3.1 Define Function

Run this to define the function for scraping the list

In [15]:
#Define the function for scraping every page of a list from a url

def scrape_book_list(url, show_df = False, show_progress = False):

    ##Define urls and create the initial html soup
    if show_progress == True:
        print('Turning page into soup...')
    goodreads_url = "https://www.goodreads.com"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #Get list length
    if show_progress == True:
        print('Getting list length...')
    
    pagination = soup.find("div", class_ = "pagination").find_all("a")
    while pagination == None:
        time.sleep(3)
        pagination = soup.find("div", class_ = "pagination").find_all("a")


    ##For navigation, I need to find out how many pages there are in this list
    pages = []
    for a in pagination:
        pages.append(a.text)
    max_page = int(pages[-2]) #The last item is always the word ' next', take the next to last
    
    ##Now create a list of links
    ### The pages in these lists are defined by adding '?page=[i]' at the end, making them easy to iterate over
    list_page_links = []
    for i in range(2, max_page + 1):
        list_page_link = f"{url}?page={i}"
        list_page_links.append(list_page_link)
    
    ##The lists are in a table
    if show_progress == True:
        print('Scraping list table on page 1...')
    list_table = soup.find("table", class_= "tableList")
    
    ##Get the relevant page elements to populate the columns with content from the first page
    rankings = [row.find("td").text.strip() for row in list_table.find_all("tr")]
    titles = [row.find("a", class_="bookTitle").text.strip() for row in list_table.find_all("tr")]
    authors = [row.find("a", class_="authorName").text.strip() for row in list_table.find_all("tr")]
    links = [row["href"] for row in list_table.find_all("a", class_="bookTitle")]
    
    ##Scores and votes are stored in such a way that I need to use a different method
    table_rows = soup.find_all("tr")

    scores = []
    votes = []

    for row in table_rows:
        extra_info = [element.text for element in row.find_all("a", href = "#")]
        score = int(extra_info[0].replace("score: ","").replace(",", ""))
        vote_count = int(extra_info[1].replace(" people voted", "").replace(" person voted", "").replace(",", ""))
        scores.append(score)
        votes.append(vote_count)    
        
    ##Ratings and counts of ratings
    ratings_and_counts = soup.find_all("span", class_ = "minirating")

    avg_ratings  = []
    rating_counts = []

    for i in ratings_and_counts:
        text = i.text
        text = text.replace("it was amazing ", "").replace("really liked it ", "").replace("liked it ", "").replace("it was ok ", "").replace("did not like it ", "")
        text = text.replace(" avg rating ", "").replace(" ratings", "").replace(" rating","").replace(",", "")
        text = text.split("— ")
        avg_ratings.append(float(text[0]))
        rating_counts.append(int(text[1]))

    
    ##Loop through and append content from the other pages
    if show_progress == True:
        print('Scraping additional list pages...')
        
    bar = progressbar.ProgressBar(maxval=max_page).start()
    
    for l in list_page_links:
        page = requests.get(l)
        soup = BeautifulSoup(page.content, 'html.parser')
        list_table = soup.find("table", class_= "tableList")
        while list_table == None:
            time.sleep(3)
            page = requests.get(l)
            soup = BeautifulSoup(page.content, 'html.parser')
            list_table = soup.find("table", class_= "tableList")
        
        rankings_new = [row.find("td").text.strip() for row in list_table.find_all("tr")]
        titles_new = [row.find("a", class_="bookTitle").text.strip() for row in list_table.find_all("tr")]
        authors_new = [row.find("a", class_="authorName").text.strip() for row in list_table.find_all("tr")]
        links_new = [row["href"] for row in list_table.find_all("a", class_="bookTitle")]
        
        ###Scores and votes
        table_rows = soup.find_all("tr")

        scores_new = []
        votes_new= []

        for row in table_rows:
            extra_info = [element.text for element in row.find_all("a", href = "#")]
            score = int(extra_info[0].replace("score: ","").replace(",", ""))
            vote_count = int(extra_info[1].replace(" people voted", "").replace(" person voted", "").replace(",", ""))
            scores_new.append(score)
            votes_new.append(vote_count)  
        
        #Ratings and counts
        ratings_and_counts_new = soup.find_all("span", class_ = "minirating")

        avg_ratings_new  = []
        rating_counts_new = []

        for i in ratings_and_counts_new:
            text = i.text
            text = text.replace("it was amazing ", "").replace("really liked it ", "").replace("liked it ", "").replace("it was ok ", "").replace("did not like it ", "")
            text = text.replace(" avg rating ", "").replace(" ratings", "").replace(" rating","").replace(",", "")
            text = text.split("— ")
            avg_ratings_new.append(float(text[0]))
            rating_counts_new.append(int(text[1]))

        rankings.extend(rankings_new)
        titles.extend(titles_new)
        authors.extend(authors_new)
        links.extend(links_new)
        scores.extend(scores_new)
        votes.extend(votes_new)
        avg_ratings.extend(avg_ratings_new)
        rating_counts.extend(rating_counts_new)
        
        bar.update(+1)
        
    bar.finish()
    
    #Format links correctly
    links = [f"{goodreads_url}{row}" for row in links]
        
    ##Create a dataframe to easily view and manipulate the data
    global list_df
    list_df = pd.DataFrame(columns=["Rank", "Title", "Author", "URL", "Score", "Votes"])
    list_df["Rank"] = rankings
    list_df["Title"] = titles
    list_df["Author"] = authors
    list_df["URL"] = links
    list_df["Score"] = scores
    list_df["Votes"] = votes
    list_df["Average Rating"] = avg_ratings
    list_df["Rating Count"] = rating_counts
    

        
    ##Option to show dataframe at the end
    if show_df == True:
        print(list_df.head())

### 3.2 Define which List to Scrape

It's very important that you remove the page tags from the end of link (the '?page=n' part). Otherwise this will not work.


In [12]:
list_link = "https://www.goodreads.com/list/show/1.Best_Books_Ever" #@param {type:"string"}

### 3.3 Scrape The List

In [20]:
scrape_book_list(list_link, show_progress = True)

Turning page into soup...


                                                                               N/A% (0 of 100) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--

Getting list length...
Scraping list table on page 1...
Scraping additional list pages...


100% (100 of 100) |######################| Elapsed Time: 0:06:44 Time:  0:06:44


In [21]:
list_df

Unnamed: 0,Rank,Title,Author,URL,Score,Votes,Average Rating,Rating Count
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,https://www.goodreads.com/book/show/2767052-th...,3420796,34828,4.33,7823389
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,https://www.goodreads.com/book/show/2.Harry_Po...,2928451,29925,4.50,3095395
2,3,Pride and Prejudice,Jane Austen,https://www.goodreads.com/book/show/1885.Pride...,2435630,25023,4.28,3866260
3,4,To Kill a Mockingbird,Harper Lee,https://www.goodreads.com/book/show/2657.To_Ki...,2287472,23364,4.27,5598473
4,5,The Book Thief,Markus Zusak,https://www.goodreads.com/book/show/19063.The_...,1648816,16973,4.39,2320782
...,...,...,...,...,...,...,...,...
9995,9979,Lalka,Bolesław Prus,https://www.goodreads.com/book/show/484466.Lalka,295,3,3.81,15583
9996,9979,قواعد العشق الأربعون: رواية عن جلال الدين الرومي,Elif Shafak,https://www.goodreads.com/book/show/16104434,295,3,4.14,161067
9997,9979,Secret Slave: Kidnapped and abused for 13 year...,Anna Ruston,https://www.goodreads.com/book/show/32951022-s...,295,3,4.23,4897
9998,9999,"Horizon (The Sharing Knife, #4)",Lois McMaster Bujold,https://www.goodreads.com/book/show/3423435-ho...,294,4,4.04,6932
