# GoodReads List Web Scraper

## Description
Web scraping is one of the most useful methods to retrieve unstructured data from a website and save them in a structured format. Not all websites allow web scraping, so I decided to work with GoodReads for this project which allows it. 

This script scrapes details of books from a Goodreads list and outputs the information into to a CSV file.

## Libraries
The first thing I did was install the required libraries using pip through Command Prompt.
1. The **bs4** (BeutifulSoup4) library is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Python idioms for iterating, searching, and modifying the parse tree.

2. The **requests** package is crucial for making HTTP requests to a specified URL, like we do with web scraping. When one makes a request to a URI, it returns a response. GoodReads is a website that allows html web scraping, so it gives a positive response. Many websites do not. 

3. The **pandas** package will be used to convert the data into a dataframe.
    
## Example using the first book on the list
### Importing the libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd # pandas will be used to compile the data into a csv file

### Requesting the website
The first element of the scraping function is the **get** function from requests. This makes a request to the website. If the website allows web scraping, we will get a positive "200" reponse:

In [3]:
response = requests.get('https://www.goodreads.com/list/show/1599.Best_Philosophical_Fiction')
print(response)

<Response [200]>


### Inspecting the website
The list has a total of 3 pages and 281 books in it. Before we scrape all of them, we first need to understand the containers and classes of the data we want to extract by inspecting the webpage's HTML code. We can do so by right-clicking on the element we want to analyze (like the first book on the list) and clicking 'Inspect'. Then, we find the appropriate container and figure out which elements we want from it.

For this project, I chose to scrape the book title, author, average rating, number of ratings, votes, and average score (the average score is based on multiple factors, including the number of people who have voted for it and how highly those voters ranked the book). We need to find the class of each element we want to scrape, and specify it in the code. By inspecting the website, we can see that the class for book is named 'bookTitle' and the author class is called 'authorName'. 

In [4]:
# scraping the first book title
soup = bs(response.content, 'html.parser')
book_containers = soup.find_all('tr',itemtype="http://schema.org/Book")
first_book = book_containers[0]
first_book_title = first_book.find('a', class_='bookTitle')
print(first_book_title)

<a class="bookTitle" href="/book/show/49552.The_Stranger" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">The Stranger</span>
</a>


This gives us the html code of the book title. To extract the actual title, we use the **text.strip()** function:

In [5]:
print(first_book_title.text.strip())

The Stranger


We can do the same for the first book's author, and the container where average rating and number of ratings are stored. Because these 4 are stored in a list, we also need to add the **split()** function to this code:

In [10]:
first_author = first_book.find('a', class_='authorName').text.strip()
print(first_author)

first_rating = first_book.find('span', class_='minirating').text.strip().split()
print(first_rating)

Albert Camus
['4.02', 'avg', 'rating', '—', '995,631', 'ratings']


To compile all the data in a csv, we need to break down this array to the elements we want.

In [11]:
avg_rating = first_rating[0]
no_ratings = first_rating[4]
print('Average Rating:', avg_rating)
print('Ratings:', no_ratings)

Average Rating: 4.02
Ratings: 995,631


The last thing we can scrape are the book's overall score on the list and the number of people who voted for it:

In [21]:
first_score = first_book.find('span', class_='smallText uitext').text.strip().split()
print(first_score)

['score:', '8,862,', 'and', '90', 'people', 'voted']


In [None]:
score = first_score[1].replace(',','')
votes = first_score[3]
print('Overall Score:', score) # we remove the commas because a comma is attached to the end of the score
print('Number of votes:', votes)

## Scraping the whole list
Now we are going to put everything together into a function that iterates over all 3 pages and 281 books.

In [37]:
page = 1
title = []
author = []
avg_rating = []
ratings=[]
score=[]
votes=[]
while page != 4:
    url = f"https://www.goodreads.com/list/show/1599.Best_Philosophical_Fiction?page={page}"
    response = requests.get(url)
    soup = bs(response.content, 'html.parser')
    book_containers = soup.find_all('tr', itemtype="http://schema.org/Book")
    for container in book_containers:
        titles = container.find('a',class_="bookTitle").text.strip()
        title.append(titles)
        authors = container.find('a',class_="authorName").text.strip()
        author.append(authors)
        rating = container.find('span',class_="minirating").text.strip().split()
        avg = rating[0]
        avg_rating.append(avg)
        no = rating[4]
        ratings.append(no)
        scoring = container.find('span',class_="smallText uitext").text.strip().split()        
        scores = scoring[1].replace(',','')
        score.append(scores)
        vote = scoring[3]
        votes.append(vote)
    page = page + 1

In [40]:
# checking if the data was scraped properly
print(title[0], author[0], avg_rating[0], ratings[0], score[0], votes[0])

The Stranger Albert Camus 4.02 995,651 8862 90


In [38]:
# Create DataFrame out of a dictionary
books_df = pd.DataFrame({
    'title':title, 
    'author':author,
    'avg_rating':avg_rating,
    'ratings':ratings,
    'score':score,
    'votes':votes
})

print(books_df.head())

                    title             author avg_rating    ratings score votes
0            The Stranger       Albert Camus       4.02    995,651  8862    90
1    Crime and Punishment  Fyodor Dostoevsky       4.26    841,778  7213    73
2                    1984      George Orwell       4.19  4,302,165  5727    59
3  The Brothers Karamazov  Fyodor Dostoevsky       4.36    312,261  4804    49
4              Siddhartha      Hermann Hesse       4.06    734,081  4735    49


In [41]:
# Save the DataFrame to a CSV file
csv_file = 'philosophical_books.csv'
books_df.to_csv(csv_file, index=False)