## Importing necessary packages

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

pd.set_option('display.max_rows', None)

## Fetching HTML content for Sample Data

Extracting HTML data and verifying if it does have data of 100 books.

In [2]:
url = 'https://www.goodreads.com/list/show/1.Best_Books_Ever'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers= headers)
soup = BeautifulSoup(response.content, 'html.parser')
sample_book_data = soup.find_all('tr', itemtype= 'http://schema.org/Book')

len(sample_book_data)

100

## Sample Data Extraction

Scraping first page as sample & previewing.

In [3]:
sample_book_list = []

for book in sample_book_data:
    if book.find('td', width= '100%') is not None:
        title = book.find('a', class_= 'bookTitle').text.strip()
        author = book.find('a', class_= 'authorName').text.strip()
        ratings = book.find('span', class_= 'greyText smallText uitext').text.strip()
        score = book.find('span', class_= 'smallText uitext').text.strip()

    sample_book_list.append({'Title': title, 'Author': author, 'Ratings': ratings, 'Scores': score})

sample_df = pd.DataFrame(sample_book_list)
sample_df.head(10)

Unnamed: 0,Title,Author,Ratings,Scores
0,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,"4.35 avg rating — 9,819,756 ratings","score: 4,317,361,\n and\n43,890 p..."
1,Pride and Prejudice,Jane Austen,"4.29 avg rating — 4,759,518 ratings","score: 2,965,069,\n and\n30,385 p..."
2,To Kill a Mockingbird,Harper Lee,"4.26 avg rating — 6,829,646 ratings","score: 2,601,822,\n and\n26,569 p..."
3,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,"4.50 avg rating — 3,770,203 ratings","score: 2,080,101,\n and\n21,164 p..."
4,The Book Thief,Markus Zusak,"4.39 avg rating — 2,858,353 ratings","score: 1,970,342,\n and\n20,238 p..."
5,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,"3.67 avg rating — 7,260,071 ratings","score: 1,760,620,\n and\n17,948 p..."
6,Animal Farm,George Orwell,"4.02 avg rating — 4,501,567 ratings","score: 1,714,771,\n and\n17,746 p..."
7,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,"4.62 avg rating — 143,641 ratings","score: 1,654,510,\n and\n17,139 p..."
8,The Chronicles of Narnia (The Chronicles of Na...,C.S. Lewis,"4.28 avg rating — 704,889 ratings","score: 1,533,185,\n and\n15,970 p..."
9,The Fault in Our Stars,John Green,"4.12 avg rating — 5,680,897 ratings","score: 1,416,520,\n and\n14,659 p..."


## Data Extraction

Scraping the whole data of 100 pages using for loop, basic error handling & also waiting for 5 seconds after each loop, then appending the data to a list called 'book_list' before inserting it into the DataFrame.

In [4]:
book_list = []

page = 1
for page in range(1, 101):
    url = f'https://www.goodreads.com/list/show/1.Best_Books_Ever?page={page}'
    try:
        response = requests.get(url, headers= headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        book_data = soup.find_all('tr', itemtype= 'http://schema.org/Book')

        for book in book_data:
            if book.find('td', width= '100%') is not None:
                title = book.find('a', class_= 'bookTitle').text.strip()
                author = book.find('a', class_= 'authorName').text.strip()
                ratings = book.find('span', class_= 'greyText smallText uitext').text.strip()
                score = book.find('span', class_= 'smallText uitext').text.strip()

            book_list.append({'Title': title, 'Author': author, 'Ratings': ratings, 'Scores': score})

    except requests.exceptions.RequestException as exc:
        print(f"Error fetching page {page}: {exc}")
    time.sleep(5)

df = pd.DataFrame(book_list)
len(df)

10000

## Data Preview

Duplicating the original DataFrame as a backup & previewing new DataFrame, 'dup_df'.

In [16]:
dup_df = df.copy()
dup_df.head(10)

Unnamed: 0,Title,Author,Ratings,Scores
0,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,"4.35 avg rating — 9,819,756 ratings","score: 4,317,361,\n and\n43,890 p..."
1,Pride and Prejudice,Jane Austen,"4.29 avg rating — 4,759,518 ratings","score: 2,965,069,\n and\n30,385 p..."
2,To Kill a Mockingbird,Harper Lee,"4.26 avg rating — 6,829,646 ratings","score: 2,601,822,\n and\n26,569 p..."
3,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,"4.50 avg rating — 3,770,203 ratings","score: 2,080,101,\n and\n21,164 p..."
4,The Book Thief,Markus Zusak,"4.39 avg rating — 2,858,353 ratings","score: 1,970,342,\n and\n20,238 p..."
5,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,"3.67 avg rating — 7,260,071 ratings","score: 1,760,620,\n and\n17,948 p..."
6,Animal Farm,George Orwell,"4.02 avg rating — 4,501,567 ratings","score: 1,714,771,\n and\n17,746 p..."
7,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,"4.62 avg rating — 143,641 ratings","score: 1,654,510,\n and\n17,139 p..."
8,The Chronicles of Narnia (The Chronicles of Na...,C.S. Lewis,"4.28 avg rating — 704,889 ratings","score: 1,533,185,\n and\n15,970 p..."
9,The Fault in Our Stars,John Green,"4.12 avg rating — 5,680,897 ratings","score: 1,416,520,\n and\n14,659 p..."


## Data Cleanup

Cleaning the 'Title' column by removing double quotes and trimming.

In [17]:
dup_df['Title'] = dup_df['Title'].str.replace('"', '', regex = False).str.strip()

Removing unnecassary strings from the 'Ratings' column.

In [18]:
dup_df['Ratings'] = (dup_df['Ratings'] \
                    .str.replace('really liked it ', '', regex= False) \
                    .str.replace('it was amazing ', '', regex= False))

Creating two new columns, 'Avg_Ratings' & 'Total_Ratings' by splitting 'Ratings' column.

In [19]:
dup_df['Avg_Rating'] = dup_df['Ratings'].str.split().str[0]
dup_df['Total_Ratings'] = dup_df['Ratings'].str.split().str[4]

Creating two more columns, 'Score' & 'Votes' by splitting 'Scores' column.

In [20]:
dup_df['Score'] = dup_df['Scores'].str.split().str[1]
dup_df['Votes'] = dup_df['Scores'].str.split().str[3]

Dropping columns, 'Ratings' & 'Scores' since its data is already splitted & stored.

In [21]:
dup_df.drop(columns= ['Ratings', 'Scores'], inplace= True)
dup_df.head()

Unnamed: 0,Title,Author,Avg_Rating,Total_Ratings,Score,Votes
0,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,4.35,9819756,4317361,43890
1,Pride and Prejudice,Jane Austen,4.29,4759518,2965069,30385
2,To Kill a Mockingbird,Harper Lee,4.26,6829646,2601822,26569
3,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.5,3770203,2080101,21164
4,The Book Thief,Markus Zusak,4.39,2858353,1970342,20238


Setting appropriate data types for numeric columns as well as removing ','.

In [22]:
dup_df['Avg_Rating'] = dup_df['Avg_Rating'].astype('float')
dup_df['Total_Ratings'] = dup_df['Total_Ratings'].str.replace(',', '').astype('int')
dup_df['Score'] = dup_df['Score'].str.replace(',', '').astype('int')
dup_df['Votes'] = dup_df['Votes'].str.replace(',', '').astype('int')

dup_df.dtypes

Title             object
Author            object
Avg_Rating       float64
Total_Ratings      int64
Score              int64
Votes              int64
dtype: object

## Cleaned DataFrame

Creating new df, 'cleaned_df' as a cleaned, finished & organised output.

In [23]:
cleaned_df = dup_df[['Title', 'Author', 'Avg_Rating', 'Total_Ratings', 'Votes', 'Score']].copy()

## Final Output

This is the final cleaned DataFrame.

In [24]:
print(cleaned_df.shape)
cleaned_df.head(10)

(10000, 6)


Unnamed: 0,Title,Author,Avg_Rating,Total_Ratings,Votes,Score
0,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,4.35,9819756,43890,4317361
1,Pride and Prejudice,Jane Austen,4.29,4759518,30385,2965069
2,To Kill a Mockingbird,Harper Lee,4.26,6829646,26569,2601822
3,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.5,3770203,21164,2080101
4,The Book Thief,Markus Zusak,4.39,2858353,20238,1970342
5,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,3.67,7260071,17948,1760620
6,Animal Farm,George Orwell,4.02,4501567,17746,1714771
7,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,4.62,143641,17139,1654510
8,The Chronicles of Narnia (The Chronicles of Na...,C.S. Lewis,4.28,704889,15970,1533185
9,The Fault in Our Stars,John Green,4.12,5680897,14659,1416520


## Exporting as CSV File

Exporting the output DataFrame, 'cleaned_df' as a .csv file named 'best_books_ever.csv'.

In [None]:
cleaned_df.to_csv('best_books_ever.csv', index= False, encoding= 'utf-8')
print("Created 'best_books_ever.csv' successfully")

Created 'best_books_ever.csv' successfully
