****Project Brief****
Problem: Choice Overload. Users waste more time searching than reading because of decision paralysis and a "winner-takes-all" market.
Solution: A context-aware engine that replaces popularity bias with situational matching (mood, time, environment e.g. bedtime, 10mins, long holiday/travelling).
Goal: Reduce decision fatigue and unlock the "Long Tail" of publishing/giving niche books the spotlight while helping readers find the perfect book for their current moment.

****Goodreads Webscraping****
Book data required 
- Genre 
- Title 
- Author
- Rating
- Rating counts 
- Description 
- Page numbers 
- ISBN
- Language 
- Published Year 
- Book Cover Image 
- Link to the book 

****Open Library API***
Identifiers: ISBN-13
Physical Specs: Number of pages, physical dimensions, weight, and binding type (Hardcover, mass-market paperback, etc.).

Publishing Info: Publisher name, specific publication date, and series name.

Table of Contents: Often includes a full list of chapters (a feature many other APIs lack).

3. The "Author" Layer
Open Library treats authors as distinct entities with their own metadata.

Biographical Data: Full name, birth/death dates, and a biography.

Identifiers: Links to external authority files like VIAF, Wikidata, and Library of Congress ID.

Photos: Portraits of the author when available.

4. Digital & Community Data
Because Open Library is part of the Internet Archive, it includes unique "living" data:

Availability: Data on whether an eBook version is available to borrow, read online, or download.

Community Activity: User-generated Reading Logs (Want to Read, Currently Reading, Have Read), public Book Lists, and user ratings.

Revision History: Every single change made to a record is stored, meaning you can access previous "versions" of a book's data.

In [24]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

In [6]:
#Checking if the webscraping works 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
r = requests.get("https://www.goodreads.com/", headers=headers)
print (r.status_code)

200


soup = BeautifulSoup(r.text, 'html.parser' ) 
print (soup.prettify())

In [None]:
#create the dictionary of genre list 
genres_list = {}
for a in soup.select("div a.gr-hyperlink href=genres/art"):

In [21]:
genre_art = []

for a in soup.select('a.gr-hyperlink[href="/genres/art"]'):
    text = a.get_text(strip=True)
    
    if text:
        genre_art.append(text)

print(genre_art)

['Art']


#genre_art = []
#----------------------------
#for a in soup.select('a.gr-hyperlink[href="/genres/art"]'):
#text = a.get_text(strip=True)
#if text:
   #     genre_art.append(text)

#print(genre_art)

In [22]:
#Best Book Ever List from GoodReads 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
r = requests.get("https://www.goodreads.com/list/show/1.Best_Books_Ever", headers=headers)
print (r.status_code)

200


In [None]:
#Prettyfing the Best Book Ever Page 
soup = BeautifulSoup(r.text, 'html.parser' ) 
#print (soup.prettify())

In [26]:
#Best Book Ever List from GoodReads 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
r = requests.get("https://www.goodreads.com/book/show/2767052-the-hunger-games", headers=headers)
print (r.status_code)

200


In [None]:
#Prettifying Hanger Games  
soup = BeautifulSoup(r.text, 'html.parser' ) 
#print (soup.prettify())

In [None]:
#Testing to see if we can scrape the title only, it did not return result 
results_list = [] 
for a in soup.select('h1.Text_title1'):
    text = a.get_text(strip=True)

    results_list.append(text)

print(results_list)

In [29]:
#Web Scraping of Best Book Ever List 
# Base URL for the list
the_hunger_game = {}

base_url = "https://www.goodreads.com/list/show/1.Best_Books_Ever"
page_to_scrape = 1  # Starting page

def scrape_book_details(book_url):
    # This simulates "clicking" into the book
    full_url = "https://www.goodreads.com" + book_url
    response = requests.get(full_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Title 
    title = soup.find('h1', {'class': 'Text Text_title1'}) #this was the wrong position 
    return title.text.strip() if title else "No title"

# Pagination Loop
while page_to_scrape <= 2:  # Let's just do 2 pages for this example
    print(f"--- Scraping Page {page_to_scrape} ---")
    params = {'page': page_to_scrape}
    response = requests.get(base_url, params=params)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 1. Find all book links on the list page
    book_links = soup.find_all('a', class_='bookTitle')

    for link in book_links:
        relative_url = link['href']
        title = link.find('span').text
        print(f"Clicking into: {title}")
        
        # 2. Go inside the book page
        desc = scrape_book_details(relative_url)
        print(f"Description found: {desc[:50]}...")
        
        # Respectful delay so you don't get banned
        all_books.append(the_hunger_game)
        time.sleep(1)

    page_to_scrape += 1

the_hunger_game

--- Scraping Page 1 ---
--- Scraping Page 2 ---


{}

In [None]:
#Individual Page 
https://www.goodreads.com/book/show/2767052-the-hunger-games

In [None]:
books = pd.DataFrame ({})

In [3]:
#550 Fiction Books from Open Library API 
import requests
import pandas as pd
import time

open_lib = []
target_count = 550
page = 1

#putting all the column names, so to avoid empty cells (only return the data when its not empty) 
fields = "title,author_name,subject,ratings_average,ratings_count,first_sentence,number_of_pages_median,isbn,language,first_publish_year,cover_i,key"

print("starting Data Collection...")

while len(open_lib) < target_count:
    url = f"https://openlibrary.org/search.json?subject=fiction&fields={fields}&page={page}&limit=550" #genre is fiction here
    
    headers = {'User-Agent': 'bookrecommendation/1.0 (example@email.com)'} #this gives an user agent to make the access clear its human 
    
    try:
        response = requests.get(url, headers=headers, timeout=10) #if there is not response in 10seconds, show connection error 
    except requests.exceptions.RequestException as e: 
        print(f"❌ Connection error: {e}")
        break

    if response.status_code == 200: #if status code is 200, proceed 
        json_response = response.json()
        data = json_response.get('docs', []) #'docs' is where the book list is : "Each document specified listed in "docs"
        
        if not data:
            print("No more data available.")
            break

        for item in data:
            if len(open_lib) >= target_count:
                break

            # 2. Extracting data with safe defaults
            # Note: first_sentence is often a list, so we handle that specifically
            desc = item.get('first_sentence')
            description = desc[0] if isinstance(desc, list) else "No description available"

            row = {
                'Title': item.get('title'),
                'Author': item.get('author_name', ['N/A'])[0], #"Get the list of authors. If there are none, use ['N/A']. Then, just take the 1st one ([0])."
                'Genre': ", ".join(item.get('subject', [])[:3]), #Subjects are also lists. This takes the first 3 items ([:3]) unified with a comma.
                'Rating_Average': item.get('ratings_average'),
                'Rating_Counts': item.get('ratings_count'),
                'Description': description,
                'Page_Numbers': item.get('number_of_pages_median'), #OP has median of all versions e.g. hard/paper cover 
                'ISBN': item.get('isbn', ['N/A'])[0],
                'Language': item.get('language', ['N/A'])[0],
                'Published_Year': item.get('first_publish_year'),
                'Cover_URL': f"https://covers.openlibrary.org/b/id/{item.get('cover_i')}-L.jpg" if item.get('cover_i') else None,
                'Book_Link': f"https://openlibrary.org{item.get('key')}"
            }
        
            open_lib.append(row) #appending retrieved data to the list 
        
        print(f"Page {page} processed. Total items collected: {len(open_lib)}") #number of page collected 
        page += 1 #increment the page one by one 
        time.sleep(1) #sleep for 1 sec per page loading 
        
    elif response.status_code == 429: #if there is some distruption, wait for 20secs 
        print("Waiting 20 seconds...")
        time.sleep(20)
    else:
        print(f"❌ Error {response.status_code}. Stopping.")
        break

#converting to dataframe
op_fic_df = pd.DataFrame(open_lib)

op_fic_df.dropna(subset=['Title'], inplace=True) #if title does not exist, drop 

print("Collection Complete!")

op_fic_df.to_csv('Open_Library_Fiction_550.csv', index=False)

starting Data Collection...
Page 1 processed. Total items collected: 550
Collection Complete!


In [5]:
op_fic_df.head(10)

Unnamed: 0,Title,Author,Genre,Rating_Average,Rating_Counts,Description,Page_Numbers,ISBN,Language,Published_Year,Cover_URL,Book_Link
0,The Iron Heel,Jack London,"Revolutions, fiction, Oligarchy, fiction, Utop...",3.5,20.0,"THE SOFT summer wind stirs the redwood, and Wi...",287,1548921947,eng,1907,https://covers.openlibrary.org/b/id/8243314-L.jpg,https://openlibrary.org/works/OL74502W
1,Игрокъ,Фёдор Михайлович Достоевский,"Translations into English, Continental europea...",3.714286,7.0,At length I returned from two weeks leave of a...,191,9798630153272,eng,1900,https://covers.openlibrary.org/b/id/3293339-L.jpg,https://openlibrary.org/works/OL166923W
2,Brood of the Witch-Queen,Sax Rohmer,"Fantasy, Fiction, Horror",4.6,5.0,No description available,206,1985805111,eng,1924,https://covers.openlibrary.org/b/id/2011286-L.jpg,https://openlibrary.org/works/OL2288676W
3,The Napoleon of Notting Hill,Gilbert Keith Chesterton,"Fiction, Classic Literature, Science Fiction",4.571429,7.0,No description available,160,1445508265,eng,1904,https://covers.openlibrary.org/b/id/6980094-L.jpg,https://openlibrary.org/works/OL76473W
4,Emily of New Moon,Lucy Maud Montgomery,"Juvenile fiction, Child authors, Orphans",3.666667,9.0,"The house in the hollow was ""a mile from anywh...",339,1548912425,eng,1923,https://covers.openlibrary.org/b/id/14638065-L...,https://openlibrary.org/works/OL77781W
5,The Vampyre,John William Polidori,"Fiction, Incest, Vampires",3.583333,12.0,No description available,55,1517322030,fre,1819,https://covers.openlibrary.org/b/id/4871002-L.jpg,https://openlibrary.org/works/OL3625242W
6,The Sea Fairies,L. Frank Baum,"Children's stories, Juvenile fiction, Classic ...",3.333333,3.0,"""Nobody,"" said Cap'n Bill solemnly, ""ever sawr...",130,9781098604226,rus,1911,https://covers.openlibrary.org/b/id/1814237-L.jpg,https://openlibrary.org/works/OL262384W
7,Lilith,George MacDonald,"Fiction, romance, fantasy, Fiction, general, N...",3.2,5.0,"I had just finished my studies at Oxford, and ...",270,1406530042,fre,1895,https://covers.openlibrary.org/b/id/14364546-L...,https://openlibrary.org/works/OL15437W
8,The Enchanted Castle,Edith Nesbit,"Fiction, Fantasy, Magic",4.181818,11.0,"There were three of them-Jerry, Jimmy, and Kat...",186,9781548958534,spa,1907,https://covers.openlibrary.org/b/id/6644514-L.jpg,https://openlibrary.org/works/OL99541W
9,Pollyanna,Eleanor Hodgman Porter,"Aunts, Cheerfulness, Classic Literature",3.809524,21.0,Miss Polly Harrington entered her kitchen a li...,194,1532762569,eng,1912,https://covers.openlibrary.org/b/id/902113-L.jpg,https://openlibrary.org/works/OL2775807W
