**Data Cleaning for Books Dataset**
---

**1) Import Libraries file**

**Install Dependancies**

In [116]:
pip install numpy pandas matplotlib seaborn python-dotenv langchain langchain-community chromadb langchain-openai transformers gradio flax tensorflow torch

Note: you may need to restart the kernel to use updated packages.


**Import Libraries**

In [118]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
from langchain.chains import LLMChain
from langchain_community.vectorstores import Chroma
from transformers import pipeline
import gradio as gr
from transformers import AutoModel, AutoTokenizer

In [119]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

**2) Load Data**

In [121]:
books = pd.read_csv("datasets/books.csv", on_bad_lines="skip")

**3) Summarise Data**

In [123]:
books.sample(5)

Unnamed: 0,isbn13,isbn10,title,subtitle,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
4604,9780756400613,0756400619,The Serpent's Shadow,,Mercedes Lackey,Fiction,http://books.google.com/books/content?id=PAHin...,From the magical mysteries of India to the gas...,2002.0,3.99,394.0,8533.0
1047,9780143037613,0143037617,Bleak House,,Charles Dickens;Nicola Bradbury,Fiction,http://books.google.com/books/content?id=L-xBP...,The English equity court of the nineteenth-cen...,2006.0,4.01,1017.0,77515.0
1880,9780345463098,0345463099,Yoda,Dark Rendezvous,Sean Stewart,Fiction,http://books.google.com/books/content?id=rdgi7...,"As the Clone Wars rage on, Yoda receives a mes...",2004.0,3.88,329.0,3889.0
267,9780060786731,0060786736,A False Mirror,An Inspector Ian Rutledge Mystery,Charles Todd,Fiction,http://books.google.com/books/content?id=MMG3b...,Summoned to a small harbor town by a former tr...,2007.0,4.02,384.0,2354.0
4491,9780745950549,074595054X,The Gifts of the Jews,How a Tribe of Desert Nomads Changed the Way E...,Thomas Cahill,Bible,http://books.google.com/books/content?id=35dHP...,"The bestselling author of ""How the Irish Saved...",2001.0,3.87,304.0,3066.0


In [124]:
books.shape

(6810, 12)

*6810 entries of data with 12 columns*

**Check Column Titles**

In [127]:
books.columns

Index(['isbn13', 'isbn10', 'title', 'subtitle', 'authors', 'categories',
       'thumbnail', 'description', 'published_year', 'average_rating',
       'num_pages', 'ratings_count'],
      dtype='object')

**Check Data Tyoes**

In [129]:
books.dtypes

isbn13              int64
isbn10             object
title              object
subtitle           object
authors            object
categories         object
thumbnail          object
description        object
published_year    float64
average_rating    float64
num_pages         float64
ratings_count     float64
dtype: object

**Check for Nulls**

In [131]:
books.isnull().sum()

isbn13               0
isbn10               0
title                0
subtitle          4429
authors             72
categories          99
thumbnail          329
description        262
published_year       6
average_rating      43
num_pages           43
ratings_count       43
dtype: int64

**Change Publish Year to correct data type for analysis**

In [133]:
books['published_year'] = books['published_year'].fillna(0).astype(int)

**Handling Null Values** 
--

**Drop columns we do not intend to use that and columns with irretrivable data entries**

In [136]:
books = (
    books
    .drop(columns=['isbn10','subtitle'])
    .dropna(subset=['authors', 'categories', 'description','published_year'])
)

**Fill Rating, Rating Counts and Book Numbers with averages to preserve data**

In [138]:
books = books.fillna({
    'average_rating': books['average_rating'].mean(),
    'num_pages': books['num_pages'].mean(),
    'ratings_count': books['ratings_count'].mean()
})

**Check Nulls**

In [140]:
books.isnull().sum()

isbn13              0
title               0
authors             0
categories          0
thumbnail         199
description         0
published_year      0
average_rating      0
num_pages           0
ratings_count       0
dtype: int64

In [141]:
books.sample(5)

Unnamed: 0,isbn13,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
1329,9780237525378,Oliver Twist,Pauline Francis;Charles Dickens,Juvenile Nonfiction,http://books.google.com/books/content?id=X6RvT...,This wonderful series is a quick way into a ra...,2003,3.66,48.0,92.0
5328,9780826415745,Anita Diamant's The Red Tent,Ann Finding,Literary Criticism,http://books.google.com/books/content?id=pSGfo...,Continuum Contemporaries give readers accessib...,2004,4.13,88.0,439.0
3288,9780520072381,The Responsibility of Forms,Roland Barthes,Literary Criticism,http://books.google.com/books/content?id=XTLuw...,These late essays of Roland Barthes's are conc...,1991,3.94,320.0,92.0
2476,9780394757650,The Simple Art of Murder,Raymond Chandler,Fiction,http://books.google.com/books/content?id=yj2fm...,An essay on detective fiction accompanies eigh...,1950,4.16,384.0,4794.0
4647,9780761501664,The Wealthy Barber,David Chilton,"Finance, Personal",http://books.google.com/books/content?id=RFqve...,In this new and expanded edition of one of the...,1996,4.02,199.0,61.0


**Transforming Data**
--

In [143]:
books["categories"].unique()

array(['Fiction', 'Detective and mystery stories', 'American fiction',
       'Christian life', 'Authors, English', 'Africa, East',
       'Hyland, Morn (Fictitious character)', 'Adventure stories',
       'Arthurian romances', 'Fantasy fiction', 'English drama',
       'Country life', 'English fiction', 'Clergy',
       'Aubrey, Jack (Fictitious character)',
       'Detective and mystery stories, English', 'Black Death',
       'Human cloning', 'Science fiction', 'Great Britain',
       'American essays', 'China', 'Capitalism', 'Ireland',
       'Juvenile Fiction', "Children's stories, English",
       'Male friendship', 'Literary Collections',
       'Beresford, Tommy (Fictitious character)',
       'Imaginary wars and battles', 'Dysfunctional families',
       'Poirot, Hercule (Fictitious character)', 'Christmas stories',
       'Marple, Jane (Fictitious character)', 'Belgians',
       'Battle, Superintendent (Fictitious character)',
       'Baggins, Frodo (Fictitious character)', '

**Transform categories column using OPENAI api: the current df has way too many we will make this into around 10-15**

**Set apikey - INSERT YOUR API_KEY HERE**

In [146]:
import configy
from openai import OpenAI

In [147]:
client = OpenAI(api_key=configy.OPENAI_AI_KEY)

**Prompt Engineering to create a function that takes a list of category names and uses GPT to label each one with a standard genre**

In [149]:
# Function to map categories to genres
def map_categories_to_genres(categories_chunk):
    prompt = f"""
You are an expert librarian. I will give you a list of book categories. Please map each category to ONE of the following standardized genres:

Genres:
1. History
2. Romance
3. Mystery/Thriller
4. Science Fiction/Fantasy
5. Biography/Memoir
6. Self-Help
7. Religion
8. Science/Technology
9. Philosophy
10. Poetry
11. Art
12. Children's
13. Other

Return a JSON dictionary where keys are the original categories and values are the mapped genres.

Categories:
{categories_chunk}
"""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You categorize book subjects."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    import json
    genre_mapping = json.loads(response.choices[0].message.content.strip())
    return genre_mapping


**This makes a list of all the unique category names in the categories column, skipping any missing values. We want to see every different category that exists so we can map them to the standard genres only once.
If we tried to process the full column row by row, we’d get lots of duplicates and waste time and tokens.**

In [151]:
unique_categories = books["categories"].dropna().unique().tolist()

**This code splits your long list of unique categories into smaller batches of 200, sends each batch to the model to get genre mappings, and then combines all the results into one big dictionary.**

In [153]:
import math

chunk_size = 200
num_chunks = math.ceil(len(unique_categories) / chunk_size)

all_mappings = {}

for i in range(num_chunks):
    chunk = unique_categories[i*chunk_size:(i+1)*chunk_size]
    mapping = map_categories_to_genres(chunk)
    all_mappings.update(mapping)

**We create a new column called genre in our books table. For each row, we look up the category in our big mapping dictionary all_mappings to get its genre. If we don’t find a match, we fill in “Other” instead. This way, every book gets a clean genre label.**

In [155]:
books.sample(5)

Unnamed: 0,isbn13,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
1098,9780156010597,All the Names,José Saramago;Margaret Jull Costa,Fiction,http://books.google.com/books/content?id=2Ln6j...,When a drone in the Central Registry discovers...,2001,3.89,245.0,10634.0
5044,9780805081459,Travels in the Scriptorium,Paul Auster,Fiction,http://books.google.com/books/content?id=6MzR9...,An elderly man awakens disoriented in an unfam...,2007,3.23,145.0,6492.0
4994,9780802829665,A Time to Embrace,William Stacy Johnson,Political Science,http://books.google.com/books/content?id=i9AQW...,As rhetoric continues to heat up on both sides...,2006,4.2,330.0,44.0
3272,9780517219027,Three complete novels,Stephen King,Fiction,http://books.google.com/books/content?id=V-cOA...,Provides three of the author's early horror ta...,2002,4.53,1096.0,12320.0
5831,9781402726620,Gulliver's Travels,Martin Woodside;Jonathan Swift;Jamel Akib,Juvenile Fiction,http://books.google.com/books/content?id=qk8ps...,An abridged version of the voyages of an eight...,2006,3.93,160.0,724.0


**Export**

In [157]:
books.to_csv("cleaned_books.csv", index=False, encoding="utf-8")

In [158]:
books

Unnamed: 0,isbn13,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count
0,9780002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004,3.85,247.0,361.0
1,9780002261982,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000,3.83,241.0,5164.0
2,9780006163831,The One Tree,Stephen R. Donaldson,American fiction,http://books.google.com/books/content?id=OmQaw...,Volume Two of Stephen Donaldson's acclaimed se...,1982,3.97,479.0,172.0
3,9780006178736,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993,3.93,512.0,29532.0
4,9780006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002,4.15,170.0,33684.0
...,...,...,...,...,...,...,...,...,...,...
6803,9788173031014,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002,3.70,175.0,24.0
6804,9788179921623,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003,3.82,198.0,1568.0
6805,9788185300535,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999,4.51,531.0,104.0
6808,9789027712059,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,http://books.google.com/books/content?id=Vy7Sk...,Since the three volume edition ofHegel's Philo...,1981,0.00,210.0,0.0
