# About 
Here, I intend to build a LLM-based engine to retrieve the txt book link for 'book of interest'.
 - In this notebook, I first prepare an sql database.
 - The input csv file is from "https://www.gutenberg.org/ebooks/offline_catalogs.html#xmlrdf" (Section "XML/RDF/CSV")

# 1. Settings

### Packages

In [1]:
import os
import pandas as pd


# sql related
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database

# llama index
from llama_index.core import SQLDatabase



### Variables

In [2]:
#----------------#
# variables that require changes
#----------------#
llm_model_id ="llama2"

# sql table name 
sql_table_name = 'catalog_table'

### Directories

In [3]:
main_Dir = "../"

#----------------#
# data dir
#----------------#
data_Dir = os.path.join(main_Dir,"data")
raw_data_Dir = os.path.join(data_Dir,"raw")
processed_data_Dir = os.path.join(data_Dir,"processed")
sql_data_Dir=os.path.join(data_Dir,"sql")



#----------------#
# make dirs
#----------------#
for f in [data_Dir, raw_data_Dir, processed_data_Dir, sql_data_Dir]:
    os.makedirs(f, exist_ok=True)

# 2.Setup a sql database 

## 2.1 Preprocess data

### Exploratory Data Analysis (EDA)

In [4]:
filename="pg_catalog.csv"
filepath= os.path.join(raw_data_Dir, filename)
data = pd.read_csv(filepath,low_memory=False)
data.head(1)

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...


### Filter data

In [5]:
# filter <Type by 'Text'> and <Language by 'en'> and drop two columns
# then drop some columns
drop_cols= ['Issued','LoCC','Type','Language']
df = data.query('Type=="Text" & Language=="en"').drop(drop_cols, axis=1,inplace=False).drop_duplicates()
df.head(2)

Unnamed: 0,Text#,Title,Authors,Subjects,Bookshelves
0,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",Politics; American Revolutionary War; United S...
1,2,The United States Bill of Rights\r\nThe Ten Or...,United States,Civil rights -- United States -- Sources; Unit...,Politics; American Revolutionary War; United S...


### Rename a column

In [10]:
# rename column "Text#" to "ID"
df.rename({'Text#': 'ID', 'Title': 'Book'}, axis=1, inplace=True)

### Split a column entry (separated by`;`) into separated rows

In [11]:
# use pandas assign and explode function 
col_to_split = "Bookshelves"

# split column and drop na
df[col_to_split] = df[col_to_split].str.split(';')
df=df.explode(col_to_split).dropna(subset=[col_to_split], how='all', inplace=False)

# strip white space
df[col_to_split] = df[col_to_split].str.strip()
df= df.drop_duplicates()
df.head(2)

Unnamed: 0,ID,Book,Authors,Subjects,Bookshelves
0,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",Politics
0,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",American Revolutionary War


### What types of books are on the "Bookshelves"? 

In [12]:
# collect "Boookshelves" into a sorted list 
list_Bookshelves = sorted(df[col_to_split].unique().tolist())

#### First, how many types?

In [13]:
# couont number of elements in the list
n_list = len(list_Bookshelves)
print(f'There are {n_list} categories in the column "{col_to_split}" ')

There are 247 categories in the column "Bookshelves" 


#### Second, what are they?
 * The below dataframe displays the book types alphabetically (from left to right)

In [14]:
# slice the list into 10 elements per row
df_Bookshelves = pd.DataFrame([list_Bookshelves[n:n+9] for n in range(0, len(list_Bookshelves), 10 )])

# display the book types alphabetically (from left to right)
df_Bookshelves

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6 Best Loved Spanish Literary Classics,Adventure,Africa,African American Writers,Ainslee's,American Revolutionary War,Anarchism,Animal,Animals-Domestic
1,Animals-Wild-Birds,Animals-Wild-Insects,Animals-Wild-Mammals,Animals-Wild-Reptiles and Amphibians,Animals-Wild-Trapping,Anthropology,Archaeology,Architecture,Argentina
2,Art,Arthurian Legends,Astounding Stories,Astronomy,Atheism,Australia,Bahá'í Faith,Banned Books List from the American Library As...,Banned Books from Anne Haight's list
3,"Bestsellers, American, 1895-1923",Bibliomania,Biographies,Biology,Bird-Lore,"Birds, Illustrated by Color Photography",Blackwood's Edinburgh Magazine,Boer War,Botany
4,Buchanan's Journal of Man,Buddhism,Bulgaria,CIA World Factbooks,Camping,Canada,Canon Law,Celtic Magazine,Chambers's Edinburgh Journal
5,Child's Own Book of Great Musicians,Children's Anthologies,Children's Biography,Children's Book Series,Children's Fiction,Children's History,Children's Instructional Books,Children's Literature,"Children's Myths, Fairy Tales, etc."
6,Children's Religion,Children's Verse,Christianity,Christmas,Classical Antiquity,Contemporary Reviews,Continental Monthly,Cookbooks and Cooking,Crafts
7,Crime Nonfiction,Current History,Czech,DE Lyrik,DE Prosa,Detective Fiction,Dew Drops,Donahoe's Magazine,Early English Text Society
8,Education,Egypt,Engineering,English Civil War,Erotic Fiction,Esperanto,FR Femmes,FR Illustrateurs,FR Langues
9,FR Poésie,Famous Scots Series,Fantasy,Folklore,Forestry,France,Garden and Forest,Geology,Germany


## 2.2 Save the processed data to a csv

In [15]:
# output csv name
processed_filename =  f"processed_{filename}"

# output csv path
processed_filepath = os.path.join(processed_data_Dir,processed_filename)

# save to csv
df.to_csv(processed_filepath,index=False)

## 2.3 Save the processed data to a database

In [17]:
# databse name
db_name =  filename.removesuffix('.csv') +".db"
db_path= os.path.join(sql_data_Dir,db_name)

# create a sql engine
sql_url = f'sqlite:///{db_path}'
engine = create_engine(sql_url, echo=False)

# create a sql database
if not database_exists(engine.url):
    create_database(engine.url)

# save df to db
df.to_sql(sql_table_name, con=engine)

13207

In [18]:
#Construct a SQLDatabase Index
catalog_db = SQLDatabase(engine,
                         include_tables=[sql_table_name])