# RAG Project 2n Phase - General Handbook / Changing Chunck Structure

## Parsing HTML

I found that links for all pages in the General Handbook are found in the source code of the main page. I used BeautifulSoup to parse all links in the main page of the General Handbook page. I saved them into a list. Then, I checked which links I wouldn't need, and cut them out of the list.

In [1]:
from bs4 import BeautifulSoup

with open("documents/study_manual_general-handbook.html", "r") as f:
    soup = BeautifulSoup(f, "html.parser")
    links = [a['href'] for a in soup.find_all('a', class_="html-external-link", href=True)]

print(len(links))

#for link in links[8:-1]:
#    print(link)    

links = links[8:-1]

# Found that many links where only references to the same pages, so I just filtered them and kept only one link for each page (39 links total for each page in the General Handbook). 
chapter_links = []

for link in links:

    try:
        link_i = link.index("#")

        if link[:link_i] in chapter_links:
            continue
        chapter_links.append(link[:link_i])
    except:
        if link in chapter_links:
            continue
        chapter_links.append(link)

print(len(chapter_links))
for url in chapter_links:
    print(url)

1015
39
https://www.churchofjesuschrist.org/study/manual/general-handbook/0-introductory-overview?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/1-work-of-salvation-and-exaltation?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/2-supporting-individuals-and-families?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/3-priesthood-principles?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/4-leadership-in-the-church-of-jesus-christ?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/5-general-and-area-leadership?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/6-stake-leadership?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/7?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/8-elders-quorum?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/9-relief-society?lan

I used Selenium to use the browser and wait for the pages to load before getting their content. Then, I saved all pages in a json document for further processing.

In [2]:
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # Replace with desired browser


with open('general-handbook-pages.json', 'w', encoding='utf-8') as f:
    
    manual_dict = {}

    for url in chapter_links:
        driver.get(url)

        # Wait for a specific element to be present (customizable)
        wait = WebDriverWait(driver, 10)  # 10 second timeout
        wait.until(EC.presence_of_element_located((By.ID, "page")))

        time.sleep(3)

        content = driver.page_source
        manual_dict[f"{url}"] = str(content)

    json.dump(manual_dict, f, ensure_ascii=False, indent=4)

driver.quit()


## Preparing Chuncks with BeautifulSoup
I divided the content using their subheadings and regex to get smaller portions of text.

In [2]:
import json
from bs4 import BeautifulSoup
import re

main_content = {}

# I started by getting all chapter titles and main content from each page.
with open('processed-docs/general-handbook-pages.json', 'r', encoding='utf-8') as f:

    gen_hand_dict = json.loads(f.read())
    
    for key, value in gen_hand_dict.items():
        soup = BeautifulSoup(value, "html.parser")
        
        main_content[key] = {"chapter": soup.find("h1", attrs={'id':'title1'}).get_text(), "body": soup.find("div", class_="body-block")}

Now, I will try to see if I can keep tables and maybe lists with their respective tags.

In [75]:
def extract_content(element):
    content_list = []
    
    for child in element.children:
        child.attrs = {}
        chunks = []
        if child.name in ['table', 'ul', 'ol']:            
            for tag in child.find_all(True):  # True means all tags
                tag.attrs = {}
            # Append full HTML for tables and lists
            chunks.append(str(child))
        elif child.name is not None:            
            for tag in child.find_all(True):  # True means all tags
                tag.attrs = {}
            # Recursively extract content from other tags
            chunks.extend(extract_content(child))
        else:            
            # Append text content for non-HTML elements (e.g., text nodes)
            text = child.strip()
            if text:
                chunks.append(text)

        chunks = ' '.join(chunks)
        content_list.append(chunks)
    
    return content_list

In [77]:
list_of_text = []

# Extract content from each section
for key, value in main_content.items():
    for section in value["body"].find_all('section'):
        content = extract_content(section)
        list_of_text.extend(content)

print("Lenght of list:", len(list_of_text))
print(list_of_text[0:10])

Lenght of list: 15362
['', ' 0.0  Introduction ', '', 'The Lord taught, “Let every man learn his duty, and to act in the office in which he is appointed, in all diligence” ( Doctrine and Covenants 107:99 ). As a leader in The Church of Jesus Christ of Latter-day Saints, you should seek personal revelation to help you learn and fulfill the duties of your calling.', '', 'Studying the scriptures and the teachings of latter-day prophets will help you understand and fulfill your duties. As you study the words of God, you will be more receptive to the influence of the Spirit (see Doctrine and Covenants 84:85 ).', '', 'You also learn your duties by studying the instructions in this handbook. These instructions can invite revelation if they are used to provide an understanding of principles, policies, and procedures to apply while seeking the guidance of the Spirit.', '', '']


In [78]:
filtered_list_of_text = []
filtered_list_of_text_1 = []

# Remove "\n" from the list
for string in list_of_text:
    cleaned_string = string.replace('\n', ' ').strip()
    cleaned_string = ' '.join(cleaned_string.split())  # Removes extra spaces
    filtered_list_of_text_1.append(cleaned_string)

# Remove empty strings from the list
for text in filtered_list_of_text_1:
    if text != "":
        filtered_list_of_text.append(text)

print("Lenght of list:", len(filtered_list_of_text))
print(filtered_list_of_text[0:10])

Lenght of list: 6941
['0.0 Introduction', 'The Lord taught, “Let every man learn his duty, and to act in the office in which he is appointed, in all diligence” ( Doctrine and Covenants 107:99 ). As a leader in The Church of Jesus Christ of Latter-day Saints, you should seek personal revelation to help you learn and fulfill the duties of your calling.', 'Studying the scriptures and the teachings of latter-day prophets will help you understand and fulfill your duties. As you study the words of God, you will be more receptive to the influence of the Spirit (see Doctrine and Covenants 84:85 ).', 'You also learn your duties by studying the instructions in this handbook. These instructions can invite revelation if they are used to provide an understanding of principles, policies, and procedures to apply while seeking the guidance of the Spirit.', '0.1 This Handbook', 'General Handbook: Serving in The Church of Jesus Christ of Latter-day Saints provides guidance for general and local Church l

In [79]:
for text in filtered_list_of_text:
    print("Lenght:", len(text))
    print(text)
    print()

Lenght: 16
0.0 Introduction

Lenght: 309
The Lord taught, “Let every man learn his duty, and to act in the office in which he is appointed, in all diligence” ( Doctrine and Covenants 107:99 ). As a leader in The Church of Jesus Christ of Latter-day Saints, you should seek personal revelation to help you learn and fulfill the duties of your calling.

Lenght: 240
Studying the scriptures and the teachings of latter-day prophets will help you understand and fulfill your duties. As you study the words of God, you will be more receptive to the influence of the Spirit (see Doctrine and Covenants 84:85 ).

Lenght: 250
You also learn your duties by studying the instructions in this handbook. These instructions can invite revelation if they are used to provide an understanding of principles, policies, and procedures to apply while seeking the guidance of the Spirit.

Lenght: 17
0.1 This Handbook

Lenght: 161
General Handbook: Serving in The Church of Jesus Christ of Latter-day Saints provides gu

In [12]:
my_i = 0

for chunks in chapters_n_sections:
    if my_i > 10 and my_i < 20:
        print(chunks)
    my_i += 1

{'chapter': 'God’s Plan and Your Role in His Work of Salvation and Exaltation', 'section': '\n1.4', 'text': ['\n', '\nYour Role in God’s Work\n\nAs a leader in the Church, you are called to teach and support those you serve as they assist God in His work of salvation and exaltation (see 1.2). Having a clear understanding of God’s work, of what He invites you to do, and of the purpose of His Church will help you focus your efforts in bringing souls to Christ.\nPrayerfully seek to know how you can help bring about God’s purposes in the lives of those you serve. God will direct you through the promptings of the Holy Ghost (see 2\xa0Nephi 32:5). Laboring with the Lord in His vineyard will bring you great joy (see Jacob 5:70–72).\n']}
{'chapter': 'Supporting Individuals and Families in God’s Work of Salvation and Exaltation', 'section': '\n2.0', 'text': ['\n', '\nIntroduction\n\nAs a leader in the Church of Jesus Christ, you support individuals and families in accomplishing God’s work of sa

## Creating Text Embeddings with Gemini
Next, I will use the Gemini API to create text embeddings.

In [2]:
## Importing and initializing the library
import google.generativeai as genai
import os

google_api_key=os.environ.get("GOOGLE_API_KEY")

genai.configure(api_key=google_api_key)



  from .autonotebook import tqdm as notebook_tqdm


I will try to generate embeddings for 100 chuncks first.

In [8]:
embeddings_n_meta = []
# rounds = 0

for i, value in enumerate(chapters_n_sections):
    chunk_data = {}

    result = genai.embed_content(
        model="models/embedding-001",
        content=chapters_n_sections[i][i+1],
        task_type="retrieval_document",
        title="Embedding of single string")

    chunk_data[i+1] = chapters_n_sections[i][i+1]
    chunk_data["embedding"] = result["embedding"]
    chunk_data["title"] = chapters_n_sections[i]["title"]
    chunk_data["chapter"] = chapters_n_sections[i]["chapter"]
    chunk_data["url"] = chapters_n_sections[i]["url"]
    chunk_data["section"] = chapters_n_sections[i]["section"]

    embeddings_n_meta.append(chunk_data)

    # rounds += 1
    # if rounds == 100:
    #     break


Now, I will just check that data is in te desired format.

In [None]:
for embedding in embeddings_n_meta:
    print(embedding)

Here I save the file to free memory from my variables.

In [10]:
import json

with open('chunks-data.json', 'w', encoding='utf-8') as f:
    
    chunks_dictionary = {"data" : embeddings_n_meta}    

    json.dump(chunks_dictionary, f, ensure_ascii=False, indent=4)

In [1]:
## Cleariing the memory

embeddings_n_meta = []
chapters_n_sections = []

## Uploading embeddings to Pinecone vector database
Finally I will upload all embeddings and metadata to Pinecone

In [1]:
import json
data = []
with open("processed-docs/chunks-data.json", "r") as f:
    data = json.loads(f.read())
    data = data["data"]

I will check that I have loaded data correctly

In [2]:
print(len(data))

1370


In [4]:
emb_index = 1

for embedding in data:
    if len(embedding[f"{emb_index}"]) == 6596:
        print(embedding)

    emb_index += 1

{'839': 'Calling  Recommended by  Approved by  Sustained by  Called and set apart by  Calling  Bishop  Recommended by  Stake presidency, using LCR  Approved by  First Presidency and Quorum of the Twelve  Sustained by  Ward members  Called and set apart by  Stake president after receiving approval from the First Presidency (see 30.7)  Calling  Counselors in the bishopric  Recommended by  Bishop  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart by  Stake president or an assigned counselor  Calling  Ward clerk (and assistant ward clerks if needed)  Recommended by  Bishopric  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart by  Stake president or an assigned counselor or high councilor  Calling  Ward executive secretary (and assistant ward executive secretaries if needed)  Recommended by  Bishopric  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart 

Time to initialize Pinecone

In [9]:
from pinecone import Pinecone, ServerlessSpec
import os

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

  from tqdm.autonotebook import tqdm


Now, I create an index to upsert my embeddings.

In [None]:
pc.create_index(
    name="general-handbook",
    dimension=768, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

I realized I have to follow a specific format to upsert my vectors to Pinecone, so I will create a new list of dictionaries with the new format.

Update: Later I found that each upsert has a limit, so I divide all data into 3 lists

In [12]:
# This is an example of the new format for the chunks data
# {
#     "id": "1", 
#     "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], 
#     "metadata": {"title": "example", "chapter": "example", "url": "example", "section": "example", "content": "example"}
# }

new_data_1 = []
new_data_2 = []
new_data_3 = []

vector_id = 1

for chunk in data:
    if chunk["title"] != "None":
        if vector_id < 500:
            new_data_1.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        elif vector_id < 1000:
            new_data_2.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        else:
            new_data_3.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        
    vector_id += 1
    

In [13]:
largest = 0

for data in new_data_1:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_1 length:", len(new_data_1))
print("Largest:", largest)

largest = 0
for data in new_data_2:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_2 length:", len(new_data_2))
print("Largest:", largest)

largest = 0
for data in new_data_3:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_3 length:", len(new_data_3))
print("Largest:", largest)

new_data_1 length: 466
Largest: 4018
new_data_2 length: 461
Largest: 6596
new_data_3 length: 354
Largest: 4766


Finally, I will upsert all vectors to Pinecone

In [14]:
index = pc.Index("general-handbook")

index.upsert(
    vectors = new_data_1,
    namespace = "general-handbook-vectors"
)

index.upsert(
    vectors = new_data_2,
    namespace = "general-handbook-vectors"
)

index.upsert(
    vectors = new_data_3,
    namespace = "general-handbook-vectors"
)

{'upserted_count': 354}

I deleted the index vectors a few times to try different approaches to upsert.

In [11]:
# index = pc.Index("general-handbook")
# index.delete(delete_all=True, namespace="general-handbook-vectors")

{}