# RAG Project - General Handbook / Data Collection

## Using Unstructured API
I will first try to use the Unstructured API to try to convert the PDF manual to a JSON object and see what I get.

I start by importing all the required libraries.

In [2]:
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from dotenv import load_dotenv
import os

load_dotenv()

True

Set up the Unstructured client.

In [3]:
client = UnstructuredClient(
    api_key_auth=os.environ.get("UNS_API_KEY"),
    server_url=os.environ.get("UNS_API_URL"),
)

Use the Unstructured API to normalize the document and see the results in JSON.

In [4]:
import json

filename = "documents/general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = client.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Concurrency level set to 5
INFO: Determined optimal split size of 20 pages.
INFO: Document split into 24, 20-paged sets.
INFO: Partitioning 24, 20-paged sets.
INFO: Partitioning set #1 (pages 1-21).
INFO: Partitioning set #2 (pages 21-41).
INFO: Partitioning set #3 (pages 41-61).
INFO: Partitioning set #4 (pages 61-81).
INFO: Partitioning set #5 (pages 81-101).
INFO: Partitioning set #6 (pages 101-121).
INFO: Partitioning set #7 (pages 121-141).
INFO: Partitioning set #8 (pages 141-161).
INFO: Partitioning set #9 (pages 161-181).
INFO: Partitioning set #10 (pages 181-201).
INFO: Partitioning set #11 (pages 201-221).
INFO: Partitioning set #12 (pages 221-241).
INFO: Partitioning set #13 (pages 241-261).
INFO: Partitioning set #14 (pages 261-281).
INFO: Partitioning set #15 (pages 281-301).
INFO: Partitioning set #16 (pages 301-321).
INFO: Partitioning set #17 (pages 321-341).
INFO: Partitioning se

[
  {
    "type": "Title",
    "element_id": "75364766980cca8c684342fc63fdbac9",
    "text": "GENER AL HANDBOOK",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
    }
  },
  {
    "type": "Title",
    "element_id": "3acbc078c0f0d58a3d54d51a4669f30a",
    "text": "Serving in The Church of Jesus Christ of Latter-day Saints",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
    }
  },
  {
    "type": "Title",
    "element_id": "2355c0c1843351d5ccfcad1a462c4fd2",
    "text": "A U G U S T 2 0 2 3",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "general_h

I check the JSON reponse in a different section to see what I got. I noticed they were all indexes.

In [5]:
print(json.dumps(resp.elements[100:105], indent=2))

[
  {
    "type": "NarrativeText",
    "element_id": "96c5923d24c835b378c12d91d527bc5b",
    "text": "9.3.2.1. Calling a Relief Society Presidency",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 13,
      "parent_id": "0eb7a0d5a4378ddca527d66ee590214f",
      "filename": "general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "ff2835128f05d6c22d3c3830ff7e07af",
    "text": "9.3.2.2. Responsibilities",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 13,
      "parent_id": "0eb7a0d5a4378ddca527d66ee590214f",
      "filename": "general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "803042eb532f2e0d812784232a936b89",
    "text": "9.3.3. Secretary",
    "metadata": {
      

I checked the length of the response to determine a better place to check the response.

In [6]:
len(resp.elements)

10796

I checked again the JSON response, now in a different place, closer to the center of the data.

In [7]:
print(json.dumps(resp.elements[5000:5005], indent=2))

[
  {
    "type": "NarrativeText",
    "element_id": "ff2f054b971650c24b41a1cb5a5d6e26",
    "text": "If the bishop and stake president have concerns about a candidate meeting the standards of worthi- ness or about his or her health readiness, they coun- sel together and with the person. With a young candidate\u2019s permission, they may also counsel with his or her parents. The bishop and stake president do not submit a recommendation until the person has repented of serious sin (see 24.3.2.1). Depending on the person\u2019s physical, mental, or emotional health, they may discuss the possibility of being assigned as a service missionary.",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 220,
      "parent_id": "eed2d0b7f68d9542120512742be2384f",
      "filename": "general_handbook_serving_in_the_church_of_jesus_christ_of_latter_day_saints.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "5e817103

Finally, I decide to check the type of the response, to get an idea of how I could save the response and avoid running all this process again.

In [18]:
type(resp.elements)

list

I use the JSON encoder to save the file into a new file.

In [15]:
#with open('general-handbook.json', 'w', encoding='utf-8') as f:
#   json.dump(resp.elements, f, ensure_ascii=False, indent=4)

## Parsing HTML

I found that links for all pages in the General Handbook are found in the source code of the main page. I used BeautifulSoup to parse all links in the main page of the General Handbook page. I saved them into a list. Then, I checked which links I wouldn't need, and cut them out of the list.

In [19]:
from bs4 import BeautifulSoup

with open("documents/study_manual_general-handbook.html", "r") as f:
    soup = BeautifulSoup(f, "html.parser")
    links = [a['href'] for a in soup.find_all('a', class_="html-external-link", href=True)]

print(len(links))

#for link in links[8:-1]:
#    print(link)    

links = links[8:-1]

# Found that many links where only references to the same pages, so I just filtered them and kept only one link for each page (39 links total for each page in the General Handbook). 
chapter_links = []

for link in links:

    try:
        link_i = link.index("#")

        if link[:link_i] in chapter_links:
            continue
        chapter_links.append(link[:link_i])
    except:
        if link in chapter_links:
            continue
        chapter_links.append(link)

print(len(chapter_links))
for url in chapter_links:
    print(url)

1015
39
https://www.churchofjesuschrist.org/study/manual/general-handbook/0-introductory-overview?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/1-work-of-salvation-and-exaltation?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/2-supporting-individuals-and-families?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/3-priesthood-principles?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/4-leadership-in-the-church-of-jesus-christ?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/5-general-and-area-leadership?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/6-stake-leadership?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/7?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/8-elders-quorum?lang=eng
https://www.churchofjesuschrist.org/study/manual/general-handbook/9-relief-society?lan

I used Selenium to use the browser and wait for the pages to load before getting their content. Then, I saved all pages in a json document for further processing.

In [20]:
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # Replace with desired browser


with open('general-handbook-pages.json', 'w', encoding='utf-8') as f:
    
    manual_dict = {}

    for url in chapter_links:
        driver.get(url)

        # Wait for a specific element to be present (customizable)
        wait = WebDriverWait(driver, 10)  # 10 second timeout
        wait.until(EC.presence_of_element_located((By.ID, "page")))

        time.sleep(3)

        content = driver.page_source
        manual_dict[f"{url}"] = str(content)

    json.dump(manual_dict, f, ensure_ascii=False, indent=4)

driver.quit()


## Preparing Chuncks with BeautifulSoup
I divided the content using their subheadings and regex to get smaller portions of text.

In [1]:
import json
from bs4 import BeautifulSoup
import re

main_content = {}

# I started by getting all chapter titles and main content from each page.
with open('processed-docs/general-handbook-pages.json', 'r', encoding='utf-8') as f:

    gen_hand_dict = json.loads(f.read())
    
    for key, value in gen_hand_dict.items():
        soup = BeautifulSoup(value, "html.parser")
        
        main_content[key] = {"chapter": soup.find("h1", attrs={'id':'title1'}).get_text(), "body": soup.find("div", class_="body-block")}


# Then, I further divided the content in each chapter into sections, and then subsections based on their headings.
chapters_n_sections = []

chunk_id = 1
for key, value in main_content.items():
        
    for section in value["body"].find_all("section", recursive=False):
        section_complement = 1        
        child_text = section.get_text()
        
        try:
            y = re.findall("\n\d+\.\d+(?:\.\d+)*", child_text)[0]
        except:
            y = y + f".{section_complement}" # Keep last section number plus complement
            section_complement += 1
        
        x = re.split("\n\d+\.\d+(?:\.\d+)*", child_text)        
        
        # print("Section:", y)
        for element in x:
            chunk_dict = {}
            if element != "\n\n\n" and len(element) > 1:
                paragraph = element.replace("\n", " ").strip()

                try:
                    chunk_dict[chunk_id] = re.match(r'^(?:.*?\s{2,})(.*)', paragraph).group(1)
                except:
                    chunk_dict[chunk_id] = paragraph
                
                try:
                    chunk_dict["title"] = re.match(r'^(.*?)(?:\s{2,})', paragraph).group(1)
                except:
                    chunk_dict["title"] = "None"

                chunk_dict["chapter"] = value["chapter"]
                chunk_dict["url"] = key
                chunk_dict["section"] = y.replace("\n", " ").strip()
                chapters_n_sections.append(chunk_dict)

                # print(element)
                # print("INSIDE", len(element))
                # print("\n", "/"*300,"\n")
                chunk_id += 1

for chunks in chapters_n_sections:
    print(chunks)
    

{1: 'The Lord taught, “Let every man learn his duty, and to act in the office in which he is appointed, in all diligence” (Doctrine and Covenants 107:99). As a leader in The Church of Jesus Christ of Latter-day Saints, you should seek personal revelation to help you learn and fulfill the duties of your calling. Studying the scriptures and the teachings of latter-day prophets will help you understand and fulfill your duties. As you study the words of God, you will be more receptive to the influence of the Spirit (see Doctrine and Covenants 84:85). You also learn your duties by studying the instructions in this handbook. These instructions can invite revelation if they are used to provide an understanding of principles, policies, and procedures to apply while seeking the guidance of the Spirit.', 'title': 'Introduction', 'chapter': 'Introductory Overview', 'url': 'https://www.churchofjesuschrist.org/study/manual/general-handbook/0-introductory-overview?lang=eng', 'section': '0.0'}
{2: 'G

## Creating Text Embeddings with Gemini
Next, I will use the Gemini API to create text embeddings.

In [2]:
## Importing and initializing the library
import google.generativeai as genai
import os

google_api_key=os.environ.get("GOOGLE_API_KEY")

genai.configure(api_key=google_api_key)



  from .autonotebook import tqdm as notebook_tqdm


I will try to generate embeddings for 100 chuncks first.

In [8]:
embeddings_n_meta = []
# rounds = 0

for i, value in enumerate(chapters_n_sections):
    chunk_data = {}

    result = genai.embed_content(
        model="models/embedding-001",
        content=chapters_n_sections[i][i+1],
        task_type="retrieval_document",
        title="Embedding of single string")

    chunk_data[i+1] = chapters_n_sections[i][i+1]
    chunk_data["embedding"] = result["embedding"]
    chunk_data["title"] = chapters_n_sections[i]["title"]
    chunk_data["chapter"] = chapters_n_sections[i]["chapter"]
    chunk_data["url"] = chapters_n_sections[i]["url"]
    chunk_data["section"] = chapters_n_sections[i]["section"]

    embeddings_n_meta.append(chunk_data)

    # rounds += 1
    # if rounds == 100:
    #     break


Now, I will just check that data is in te desired format.

In [None]:
for embedding in embeddings_n_meta:
    print(embedding)

Here I save the file to free memory from my variables.

In [10]:
import json

with open('chunks-data.json', 'w', encoding='utf-8') as f:
    
    chunks_dictionary = {"data" : embeddings_n_meta}    

    json.dump(chunks_dictionary, f, ensure_ascii=False, indent=4)

In [1]:
## Cleariing the memory

embeddings_n_meta = []
chapters_n_sections = []

## Uploading embeddings to Pinecone vector database
Finally I will upload all embeddings and metadata to Pinecone

In [1]:
import json
data = []
with open("processed-docs/chunks-data.json", "r") as f:
    data = json.loads(f.read())
    data = data["data"]

I will check that I have loaded data correctly

In [2]:
print(len(data))

1370


In [4]:
emb_index = 1

for embedding in data:
    if len(embedding[f"{emb_index}"]) == 6596:
        print(embedding)

    emb_index += 1

{'839': 'Calling  Recommended by  Approved by  Sustained by  Called and set apart by  Calling  Bishop  Recommended by  Stake presidency, using LCR  Approved by  First Presidency and Quorum of the Twelve  Sustained by  Ward members  Called and set apart by  Stake president after receiving approval from the First Presidency (see 30.7)  Calling  Counselors in the bishopric  Recommended by  Bishop  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart by  Stake president or an assigned counselor  Calling  Ward clerk (and assistant ward clerks if needed)  Recommended by  Bishopric  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart by  Stake president or an assigned counselor or high councilor  Calling  Ward executive secretary (and assistant ward executive secretaries if needed)  Recommended by  Bishopric  Approved by  Stake presidency and high council  Sustained by  Ward members  Called and set apart 

Time to initialize Pinecone

In [9]:
from pinecone import Pinecone, ServerlessSpec
import os

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

  from tqdm.autonotebook import tqdm


Now, I create an index to upsert my embeddings.

In [None]:
pc.create_index(
    name="general-handbook",
    dimension=768, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

I realized I have to follow a specific format to upsert my vectors to Pinecone, so I will create a new list of dictionaries with the new format.

Update: Later I found that each upsert has a limit, so I divide all data into 3 lists

In [12]:
# This is an example of the new format for the chunks data
# {
#     "id": "1", 
#     "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], 
#     "metadata": {"title": "example", "chapter": "example", "url": "example", "section": "example", "content": "example"}
# }

new_data_1 = []
new_data_2 = []
new_data_3 = []

vector_id = 1

for chunk in data:
    if chunk["title"] != "None":
        if vector_id < 500:
            new_data_1.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        elif vector_id < 1000:
            new_data_2.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        else:
            new_data_3.append({
                "id": str(vector_id),
                "values": chunk["embedding"],
                "metadata": {"title": chunk["title"], "chapter": chunk["chapter"], "url": chunk["url"], "section": chunk["section"], "content": chunk[str(vector_id)]}
            })
        
    vector_id += 1
    

In [13]:
largest = 0

for data in new_data_1:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_1 length:", len(new_data_1))
print("Largest:", largest)

largest = 0
for data in new_data_2:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_2 length:", len(new_data_2))
print("Largest:", largest)

largest = 0
for data in new_data_3:
    if len(data["metadata"]["content"]) > largest:
        largest = len(data["metadata"]["content"])

print("new_data_3 length:", len(new_data_3))
print("Largest:", largest)

new_data_1 length: 466
Largest: 4018
new_data_2 length: 461
Largest: 6596
new_data_3 length: 354
Largest: 4766


Finally, I will upsert all vectors to Pinecone

In [14]:
index = pc.Index("general-handbook")

index.upsert(
    vectors = new_data_1,
    namespace = "general-handbook-vectors"
)

index.upsert(
    vectors = new_data_2,
    namespace = "general-handbook-vectors"
)

index.upsert(
    vectors = new_data_3,
    namespace = "general-handbook-vectors"
)

{'upserted_count': 354}

I deleted the index vectors a few times to try different approaches to upsert.

In [11]:
# index = pc.Index("general-handbook")
# index.delete(delete_all=True, namespace="general-handbook-vectors")

{}