# RAG application for "A Guide for First-Time Parents"

## Install tika and parse PDF file

- Install libraries
- Download PDF from the website [The Asian Parent](https://th.theasianparent.com/%E0%B8%84%E0%B8%B9%E0%B9%88%E0%B8%A1%E0%B8%B7%E0%B8%AD%E0%B8%94%E0%B8%B9%E0%B9%81%E0%B8%A5%E0%B8%A5%E0%B8%B9%E0%B8%81)
- Parse a PDF file using `tika`
- Clean text (using a simple created function)

In [66]:
%%capture
!pip install tika
!pip install unidecode
!pip install faker google-cloud-aiplatform
!pip install faiss-cpu==1.7.4

In [67]:
import pandas as pd
import numpy as np
import faiss

import tika
tika.initVM()
from tika import parser
from unidecode import unidecode

In [68]:
parsed_book = parser.from_file("baby_0_3.pdf")

n_pages = int(parsed_book["metadata"]["xmpTPg:NPages"])
print(n_pages)

86


In [69]:
def clean_text(text: str):
    """Clean parsed text from PDF for embedding"""
    text = text.replace("\uf70a", "่")
    text = text.replace("�ำ", "ำ")
    text = text.replace("�า", "ำ")
    return text

In [70]:
content = parsed_book["content"]
content_processed = clean_text(content)
pages = content_processed.split("\n\n\n\n")

In [71]:
pages_strip = [" ".join(page.split()) for page in pages]  # strip extra spaces from page

## Perform RAG for each page in the book

- As we skim through, each page already contains a single content
- Chunk information to default `chunk_size` of 2048

In [72]:
def convert_page_to_chunk(page_text, chunk_size: int = 2048):
    chunks = [page_text[i:i + chunk_size] for i in range(0, len(page_text), chunk_size)]
    return chunks

In [73]:
chunks = []
for text in pages_strip:
    chunks.extend(convert_page_to_chunk(text))

In [74]:
len(chunks)

101

## Prompting using RAG

- Embed text chunks with and store using `faiss`
- Embed query using the same embedding script
- Find the closest text chunks
- Add information and perform RAG

In [75]:
from google.oauth2 import service_account
from google.cloud import aiplatform
from vertexai.language_models import TextGenerationModel, TextEmbeddingModel

project = "protean-sunup-89503"
service_account_path = "service_account.json"

credentials = service_account.Credentials.from_service_account_file(service_account_path)
aiplatform.init(project="protean-sunup-89503", credentials=credentials)

In [81]:
def get_embedding(text: str, model: str = "textembedding-gecko@003"):
    """Function to perform text embedding, see options from OpenAI's website at

    https://platform.openai.com/docs/guides/embeddings/embedding-models
    """
    emb_model = TextEmbeddingModel.from_pretrained(model)
    
    text = text.replace("\n", " ")
    text_embeddings = emb_model.get_embeddings([text])
    return text_embeddings[0].values

In [82]:
text_chunks = pd.DataFrame(chunks, columns=["text"])["text"]
text_embeddings = np.vstack(text_chunks.apply(get_embedding))

In [83]:
print(text_embeddings.shape) # shape of text embedding

(101, 768)


In [84]:
text_embeddings[0]

array([ 3.29192057e-02, -4.32058871e-02, -1.34903071e-02, -1.70686506e-02,
        5.24765402e-02,  2.30775177e-02,  4.03249338e-02, -3.31014954e-02,
        2.24490240e-02,  2.57579796e-02,  2.70748120e-02,  1.71025982e-03,
       -5.62506029e-03, -2.98270732e-02, -5.46590937e-03, -3.87219787e-02,
       -1.74573939e-02, -3.50353448e-03,  2.65241582e-02, -9.42218583e-03,
        3.02749407e-03,  2.47907620e-02, -1.05854776e-02,  1.02505805e-02,
       -4.61345422e-04, -1.83114354e-02,  1.40716052e-02, -5.44563644e-02,
       -3.23577635e-02,  8.45417604e-02, -5.27261160e-02,  2.14207843e-02,
       -8.94813761e-02, -1.34279523e-02,  5.39359543e-03, -4.57768366e-02,
       -2.04425398e-02,  1.05721587e-02,  1.28345853e-02,  4.88189794e-02,
        6.82296464e-03,  1.47231892e-02, -4.21236865e-02, -5.96507117e-02,
        1.92534104e-02,  2.51372419e-02, -6.29360147e-04, -3.69597524e-02,
        1.01877972e-02, -4.41157706e-02, -1.98244124e-05, -1.74322166e-02,
        6.86555952e-02, -

In [85]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

In [88]:
question = "ถ้าลูกดูดนิ้วต้องทำอย่างไรบ้าง"
# question = "ถ้าลูกดูดมือต้องทำอย่างไรบ้าง" # prompt augmentation???
question_embeddings = np.array([get_embedding(question)])

D, I = index.search(question_embeddings, k=2)
print(I)

retrieved_chunk = [chunks[i] for i in I.tolist()[0]]

[[65 69]]


In [89]:
retrieved_chunk[0]

'พ'

Compare the prompt with (RAG) and without retrieved information

In [93]:
def complete_text(prompt: str):
    gen_model = TextGenerationModel.from_pretrained("text-bison")
    prompt = "You are a helpful assistant designed to generate output prompt for parent who are asking questions about newborn in Thai.\n" + prompt
    
    return gen_model.predict(prompt).text

In [94]:
prompt_with_context = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the given query in Thai. The answer should be concise and clear.
Query: {question}
Answer:
"""

prompt_no_context = f"""
Answer the given query in Thai. The answer should be concise and clear.
Query: {question}
Answer:
"""

In [95]:
output_no_rag = complete_text(prompt_no_context)
print(output_no_rag)

 - ปล่อยให้ลูกดูดนิ้วไปก่อน เพราะเป็นเรื่องปกติของเด็กทารก
- หากลูกดูดนิ้วจนติดเป็นนิสัย ให้พยายามหาวิธีอื่นในการปลอบโยนลูก เช่น การตบหลังเบาๆ หรือการร้องเพลงกล่อม
- หากลูกดูดนิ้วจนนิ้วบวมหรือเป็นแผล ให้พาไปพบแพทย์


In [96]:
output_rag = complete_text(prompt_with_context)
print(output_rag)

 - ปล่อยให้ดูดนิ้วไปก่อน เพราะเป็นเรื่องปกติของเด็กทารก
- หากลูกดูดนิ้วจนนิ้วบวมแดง ให้พยายามหาวิธีอื่นในการปลอบลูก เช่น การอุ้ม การโยก หรือการร้องเพลงกล่อม
- หากลูกยังดูดนิ้วอยู่หลังอายุ 2 ขวบ ให้ปรึกษาแพทย์
