## Video Referrence
- [Complete Tutorial on Vector Database - Learn ChromaDB, Pinecone & Weaviate | Generative AI](https://www.youtube.com/watch?v=8KrTO9bS91s)
- https://github.com/entbappy/Complete-Generative-AI-Course-on-YouTube/blob/main/Vector%20Database/2.Pinecone_demo.ipynb


## Import All the Required Libraries

In [13]:
from langchain.document_loaders import PyPDFDirectoryLoader, PyPDFLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

In [None]:
# Load API keys from .env file
# from dotenv import load_dotenv, find_dotenv
# load_dotenv(find_dotenv())

True

## Extract the Text from the PDF's

In [2]:
# loader = PyPDFDirectoryLoader("pdf_data")
loader = PyPDFLoader("Website_Report_V1.pdf")
data = loader.load()

In [3]:
data

[Document(metadata={'producer': 'Skia/PDF m137 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Website Report V1', 'source': 'Website_Report_V1.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='Landing  Page  Analysis  \n1.  Header  &  Navigation  \n●  Logo  &  Branding :  The  “MY  PolicyLens”  logo  in  the  top-left  immediately  identifies  the  \nplatform.\n \nIt\n \nuses\n \nbold,\n \nuppercase\n \nlettering\n \nto\n \nconvey\n \nauthority\n \nand\n \ntrust.\n \n ●  Primary  Navigation :  Links  to  Home ,  Policy ,  and  an  AI  Chatbot  tool  sit  prominently  \nacross\n \nthe\n \ntop,\n \nmaking\n \nessential\n \nfeatures\n \nreachable\n \nin\n \none\n \nclick.\n \n ●  Language  Selector :  A  toggle  between  Bahasa  Malaysia  and  English  allows  the  site  \nto\n \nserve\n \na\n \nbilingual\n \naudience,\n \nenhancing\n \naccessibility\n \nand\n \ninclusivity.\n \nInsight :  The  clear,  uncluttered  header  helps  users  orient

## Clean Data

In [None]:
import re

def clean_page(text: str) -> str:
    # Remove artificial newlines that break up sentences or words
    text = re.sub(r'\n+', ' ', text)             # Merge multiple newlines into one space
    text = re.sub(r'(?<=\w)-\s+(?=\w)', '', text) # Fix hyphenated line breaks (e.g., "subsi-\ndy" → "subsidy")
    text = re.sub(r'\s+', ' ', text)             # Normalize extra spaces
    return text.strip()

def clean_documents(docs: list[Document]) -> list[Document]:
    cleaned_docs = []
    for doc in docs:
        cleaned_text = clean_page(doc.page_content)
        cleaned_docs.append(Document(page_content=cleaned_text, metadata=doc.metadata))
    return cleaned_docs


In [None]:
clean_data = clean_documents(data)
print(clean_data)

[Document(metadata={'producer': 'Skia/PDF m137 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Website Report V1', 'source': 'Website_Report_V1.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='Landing Page Analysis 1. Header & Navigation ● Logo & Branding : The “MY PolicyLens” logo in the top-left immediately identifies the platform. It uses bold, uppercase lettering to convey authority and trust. ● Primary Navigation : Links to Home , Policy , and an AI Chatbot tool sit prominently across the top, making essential features reachable in one click. ● Language Selector : A toggle between Bahasa Malaysia and English allows the site to serve a bilingual audience, enhancing accessibility and inclusivity. Insight : The clear, uncluttered header helps users orient themselves quickly and supports multilingual engagement. 2. Hero Section ● Headline : “Making Policies Transparent & Understandable for Everyone” succinctly states the platform’s mission

In [16]:
print(clean_data[0])

page_content='Landing Page Analysis 1. Header & Navigation ● Logo & Branding : The “MY PolicyLens” logo in the top-left immediately identifies the platform. It uses bold, uppercase lettering to convey authority and trust. ● Primary Navigation : Links to Home , Policy , and an AI Chatbot tool sit prominently across the top, making essential features reachable in one click. ● Language Selector : A toggle between Bahasa Malaysia and English allows the site to serve a bilingual audience, enhancing accessibility and inclusivity. Insight : The clear, uncluttered header helps users orient themselves quickly and supports multilingual engagement. 2. Hero Section ● Headline : “Making Policies Transparent & Understandable for Everyone” succinctly states the platform’s mission in plain language. ● Subheading : “Navigating policies made easy — breaking down complex policies into straightforward, accessible information for everyone.” reinforces the value proposition. ● Call to Action (CTA) : A promi

## Split the Extracted Data into Text Chunks

In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_chunks = text_splitter.split_documents(clean_data)

In [18]:
print(len(text_chunks))
text_chunks[0]

93


Document(metadata={'producer': 'Skia/PDF m137 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Website Report V1', 'source': 'Website_Report_V1.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='Landing Page Analysis 1. Header & Navigation ● Logo & Branding : The “MY PolicyLens” logo in the top-left immediately identifies the platform. It uses bold, uppercase lettering to convey authority and trust. ● Primary Navigation : Links to Home , Policy , and an AI Chatbot tool sit prominently across the top, making essential features reachable in one click. ● Language Selector : A toggle between Bahasa Malaysia and English allows the site to serve a bilingual audience, enhancing accessibility')

## Load OPENAI API

In [None]:
os.environ['OPENAI_API_KEY'] = 

## Download the Embeddings

In [20]:
# Use a specific embedding model
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [None]:
text = "LangChain is an AI framework for LLMs."
vector = embeddings_model.embed_query(text)

print(len(vector))
print(vector[:5])  # Print first 5 values for readability


1536
[0.0009549352689646184, 0.0018590498948469758, 0.0324464850127697, -0.013298600912094116, 0.039292510598897934]


## Initializing the Pinecone

In [27]:
import pinecone

os.environ['PINECONE_API_KEY'] = "pcsk_7HmYTn_KS4n9fp4CzxTjTrKpYWaaBgHvP2JPRRx9fp5URALDkKuCC1yeZYhbZ557rEfjYT"
pc = pinecone.Pinecone(os.getenv('PINECONE_API_KEY'))
index_name = "fit5120-tm01"
index = pc.Index(index_name)

## Create Embeddings for each of the Text Chunk

In [28]:
# Pinecone.from_texts() vs Pinecone.from_documents()
# .from_documents() stores meta data while .from_texts() does not

docsearch = Pinecone.from_texts([t.page_content for t in text_chunks], embeddings_model, index_name=index_name)

## If you already have an index, you can load it like this

In [20]:
docsearch = Pinecone.from_existing_index(index_name, embeddings_model)
docsearch

<langchain_community.vectorstores.pinecone.Pinecone at 0x211027ec050>

## Similarity Search

In [23]:
query = "How much is the diesel subisdy expenditure in 2024"

docs = docsearch.similarity_search(query, k=3)

In [24]:
docs

[Document(metadata={}, page_content='understanding.\n \n \nDiesel  Dilemma  Page  Analysis   \nEscalation  of  Subsidy  Expenditure  \nBetween\n \n2019\n \nand\n \n2023,\n \nMalaysia’s\n \ndiesel-subsidy\n \noutlay\n \nsoared\n \nfrom\n \nRM\n \n1.4\n \nbillion\n \nto\n \nRM\n \n14.3\n \nbillion—a\n \nmore\n \nthan\n \nten-fold\n \nincrease\n \n(Ministry\n \nof\n \nFinance\n \nMalaysia,\n \n2024).\n \nThis\n \n920\n \npercent\n \nrise\n \nplaced\n \nunsustainable\n \npressure\n \non\n \npublic\n \nfinances,\n \nconsuming\n \nan\n \never-larger\n \nshare\n \nof\n \nthe\n \nfederal\n \nbudget.\n \nAccording\n \nto'),
 Document(metadata={}, page_content='4.  Post-Reform  Period  (Q2  2024–Q2  2025)  \n●  No  Step  Increase:  Unlike  Peninsular  diesel,  the  Diesel  East  series  shows  no  \nsignificant\n \njump\n \nin\n \nQ2\n \n2024.\n \nIt\n \nremains\n \nat\n \n~RM\n \n2.18/L\n \nthroughout\n \n2024–2025.\n \n ●  Implication:  The  targeted-subsidy  reform  of  June  2024  did  not  

## Creating a LLM Model Wrapper

In [25]:
llm_gpt4 = ChatOpenAI(model="gpt-4.1-nano") #gpt-4.1-mini

qa = RetrievalQA.from_chain_type(llm=llm_gpt4, chain_type="stuff", retriever=docsearch.as_retriever(search_kwargs={"k": 3}))

## Q/A

In [26]:
query = "How much is the diesel subisdy expenditure in 2024"
qa.invoke(query)

{'query': 'How much is the diesel subisdy expenditure in 2024',
 'result': 'The diesel subsidy expenditure in Malaysia in 2024 is not explicitly provided in the given data. However, it is stated that the expenditure soared from RM 1.4 billion in 2019 to RM 14.3 billion in 2023, representing a more than ten-fold increase. Since 2024 falls after 2023, the expenditure is likely still very high, but the exact figure for 2024 is not specified.'}

In [27]:
while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print('Exiting')
    break
  if user_input == '':
    continue
  result = qa.invoke({'query': user_input})
  print(f"Answer: {result['result']}")
     

Answer: The diesel subsidy expenditure in Malaysia in 2023 was RM 14.3 billion.
Exiting
