Load Libraries

In [21]:
pip install sentence-transformers numpy python-dotenv chromadb groq pypdf




In [22]:
from sentence_transformers import SentenceTransformer
from groq import Groq
import os
from pypdf import PdfReader
import chromadb
import numpy as np
from numpy import linalg

Load embedder model

In [23]:
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-small-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [24]:
def normalize_embeddings(embeddings):
  norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
  return (embeddings/norms).tolist()

In [25]:
const_embeddings

[[-0.04393754432576429,
  0.013259949070658868,
  0.05073908292569812,
  0.002190050807648674,
  0.018501256832412542,
  0.0569073631130003,
  0.06146608682148055,
  -0.04996019545550041,
  -0.029238769772792172,
  0.04982655438287366,
  -0.0752109709308261,
  0.015212235072602413,
  0.00015892007928506403,
  -0.0030432613164651443,
  0.028794467409285512,
  -0.0017309869803999313,
  -0.009391095152414453,
  -0.015077726007281179,
  -0.08905229536152308,
  0.04579225844935662,
  0.10125244304734317,
  0.030525263468545564,
  -0.041976763730079955,
  0.02455953477817324,
  0.08305848170581283,
  0.04704508110661592,
  0.022872073161053198,
  0.020948195293401438,
  0.02859072381938469,
  -0.15442245686060438,
  -0.06644209894915813,
  -0.03275338395148667,
  0.013072249375766556,
  0.07311477229893017,
  -0.022215200131724743,
  -0.01912190141129513,
  -0.02141947434831208,
  0.006115322854050804,
  -0.005977460095926085,
  0.04851513663080067,
  0.022882012236198658,
  0.04118728153360

Function to extract text from documents

In [26]:
def extract_text(path):
  reader = PdfReader(path)
  text = ""
  for page in reader.pages:
    text += page.extract_text() + "\n"
  return text

Load the text the extract text

In [27]:
text = extract_text("/content/The_Constitution_of_Kenya_2010.pdf")

Turn into chunk now

In [28]:
def chunk_text(text, size=300):
  words = text.split()
  chunks=[]
  for i in range(0, len(words), size):
    chunks.append(" ".join(words[i:i+size]))
  return chunks

In [29]:
chunks = chunk_text(text)

Turn text to chunks

In [30]:
const_embeddings = model.encode(chunks).tolist()

In [31]:
const_embeddings= normalize_embeddings(const_embeddings)

In [32]:
const_embeddings

[[-0.04393754432576429,
  0.013259949070658868,
  0.05073908292569812,
  0.002190050807648674,
  0.018501256832412542,
  0.0569073631130003,
  0.06146608682148055,
  -0.04996019545550041,
  -0.029238769772792172,
  0.04982655438287366,
  -0.0752109709308261,
  0.015212235072602413,
  0.00015892007928506403,
  -0.0030432613164651443,
  0.028794467409285512,
  -0.0017309869803999313,
  -0.009391095152414453,
  -0.015077726007281179,
  -0.08905229536152308,
  0.04579225844935662,
  0.10125244304734317,
  0.030525263468545564,
  -0.041976763730079955,
  0.02455953477817324,
  0.08305848170581283,
  0.04704508110661592,
  0.022872073161053198,
  0.020948195293401438,
  0.02859072381938469,
  -0.15442245686060438,
  -0.06644209894915813,
  -0.03275338395148667,
  0.013072249375766556,
  0.07311477229893017,
  -0.022215200131724743,
  -0.01912190141129513,
  -0.02141947434831208,
  0.006115322854050804,
  -0.005977460095926085,
  0.04851513663080067,
  0.022882012236198658,
  0.04118728153360

Load chromadb to store documents and embeddings

In [33]:
client = chromadb.PersistentClient(path="./chroma_db")

In [34]:
collection = client.create_collection(
    name="documents", metadata={"description": "My document collection"}
)

print("collection created:", collection.name)

InternalError: Collection [documents] already exists

In [35]:
const_embeddings = model.encode(chunks).tolist()

In [36]:
ids = [f"doc_{i}" for i in range(len(chunks))]

Add to collections

In [37]:
collection.add(
    documents=chunks,
    embeddings=const_embeddings,
    ids=ids
)

Add inferencing - The brain

In [38]:
pip install groq



In [39]:
from groq import Groq
import os

In [41]:
groq_client = Groq(api_key=("xxxx"))

In [46]:
def generate_answer(question, retrieved_docs):
  context ="\n\n".join(retrieved_docs)

  system_promt = """
  You are an expert assistant.
  Answer ONLY using the provided context
  If the answe is not in the context, say:
  "The document does not contain this information"
  """

  user_promt = f"""
  context:
  {context}

  Question:
  {question}
  """

  response = groq_client.chat.completions.create(
      model="meta-llama/llama-4-scout-17b-16e-instruct",
      messages=[
          {"role":"system","content":system_promt},
          {"role":"user","content":user_promt}
      ],
      temperature=0,
      max_tokens=800
  )

  return response.choices[0].message.content

In [44]:
def ask(question):
  query_embedding = model.encode([question])
  query_embedding = normalize_embeddings(query_embedding)

  results = collection.query(
      query_embeddings=query_embedding,
      n_results=3
  )

  retrieved_docs = results["documents"][0]
  return generate_answer(question, retrieved_docs)

In [48]:
answer = ask("what document say about freedom?")
print(answer)


The Constitution of Kenya, 2010, specifically Articles 33, 34, 35, 36, and 37, outline the following freedoms:

1. **Freedom of Expression (Article 33)**: 
   - Every person has the right to freedom of expression, which includes freedom to seek, receive or impart information or ideas, freedom of artistic creativity, and academic freedom and freedom of scientific research.
   - The right does not extend to propaganda for war, incitement to violence, hate speech, or advocacy of hatred.

2. **Freedom of the Media (Article 34)**:
   - Freedom and independence of electronic, print, and all other types of media are guaranteed.
   - The State shall not exercise control over or interfere with any person engaged in broadcasting, production or circulation of any publication, or dissemination of information by any medium.

3. **Access to Information (Article 35)**:
   - Every citizen has the right of access to information held by the State and information held by another person required for the e

In [50]:
answer = ask("owning land in kenya")
print(answer)


According to the context, all land in Kenya belongs to the people of Kenya collectively as a nation, as communities, and as individuals (Article 61 (1)). Land in Kenya is classified into three categories: public, community, or private (Article 61 (2)). 

There is no information on how individuals or communities can own land, but it is mentioned that public land includes land that was unalienated government land, land held by State organs, and land transferred to the State (Article 62 (1)). 

For more information on specifics of owning land in Kenya,  The document does not contain this information.
