This is just a notebook show-casing the core functionality of project. You **should not submit a notebook** for project, your application should have a **User Interface**.

# Relational DB Example

In [1]:
import psycopg2 # connecting to postgreSQL
import os
from dotenv import load_dotenv

load_dotenv() # load password from .env file in the same directory

True

Connect to postgreSQL

In [2]:
conn = psycopg2.connect(
    host="localhost",
    port="5432",
    database="postgres",
    user="postgres",
    password=os.getenv(
        "DB_PASS"
    ),  # You can simply enter your password here, instead of using .env
)

In [3]:
cur = conn.cursor()

Get the list of all tables in 'postgres' database:

In [8]:
cur.execute("""
    SELECT table_name
    FROM information_schema.tables
    WHERE table_schema='public';
""")
tables = cur.fetchall()

print("Tables in database:")
for table in tables:
    print("-", table[0])

Tables in database:


Add a table:

In [5]:
cur.execute("""
    CREATE TABLE IF NOT EXISTS people (
        id SERIAL PRIMARY KEY,
        name VARCHAR(100)
    );
""")
conn.commit()

Remove a table:

In [7]:
cur.execute("DROP TABLE IF EXISTS people;")
conn.commit()

Close the connection:

In [9]:
cur.close()
conn.close()

# Vector Database Example

In [10]:
import numpy as np
from sentence_transformers import SentenceTransformer # for text -> vector embedding
import faiss # an example of a vector DB (currently stores in the memory)

  from .autonotebook import tqdm as notebook_tqdm


An example document including a couple of facts in each paragraph:

In [11]:
doc_text = """
The Eiffel Tower in Paris was completed in 1889 as the entrance arch to the World’s Fair. 
It is 324 meters tall and remains one of the most visited monuments in the world.

Albert Einstein published his theory of special relativity in 1905. 
This work introduced the famous equation E=mc^2, linking energy and mass.

The Amazon rainforest covers about 5.5 million square kilometers, making it the largest rainforest on Earth. 
It plays a crucial role in regulating the global climate by absorbing carbon dioxide.

Mount Everest, located in the Himalayas on the border of Nepal and China, is the highest mountain in the world. 
Its peak reaches 8,849 meters above sea level.

The Great Wall of China stretches over 21,000 kilometers. 
It was built over centuries to protect Chinese states and empires from invasions by nomadic groups.

The Moon is Earth’s only natural satellite. 
It has a diameter of about 3,474 kilometers and affects ocean tides through its gravitational pull.

Leonardo da Vinci painted the Mona Lisa in the early 16th century. 
The painting is displayed at the Louvre Museum in Paris and is considered one of the most famous artworks in history.

The Pacific Ocean is the largest and deepest of Earth’s oceans. 
It covers more than 63 million square miles and contains the Mariana Trench, the deepest oceanic trench.

The Colosseum in Rome, Italy, was built in the first century AD. 
It could hold up to 50,000 spectators and hosted gladiatorial contests and public spectacles.

The Sahara Desert is the largest hot desert in the world. 
It spans approximately 9 million square kilometers across North Africa.
"""


Chunk the document into [window]-sized tokens:

In [12]:
def chunk_text(text, max_words=200, overlap=40):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i+max_words]
        if not chunk:
            break
        chunks.append(" ".join(chunk))
        i += max_words - overlap  # slide window with overlap
    return chunks

In [13]:
chunks = chunk_text(doc_text, max_words=20, overlap=10)
print(f"Total chunks: {len(chunks)}")

Total chunks: 27


Generate a vector embedding of each chunk:

In [14]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # 384-dim
emb_matrix = model.encode(chunks, convert_to_numpy=True, normalize_embeddings=True)

Store the chunks into a vector database (here, faiss):

In [15]:
dim = emb_matrix.shape[1]
index = faiss.IndexFlatIP(dim)  # cosine works with normalized vectors using inner product
index.add(emb_matrix)           # store embeddings

Answer a query (just retrieve the top-k results):

In [16]:
def search(query, k=3):
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    scores, idxs = index.search(q_emb, k)  # (1, k)
    results = []
    for rank, (i, s) in enumerate(zip(idxs[0], scores[0]), start=1):
        results.append({"rank": rank, "score": float(s), "chunk": chunks[i]})
    return results

In [19]:
query = "What is Mount Everest?"

In [20]:
hits = search(query, k=3)

print("\nTop matches:")
for h in hits:
    print(f"[{h['rank']}] score={h['score']:.3f}\n{h['chunk']}\n---")


Top matches:
[1] score=0.635
by absorbing carbon dioxide. Mount Everest, located in the Himalayas on the border of Nepal and China, is the highest
---
[2] score=0.635
It plays a crucial role in regulating the global climate by absorbing carbon dioxide. Mount Everest, located in the Himalayas
---
[3] score=0.491
on the border of Nepal and China, is the highest mountain in the world. Its peak reaches 8,849 meters above
---
