In [1]:
%pip install -U cohere


Collecting cohere
  Downloading cohere-5.14.0-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp312-cp312-macosx_10_13_universal2.whl.metadata (5.5 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20250328-py3-none-any.whl.metadata (2.3 kB)
Downloading cohere-5.14.0-py3-none-any.whl (253 kB)
Downloading fastavro-1.10.0-cp312-cp312-macosx_10_13_universal2.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading types_requests-2.32.0.20250328-py3-none-any.whl (20 kB)
Installing collected packages: types-requests, fastavro, cohere
Successfully installed cohere-5.14.0 fastavro-1.10.0 types-requests-2.32.0.20250328
Note: you may need to restart the kernel to use updated packages.


In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import DirectoryLoader, PyPDFLoader

In [8]:
# Define the directory containing your PDF documents
documents_dir = "../document/"

# Create a DirectoryLoader to load all PDF files from the directory
loader = DirectoryLoader(
    documents_dir,
    glob="**/*.pdf",  # This will load all PDF files recursively
    loader_cls=PyPDFLoader,
    show_progress=True
)

# Load the documents
documents = loader.load()

# Print the number of documents loaded
print(f"Loaded {len(documents)} documents")

100%|██████████| 3/3 [00:00<00:00,  4.74it/s]

Loaded 48 documents





In [9]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents into chunks
texts = text_splitter.split_documents(documents)

# Print the number of chunks created
print(f"Created {len(texts)} chunks")

Created 217 chunks


In [10]:
# Initialize embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Create vector store
vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Print the number of vectors stored
print(f"Stored {len(texts)} vectors in the database")

  embeddings = HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm


Stored 217 vectors in the database


In [11]:
results = vectorstore.similarity_search(
    "what are the main principle of Attention Block",
    k=5,
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
signiﬁcant improvements in computational efﬁciency through factorization tricks [18] and conditional
computation [26], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in [{'author': 'Ashish Vaswani, Noam Shazeer, Niki 

In [12]:
from typing import List, Dict, Any

def search_documents(
    query: str,
    vectorstore: Chroma,
    k: int = 5
) -> List[str]:
    """
    Perform similarity search on the vector store using the provided query.
    
    Args:
        query (str): The search query from the user
        vectorstore (Chroma): The initialized Chroma vector store
        k (int, optional): Number of results to return. Defaults to 5.
    
    Returns:
        List[str]: List of page content from the documents
    """
    try:
        results = vectorstore.similarity_search(
            query,
            k=k
        )
        return [doc.page_content for doc in results]
    except Exception as e:
        print(f"Error performing similarity search: {str(e)}")
        return []

In [13]:
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key = OPENAI_API_KEY)

In [14]:
import cohere
COHERE_API_KEY = os.getenv('COHERE_API_KEY')
co = cohere.ClientV2(COHERE_API_KEY)

In [22]:
query = '''What are the main principles of Attention Block?'''
context = search_documents(query, vectorstore, k=10)
# Rerank the documents
results = co.rerank(
    model="rerank-v3.5", query=query, documents=context, top_n=3
)
for result in results.results:
    print(result)


document=None index=3 relevance_score=0.074966356
document=None index=0 relevance_score=0.05086082
document=None index=4 relevance_score=0.04248727


In [26]:
reranked_context = []
for result in results.results:
    reranked_context.append(context[result.index])
print(reranked_context)

['The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building\nblock, computing hidden representations in parallel for all input and output positions. In these models,\nthe number of operations required to relate signals from two arbitrary input or output positions grows\nin the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes\nit more difﬁcult to learn dependencies between distant positions [ 11]. In the Transformer this is\nreduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions', 'Recurrent models typically factor computation along the 

In [27]:
systemPrompt = f'''You are an intelligent bot you'll be given a text and you'll have to answer the question based on the text
{reranked_context}
'''
conversationHistory = [
    {"role": "system", "content": systemPrompt},
    {"role": "user", "content": query}
]

def Answer_Question(conversationHistory):
    response = client.chat.completions.create(
        model = "o3-mini",
        messages=conversationHistory,
        
    )

    return response

response = Answer_Question(conversationHistory)
print(response.choices[0].message.content)

The Attention Block is built on several key ideas:

1. Parallelized Dependency Modeling: Instead of processing tokens one by one (as in recurrent models), self-attention allows all positions in a sequence to be computed in parallel. Each token can directly interact with all other tokens, making it much easier to capture long-range dependencies.

2. Scaled Dot-Product Attention: The core computation involves comparing “queries” with “keys” using a dot product to measure similarity, and then using the result to weight “value” vectors. Because large dot products (especially with high-dimensional keys) can push the softmax into regions with very small gradients, the dot products are scaled by 1/√(dk) (where dk is the dimensionality of the keys) to keep the gradients in a healthy range.

3. Multi-Head Attention: Rather than performing a single attention operation, the model first projects the queries, keys, and values several times using different learned linear transformations. These separ