In [40]:
import pandas as pd
text_chunks_path = './data/text_chunks.csv'
text_chunks_df = pd.read_csv(text_chunks_path, index_col=0)
text_chunks_dict = text_chunks_df.to_dict(orient='records')

[{'type': 'P',
  'text': 'The pace at which computer systems change was, is, and continues to be overwhelming. From 1945, when the modern computer era began, until about 1985, computers were large and expensive. Moreover, lacking a way to connect them, these computers operated independently of one another. ',
  'chapter': '01 INTRODUCTION ',
  'parent_chapter': nan,
  'char_count': 282,
  'word_count': 44,
  'sentence_count': 3,
  'token_count': 70.5},
 {'type': 'P[2]',
  'text': 'Starting in the mid-1980s, however, two advances in technology began to change that situation. The first was the development of powerful microprocessors. Initially, these were 8-bit machines, but soon 16-, 32-, and 64-bit CPUs became common. With powerful multicore CPUs, we now are again facing the challenge of adapting and developing programs to exploit parallelism. In any case, the current generation of machines have the computing power of the mainframes deployed 30 or 40 years ago, but for 1/1000th of the 

In [9]:
from sentence_transformers import SentenceTransformer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
embedding_model_id = "all-mpnet-base-v2"
embedding_model = SentenceTransformer(model_name_or_path=embedding_model_id, device=device)



In [2]:
# %%time
# import copy 
# text_chunks_dict_copy = copy.deepcopy(text_chunks_dict)
# from tqdm.auto import tqdm
# for item in tqdm(text_chunks_dict_copy):
#     item["embedding"] = embedding_model.encode(item["text"]) 

In [None]:
%%time
text_list = [item["text"] for item in text_chunks_dict]
# Embed texts in batches
text_chunks_embeddings =embedding_model.encode(text_list, batch_size=32, convert_to_tensor=False, show_progress_bar=True)
for i, item in enumerate(text_chunks_dict):
    item["embedding"] = text_chunks_embeddings[i]

##### Save chunks w/ embeddings as CSV.

In [43]:
text_chunks_with_embeddings_df = pd.DataFrame(text_chunks_dict)
chunks_with_embeddings_path = "./data/text_chunks_with_embeddings.csv"
text_chunks_with_embeddings_df.to_csv(chunks_with_embeddings_path)

#### Load CSV.

Load CSV with chunks and its embedding.

Process the CSV into right format.

Create the embedding model.

Create a search pipeline for the query and the embeddings.

In [7]:
import random
import torch
import pandas as pd
import numpy as np
import helpers
import importlib
importlib.reload(helpers)
from helpers import import_chunks_with_embeddings, get_chunks_embeddings_as_tensor

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
csv_path = "./data/text_chunks_with_embeddings.csv"

chunks_with_embeddings = import_chunks_with_embeddings(csv_path)

embeddings = get_chunks_embeddings_as_tensor(chunks_with_embeddings).to(device)
embeddings.shape

torch.Size([2898, 768])

In [5]:
# Create the model
from sentence_transformers import SentenceTransformer, util

embedding_model = SentenceTransformer('all-mpnet-base-v2', device=device)



##### Search Pipeline


In [6]:
from helpers import retrieve_relevant_resources, print_top_results_and_scores

query = "What is a STUB?"

print_top_results_and_scores(
    *retrieve_relevant_resources(query=query, 
                                 embeddings=embeddings, 
                                 embedding_model=embedding_model),
    chunks_with_embeddings
)

Time taken to compute dot scores on (2898): 8.282496128231287e-05 seconds
Score: 0.5530380010604858
Chapter: Object-based architectural style
Text:
The server-side stub is often referred to as a skeleton as it provides the bare
means for letting the server middleware access the user-defined objects. In
practice, it often contains incomplete code in the form of a language-specific
class that needs to be further specialized by the developer.


Score: 0.5495573878288269
Chapter: 4.2.2 Parameter passing
Text:
The function of the client stub is to take its parameters, pack them into a
message, and send them to the server stub. While this sounds straightforward, it
is not quite as simple as it at first appears.


Score: 0.4721587896347046
Chapter: Note 4.8 (Advanced: Implementing stubs as global references revisited)
To provide a more in-depth insight in the working of sockets, let us look at a
more elaborate example, namely the use of stubs as global references.
Text:
To use a stub as a glo

### LLM

In [1]:
import importlib
import resource_utils
importlib.reload(resource_utils)
from resource_utils import *

print(f"Total     GPU memory: {get_total_gpu_memory()} GB")
print(f"Available GPU memory: {round(get_available_vram()[0]/1024, 1)} GB")


Total     GPU memory: 24 GB
Available GPU memory: 23.6 GB


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# Attention implementation, either 'sdpa' or 'flash_attention_2'
if(is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa"
print(f"Using {attn_implementation} attention")

# Model
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Instantiate tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id, 
                                                 torch_dtype=torch.float16,
                                                 low_cpu_mem_usage=False,
                                                 attn_implementation=attn_implementation)

model = llm_model.to("cuda")


Using sdpa attention


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



In [3]:
print(get_model_num_params(llm_model))
print(get_model_mem_size(llm_model))

8030261248
{'model_mem_bytes': 16194748416, 'model_mem_mb': 15444.52, 'model_mem_gb': 15.08}


### Generate text with Llama 3 8B

In [10]:
input_text = "What are the fundamental principles of Named Data Networking (NDN) and how does it differ from traditional host-based networking?"
print(f"Input text: {input_text}")

# Prompt template
message_template = [
    { "role": "system", "content": "You are Study-Buddy. An educatinal chatbot that will aid students in their studies." },
    { "role": "user", "content": input_text }
]

prompt = tokenizer.apply_chat_template(
    message_template,
    add_generation_prompt=True,
    #return_tensors="pt"#
    tokenize=False # keep as raw text
)

print(f"\nPrompt formatted:\n{prompt}")

Input text: What are the fundamental principles of Named Data Networking (NDN) and how does it differ from traditional host-based networking?

Prompt formatted:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are Study-Buddy. An educatinal chatbot that will aid students in their studies.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are the fundamental principles of Named Data Networking (NDN) and how does it differ from traditional host-based networking?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [11]:
%%time
# Tokenize the input text (turn it into numbers) and sen iut to the GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = llm_model.generate(**input_ids, max_new_tokens=500)

# print(f"Model output (tokens):\n{outputs[0]}\n")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


CPU times: user 10.1 s, sys: 14.3 ms, total: 10.1 s
Wall time: 10.1 s


In [12]:
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}")

Model output (decoded):
<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are Study-Buddy. An educatinal chatbot that will aid students in their studies.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are the fundamental principles of Named Data Networking (NDN) and how does it differ from traditional host-based networking?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd be happy to help you with that!

Named Data Networking (NDN) is a new networking paradigm that focuses on naming and retrieving data instead of addressing and routing packets. It's a significant departure from traditional host-based networking, which focuses on addressing and routing packets between hosts.

Here are the fundamental principles of NDN:

1. **Named Data**: In NDN, data is identified by its name, rather than its location or sender. This allows for more efficient and flexible data retrieval, as data can be retrieved directly by its name, without knowing i

In [4]:
with open('questions.md', 'r') as file:
    query_list = file.read().splitlines()

In [10]:
import random
from helpers import retrieve_relevant_resources

query = random.choice(query_list)
print(f"Query: {query}")

# Get just scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings, embedding_model=embedding_model)

scores, indices

Query: What are the advantages and disadvantages of using threads compared to processes in a distributed system?
Time taken to compute dot scores on (2898): 9.622599463909864e-05 seconds


(tensor([0.7697, 0.6948, 0.6759, 0.6685, 0.6663], device='cuda:0'),
 tensor([554, 505, 534, 488, 494], device='cuda:0'))

## Augmenting prompt with context items

In [11]:
get_available_vram()

[7800]

In [12]:

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

SYS_PROMPT = """You are Study-Buddy. An educational chatbot that will aid students in their studies.
You are given the extracted parts of curriulum specific documents and a question. Provide a conversational and educational answer with good and easily read formatting.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
"""

def get_user_prompt(query: str, retreived_documents: list[dict]):
    """
    Formats the prompt with the query and the retreived documents.
    """
    base_prompt = f"Query: {query}\nContext:"
    for item in retreived_documents:
        base_prompt += f"\n- {item['text']}"
    # base_prompt += [item["text"] for item in retreived_documents]
    
    # context_items = [item for i, item in enumerate(chunks_with_embeddings) if i in indices]
    # prompt = prompt_formatter(query=query, context_items=context_items)
    return base_prompt

def format_prompt(formatted_prompt: str):
    message = [
        { "role": "system", "content": SYS_PROMPT },
        { "role": "user", "content": formatted_prompt }
    ]
    return message
    
def generate_response(prompt: str):
    
    input_ids = tokenizer.apply_chat_template(
        prompt,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    
    outputs = model.generate(
        input_ids, 
        max_new_tokens=1024, 
        eos_token_id = terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    
    response = outputs[0][input_ids.shape[-1]:]
    return tokenizer.decode(response, skip_special_tokens=True)

def study_buddy(query: str):
    scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings, embedding_model=embedding_model, print_time=False)
    user_prompt = get_user_prompt(query=query, retreived_documents=[chunks_with_embeddings[i] for i in indices])
    formatted_prompt = format_prompt(user_prompt) 
    return generate_response(formatted_prompt)

In [13]:
from helpers import print_wrapped
query = random.choice(query_list)
print("\n-----------------\nQuery:\n")
print_wrapped(f"{query}")
print("\n-----------------\n")
print(study_buddy(query))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



-----------------
Query:

What are the main components of a Content Delivery Network (CDN) and how do they
work together to improve content delivery performance?

-----------------

The main components of a Content Delivery Network (CDN) are:

1. **Origin Server**: This is the primary server that hosts the content of a website. It is responsible for storing and serving the original content.
2. **Edge Servers**: These are strategically located servers that are part of the CDN. They are responsible for caching and serving content to users. Edge servers are typically located closer to the users, which reduces latency and improves performance.
3. **Content Cache**: This is a storage mechanism that stores copies of content on edge servers. The content cache ensures that the selected edge server has the required content readily available.
4. **Content Replication**: This is the process of copying content from the origin server to multiple edge servers. Content replication ensures that the c

In [None]:
def prompt_formatter(query: str, context_items: list[dict]) -> str:
    """
    Augments the query with the context items.
    """
    context = "- " + "\n- ".join([item["text"] for item in context_items])
    base_prompt = """Based on the following context items, please answer the query. 
    Do not mention any refrences or sources in your answer.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your your answers are as explanatory and educational as possible.
    Now use the following context items to answer the query:
    Context items:
    {context}
    Query: {query}
    """
    base_prompt = base_prompt.format(context=context, query=query)
    
    message_template = [
        { "role": "system", "content": SYS_PROMPT },
        { "role": "user", "content": input_text }
    ]
    prompt = tokenizer.apply_chat_template(
    message_template,
    add_generation_prompt=True,
    #return_tensors="pt"#
    tokenize=False # keep as raw text
    )
    return prompt

In [31]:

query = random.choice(query_list)
print(f"Query: {query}")

scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings, embedding_model=embedding_model)

context_items = [chunks_with_embeddings[i] for i in indices]
prompt = prompt_formatter(query, context_items)

print(f"\n\nPrompt:\n{prompt}")

Query: What are the challenges of achieving both consistency and fault tolerance in large-scale distributed systems?
Time taken to compute dot scores on (2898): 8.242693729698658e-05 seconds


Prompt:
Based on the following context items, please answer the query. 
    Do not mention any refrences or sources in your answer.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your your answers are as explanatory and educational as possible.
    Now use the following context items to answer the query:
    Context items:
    - To understand the role of fault tolerance in distributed systems, we first need to take a closer look at what it actually means for a distributed system to tolerate faults. Being fault tolerant is strongly related to what are called dependable systems. Dependability is a term that covers a number of useful requirements for distributed systems,

In [32]:
%%time

input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = llm_model.generate(**input_ids, 
                             temperature=0.7,
                             do_sample=True,
                             max_new_tokens=500)

output_text = tokenizer.decode(outputs[0])
print(f"Query:{query}")
print(f"RAG answer:\n{output_text.replace(prompt,'')}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query:What are the challenges of achieving both consistency and fault tolerance in large-scale distributed systems?
RAG answer:
<|begin_of_text|> Based on the context items provided, what are the challenges of achieving both consistency and fault tolerance in large-scale distributed systems?

Answer: 
The main challenges of achieving both consistency and fault tolerance in large-scale distributed systems are the potential loss of performance and the complexity of masking partial failures. In order to achieve fault tolerance, processes in a fault-tolerant group may need to exchange numerous messages, which can lead to a decrease in performance. Additionally, the complexity of masking partial failures and the recovery from those failures can be intricate, especially in large-scale distributed systems where there are many dependencies and unexpected failures can occur. Moreover, realizing specific forms of fault tolerance, such as being able to withstand arbitrary failures, may not always