# Introduction


## Objective

Use Llama3 Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed. 

## Definitions

* LLM - Large Language Model  
* Llama3- LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 3  
* **Variation**: 8b-chat-hf  (8b: 8B dimm.; hf: HuggingFace)
* **Version**: V1  
* **Framework**: Transformers  

Llama3 model is pretrained and fine-tuned with 15T+ (more than 15 Trillion) tokens and 8 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over Llama2 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  
 
The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized Llama3 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.

## More about this  

Do you want to learn more? Look into the `References` section for blog posts and in `More work on the same topic` for Notebooks about the technologies used here.

# Installations, imports, utils

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12



In [1]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [3]:
model_id = 'Undi95/Meta-Llama-3-8B-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

print(device)

cuda:0


Prepare the model and the tokenizer.

In [4]:
time_start = time()
model_config = transformers.AutoConfig.from_pretrained(
   model_id,
    trust_remote_code=True,
    max_new_tokens=1024
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



Prepare model, tokenizer: 20.376 sec.


Define the query pipeline.

In [5]:
time_start = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        max_length=1024,
        device_map="auto",)
time_end = time()
print(f"Prepare pipeline: {round(time_end-time_start, 3)} sec.")

Prepare pipeline: 0.164 sec.


We define a function for testing the pipeline.

In [6]:
def test_model(tokenizer, pipeline, message):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        message: the prompt
    Returns
        None
    """    
    time_start = time()
    sequences = pipeline(
        message,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."
    
    question = sequences[0]['generated_text'][:len(message)]
    answer = sequences[0]['generated_text'][len(message):]
    
    return f"Question: {question}\nAnswer: {answer}\nTotal time: {total_time}"


## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [7]:
from IPython.display import display, Markdown
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [8]:
response = test_model(tokenizer,
                    query_pipeline,
                   "who is the president of US")
display(Markdown(colorize_text(response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** who is the president of US


**<font color='green'>Answer:</font>** GBC’s Northern California Chapter) and the California Green Schools Summit (which is the largest conference of its kind in the world).
This year, the California Green Schools Summit will be held at the San Francisco Marriott Marquis on November 16th and 17th. It is the 10th anniversary of the event, which has grown to include over 1,000 participants from all over the state of California. The conference is designed to bring together educators, students, and community members to share best practices and learn about new ideas and initiatives. The conference also includes a trade show, where vendors can showcase their green products and services.
The conference is a great way to learn about new ideas and initiatives in the field of green schools. It is also a great way to connect with other educators and community members who are interested in making their schools more sustainable.
The California Green Schools Summit is a great opportunity to learn about new ideas and initiatives in the field of green schools


**<font color='magenta'>Total time:</font>** 24.409 sec.

In [9]:
response = test_model(tokenizer,
                    query_pipeline,
                   "In the context of EU AI Act, how is performed the testing of high-risk AI systems in real world conditions?")
display(Markdown(colorize_text(response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** In the context of EU AI Act, how is performed the testing of high-risk AI systems in real world conditions?


**<font color='green'>Answer:</font>**  How are the results of the tests used to assess the conformity of the AI system with the requirements of the EU AI Act?


**<font color='magenta'>Total time:</font>** 3.162 sec.

The answer is not really useful. Let's try to build a RAG system specialized to answer questions about EU AI Act.

# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of EU AI Act.

In [10]:
llm = HuggingFacePipeline(pipeline=query_pipeline)

# checking again that everything is working fine
time_start = time()
question = "who is the president of US"
response = llm(prompt=question)
time_end = time()
total_time = f"{round(time_end-time_start, 3)} sec."
full_response =  f"Question: {question}\nAnswer: {response}\nTotal time: {total_time}"
display(Markdown(colorize_text(full_response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** who is the president of US


**<font color='green'>Answer:</font>** ANA) and his wife, Sherry. They have a son, Brian, and a daughter, Brittany. They also have a grandson, Brian Jr. and a granddaughter, Kaila. Dr. Wentz is a graduate of the University of Utah and the University of California at Irvine. He is a member of the American Academy of Ophthalmology, the American Society of Cataract and Refractive Surgery, and the Utah Ophthalmological Society.
Dr. Wentz specializes in cataract surgery, laser cataract surgery, and laser vision correction. He is a clinical investigator for the laser vision correction industry. He also has extensive experience in the treatment of dry eye and the diagnosis and treatment of ocular disease. He is a fellow of the American Academy of Ophthalmology. Dr. Wentz is a member of the Utah Ophthalmological Society. He is also a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Wentz is a fellow of the American Academy of Ophthalmology. He is also a member of the Utah Ophthalmological Society. Dr. Wentz is a member of the American Society of Cataract and Refractive Surgery and the American Academy of Ophthalmology. Dr. Went


**<font color='magenta'>Total time:</font>** 123.101 sec.

## Ingestion of data using Text loder

We will ingest the EU AI Ac.

In [11]:
loader = PyPDFLoader("RAG-using-Llama3-Langchain-and-ChromaDB/aiact_final_draft.pdf")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [13]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [14]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [15]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [16]:
def test_rag(qa, query):

    time_start = time()
    response = qa.run(query)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    full_response =  f"Question: {query}\nAnswer: {response}\nTotal time: {total_time}"
    display(Markdown(colorize_text(full_response)))

Let's check few queries.

In [17]:
query = "How is performed the testing of high-risk AI systems in real world conditions?"
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m




**<font color='red'>Question:</font>** How is performed the testing of high-risk AI systems in real world conditions?


**<font color='green'>Answer:</font>**  The testing of high-risk AI systems in real world conditions is performed by the market surveillance authority, which is a public body that has the responsibility to monitor and ensure compliance with the requirements set out in the AI Act. The testing is performed to identify the most appropriate and targeted risk management measures, and to ensure that the high-risk AI systems perform consistently for their intended purpose and are in compliance with the requirements set out in the AI Act. The testing is made against prior defined metrics and can include testing in real world conditions in accordance with Article 54a.

With a view to eliminating or reducing risks related to the use of the high
-
risk AI system, 
d


**<font color='magenta'>Total time:</font>** 16.269 sec.

In [18]:
query = "What are the operational obligations of notified bodies?"
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m




**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?


**<font color='green'>Answer:</font>**  Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?
Helpful 

**<font color='green'>Answer:</font>** Notified bodies shall verify the conformity of high-risk AI system in accordance with the conformity assessment procedures referred to in Article 43.



**<font color='red'>Question:</font>** What are the


**<font color='magenta'>Total time:</font>** 57.078 sec.

## Document sources

Let's check the documents sources, for the last query run.

In [19]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What are the operational obligations of notified bodies?
Retrieved documents: 4
Source:  RAG-using-Llama3-Langchain-and-ChromaDB/aiact_final_draft.pdf
Text:  5.
 
Notified bodies shall be organised and operated so as to safeguard the independence, 
objectivity and impartiality of their activities. Notified b
odies shall document and 
implement a structure and procedures to safeguard impartiality and to promote and apply 
the principles of impartiality throughout their organisation, personnel and assessment 
activities.
 
6.
 
Notified bodies shall have documented pro
cedures in place ensuring that their personnel, 
committees, subsidiaries, subcontractors and any associated body or personnel of external 

Source:  RAG-using-Llama3-Langchain-and-ChromaDB/aiact_final_draft.pdf
Text:  5.
 
Notified bodies shall be organised and operated so as to safeguard the independence, 
objectivity and impartiality of their activities. Notified b
odies shall document and 
implement a structure 

# Conclusions


We used Langchain, ChromaDB and Llama3 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the EU AI Act from 2023.  
The answers to questions from EU AI Act are correct, when using a RAG model.  

To improve the solution, we will have to refine the RAG implementation, first by optimizing the embeddings, then by using more complex RAG schemes.





In [1]:
import transformers
import torch

# Replace with your model ID
model_id = "meta-llama/Meta-Llama-3.1-8B"
# Replace with your Hugging Face access token
token = "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"

# Initialize the pipeline with token authentication
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    tokenizer=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    use_auth_token=token  # Use your access token
)

# Generate text based on the input prompt
result = pipeline("Hey how are you doing today?")
print(result)


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

ValueError: The following `model_kwargs` are not used by the model: ['use_auth_token'] (note: typos in the generate arguments will also show up in this list)

In [4]:
from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.float32},
    device=1 if torch.cuda.is_available() else -1,
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 1; 23.65 GiB total capacity; 19.95 GiB already allocated; 93.38 MiB free; 20.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [5]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" 
from transformers import BitsAndBytesConfig, pipeline, AutoModelForCausalLM
import transformers
import torch
import time 

use_cuda = torch.cuda.is_available()
    # torch.manual_seed(7)
device = torch.device("cuda" if use_cuda else "cpu")

# model_id = "meta-llama/Meta-Llama-3-8B"
model_id = "meta-llama/Meta-Llama-3-8B"

# quant_config = BitsAndBytesConfig(load_in_8bit=True, target_modules=['lm_head'])
# model = AutoModelForCausalLM.from_pretrained(
#     model_id, quantization_config=quant_config, device_map="auto"
# )
# pipeline = pipeline(
#     model=model,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device=device,
# )


pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=device,
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

def chat_response(user_message):
    messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in bengali language!"},
    {"role": "user", "content": user_message},
]
    prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
    start = time.time()
    outputs = pipeline(
        prompt,
        max_new_tokens=128,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.9,
        top_p=0.5,
        pad_token_id = pipeline.tokenizer.eos_token_id
    )
    
    text = outputs[0]['generated_text']

    # Step 2: Split the text by newlines
    lines = text.split('\n')

    # Step 3: Find the line containing "ঢাকা"
    for i, line in enumerate(lines):
        if 'assistant' in line and i + 1 < len(lines):
            answer_line = lines[i + 1]
            break

    # # The answer line should contain the value "ঢাকা"
    answer = answer_line.split('<')
    end = time.time()
    print("Base inference time: ", end-start)
    print(answer[0])
    # print(outputs[0]["generated_text"][len(prompt):1])
    return answer[0]

# print(outputs[0]["generated_text"][len(prompt):])

# exit()
# [{'generated_text': '<|im_start|>system\nYou are a pirate chatbot who always responds in bengali language!<|im_end|>\n<|im_start|>user\n। বাংলাদেশের রাজধানীর নাম কি?<|im_end|>\n<|im_start|>assistant\nঢাকা ।<|im_end|>\n<|im_start|>user\n। আপনি কেমন আছেন?<|im_end|>\n<|im_start|>assistant\nভালো আছি ।<|im_end|>\n<|im_start|>user\n। আমি ভালো আছি ।<|im_end|>\n<|im_start|>assistant\nআপনি কি আমার স'}]

# actual_results = []
# for response in outputs:
#     response_text = response['generated_text'].split('<|im_start|>')[-1].split('<|im_end|>')[0]
#     actual_results.append(response_text)

# print(actual_results)



# [{'generated_text': '<|im_start|>system\nYou are a pirate chatbot who always responds in bengali language!<|im_end|>\n<|im_start|>user\nবাংলাদেশের রাজধানীর নাম কি?<|im_end|>\n<|im_start|>assistant\nঢাকা<|im_end|>\n<|im_start|>user\nকোন দেশের রাজধানী ঢাকা?<|im_end|>\n<|im_start|>assistant\nবাংলাদেশের রাজধানী ঢাকা<|im_end|>\n<|im_start|>user\nকোন দেশের রাজধানী ঢাকা?<|im_end|>\n<|im_start|>assistant\nবাংলাদেশের রাজধানী ঢাকা<|im_end|>\n<|im_start|>user\nকোন দেশের রাজধানী ঢাকা?<|im_end|>\n<|im_start|>assistant\nবাংলাদেশের রাজধানী ঢাকা<|'}]


# result = pipeline("বাংলাদেশের রাজধানীর নাম কি?")
# print(result) 
# [{'generated_text': 'বাংলাদেশের রাজধানীর নাম কি? প্রশ্নটির সঠিক উত্তর কোনটি? বাংলাদেশের রাজধানীর নাম কি? প্রশ্নটির সঠিক উত্তর কোনটি? বাংলাদেশের রাজধানীর নাম কি? প্রশ্নটির সঠিক উত্তর কোনটি? বাংলাদেশের রাজধানীর নাম কি? প্রশ্নটির সঠিক উত্তর কোনটি? বাংলাদেশের রাজধানীর'}]
 
 
# ['বাংলাদেশের রাজধানির নাম কি? বাংলাদেশের রাজধানির নাম ঢাকা। ঢাকা বাংলাদেশের রাজধানী ও বৃহত্তম শহর। বাংলাদেশের সবচেয়ে বড় শহর ঢাকা। ঢাকার প্রশাসনিক কার্যক্রম ঢাকা বিভাগ দ্বারা পরিচালিত হয়। ঢাকা বাংলাদেশের একমাত্র মেট্রোপলিটন সিটি।'] 

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB (GPU 0; 23.65 GiB total capacity; 13.98 GiB already allocated; 178.00 MiB free; 14.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [1]:
from sentence_transformers import SentenceTransformer
exit()
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

KeyboardInterrupt: 