# Introduction to Large Language Models (LLMs)

Large Language Models (LLMs) have revolutionized natural language processing (NLP) and have become foundational tools in various AI applications. These models, trained on massive datasets, are capable of generating human-like text, translating languages, summarizing documents, answering questions, and much more. In this notebook, we will explore how to use large language models, particularly those built on architectures like GPT, BERT, and LLaMA.

## What Are Large Language Models?

LLMs are a type of deep learning model specifically designed to understand and generate text. They are typically based on transformer architectures, which allow them to process and generate text in parallel, making them highly efficient and powerful. The "large" in LLMs refers to the vast number of parameters—often in the billions—these models possess, which enables them to capture the nuances and complexities of human language.

### Key Features of LLMs:

- **Contextual Understanding:** LLMs can generate text that is contextually relevant, meaning they understand the context in which words are used, leading to more accurate and coherent outputs.
- **Transfer Learning:** These models can be fine-tuned on specific tasks with smaller datasets, making them highly versatile for various NLP applications.
- **Generative Capabilities:** Beyond understanding text, LLMs can generate creative and complex content, from poetry to technical explanations.

## Why Use LLMs?

The widespread adoption of LLMs is driven by their ability to perform a wide range of tasks with minimal human intervention. They are used in chatbots, content creation, sentiment analysis, language translation, code generation, and more. Their ability to generalize across tasks without needing task-specific models has made them invaluable in both industry and research.

## What Will You Learn?

In this notebook, you will learn how to leverage open-source LLMs for text generation and other NLP tasks using Python and popular libraries such as `transformers`. We will guide you through:
- Setting up the environment for working with LLMs.
- Loading pre-trained models and tokenizers.
- Generating text based on specific prompts.

By the end of this notebook, you will have a solid understanding of how to implement and use LLMs in practical scenarios, opening the door to countless AI-powered applications.

Let’s get started!


In [1]:
!pip install \
  transformers==4.57.3 \
  bitsandbytes==0.49.0 \
  accelerate==1.12.0 \
  sentence-transformers==5.1.2 \
  faiss-cpu==1.13.1 \
  datasets==4.0.0 \
  evaluate==0.4.6


Collecting bitsandbytes==0.49.0
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting sentence-transformers==5.1.2
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu==1.13.1
  Downloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting evaluate==0.4.6
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m488.0/488.0 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.13.1-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# LLM: Qwen2

- [Model details](https://qwenlm.github.io/blog/qwen2/)
- Completely open, you do not have to apply for access
- We will use a quantized version of the model (preserves accuracy of the model while significantly reduces memory requirements). If you are interested, read more [here](https://huggingface.co/docs/optimum/concept_guides/quantization).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import init_empty_weights
from transformers import BitsAndBytesConfig

device = "cuda" # the device to load the model onto

# Initialize the model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Set to True for 4-bit quantization or False for 8-bit
    bnb_4bit_use_double_quant=True,  # Optional: Improves stability in 4-bit quantization
    bnb_4bit_quant_type="nf4",  # Optional: Use 'nf4' for better accuracy or 'fp4' for faster computation
    bnb_4bit_compute_dtype=torch.float16  # Optional: use float16 for better performance on newer GPUs
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=bnb_config  # Pass the quantization configuration
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### Qwen2 Response Generation Function

This function is a small wrapper around a Qwen2 language model that takes a text prompt and returns the model’s answer. From the user’s perspective, it behaves like a simple “ask a question, get an answer” interface. Internally, however, several steps are required to transform natural language into a form the model can process and then convert the model’s output back into readable text.

The model itself does not understand text directly. Instead, all input text is first converted into numbers using a **tokenizer**, which breaks the text into smaller pieces (tokens) such as words or subwords and maps them to numerical IDs. These numbers are what the neural network actually operates on.

Once the prompt is tokenized, it is passed through the Qwen2 transformer model, which has been trained to predict the most likely next token given everything it has seen so far. Generation happens step by step: the model predicts the next token, appends it to the sequence, and repeats this process until a complete response is produced.

Finally, the generated token IDs are converted back into human-readable text using the tokenizer in reverse, and only the newly generated part of the sequence is returned as the final answer.


**Workflow overview:**

1. **Construct chat messages**  
   A system message defines the assistant’s behavior, and a user message contains the input prompt.

2. **Apply the chat template**  
   The tokenizer formats the messages into the model-specific chat prompt and appends a generation marker.

3. **Tokenize and move to device**  
   The formatted prompt is converted into input IDs and transferred to the target device (CPU or GPU).

4. **Generate model output**  
   The model produces up to a fixed number of new tokens as a continuation of the prompt.

5. **Extract generated content only**  
   The original prompt tokens are removed so that only newly generated tokens remain.

6. **Decode to text**  
   Token IDs are converted back into readable text, skipping special tokens.

The function returns the final text response generated by the Qwen2 model.


In [3]:
# A function that returns the answer from Qwen2 model
def qwen2_generate(prompt):
    messages = [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
  ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

In [4]:
# Chat about anything
prompt = "Give me a short introduction to large language model."
response = qwen2_generate(prompt)
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


A large language model is a type of artificial intelligence (AI) system that has been trained on vast amounts of text data, allowing it to generate human-like responses and perform various natural language processing tasks. These models are characterized by their massive size, often containing billions of parameters, which enables them to learn complex patterns in language.

Developed using deep learning techniques, particularly transformer architectures like the ones used in the popular models such as GPT-3 or BERT, these models can understand context, generate coherent responses to a wide range of prompts, and even engage in conversations. They are used in a variety of applications, including but not limited to, language translation, text summarization, chatbot development, content generation, and more. The key advantage of large language models lies in their ability to provide sophisticated, context-aware outputs that mimic human language skills, making them valuable tools in the fi

In [5]:
# Translate documents
prompt = "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten."
response = qwen2_generate(prompt)
print(response)

English:
When it comes to multiplying one's own money or saving for retirement, it is increasingly common to hear about ETFs - short for Exchange-Traded Funds, or stock-exchange traded index funds. Sounds daunting? Maybe. But setting up your first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we neither need a starting capital nor extensive knowledge. Even someone who has been saving for retirement for decades can still get started.

French:
Lorsqu'il s'agit de multiplier ses propres économies ou d'épargner pour la retraite, on entend de plus en plus parler des ETFs - abrégé de Exchange-Traded Funds, c'est-à-dire les fonds en index boursiers. Cela paraît-il intimidant ? Peut-être. Mais mettre en place votre premier plan d'épargne est simple. Tout ce dont vous avez besoin, c'est d'un smartphone. Et pour investir dans des ETFs, nous n'avons ni besoin d'un capital de départ, ni d'une grande connaissance. Même quelqu'un qui a été en train d'épargner 

### Modify text generation parameters

Now, we are going to change the hyperparameters of text generation. We will focus on only a few most important. You can find a full list of parameters on [HuggingFace](https://huggingface.co/docs/transformers/main_classes/text_generation).

In [6]:
prompt = "Translate to English:\n\nUS-Vizepräsidentin Kamala Harris verteidigte ihre geänderte Meinung zu wichtigen Themen in ihrem ersten Interview seit ihrer Bewerbung um die Präsidentschaf."
messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

### Temperature
**Temperature** is a parameter used in natural language processing models to increase or decrease the “confidence” a model has in its most likely response.

A low temperature makes the model behave more conservatively: it strongly prefers the most likely words, which leads to predictable, focused, and often more factual responses.

A higher temperature loosens this preference, allowing less likely words to be chosen more often, which can make the output more creative, varied, or surprising—but also more prone to errors or incoherence.

You can think of temperature like adjusting the “randomness dial”: turning it down produces safe, consistent answers, while turning it up encourages exploration and diversity in what the model says.

You can find a great visualization and explanation of temperature [here](https://lukesalamone.github.io/posts/what-is-temperature/).

### How temperature affects translation
At **low temperature**, the model strongly favors the highest-probability next token at each step. This results in translations that are stable, literal, and repeatable. You will often see very similar or even identical translations across multiple runs, with conservative word choices and standard phrasing. This setting is generally preferred for production or evaluation scenarios where faithfulness, consistency, and minimal paraphrasing are important.

As the **temperature increases**, the model begins to sample from a wider range of plausible tokens. In translation, this can introduce paraphrasing, alternative sentence structures, or stylistic variation. While this may produce translations that sound more natural or expressive in some cases, it also increases the risk of drift—for example, slightly altered meaning, unnecessary rewording, or added/omitted nuances that were not present in the source text.

At **very high temperatures**, translation quality typically degrades. The model may choose unusual or low-probability words, repeat phrases, or introduce factual or semantic errors. For translation specifically, this is undesirable because the task prioritizes semantic equivalence over creativity.

In [7]:
# change temperature of the model
for tmp in [0.1, 0.1, 0.5, 0.98]:  # lower values make model more deterministic
  generated_ids = model.generate(
      model_inputs.input_ids,
      max_new_tokens=512,
      do_sample=True,
      temperature=tmp
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her campaign for the presidency.
US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her campaign for the presidency.
US Vice President Kamala Harris defended her changed opinions on important issues in her first interview since running for the presidency.
US Vice President Kamala Harris defended her revised stance on key issues in her first interview since her presidential campaign, as she explained her changed opinions on important topics.


### Top-k Sampling

**Top-k sampling** controls *which words the model is even allowed to choose from* when generating the next token.

At every step, the model assigns probabilities to many possible next words. Instead of considering *all* of them, top-k sampling **keeps only the k most likely candidates** and completely ignores the rest. The next word is then sampled only from this reduced shortlist.

When the context is **short or vague** (e.g., after “The”), probability is spread across many plausible words (“dog,” “car,” “woman,” etc.). In this case, the top-k set may capture only part of the total probability mass. As the context becomes **more specific** (e.g., “The car”), probability concentrates on a few highly likely words (“drives,” “is”), and the top-k candidates cover almost all reasonable continuations.

A **small k** makes the model more conservative and predictable, because it can choose only from a few high-probability words. A **larger k** allows more variety, enabling less common but still plausible words to appear, which increases diversity but also the risk of awkward or less accurate output.

In practice, top-k sampling acts as a **hard filter on creativity**: it limits how far the model can stray by restricting the choice set, while temperature (often used together with top-k) controls how strongly the model prefers the top options *within* that set.


![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

In [8]:
# change top_k of the model
for v in [10, 50, 100]:
  generated_ids = model.generate(
      model_inputs.input_ids,
      max_new_tokens=512,
      do_sample=True,
      top_k=v
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her campaign for the presidency.
US Vice President Kamala Harris defended her revised stance on key issues in her first interview after her bid for the presidency.
US Vice President Kamala Harris defended her changed opinions on important issues in her first interview since running for the presidency.


### Use case: Simple knowledge-injection
Retrive an external article from a website to chat about with LLM

In [10]:
import requests
from bs4 import BeautifulSoup

# Step 1: Specify the URL of the website you want to fetch
url = 'https://www.bbc.com/news/articles/cd0532n9pdko'  # Replace with the URL of your choice

# Step 2: Send a GET request to the URL
response = requests.get(url)

# Step 3: Check if the request was successful (status code 200)
if response.status_code == 200:
    # Step 4: Parse the content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Step 5: Extract all text from the HTML
    text = soup.get_text()

    # Optional: Clean up the text by stripping leading/trailing whitespaces
    cleaned_text = text.strip()

    # Print the cleaned text
    print(cleaned_text)
else:
    print(f"Failed to fetch the website. Status code: {response.status_code}")


Ukraine F-16 destroyed during Russian attack, BBC toldSkip to contentWatch LiveBritish Broadcasting CorporationHomeNewsSportBusinessInnovationCultureArtsTravelEarthAudioVideoLiveDocumentariesHomeNewsIsrael-Gaza WarWar in UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN. Ireland PoliticsScotlandScotland PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC InDepthBBC VerifySportBusinessExecutive LoungeTechnology of BusinessFuture of BusinessInnovationWatch DocumentariesTechnologyScience & HealthArtificial IntelligenceAI v the MindCultureWatch DocumentariesFilm & TVMusicArt & DesignStyleBooksEntertainment NewsArtsWatch DocumentariesArts in MotionTravelWatch DocumentariesDestinationsAfricaAntarcticaAsiaAustralia and PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s TableCulture & ExperiencesAdventuresThe SpeciaListEarthWatch DocumentariesNatural WondersWeather & ScienceClimate SolutionsSustainable Bu

In [11]:
prompt = "What was destroyed by Russian attack on 29th of August 2024?"
response = qwen2_generate(prompt)
print(response)

As an AI, I don't have the ability to predict future events or provide real-time information. Any speculation about future events would be purely hypothetical and not based on factual data. If you're referring to a specific event in 2024 that hasn't happened yet, it's important to wait for credible news sources to report any actual occurrences.


In [12]:
prompt = "Read this context and answer the question. \n\n" + cleaned_text + "\n\nQuestion: What was destroyed by Russian attack?"
response = qwen2_generate(prompt)
print(response)

According to the context provided, one of the F-16 fighter jets sent from NATO allies to Ukraine has been destroyed during a Russian attack.


# RAG (Retrieval-augmented Generation)

[![RAG.jpg](https://i.postimg.cc/W1V5BJWn/RAG.jpg)](https://postimg.cc/Mvs0RX8M)

**In simple terms, RAG is to LLMs what an open-book exam is to humans.**

The concept of an open-book exam centers around assessing a student's reasoning abilities rather than their capacity to memorize specific details. In a similar vein, RAG separates factual knowledge from the LLM’s reasoning capabilities. This factual information is stored in an external knowledge source, which is both easily accessible and updatable:

- **Parametric knowledge:** Knowledge that is learned during training and implicitly stored within the neural network's weights.
- **Non-parametric knowledge:** Information that is stored externally, for example, in a vector database.
e.

The RAG workflow consists of:

1. **The Retrieve**: The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded using an embedding model into the same vector space as the additional context in the vector database. This enables a similarity search, and the top k closest data objects from the vector database are returned.
2. **Augment**: The user query and the retrieved additional context are incorporated into a prompt template.
3. **Generate**: Finally, the retrieval-augmented prompt is fed to the LLM.

We will use the `langchain` framework to efficiently prompt the LLMs and prepare the RAG.

In [35]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from sentence_transformers import SentenceTransformer, util
import faiss

# Sample corpus
corpus = [
    "The capital of France is Paris.", # 0
    "Python is a programming language that lets you work quickly.", # ¸1
    "The Eiffel Tower is located in Paris.", # 2
    "The Great Wall of China is visible from space.", # 3
    "GPT-3 is a state-of-the-art language model developed by OpenAI." # 4
]

# Initialize retriever model using a newer sentence-transformer model
retriever_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Encode corpus
corpus_embeddings = retriever_model.encode(corpus, convert_to_tensor=True)

In [36]:
# This code initializes a FAISS index for efficient similarity search over a collection of vector embeddings.
corpus_embeddings_np = corpus_embeddings.cpu().numpy()
index = faiss.IndexFlatL2(corpus_embeddings_np.shape[1])
index.add(corpus_embeddings_np)

In [43]:
def rag_query(query):
    # Step 1: Retrieve
    # Encode query and retrieve relevant documents
    query_embedding = retriever_model.encode(query, convert_to_tensor=True)
    query_embedding_np = query_embedding.cpu().numpy()
    query_embedding_np = query_embedding_np.reshape(1, -1)
    distances, top_k_indices = index.search(query_embedding_np, k=2)
    print('Top-k indices:', top_k_indices)

    # Fetch top-k relevant documents
    retrieved_docs = [corpus[idx] for idx in top_k_indices[0]]

    # Step 2: Augment
    # Combine query and retrieved docs for generation input
    combined_input = query + " " + " ".join(retrieved_docs)
    print('Augmented prompt:', combined_input)

    # Step 3: Generate response using the T5 model
    response = qwen2_generate(combined_input)

    return response

In [45]:
# Test the RAG implementation
query = "What is the state-of-the-art language model developed by OpenAI? Answer in one word."
print("Query:", query)
response = rag_query(query)
print("Response:", response)

Query: What is the state-of-the-art language model developed by OpenAI? Answer in one word.
Top-k indices: [[4 1]]
Augmented prompt: What is the state-of-the-art language model developed by OpenAI? Answer in one word. GPT-3 is a state-of-the-art language model developed by OpenAI. Python is a programming language that lets you work quickly.
Response: GPT-3


# Use case in translation: Improve medicine translations with RAG technique



In [17]:
torch.cuda.empty_cache()

In [18]:
from datasets import load_dataset

ds = load_dataset("ahazeemi/opus-medical-en-de")

README.md:   0%|          | 0.00/943 [00:00<?, ?B/s]

data/train-00000-of-00001-da7e17e633fe81(…):   0%|          | 0.00/35.3M [00:00<?, ?B/s]

data/test-00000-of-00001-ed87a31bbd73afd(…):   0%|          | 0.00/292k [00:00<?, ?B/s]

data/dev-00000-of-00001-835e6fbddd1d2256(…):   0%|          | 0.00/287k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/248099 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [19]:
# Sample corpus
corpus = ds['train'].select(list(range(10000)))

In [20]:
# Initialize retriever model using a newer sentence-transformer model
retriever_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [21]:
# Encode corpus
corpus_embeddings = retriever_model.encode(corpus['de'], convert_to_tensor=True, show_progress_bar=True, device='cuda', batch_size=128)

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

In [22]:
# Initialize the FAISS index
corpus_embeddings_np = corpus_embeddings.cpu().numpy()
index = faiss.IndexFlatL2(corpus_embeddings_np.shape[1])
index.add(corpus_embeddings_np)

In [23]:
def get_samples(idx):
    de = corpus['de'][idx]
    en = corpus['en'][idx]
    return f"German:\t{de}\nEnglish:\t{en}\n"


def rag_query(query, k=3):
    # Encode query and retrieve relevant documents
    query_embedding = retriever_model.encode(query, convert_to_tensor=True)
    query_embedding_np = query_embedding.cpu().numpy()
    query_embedding_np = query_embedding_np.reshape(1, -1) # Reshape to a 2D array
    distances, top_k_indices = index.search(query_embedding_np, k=k)

    # Fetch top-k relevant documents
    retrieved_docs = [get_samples(int(idx)) for idx in top_k_indices[0]]

    # Combine query and retrieved docs for generation input
    combined_input = "Find below a few example translations.\n\n" + "\n".join(retrieved_docs) + query

    # Generate response using the T5 model
    response = qwen2_generate(combined_input)

    return combined_input, response

In [24]:
# Without RAG
example = 'Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.'
query = f"\nTranslate to English: \n\n{example}\n"
print("Query:", query)
print("Response:", qwen2_generate(query))  # call original function

Query: 
Translate to English: 

Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.

Response: Levemir InnoLet can be cleaned externally by wiping with a medical swab.


In [25]:
# With the RAG implementation
example = 'Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.'
query = f"\nTranslate to English: \n\n{example}\n"
combined_input, response = rag_query(query)
print("Query:", combined_input)
print("----"*5)
print("Response:", response)

Query: Find below a few example translations.

German:	Äußerlich kann Mixtard 10 NovoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.
English:	You can clean the exterior of your Mixtard 10 NovoLet by wiping it with a medicinal swab.

German:	53 • Entfernen Sie nicht den Gummistopfen. • Reinigen Sie den Stopfen mit einem antiseptischen Tupfer. • Stellen Sie die Durchstechflasche auf eine ebene Oberfläche.
English:	• Do not remove the stopper • Clean the stopper with an antiseptic swab • Put the vial on a flat surface.

German:	Äußerlich kann Protaphane NovoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.
English:	You can clean the exterior of your Protaphane NovoLet by wiping it with a medicinal swab.

Translate to English: 

Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.

--------------------
Response: You can clean the exterior of your Levemir InnoLet by wiping it with a medicinal swab.


In [26]:
# take 30 test samples and evaluate based on BLUE score
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import torch
from tqdm import tqdm

def infer_on_test_set(test_size=5):
    # Load the test set
    test_set = ds['test'].select(list(range(test_size)))

    # Initialize lists to store references and generated translations
    references = []
    generated_translations_org = []
    generated_translations_rag = []

    for i, example in enumerate(tqdm(test_set)):
        query = f"\nTranslate to English: \n\n{example['de']}\n"
        combined_input, response_rag = rag_query(query)
        response_org = qwen2_generate(query)
        references.append(example['en'])
        generated_translations_org.append(response_org)
        generated_translations_rag.append(response_rag)

    return references, generated_translations_org, generated_translations_rag

In [27]:
# Evaluate on the test set
references, generated_translations_org, generated_translations_rag = infer_on_test_set(35)

100%|██████████| 35/35 [03:21<00:00,  5.75s/it]


In [28]:
generated_translations_org

['Levemir InnoLet can be cleaned externally by wiping with a medical swab.',
 'If an alternative injection site is used, it is particularly important to generate sufficient surface tension at the injection site to enable a successful injection.',
 'At higher strengths (5, 7.5, and 10 mg), Arixtra is suitable for the treatment of venous thromboembolic events such as deep vein thrombosis (DVT, blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'The metabolism and excretion of dabigatran were investigated after a single intravenous administration of radiolabeled dabigatran in healthy male subjects.',
 'Which risks are associated with Poulvac FluFend H5N3 RG?',
 'Nicotine derivatives, Colestipol),',
 'Jaundice, elevated liver values',
 'Table 1 lists the unwanted drug reactions (events where a causal relationship with the medication is assumed) that occurred in 291 Alzheimer patients. These patients were part of a specific 24-week double-blind, placebo-controlled

In [29]:
generated_translations_rag

['You can clean the exterior of your Levemir InnoLet by wiping it with a medicinal swab.',
 'When an alternative injection site is used, it is particularly important to generate sufficient surface tension at the injection site to facilitate a successful injection.',
 'At higher strengths (5, 7.5, and 10 mg), Arixtra is suitable for treating venous thromboembolic events such as deep vein thrombosis (DVT, blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'The metabolism and excretion of dabigatran were investigated after a single intravenous administration of radiolabeled dabigatran to healthy male volunteers.',
 'Which risks are associated with Poulvac FluFend H5N3 RG?',
 'Nicotine acid derivatives, Colestipol),',
 'Jaundice, elevated liver values',
 'Table 1 lists the unwanted drug reactions (events where a causal relationship with the drug is assumed) that occurred in 291 Alzheimer patients. These patients participated in a specific 24-week double-blind, pl

In [30]:
references

['You can clean your Levemir InnoLet by wiping it with a medicinal swab.',
 'When using alternate injection sites, it is particularly important to create enough surface tension on the site to be able to successfully complete the injection.',
 'At higher strengths (5, 7.5 and 10 mg), Arixtra is used to treat VTEs such as deep vein thrombosis (DVT, a blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'Metabolism and excretion of dabigatran were studied following a single intravenous dose of radiolabeled dabigatran in healthy male subjects.',
 'What is the risk associated with Poulvac FluFend H5N3 RG?',
 '{PRODUCT NAME} belongs to a group of medicines known as statins, which are lipid (fat) regulating medicines.',
 'Rare: icterus, increased liver values',
 'Table 1 displays the adverse reactions (events reasonably believed to be causally related to the medicinal product) reported in 291 patients with Alzheimer’ s dementia treated in a specific 24-week double-bli

In [31]:
import evaluate

bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=generated_translations_org, references=references)
print(f"BLEU score (ORG): {bleu_score}")
bleu_score = bleu.compute(predictions=generated_translations_rag, references=references)
print(f"BLEU score (RAG): {bleu_score}")

# BLEU score is higher with RAG & also the length is very close to reference

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

BLEU score (ORG): {'bleu': 0.2800863457403616, 'precisions': [0.49902534113060426, 0.3249243188698285, 0.22698744769874477, 0.16720955483170466], 'brevity_penalty': 1.0, 'length_ratio': 1.2761194029850746, 'translation_length': 1026, 'reference_length': 804}
BLEU score (RAG): {'bleu': 0.3660207728198022, 'precisions': [0.6354556803995006, 0.4177545691906005, 0.3023255813953488, 0.22701149425287356], 'brevity_penalty': 0.996261686604726, 'length_ratio': 0.996268656716418, 'translation_length': 801, 'reference_length': 804}
