# Introduction to Large Language Models (LLMs)

Large Language Models (LLMs) have revolutionized natural language processing (NLP) and have become foundational tools in various AI applications. These models, trained on massive datasets, are capable of generating human-like text, translating languages, summarizing documents, answering questions, and much more. In this notebook, we will explore how to use large language models, particularly those built on architectures like GPT, BERT, and LLaMA.

## What Are Large Language Models?

LLMs are a type of deep learning model specifically designed to understand and generate text. They are typically based on transformer architectures, which allow them to process and generate text in parallel, making them highly efficient and powerful. The "large" in LLMs refers to the vast number of parameters—often in the billions—these models possess, which enables them to capture the nuances and complexities of human language.

### Key Features of LLMs:

- **Contextual Understanding:** LLMs can generate text that is contextually relevant, meaning they understand the context in which words are used, leading to more accurate and coherent outputs.
- **Transfer Learning:** These models can be fine-tuned on specific tasks with smaller datasets, making them highly versatile for various NLP applications.
- **Generative Capabilities:** Beyond understanding text, LLMs can generate creative and complex content, from poetry to technical explanations.

## Why Use LLMs?

The widespread adoption of LLMs is driven by their ability to perform a wide range of tasks with minimal human intervention. They are used in chatbots, content creation, sentiment analysis, language translation, code generation, and more. Their ability to generalize across tasks without needing task-specific models has made them invaluable in both industry and research.

## What Will You Learn?

In this notebook, you will learn how to leverage open-source LLMs for text generation and other NLP tasks using Python and popular libraries such as `transformers`. We will guide you through:
- Setting up the environment for working with LLMs.
- Loading pre-trained models and tokenizers.
- Generating text based on specific prompts.

By the end of this notebook, you will have a solid understanding of how to implement and use LLMs in practical scenarios, opening the door to countless AI-powered applications.

Let’s get started!


In [None]:
# Install package requirements
!pip install transformers==4.41 bitsandbytes accelerate sentence_transformers faiss-cpu openai datasets evaluate

Collecting transformers==4.41
  Downloading transformers-4.41.0-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting sentence_transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting openai
  Downloading openai-1.46.0-py3-none-any.whl.metadata (24 kB)
Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx

# LLM: Qwen2

- [Model details](https://qwenlm.github.io/blog/qwen2/)
- Completely open, you do not have to apply for access
- We will use a quantized version of the model (preserves accuracy of the model while significantly reduces memory requirements). If you are interested, read more [here](https://huggingface.co/docs/optimum/concept_guides/quantization).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import init_empty_weights
from transformers import BitsAndBytesConfig

device = "cuda" # the device to load the model onto

# Initialize the model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Set to True for 4-bit quantization or False for 8-bit
    bnb_4bit_use_double_quant=True,  # Optional: Improves stability in 4-bit quantization
    bnb_4bit_quant_type="nf4",  # Optional: Use 'nf4' for better accuracy or 'fp4' for faster computation
    bnb_4bit_compute_dtype=torch.float16  # Optional: use float16 for better performance on newer GPUs
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=bnb_config  # Pass the quantization configuration
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# A function that returns the answer from Qwen2 model
def qwen2_generate(prompt):
    messages = [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
  ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

In [None]:
# Chat about anything
prompt = "Give me a short introduction to large language model."
response = qwen2_generate(prompt)
print(response)

A large language model is a type of artificial intelligence (AI) system that has been trained on vast amounts of text data, enabling it to generate human-like responses and perform various natural language processing tasks. These models are characterized by their immense size, which allows them to capture complex patterns and nuances in language.

Large language models can be used for a wide range of applications, such as:

1. **Language translation**: They can translate text from one language to another with a high degree of accuracy.
2. **Text summarization**: They can condense long texts into shorter, more manageable summaries while preserving the main points.
3. **Question answering**: They can answer questions posed in natural language, providing detailed and contextually relevant responses.
4. **Dialogue generation**: They can engage in conversations with humans, responding to prompts and maintaining coherent discussions.
5. **Creative writing**: They can generate stories, poems,

In [None]:
# Translate documents
prompt = "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten."
response = qwen2_generate(prompt)
print(response)

English Translation:
When it comes to multiplying your own capital or saving for retirement, ETFs are increasingly being talked about - short for Exchange-Traded Funds, or stock exchange-traded index funds. Sounds daunting? Maybe it does. But setting up a first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we don't need any startup capital or extensive knowledge. Even those who have been saving for retirement for decades can still start.

French Translation:
Lorsqu'il s'agit de multiplier vos propres économies ou de les épargner pour la retraite, les ETFs sont de plus en plus évoqués - abrégé de Fonds d'Indexation à Circulation Échangée, ou fonds d'indexation cotés en bourse. Cela sonne-il impressionnant ? Peut-être. Mais mettre en place un premier plan d'épargne est simple. Tout ce dont vous avez besoin, c'est d'un smartphone. Et pour investir dans les ETFs, nous n'avons ni besoin de capital de départ ni de connaissances approfondies. Même ceux 

### Modify text generation parameters

Now, we are going to change the hyperparameters of text generation. We will focus on only a few most important. You can find a full list of parameters on [HuggingFace](https://huggingface.co/docs/transformers/main_classes/text_generation).

In [None]:
prompt = "Translate to English:\n\nUS-Vizepräsidentin Kamala Harris verteidigte ihre geänderte Meinung zu wichtigen Themen in ihrem ersten Interview seit ihrer Bewerbung um die Präsidentschaf."
messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

### Temperature
Temperature is the value used to modulate the next token probabilities. You can find a great visualization and explanation of temperature [here](https://lukesalamone.github.io/posts/what-is-temperature/).

In [None]:
# change temperature of the model
for tmp in [0.1, 0.1, 0.5, 0.98]:  # lower values make model more deterministic
  generated_ids = model.generate(
      model_inputs.input_ids,
      max_new_tokens=512,
      do_sample = True,
      temperature = tmp
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her campaign for the presidency.
US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her campaign for the presidency.
US Vice President Kamala Harris defended her revised stance on key issues in her first interview since her bid for the presidency.
US Vice President Kamala Harris defended her revised stance on key issues in her first interview following her bid for the presidency.


# Sampling
Top_k sampling restricts model to the top_k candidates to generate next token (word)

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

In [None]:
# change top_k of the model
for v in [10, 50, 100]:
  generated_ids = model.generate(
      model_inputs.input_ids,
      max_new_tokens=512,
      do_sample=True,
      top_k=v
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

US Vice President Kamala Harris defended her changed opinions on key issues in her first interview after campaigning for the presidency.
US Vice President Kamala Harris defended her revised stance on key issues in her first interview since campaigning for the presidency.
US Vice President Kamala Harris defended her revised opinions on key issues in her first interview since her bid for the presidency.


In [None]:
# Continue the story
prompt = "Add a few more sentences and conclude the following scene:\n\nI went on a walk with my dog. You won't believe what happened to me next."
messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

for v in [10, 50, 100]:
  generated_ids = model.generate(
      model_inputs.input_ids,
      max_new_tokens=256,  # keep it short
      do_sample=True,
      top_k=v,
      temperature=1  # lower temperature to make the model more deterministic
  )
  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(f'PARAMETER VALUE {v}\n', response, '\n')

PARAMETER VALUE 10
 As I walked my dog around the neighborhood, I couldn't help but notice the beauty of the autumn foliage surrounding us. The leaves had transformed into a vibrant array of orange, red, and yellow hues that contrasted beautifully against the blue sky.

Suddenly, I heard a rustling sound coming from behind some bushes nearby. Curious, I decided to investigate, and as I approached the bushes, I saw a small, frightened rabbit hiding there. I knew that getting too close could startle it, so I slowly backed away and gave it some space.

To my surprise, the rabbit seemed to calm down and even started nibbling on some grass nearby. My dog, however, was not pleased with this new addition to our walk, and started barking and growling at the rabbit. I quickly intervened and tried to distract my dog with a toy, but it just made the situation worse.

Just then, a kind old man appeared out of nowhere and asked if we needed any help. He had seen the commotion and offered to help me

### Use case: Simple knowledge-injection
Retrive an external article from a website to chat about with LLM

In [None]:
import requests
from bs4 import BeautifulSoup

# Step 1: Specify the URL of the website you want to fetch
url = 'https://www.bbc.com/news/articles/cd0532n9pdko'  # Replace with the URL of your choice

# Step 2: Send a GET request to the URL
response = requests.get(url)

# Step 3: Check if the request was successful (status code 200)
if response.status_code == 200:
    # Step 4: Parse the content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Step 5: Extract all text from the HTML
    text = soup.get_text()

    # Optional: Clean up the text by stripping leading/trailing whitespaces
    cleaned_text = text.strip()

    # Print the cleaned text
    print(cleaned_text)
else:
    print(f"Failed to fetch the website. Status code: {response.status_code}")


Ukraine F-16 destroyed during Russian attack, BBC toldSkip to contentBritish Broadcasting CorporationHomeNewsSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar in UkraineUS ElectionKamala HarrisDonald TrumpJD VanceTim WalzUS & CanadaUKUK PoliticsEnglandN. IrelandN. Ireland PoliticsScotlandScotland PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC InDepthBBC VerifySportBusinessExecutive LoungeTechnology of BusinessWomen at the HelmFuture of BusinessInnovationTechnologyScience & HealthArtificial IntelligenceAI v the MindCultureFilm & TVMusicArt & DesignStyleBooksEntertainment NewsArtsArts in MotionTravelDestinationsAfricaAntarcticaAsiaAustralia and PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s TableCulture & ExperiencesAdventuresThe SpeciaListEarthNatural WondersWeather & ScienceClimate SolutionsSustainable BusinessGreen LivingVideoLiveLive NewsLive SportHomeNews

In [None]:
prompt = "What was destroyed by Russian attack on 29th of August 2024?"
response = qwen2_generate(prompt)
print(response)

As an AI, I don't have the capability to predict future events or provide real-time updates. My knowledge is based on data up until 2021. Therefore, I cannot accurately answer your question about what was destroyed by a Russian attack on the 29th of August 2024. For such information, you should refer to news outlets or official statements after the event has occurred. Always ensure to trust reliable sources for accurate and up-to-date information.


In [None]:
prompt = "Read this context and answer the question. \n\n" + cleaned_text + "\n\nQuestion: What was destroyed by Russian attack?"
response = qwen2_generate(prompt)
print(response)

According to the context provided, one of the F-16 fighter jets sent from NATO allies to Ukraine has been destroyed during a Russian attack. This happened when there was a barrage of Russian missiles, resulting in the death of pilot Oleksiy Mes.


# RAG (Retrieval-augmented Generation)

[![RAG.jpg](https://i.postimg.cc/W1V5BJWn/RAG.jpg)](https://postimg.cc/Mvs0RX8M)

**In simple terms, RAG is to LLMs what an open-book exam is to humans.**

The concept of an open-book exam centers around assessing a student's reasoning abilities rather than their capacity to memorize specific details. In a similar vein, RAG separates factual knowledge from the LLM’s reasoning capabilities. This factual information is stored in an external knowledge source, which is both easily accessible and updatable:

- **Parametric knowledge:** Knowledge that is learned during training and implicitly stored within the neural network's weights.
- **Non-parametric knowledge:** Information that is stored externally, for example, in a vector database.
e.

The RAG workflow consists of:

1. **The Retrieve**: The user query is used to retrieve relevant context from an external knowledge source. For this, the user query is embedded using an embedding model into the same vector space as the additional context in the vector database. This enables a similarity search, and the top k closest data objects from the vector database are returned.
2. **Augment**: The user query and the retrieved additional context are incorporated into a prompt template.
3. **Generate**: Finally, the retrieval-augmented prompt is fed to the LLM.

We will use the `langchain` framework to efficiently prompt the LLMs and prepare the RAG.

In [None]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from sentence_transformers import SentenceTransformer, util
import faiss

# Sample corpus
corpus = [
    "The capital of France is Paris.", # 0
    "Python is a programming language that lets you work quickly.", # ¸1
    "The Eiffel Tower is located in Paris.", # 2
    "The Great Wall of China is visible from space.", # 3
    "GPT-3 is a state-of-the-art language model developed by OpenAI." # 4
]

# Initialize retriever model using a newer sentence-transformer model
retriever_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Encode corpus
corpus_embeddings = retriever_model.encode(corpus, convert_to_tensor=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Initialize the FAISS index
corpus_embeddings_np = corpus_embeddings.cpu().numpy()
index = faiss.IndexFlatL2(corpus_embeddings_np.shape[1])
index.add(corpus_embeddings_np)

In [None]:
def rag_query(query):
    # Step 1: Encode query and retrieve relevant documents
    query_embedding = retriever_model.encode(query, convert_to_tensor=True)
    query_embedding_np = query_embedding.cpu().numpy()
    query_embedding_np = query_embedding_np.reshape(1, -1) # Reshape to a 2D array
    distances, top_k_indices = index.search(query_embedding_np, k=2)  # Correct unpacking here
    print(top_k_indices)

    # Fetch top-k relevant documents
    retrieved_docs = [corpus[idx] for idx in top_k_indices[0]]

    # Combine query and retrieved docs for generation input
    combined_input = query + " " + " ".join(retrieved_docs)

    # Step 2: Generate response using the T5 model
    response = qwen2_generate(combined_input)

    return response

In [None]:
# Test the RAG implementation
query = "What is a state-of-the-art language model developed by OpenAI. Answer in one word."
response = rag_query(query)
print("Query:", query)
print("Response:", response)

[[4 1]]
Query: What is a state-of-the-art language model developed by OpenAI. Answer in one word.
Response: GPT-3


# Use case in translation: Improve medicine translations with RAG technique



In [None]:
torch.cuda.empty_cache()

In [None]:
from datasets import load_dataset

ds = load_dataset("ahazeemi/opus-medical-en-de")

README.md:   0%|          | 0.00/943 [00:00<?, ?B/s]

(…)-00000-of-00001-da7e17e633fe818c.parquet:   0%|          | 0.00/35.3M [00:00<?, ?B/s]

(…)-00000-of-00001-ed87a31bbd73afd5.parquet:   0%|          | 0.00/292k [00:00<?, ?B/s]

(…)-00000-of-00001-835e6fbddd1d2256.parquet:   0%|          | 0.00/287k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/248099 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# Sample corpus
corpus = ds['train'].select(list(range(10000)))

In [None]:
# Initialize retriever model using a newer sentence-transformer model
retriever_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [None]:
# Encode corpus
corpus_embeddings = retriever_model.encode(corpus['de'], convert_to_tensor=True, show_progress_bar=True, device='cuda', batch_size=128)

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

In [None]:
# Initialize the FAISS index
corpus_embeddings_np = corpus_embeddings.cpu().numpy()
index = faiss.IndexFlatL2(corpus_embeddings_np.shape[1])
index.add(corpus_embeddings_np)

In [None]:
def get_samples(idx):
    de = corpus['de'][idx]
    en = corpus['en'][idx]
    return f"German:\t{de}\nEnglish:\t{en}\n"


def rag_query(query, k=3):
    # Step 1: Encode query and retrieve relevant documents
    query_embedding = retriever_model.encode(query, convert_to_tensor=True)
    query_embedding_np = query_embedding.cpu().numpy()
    query_embedding_np = query_embedding_np.reshape(1, -1) # Reshape to a 2D array
    distances, top_k_indices = index.search(query_embedding_np, k=k)

    # Fetch top-k relevant documents
    retrieved_docs = [get_samples(idx) for idx in top_k_indices[0]]

    # Combine query and retrieved docs for generation input
    combined_input = "Find below a few example translations.\n\n" + "\n".join(retrieved_docs) + query

    # Step 2: Generate response using the T5 model
    response = qwen2_generate(combined_input)

    return combined_input, response

In [None]:
# Without RAG
example = 'Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.'
query = f"\nTranslate to English: \n\n{example}\n"
print("Query:", query)
print("Response:", qwen2_generate(query))  # call original function

Query: 
Translate to English: 

Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.

Response: Externally, Levemir InnoLet can be cleaned by wiping with a medical swab.


In [None]:
# With the RAG implementation
example = 'Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.'
query = f"\nTranslate to English: \n\n{example}\n"
combined_input, response = rag_query(query)
print("Query:", combined_input)
print("----"*5)
print("Response:", response)

Query: Find below a few example translations.

German:	Äußerlich kann Mixtard 10 NovoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.
English:	You can clean the exterior of your Mixtard 10 NovoLet by wiping it with a medicinal swab.

German:	53 • Entfernen Sie nicht den Gummistopfen. • Reinigen Sie den Stopfen mit einem antiseptischen Tupfer. • Stellen Sie die Durchstechflasche auf eine ebene Oberfläche.
English:	• Do not remove the stopper • Clean the stopper with an antiseptic swab • Put the vial on a flat surface.

German:	Äußerlich kann Protaphane NovoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.
English:	You can clean the exterior of your Protaphane NovoLet by wiping it with a medicinal swab.

Translate to English: 

Äußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden.

--------------------
Response: You can clean the exterior of your Levemir InnoLet by wiping it with a medicinal swab.


In [None]:
# take 30 test samples and evaluate based on BLUE score
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import torch
from tqdm import tqdm

def infer_on_test_set(test_size=5):
    # Load the test set
    test_set = ds['test'].select(list(range(test_size)))

    # Initialize lists to store references and generated translations
    references = []
    generated_translations_org = []
    generated_translations_rag = []

    for i, example in enumerate(tqdm(test_set)):
        query = f"\nTranslate to English: \n\n{example['de']}\n"
        combined_input, response_rag = rag_query(query)
        response_org = qwen2_generate(query)
        references.append(example['en'])
        generated_translations_org.append(response_org)
        generated_translations_rag.append(response_rag)

    return references, generated_translations_org, generated_translations_rag

In [None]:
# Evaluate on the test set
references, generated_translations_org, generated_translations_rag = infer_on_test_set(35)

100%|██████████| 35/35 [03:10<00:00,  5.46s/it]


In [None]:
generated_translations_org

['Levemir InnoLet can be cleaned externally by wiping with a medical swab.',
 'If an alternative injection site is used, it is particularly important to generate sufficient surface tension at the injection site to facilitate a successful injection.',
 'At higher strengths (5, 7.5, and 10 mg), Arixtra is suitable for the treatment of venous thromboembolic events such as deep vein thrombosis (DVT, blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'The metabolism and excretion of dabigatran were investigated after a single intravenous administration of radiolabeled dabigatran in healthy male volunteers.',
 'What risks are associated with Poulvac FluFend H5N3 RG?',
 'Nicotine derivatives, Colestipol),',
 'Jaundice, elevated liver values',
 'Table 1 lists the unwanted drug reactions (events where a causal relationship with the medication is assumed) that occurred in 291 Alzheimer patients. These patients were part of a specific 24-week double-blind, placebo-contr

In [None]:
generated_translations_rag

['You can clean the exterior of your Levemir InnoLet by wiping it with a medicinal swab.',
 'If an alternative injection site is used, it is particularly important to generate sufficient skin tension at the injection site to facilitate a successful injection.',
 'At higher strengths (5, 7.5, and 10 mg), Arixtra is suitable for treating venous thromboembolic events such as deep vein thrombosis (DVT, blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'The metabolism and elimination of dabigatran were studied after a single intravenous administration of radiolabeled dabigatran in healthy male subjects.',
 'What risks are associated with Poulvac FluFend H5N3 RG?',
 'Nicotine acid derivatives, Colestipol),',
 'Jaundice, elevated liver values',
 'Table 1 lists the unwanted drug reactions (events where a causal relationship with the medication is assumed) that occurred in 291 Alzheimer patients. These patients participated in a specific 24-week double-blind, placebo

In [None]:
references

['You can clean your Levemir InnoLet by wiping it with a medicinal swab.',
 'When using alternate injection sites, it is particularly important to create enough surface tension on the site to be able to successfully complete the injection.',
 'At higher strengths (5, 7.5 and 10 mg), Arixtra is used to treat VTEs such as deep vein thrombosis (DVT, a blood clot in the leg) or pulmonary embolism (PE, blood clot in the lung).',
 'Metabolism and excretion of dabigatran were studied following a single intravenous dose of radiolabeled dabigatran in healthy male subjects.',
 'What is the risk associated with Poulvac FluFend H5N3 RG?',
 '{PRODUCT NAME} belongs to a group of medicines known as statins, which are lipid (fat) regulating medicines.',
 'Rare: icterus, increased liver values',
 'Table 1 displays the adverse reactions (events reasonably believed to be causally related to the medicinal product) reported in 291 patients with Alzheimer’ s dementia treated in a specific 24-week double-bli

In [None]:
import evaluate

bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=generated_translations_org, references=references)
print(f"BLEU score (ORG): {bleu_score}")
bleu_score = bleu.compute(predictions=generated_translations_rag, references=references)
print(f"BLEU score (RAG): {bleu_score}")

# BLEU score is higher with RAG & also the length is very close to reference

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU score (ORG): {'bleu': 0.23524364777226203, 'precisions': [0.4450307827616535, 0.27858439201451907, 0.18744142455482662, 0.13178294573643412], 'brevity_penalty': 1.0, 'length_ratio': 1.414179104477612, 'translation_length': 1137, 'reference_length': 804}
BLEU score (RAG): {'bleu': 0.36960602068379117, 'precisions': [0.6368221941992434, 0.4261213720316623, 0.31258644536652835, 0.23255813953488372], 'brevity_penalty': 0.9862243896834074, 'length_ratio': 0.986318407960199, 'translation_length': 793, 'reference_length': 804}


## Access Token and Application Required for LLaMA Model

To use the LLaMA model in this notebook, you'll need to authenticate with Hugging Face's Model Hub, as this model requires an access token. Additionally, you must apply for access to the LLaMA model since it is under a specific license that requires approval.

### Steps to Apply for Access and Obtain Your Access Token:

1. **Apply for Access to the LLaMA Model**:
   - Go to the [LLaMA model page on Hugging Face](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
   - Click on the "Request Access" button.
   - Fill out the necessary information and submit your request.
   - Wait for approval, which may take some time depending on your application.

2. **Obtain an Access Token**:
   - Once your application is approved, sign in to your Hugging Face account.
   - Navigate to your profile settings and select "Access Tokens."
   - Generate a new token if you don't have one, and copy it.

3. **Set Up the Access Token in This Notebook**:
   - You can set the access token directly in your code using the `use_auth_token` parameter when loading the model:
   
     ```python
     access_token = "hf_your_access_token_here"
     ```

4. **Use the Token in Model Loading**:
   - When loading the LLaMA model, make sure to include the `use_auth_token` parameter:

     ```python
     model = transformers.AutoModelForCausalLM.from_pretrained(
         "meta-llama/Meta-Llama-3.1-8B-Instruct",
         use_auth_token=access_token,
         torch_dtype=torch.bfloat16,
         device_map="auto",
     )
     ```

Once your access is approved and your token is set up, you can proceed to use the LLaMA model as demonstrated in this notebook.


In [None]:
# works with transformers 4.41
import transformers
import torch
from transformers import BitsAndBytesConfig

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Configure quantization settings
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Set to True for 4-bit quantization or False for 8-bit
    bnb_4bit_use_double_quant=True,  # Improves stability in 4-bit quantization
    bnb_4bit_quant_type="nf4",  # Use 'nf4' for better accuracy or 'fp4' for faster computation
    bnb_4bit_compute_dtype=torch.bfloat16  # Use bfloat16 for computations
)

# Your Hugging Face access token
access_token = ""

# Initialize the pipeline with quantization
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "quantization_config": bnb_config,  # Pass the quantization config here
        "use_auth_token": access_token  # Pass the access token here

    },
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"][-1])

# Commercial LLMs: OpenAI

The link to the `API KEY` will be sent to participants email addresses.

In [None]:
from openai import OpenAI

API_KEY = ""
client = OpenAI(api_key=API_KEY)

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a LLM?"}
  ]
)

message = response.choices[0].message.content
print(message)

APIConnectionError: Connection error.

### Structured outputs with LLMs

JSON is one of the most widely used formats in the world for applications to exchange data.

Structured Outputs is a feature that ensures the model will always generate responses that adhere to your supplied JSON Schema, so you don't need to worry about the model omitting a required key, or hallucinating an invalid enum value.

In [None]:
from pydantic import BaseModel
from openai import OpenAI

API_KEY = ""
client = OpenAI(api_key=API_KEY)

class TranslationCandidates(BaseModel):
    original: str
    german: str
    slovene: str

completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a sentence translation system. Translate an English sentence to German and Slovene."},
        {"role": "user", "content": "We are going to lunch."},
    ],
    response_format=TranslationCandidates,
)

event = completion.choices[0].message.parsed

In [None]:
import json

json_dict = json.loads(completion.choices[0].message.content)
print(json_dict)

# Exercise 1: Play with open-source LLMs presented in this Notebook

Instructions:

* **Prompt techniques**: Try to improve the baseline prompt with the prompt strategies presented on https://www.promptingguide.ai/techniques
* **Text Summarization**: Use the model to summarize news articles
* **Experiment with hyperparameters**: Change hyperparameters of the models and observe the outputs

# Exercise 2: Compare the performance of commercial and open-source models for translation tasks

Instructions:

1. Select a dataset for translation (e.g., a set of sentences in a specific language).
2. Choose an open-source LLM and a commercial LLM for comparison.
3. Translate the dataset using both models.
4. Evaluate the translations using metrics like BLEU or METEOR.
5. Analyze the results and compare the strengths and weaknesses of each model.
6. Consider factors like translation quality, speed, cost, and ease of use in your comparison.

# Exercise 3: Can you improve your machine translation problem with the RAG technique?

Instructions:

1. Identify a specific translation challenge or domain where improvement is desired.
2. Create or curate a knowledge base relevant to the chosen domain.
3. Integrate the knowledge base with your chosen LLM using the RAG technique.
4. Evaluate the translation performance with and without RAG using appropriate metrics.
5. Analyze the impact of RAG on translation quality, focusing on the specific challenge.
6. Explore different retrieval methods and prompt engineering techniques for optimization.