<a href="https://www.kaggle.com/code/gpreda/quantized-romanian-llm-openllm-ro-tests?scriptVersionId=182435900" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction


<center><img src="https://upload.wikimedia.org/wikipedia/commons/7/73/Flag_of_Romania.svg" width=100><img src="https://eu-images.contentstack.com/v3/assets/blt6b0f74e5591baa03/blt98d8a946b63c9b5f/64b7170ab314c94aa481d8c3/Untitled_design_(1).jpg" width=100><img src="https://upload.wikimedia.org/wikipedia/commons/7/73/Flag_of_Romania.svg" width=100></center>
<br>

**OpenLLM-Ro** is an initiative to develop a **Romanian LLM**, based on existing large language models. The first version is based on **Llama2**, fine-tuned with a corpus of Romanian language sources data.

For more details about the project, check this link: https://ilds.ro/llm-for-romanian/


In this notebook, we will test a **quantized** version of the Romanian LLM. We will use the model from HuggingFace: OpenLLM-Ro/RoLlama2-7b-Instruct-v1 (7 billion parameters, instruct version 1)

For quantization we will use **bitsandbytes** library. Use of bitsandbytes library is also requiring accelerate.


# Packages and installations

In [1]:
!pip install -U transformers accelerate einops xformers bitsandbytes

Collecting transformers
  Downloading transformers-4.41.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl.metadata (19 kB)
Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting xformers
  Downloading xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers)
  Downloading huggingface_hub-0.23.3-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting torch>=1.10.0 (from accelerate)
  Downloading torch-2.3.0-cp310-cp310-manylinu

In [2]:
from time import time
from IPython.display import display, Markdown
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer

# Initialize the quantized model


We specify the path to the HuggingFace model.   
We set the 4bit quantization configuration using bitsandbytes.  

In [3]:
model_id = 'OpenLLM-Ro/RoLlama2-7b-Instruct-v1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

We set the model configuration.  
Then we load the HuggingFace model with the option to quantize it  (`quantization_config=bnb_config`).   
First the model shards are downloaded from HuggingFace.  


In [4]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
time_2 = time()
print(f"Download and quantize model: {round(time_2-time_1, 3)} sec.")



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.84G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.86G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/2.68G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Download and quantize model: 1700.205 sec.


In [5]:
time_1 = time()
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare tokenizer: {round(time_2-time_1, 3)} sec.")

tokenizer_config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Prepare tokenizer: 4.798 sec.


Define the query pipeline.

In [6]:
time_1 = time()
pipe = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto")
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

2024-06-09 18:15:23.064232: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-09 18:15:23.064352: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-09 18:15:23.192656: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Prepare pipeline: 12.956 sec.


# Save the model and tokenizer

In [7]:
model.save_pretrained("/kaggle/working/RoLlama2-7b-Instruct-v1-q4b")
tokenizer.save_pretrained("/kaggle/working/RoLlama2-7b-Instruct-v1-q4b")

('/kaggle/working/RoLlama2-7b-Instruct-v1-q4b/tokenizer_config.json',
 '/kaggle/working/RoLlama2-7b-Instruct-v1-q4b/special_tokens_map.json',
 '/kaggle/working/RoLlama2-7b-Instruct-v1-q4b/tokenizer.model',
 '/kaggle/working/RoLlama2-7b-Instruct-v1-q4b/added_tokens.json',
 '/kaggle/working/RoLlama2-7b-Instruct-v1-q4b/tokenizer.json')

# Prepare the query function

In [8]:
def query_model(
        system_message,
        user_message,
        temperature=0.7,
        max_length=1024
        ):
    start_time = time()
    user_message = "Question: " + user_message + " Answer:"
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
        ]
    prompt = pipe.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
        )
    terminators = [
        pipe.tokenizer.eos_token_id,
        pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    sequences = pipe(
        prompt,
        do_sample=True,
        top_p=0.9,
        temperature=temperature,
        eos_token_id=terminators,
        max_new_tokens=max_length,
        return_full_text=False,
        pad_token_id=pipe.model.config.eos_token_id
    )
    #answer = f"{sequences[0]['generated_text'][len(prompt):]}\n"
    answer = sequences[0]['generated_text']
    end_time = time()
    ttime = f"Total time: {round(end_time-start_time, 2)} sec."

    return user_message + " " + answer  + " " +  ttime


system_message = """
You are an AI assistant designed to answer simple questions.
Please restrict your answer to the exact question asked.
"""

# Utility function to display the results of the query

In [9]:
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Test the model

Because this is a Romanian LLM, we will ask the questions in Romanian.

In [10]:
response = query_model(
    system_message,
    user_message="Care sunt primele 3 orase ca mărime ale României?",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))

No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.




**<font color='red'>Question:</font>** Care sunt primele 3 orase ca mărime ale României? 

**<font color='green'>Answer:</font>**  Primele 3 cele mai mari orașe din România sunt București, Cluj-Napoca și Timișoara.  

**<font color='magenta'>Total time:</font>** 5.11 sec.

In [11]:
response = query_model(
    system_message,
    user_message="Cine este Mircea Cărtărescu?",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Cine este Mircea Cărtărescu? 

**<font color='green'>Answer:</font>**  Mircea Cărtărescu este un renumit scriitor român, cunoscut pentru romanele sale, poezie și 

**<font color='magenta'>Total time:</font>** 3.19 sec.

In [12]:
response = query_model(
    system_message,
    user_message="Care sunt culorile drapelului României?",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Care sunt culorile drapelului României? 

**<font color='green'>Answer:</font>**  Culorile drapelului României sunt roșu, galben și albastru.  

**<font color='magenta'>Total time:</font>** 2.83 sec.

In [13]:
response = query_model(
    system_message,
    user_message="Numește 4 râuri din România.",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Numește 4 râuri din România. 

**<font color='green'>Answer:</font>**  1. Dunărea
2. Mureș
3. Olt
4. Someș  

**<font color='magenta'>Total time:</font>** 2.49 sec.

In [14]:
response = query_model(
    system_message,
    user_message="Care au fost președnții României între 1990 și 2020?",
    temperature=0.1,
    max_length=128)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Care au fost președnții României între 1990 și 2020? 

**<font color='green'>Answer:</font>**  Președinții României între 1990 și 2020 au fost:

1. Ion Iliescu (1990-1996)
2. Emil Constantinescu (1996-2000)
3. Ion Iliescu (2000-2004)
4. Traian Băsescu (2004-2014)
5. Klaus Iohannis (2014-prezent)  

**<font color='magenta'>Total time:</font>** 10.97 sec.

In [15]:
response = query_model(
    system_message,
    user_message="Cine e primarul Bucurestiului?",
    temperature=0.1,
    max_length=128)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Cine e primarul Bucurestiului? 

**<font color='green'>Answer:</font>**  Primarul Bucureștiului este Nicușor Dan.  

**<font color='magenta'>Total time:</font>** 1.78 sec.

In [16]:
response = query_model(
    system_message,
    user_message="Câte sectoare are Municipiul București?",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Câte sectoare are Municipiul București? 

**<font color='green'>Answer:</font>**  Municipiul București este împărțit în 6 sectoare.  

**<font color='magenta'>Total time:</font>** 2.25 sec.

In [17]:
response = query_model(
    system_message,
    user_message="Care e suprafața României?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Care e suprafața României? 

**<font color='green'>Answer:</font>**  Suprafața României este de aproximativ 238.391 de kilometri pătrați (92.049 mile pătrate).  

**<font color='magenta'>Total time:</font>** 4.04 sec.

In [18]:
response = query_model(
    system_message,
    user_message="Unde e centrul geometric al României?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Unde e centrul geometric al României? 

**<font color='green'>Answer:</font>**  Centrul geometric al României este situat în orașul Sibiu, în județul Sibiu.  

**<font color='magenta'>Total time:</font>** 2.83 sec.

In [19]:
response = query_model(
    system_message,
    user_message="Da-mi te rog trei exemple de cantareti de muzica usoara romani",
    temperature=0.1,
    max_length=128)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Da-mi te rog trei exemple de cantareti de muzica usoara romani 

**<font color='green'>Answer:</font>**  1. Aura Urziceanu
2. Angela Similea
3. Corina Chiriac  

**<font color='magenta'>Total time:</font>** 2.76 sec.