# Introduction

OpenLLM-Ro is a new initiative to develop a Romanian LLM, based on existing large language models. The first version is based on Llama2, fine-tuned with a corpus of Romanian language sources data.

For more details about the project, check this link: https://ilds.ro/llm-for-romanian/
        
## The current model

The model we are using here was obtained by quantization of the [OpenLLM-Ro/RoLlama2-7b-Instruct](https://huggingface.co/OpenLLM-Ro/RoLlama2-7b-Instruct) model using the notebook [Quantized Romanian LLM (OpenLLM-Ro) Tests](https://www.kaggle.com/code/gpreda/quantized-romanian-llm-openllm-ro-tests).



# Load the model

In [1]:
!pip install -U transformers accelerate bitsandbytes

Collecting transformers
  Downloading transformers-4.41.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers)
  Downloading huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.41.0-py3-none-any.whl (9.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from time import time
from IPython.display import display, Markdown
from transformers import pipeline

2024-05-19 20:51:45.603685: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-19 20:51:45.603787: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-19 20:51:45.722080: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
pipe = pipeline("text-generation", 
                model="/kaggle/input/openllm-ro-q4b/transformers/7b-instruct-q4b/1/RoLlama2-7b-Instruct-v1-q4b",
               kwargs=['_load_in_4bit'])

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Prepare the query function

In [4]:
def query_model(
        system_message,
        user_message,
        temperature=0.7,
        max_length=1024
        ):
    start_time = time()
    user_message = "Question: " + user_message + " Answer:"
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
        ]
    prompt = pipe.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
        )
    terminators = [
        pipe.tokenizer.eos_token_id,
        pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    sequences = pipe(
        prompt,
        do_sample=True,
        top_p=0.9,
        temperature=temperature,
        #num_return_sequences=1,
        eos_token_id=terminators,
        max_new_tokens=max_length,
        return_full_text=False,
        pad_token_id=pipe.model.config.eos_token_id
    )
    #answer = f"{sequences[0]['generated_text'][len(prompt):]}\n"
    answer = sequences[0]['generated_text']
    end_time = time()
    ttime = f"Total time: {round(end_time-start_time, 2)} sec."

    return user_message + " " + answer  + " " +  ttime


system_message = """
You are an AI assistant designed to answer simple questions.
Please restrict your answer to the exact question asked.
"""

# Test with few questions

In [5]:
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [6]:
response = query_model(
    system_message,
    user_message="Care este suprafata Romaniei?",
    temperature=0.1,
    max_length=32)
display(Markdown(colorize_text(response)))

No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.




**<font color='red'>Question:</font>** Care este suprafata Romaniei? 

**<font color='green'>Answer:</font>**  Suprafața României este de aproximativ 238.391 de kilometri pătrați (92.04 

**<font color='magenta'>Total time:</font>** 4.14 sec.

In [7]:
response = query_model(
    system_message,
    user_message="In ce an a castigat faima mondiala Nadia Comaneci?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** In ce an a castigat faima mondiala Nadia Comaneci? 

**<font color='green'>Answer:</font>**  Nadia Comăneci a câștigat faima mondială în 1976, când a câștigat prima medalie de aur la Jocurile Olimpice de la Montreal.  

**<font color='magenta'>Total time:</font>** 4.6 sec.

In [8]:
response = query_model(
    system_message,
    user_message="Numeste cateva parcuri din Bucuresti.",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Numeste cateva parcuri din Bucuresti. 

**<font color='green'>Answer:</font>**  Parcurile din București sunt:
1. Herăstrău
2. Cișmigiu
3. Carol
4. Kiseleff
5. Cișmigiu
6. Tineretului
7. Floreasca
8. Alexandru Io 

**<font color='magenta'>Total time:</font>** 5.65 sec.

In [9]:
response = query_model(
    system_message,
    user_message="Cine e George Hagi?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Cine e George Hagi? 

**<font color='green'>Answer:</font>**  George Hagi este un fotbalist român retras, care a jucat pentru echipa națională de fotbal a României.  

**<font color='magenta'>Total time:</font>** 3.51 sec.

In [10]:
response = query_model(
    system_message,
    user_message="Cine e primarul Capitalei Romaniei, Bucuresti, incepand cu 2020?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Cine e primarul Capitalei Romaniei, Bucuresti, incepand cu 2020? 

**<font color='green'>Answer:</font>**  Primarul Capitalei României, București, începând cu 2020, este Nicușor Dan.  

**<font color='magenta'>Total time:</font>** 3.1 sec.

In [11]:
response = query_model(
    system_message,
    user_message="Numiti 3 pesti din Delta Dunarii.",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Numiti 3 pesti din Delta Dunarii. 

**<font color='green'>Answer:</font>**  1. Sturion
2. Crap
3. Somon  

**<font color='magenta'>Total time:</font>** 1.92 sec.

In [12]:
response = query_model(
    system_message,
    user_message="Numiti 3 atractii din Bucuresti.",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Numiti 3 atractii din Bucuresti. 

**<font color='green'>Answer:</font>**  1. Palatul Parlamentului
2. Muzeul National de Arta
3. Muzeul National de Istorie a Romaniei  

**<font color='magenta'>Total time:</font>** 3.06 sec.

In [13]:
response = query_model(
    system_message,
    user_message="Care e suprafata judetului Dambovita?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Care e suprafata judetului Dambovita? 

**<font color='green'>Answer:</font>**  Suprafața județului Dâmbovița este de 6.083 km².  

**<font color='magenta'>Total time:</font>** 2.44 sec.

In [14]:
response = query_model(
    system_message,
    user_message="Cine este actualul presedinte al Romaniei?",
    temperature=0.1,
    max_length=64)
display(Markdown(colorize_text(response)))



**<font color='red'>Question:</font>** Cine este actualul presedinte al Romaniei? 

**<font color='green'>Answer:</font>**  Actualul președinte al României este Klaus Iohannis.  

**<font color='magenta'>Total time:</font>** 2.0 sec.

# Conclusions

Using the quantized model derived from OpenLLM-Ro 7B Instruct (v1) model published by **ILDS**, we can run fast queries in Romanian language.