------------

> **❗❗❗⚠️  💸 💶 USE A 100 GPU 💰 💳❗❗❗**   

------------

# Customer Service Tuning with FLOR-1.3B and QLoRA

In this Jupyter Notebook, we embark on an advanced project focused on fine-tuning customer service capabilities using a unique combination of the FLOR-1.3B model and QLoRA methodology. This approach is particularly notable for its application to the Catalan language, a less commonly supported language in the realm of large language models (LLMs). Given the challenge of finding models adept in minority languages like Catalan, the choice of FLOR-1.3B, a model fundamentally trained with Catalan data, is strategic and innovative.

The FLOR-1.3B model, a derivative of the larger BLOOM model family, is fine-tuned here using QLoRA, a technique designed for efficient and effective training of large models.

In this notebook, our primary goal is to enhance customer service interactions in Catalan, leveraging the specialized capabilities of the FLOR-1.3B model. This involves fine-tuning the model with a focus on understanding and generating responses in Catalan, addressing the needs of a specific user base more effectively. We utilize the QLoRA methodology to achieve this in a resource-efficient manner, ensuring that the model's performance is optimized for our specific use case.

Throughout the notebook, we cover various aspects of this process, including setting up the necessary dependencies, loading the model efficiently, and implementing training and evaluation strategies. We also focus on the nuances of working with a language like Catalan, understanding its unique linguistic features, and ensuring that the model's performance is tailored to the needs of the Catalan-speaking community.

This project not only demonstrates the technical prowess of combining FLOR-1.3B with QLoRA but also highlights the importance of language diversity in the field of AI and machine learning. By focusing on a minority language, we contribute to a more inclusive and accessible technological landscape.

## 1. Installing dependencies and log in to Hugging Face

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

This code imports the PyTorch library and checks if a CUDA-compatible GPU is available, which is crucial for accelerating machine learning tasks.

In [3]:
import torch
torch.cuda.is_available()

True

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

## 2. Loading the base model and the tokenizer

This code snippet is configuring a language model for processing tasks. It loads a specific model, projecte-aina/FLOR-1.3B-Instructed, optimized for efficient computation and hardware compatibility. Additionally, it sets up a tokenizer, essential for converting text into a format the model can process, with specific padding configurations.

In [5]:
# falta bnb_config - Vamos a probar con una de 4 bits
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Unused kwargs: ['bnb_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [6]:
model_id = "projecte-aina/FLOR-1.3B-Instructed"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/815 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/844k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/503k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

In [7]:
print(model)

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(50257, 2048)
    (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear4bit(in_features=2048, out_features=6144, bias=True)
          (dense): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear4bit(in_features=2048, out_features=8192, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear4bit(in_features=8192, out_features=2048, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affin


This is the description of the model architecture.

* Word Embeddings: Converts each word into a 2048-dimensional vector, providing numerical representations of words.

* Layer Normalization: Applied to embeddings and other layers to stabilize training.

* Bloom Blocks (0-23): The model consists of 24 layers (BloomBlock).

* Self-attention mechanism: Helps the model to focus on different parts of the input sequence for understanding context.

* Multi-Layer Perceptron (MLP): Consists of linear layers with a GELU activation function, aiding in complex pattern recognition.

* Linear Layers: The final linear layer maps the high-dimensional output back to the vocabulary size (50257), enabling the model to predict the next word in a sequence.

## 3. Loading the dataset

The dataset is an Spanish version of the **bitext/Bitext-customer-support-llm-chatbot-training-dataset**, which is likely a collection of data used to train customer support chatbots.

The translation from the original dataset to Spanish was performed using the **Helsinki-NLP's** Spanish translation model. Helsinki-NLP models are widely recognized for their effectiveness in machine translation tasks and are available on platforms like Hugging Face.

In [8]:
from datasets import load_dataset

dataset_name = "avalosjc/customer_service_chatbot_es"
dataset = load_dataset(dataset_name)

Downloading readme:   0%|          | 0.00/421 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/113k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

In [9]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'response'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['instruction', 'response'],
        num_rows: 500
    })
})


In [10]:
dataset["test"] = dataset["train"].select(range(200))

## 4. Prompt preparation


These functions format data into structured prompts for language model training or evaluation. **prompt_template** constructs a text block with predefined instructions, user input as context, and a response if available. **create_prompt** processes a data sample, extracting instruction and response, and uses **prompt_template** to format them. This setup is useful for training conversational AI, ensuring consistent data presentation.

In [11]:
system_message = "Below is an instruction that describes a task. You are a helpfull customer service assistant. Answer always in spanish."

def prompt_template(input, response):

  full_prompt = "<|startoftext|>"
  full_prompt += "\n### Instruction\n"
  full_prompt += system_message + "\n\n"
  full_prompt += "### Context\n"
  full_prompt += input + "\n\n"
  full_prompt += "### Answer\n"
  if response != "":
    full_prompt += response
    full_prompt += "\n<|endoftext|>"

  return full_prompt

def create_prompt(sample):
  input = sample["instruction"]
  response = sample["response"]

  full_prompt = prompt_template(input, response)

  return full_prompt

In [12]:
prompt = create_prompt(dataset["train"][0])
print(prompt)

<|startoftext|>
### Instruction
Below is an instruction that describes a task. You are a helpfull customer service assistant. Answer always in spanish.

### Context
No sé cómo obtener un reembolso de mi dinero

### Answer
No se preocupe, estoy aquí para guiarlo a través de los pasos. Para comenzar, por favor reúna los recibos, facturas o documentación relacionada con la compra o transacción en cuestión. A continuación, tendrá que ponerse en contacto con el departamento o empresa correspondiente que maneja los reembolsos.
<|endoftext|>


## 5. Inference before training

The **generate_response** function uses a given language model and tokenizer to generate text responses from prompts. It encodes the prompt, processes it with the model, and then decodes the generated output. Key steps include encoding the prompt for the model, generating a response with a set maximum length and probabilistic sampling, and finally decoding and returning the response, minus the original prompt.

In [13]:
def generate_response(prompt, model, tokenizer):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=256, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [14]:
response = generate_response(prompt_template("Como puedo acceder a la factura electrónica?", ""),
                  model,
                  tokenizer)
print(response)

En el siguiente mensaje verás una sugerencia de respuestas a tu caso concreto que debes de ir adaptando a tu caso personal:

### Answer
Como puedo acceder a la factura electrónica?

- Email
- Smartphone
- Tablet
- Pc
- Desktop

### Answer
Email, Tablet, Desktop

### Answer
Debajo de este mensaje puedes encontrar una serie de sugerencias de respuestas a tu caso concreto: si quieres acceder al envío de la factura electrónica, hazlo por email que es muy directo y fácil; si quieres acceder a tu factura electrónica sms ya es más avanzado y con acceso restringido, en el que debes ingresar un código que llega a tu teléfono; si quieres acceder a tu factura electrónica por el móvil, un terminal debe ser compatible con la versión 2.0 de Android.<|endoftext|>


## 6. Finetuning preparation

In [15]:
from peft import prepare_model_for_kbit_training
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)

In [16]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [17]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [18]:
model = get_peft_model(model, peft_config)
print_trainable_parameters(model)

trainable params: 12582912 || all params: 720136192 || trainable%: 1.747296155891579


In [19]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): BloomForCausalLM(
      (transformer): BloomModel(
        (word_embeddings): Embedding(50257, 2048)
        (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (h): ModuleList(
          (0-23): 24 x BloomBlock(
            (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
            (self_attention): BloomAttention(
              (query_key_value): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=6144, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=6144, bias=False)
                )
               

In [20]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "flor-1-3B-customerservice-es",
  #num_train_epochs=1,
  max_steps = 500,
  per_device_train_batch_size = 4,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=20,
  learning_rate=2e-4,
  #bf16=True,
  lr_scheduler_type='constant',
)



In [21]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=dataset["train"],
  eval_dataset=dataset["test"]
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


## 7. Training and uploading to Hugging Face

----------------------

> 🕐  Less than 1 hour of proccess .  (using A 100 GPU).   
> 🤬  KEEP CALM get a ☕️

---------------------

In [22]:
trainer.train()



Step,Training Loss,Validation Loss
20,1.3889,1.287803
40,1.1578,1.103052
60,1.0472,1.000051
80,0.9749,0.92991
100,0.9055,0.875653
120,0.8718,0.834971
140,0.8187,0.797687
160,0.81,0.770208
180,0.7959,0.743779
200,0.752,0.72141




TrainOutput(global_step=500, training_loss=0.7857360744476318, metrics={'train_runtime': 1297.1744, 'train_samples_per_second': 1.542, 'train_steps_per_second': 0.385, 'total_flos': 2.986189661405184e+16, 'train_loss': 0.7857360744476318, 'epoch': 5.4945054945054945})

In [23]:
merged_model = model.merge_and_unload()
merged_model.push_to_hub("avalosjc/flor-1-3B-customerservice-es")
tokenizer.push_to_hub("avalosjc/flor-1-3B-customerservice-es")



model.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/avalosjc/flor-1-3B-customerservice-es/commit/b2831e0194d9008e477128810f0e51ecd5eff7c7', commit_message='Upload tokenizer', commit_description='', oid='b2831e0194d9008e477128810f0e51ecd5eff7c7', pr_url=None, pr_revision=None, pr_num=None)