# Catalan Customer Service Dataset

Our goal is to prepare a specialized customer service dataset in English, translating it into Catalan for training or fine-tuning models in this field. We use the "bitext/Bitext-customer-support-llm-chatbot-training-dataset", ideal for training Large Language Models (LLMs) in intent detection and domain adaptation.

For translation, we employ the "Helsinki-NLP/opus-mt-tc-big-en-cat_oci_spa" model from Hugging Face. This neural machine translation model is designed to translate from English to Catalan, Occitan, and Spanish. It is part of the OPUS-MT project, aiming to make neural translation models accessible for many languages. These models are originally trained with Marian NMT and converted to PyTorch using Hugging Face's transformers library. Training data comes from OPUS.

Dataset base details:

- Use: Intent Detection in Customer Service.
- Contains 27 intents grouped into 10 categories.
- Includes 26872 question-answer pairs, about 1000 per intent.
- Features 30 entity/slot types and 12 types of language generation tags.
- The categories and intents are selected from Bitext's collection covering various sectors.

This dataset is characterized by its hybrid approach in generating question-answer pairs, combining natural texts, natural language processing (NLP) technology to extract seeds from these texts, and natural language generation (NLG) technology to expand seed texts. The entire process is overseen by computational linguists.

Once the dataset is translated, we upload it to Hugging Face for use in our model training or fine-tuning.

In [None]:
!pip install transformers transformers[sentencepiece] sentencepiece huggingface-hub datasets

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, dill, multiprocess, datasets
Successfully installed datasets-2.

## Loadint the origin dataset

We load the dataset from a file.

In [None]:
from google.colab import files
import pandas as pd
from datasets import Dataset, DatasetDict

def upload_and_transform_to_datasetdict():
    uploaded = files.upload()
    filename = next(iter(uploaded.keys()))

    dataframe = pd.read_csv(filename)

    full_dataset = Dataset.from_pandas(dataframe)

    shuffled_dataset = full_dataset.shuffle(seed=42)

    selected_dataset = shuffled_dataset.select(range(5000))

    dataset_dict = DatasetDict({
        'train': selected_dataset
    })

    return dataset_dict

dataset_dict = upload_and_transform_to_datasetdict()

Saving Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv to Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11 (1).csv


In [None]:
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['flags', 'instruction', 'category', 'intent', 'response'],
        num_rows: 5000
    })
})


## Translation Model load

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-tc-big-en-cat_oci_spa"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/355 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/820k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/465M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Dataset translation

In [None]:
from datasets import Dataset, DatasetDict

def translate(texts, cache=None):
    try:
        if cache is not None and texts in cache:
            return cache[texts]

        encoded_input = tokenizer(">>cat<<" + texts, return_tensors="pt", padding=True)
        translated_tokens = model.generate(**encoded_input)
        translation = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

        if cache is not None:
            cache[texts] = translation

        return translation
    except Exception as e:
        print(f"Error during translation: {e}")
        return ''

In [None]:
text_columns = ['instruction', 'response']

from datasets import Dataset, DatasetDict

def translate_dataset(dataset, text_columns, cache=None):
    translated_dataset_dict = {}

    for split in dataset:
        translated_rows = []
        row_number = 0

        for row in dataset[split]:
            translated_row = {}
            for column in text_columns:
                translation = translate(row[column], cache)
                translated_row[column] = translation

            translated_rows.append(translated_row)
            row_number += 1

        translated_dataset_dict[split] = Dataset.from_pandas(pd.DataFrame(translated_rows))

    return DatasetDict(translated_dataset_dict)

## Upload the dataset to Hugging Face

In [None]:
dataset = translate_dataset(dataset_dict, text_columns)

dataset = dataset.shuffle(seed=42)['train']

train_test_split = dataset.train_test_split(test_size=0.1)

train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [None]:
dataset.push_to_hub("ericrisco/customer_service_chatbot_ca")