------------

> ❗❗❗⚠️  💸   DONT USE GPU❗❗❗   
> **🧠 🤑   USE ONLY CPU (max GB used: RAM 4.5 - DISC 30 )**
------------

# Spanish Customer Service Dataset

Our goal is to prepare a specialized customer service dataset in English, translating it into Spanish for training or fine-tuning models in this field. We use the ["bitext/Bitext-customer-support-llm-chatbot-training-dataset"](https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset), ideal for training Large Language Models (LLMs) in intent detection and domain adaptation.

For translation, we employ the ["Helsinki-NLP/opus-mt-tc-big-en-es"](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-es) model from Hugging Face. This neural machine translation model is designed to translate from English to Spanish. It is part of the OPUS-MT project, aiming to make neural translation models accessible for many languages. These models are originally trained with Marian NMT and converted to PyTorch using Hugging Face's transformers library. Training data comes from OPUS.

Dataset base details:

- Use: Intent Detection in Customer Service.
- Contains 27 intents grouped into 10 categories.
- Includes 26872 question-answer pairs, about 1000 per intent.
- Features 30 entity/slot types and 12 types of language generation tags.
- The categories and intents are selected from Bitext's collection covering various sectors.

This dataset is characterized by its hybrid approach in generating question-answer pairs, combining natural texts, natural language processing (NLP) technology to extract seeds from these texts, and natural language generation (NLG) technology to expand seed texts. The entire process is overseen by computational linguists.

Once the dataset is translated, we upload it to Hugging Face for use in our model training or fine-tuning.

In [1]:
!pip install transformers transformers[sentencepiece] sentencepiece huggingface-hub datasets



In [2]:
! pip install sacremoses



## Loadint the origin dataset

We load the dataset from a file.

link dataset [Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv](https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset/resolve/main/Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv?download=true)

In [3]:
from google.colab import files
import pandas as pd
from datasets import Dataset, DatasetDict

def upload_and_transform_to_datasetdict():
    uploaded = files.upload()
    filename = next(iter(uploaded.keys()))

    dataframe = pd.read_csv(filename)

    full_dataset = Dataset.from_pandas(dataframe)

    shuffled_dataset = full_dataset.shuffle(seed=42)

    selected_dataset = shuffled_dataset.select(range(5000))

    dataset_dict = DatasetDict({
        'train': selected_dataset
    })

    return dataset_dict

dataset_dict = upload_and_transform_to_datasetdict()

Saving Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv to Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11 (1).csv


In [4]:
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['flags', 'instruction', 'category', 'intent', 'response'],
        num_rows: 5000
    })
})


## Translation Model load

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-tc-big-en-es"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Dataset translation

In [7]:
from transformers import pipeline
pipe = pipeline("translation", model=model_name)

In [8]:
from datasets import Dataset, DatasetDict

def translate(texts, cache=None):
    try:
        if cache is not None and texts in cache:
            return cache[texts]

        translations = pipe(texts)
        translation = translations[0]['translation_text']

        if cache is not None:
            cache[texts] = translation

        return translation
    except Exception as e:
        print(f"Error during translation: {e}")
        return ''

In [9]:
text_columns = ['instruction', 'response']

from datasets import Dataset, DatasetDict

def translate_dataset(dataset, text_columns, cache=None):
    translated_dataset_dict = {}

    for split in dataset:
        translated_rows = []
        row_number = 0

        for row in dataset[split]:
            translated_row = {}
            for column in text_columns:
                translation = translate(row[column], cache)
                translated_row[column] = translation

            translated_rows.append(translated_row)
            row_number += 1

        translated_dataset_dict[split] = Dataset.from_pandas(pd.DataFrame(translated_rows))

    return DatasetDict(translated_dataset_dict)

## Upload the dataset to Hugging Face

----------------------

> 🕕  More than 6 hours of proccess.  (using GPU or CPU takes the same time).   
> 🤬  KEEP CALM get a ☕️

---------------------

In [10]:
%%time
dataset = translate_dataset(dataset_dict, text_columns)

dataset = dataset.shuffle(seed=42)['train']

train_test_split = dataset.train_test_split(test_size=0.1)

train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

Your input_length: 462 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Your input_length: 494 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


CPU times: user 1d 1h 16min 30s, sys: 2min 43s, total: 1d 1h 19min 13s
Wall time: 6h 20min 32s


In [11]:
import pickle

def save_pkl(model_name, model):
  # create an iterator object with write permission - model.pickle
  with open(model_name, 'wb') as files:
    pickle.dump(model, files)

def load_pkl(model_name):
  # load saved model
  with open(model_name , 'rb') as f:
      lr = pickle.load(f)
  return lr

In [12]:
## Damos permiso a google drive y montamos la carpeta
from google.colab import drive
drive.mount('/content/drive')
#
## actualizamos el base path para usar la carpeta de google drive
base_path = '/content/drive/MyDrive/'

Mounted at /content/drive


In [13]:
# guardamos el dataset
save_pkl(base_path + 'dataset_customer_service_chatbot_es.pkl', dataset)

In [14]:
# cargamos el dataset guardado
dataset = load_pkl(base_path + 'dataset_customer_service_chatbot_es.pkl')

In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [16]:
dataset.push_to_hub("avalosjc/customer_service_chatbot_es")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/avalosjc/customer_service_chatbot_es/commit/06278d725c548ab79bffd81e2992572d1eed1510', commit_message='Upload dataset', commit_description='', oid='06278d725c548ab79bffd81e2992572d1eed1510', pr_url=None, pr_revision=None, pr_num=None)