# Ubuntu Automated Customer Service

Project Overview

This project is focused on creating a conversational AI system designed to automate customer service for Ubuntu users. By leveraging the Ubuntu Dialogue Corpus, the system will be trained to understand customer queries and offer automated solutions. The project encompasses several key phases, including data preprocessing, development and training of a natural language processing (NLP) model, and the integration of this model into a chatbot interface that users can interact with.

Data Source:
- The dataset for this project, the Ubuntu Dialogue Corpus, is available on Kaggle. [link](https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus)

Project Goals:
- To preprocess the Ubuntu Dialogue Corpus data for NLP.
- To build and train an NLP model capable of understanding and responding to user queries.
- To integrate the trained NLP model into a chatbot interface for automated customer service.
- To evaluate the effectiveness and accuracy of the conversational AI system in handling real-world user queries.

Steps:

1. **Data Acquisition and Preprocessing**:
- Download the Ubuntu Dialogue Corpus from Kaggle.
- Clean and preprocess the data to format it suitably for NLP tasks. This may include tokenization, removing stop words, and stemming or lemmatization.

2. **Model Development**:
- Select an appropriate NLP model architecture that can process the conversational data effectively. This could involve sequence-to-sequence models, transformers, or other architectures suitable for handling dialogue.
- Implement the model using a machine learning framework such as TensorFlow or PyTorch.

3. **Training**:
- Train the model on the preprocessed Ubuntu Dialogue Corpus, adjusting parameters and structures as necessary to improve performance.
- Use a portion of the data for validation to monitor the model's performance and prevent overfitting.

4. **Chatbot Integration**:
- Develop a chatbot interface that can interact with users in real-time. This interface should be capable of processing user inputs, passing them to the trained NLP model, and displaying the model's responses.
- Ensure the chatbot interface is user-friendly and can handle a variety of query types.

5. **Evaluation and Testing**:
- Test the conversational AI system with a set of predefined queries to assess its response accuracy and relevance.
- Optionally, conduct user testing with real users to gather feedback on the system's performance and identify areas for improvement.

6. **Iteration and Improvement**:
- Based on testing feedback and performance evaluations, make necessary adjustments to the model and chatbot interface.
- Explore advanced NLP techniques and model architectures to enhance the system's understanding and response capabilities.

# Step 1: Data Acquisition and Preprocessing

In [18]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset

import pandas as pd


In [19]:
import transformers
# import torch

print("transformers version:", transformers.__version__)
print("torch version:", torch.__version__)
print("pandas version:", pd.__version__)


transformers version: 4.43.3
torch version: 2.4.0+cpu
pandas version: 2.2.2


1.1- load data

In [20]:
def load_data():
    """
    Load the Ubuntu Dialogue Corpus data from CSV files and concatenate them into a single DataFrame.
    
    Returns:
        df (pd.DataFrame): Concatenated DataFrame containing dialogue texts.
    """
    df2 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText.csv', nrows=1000)
    df3 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText_301.csv', nrows=1000)
    df4 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText_196.csv', nrows=1000)
    df = pd.concat([df2, df3, df4], ignore_index=True)
    df = df.drop(['folder', 'dialogueID', 'date', 'from', 'to'], axis=1)
    return df
df = load_data()
df.head()

Unnamed: 0,text
0,"Hello folks, please help me a bit with the fol..."
1,Did I choose a bad channel? I ask because you ...
2,the second sentence is better english and we...
3,Sock Puppe?t
4,WTF?


split for train and val

In [21]:
train_texts, val_texts = train_test_split(df['text'], test_size=0.3)
print(train_texts.shape, val_texts.shape)

(2100,) (900,)


In [22]:
train_texts_list = train_texts.tolist()
val_texts_list = val_texts.tolist()

In [23]:
train_texts_list = [text if isinstance(text, str) else "" for text in train_texts_list]
val_texts_list = [text if isinstance(text, str) else "" for text in val_texts_list]

In [24]:
train_texts_list[:3]

[' only got Memtest and memtest serial',
 'Ubuntu 12.10 \\n \\l',
 'there are other people wnating help']

In [25]:
val_texts_list[:3]

["you /think/ you can access the data, that doesn't mean its all there and ok",
 'graveman?',
 'what video card']

# Stpe 2: Load GPT-2 from Local Directory

load tokenizer

In [26]:
model_path = r"saved_models/gpt2" 
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
vocab_size = len(tokenizer)
print(f"Tokenizer vocab size: {vocab_size}")

Tokenizer vocab size: 50258


In [27]:
train_encodings = tokenizer(train_texts_list, truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_texts_list, truncation=True, padding=True, return_tensors='pt')

load model

In [28]:
model = GPT2LMHeadModel.from_pretrained(model_path)
model.resize_token_embeddings(len(tokenizer))

Embedding(50258, 768)

# Step 3: Prepare Your Dataset

Convert your dataset (which has a single column 'text') into the format required for training.

In [29]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        item = {
            'input_ids': self.encodings.input_ids[idx],
            'attention_mask': self.encodings.attention_mask[idx],
            'labels': self.encodings.input_ids[idx]  # Use the input_ids as labels for language modeling
        }
        return item


train_dataset = CustomDataset(train_encodings)
val_dataset = CustomDataset(val_encodings)
print(train_dataset.__len__(), val_dataset.__len__())

2100 900


# Step 4: Fine-Tune

In [30]:
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    learning_rate=0.001,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="steps",
    save_steps=500,
    fp16=False,
    save_total_limit=2,
    prediction_loss_only=True  # Ensure loss is calculated
)

training_args



TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=10,
eval_strategy=IntervalStrategy.STEPS,
eval_use_gather_object=False,
evaluation_str

: 

Start fine-tuning

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)
trainer.train()

  0%|          | 0/132 [1:49:01<?, ?it/s]
                                       
  0%|          | 0/132 [56:44<?, ?it/s]           

{'loss': 9.0177, 'grad_norm': 8.53919506072998, 'learning_rate': 0.0009242424242424242, 'epoch': 0.15}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                       
[A                                               

  0%|          | 0/132 [1:05:20<?, ?it/s]      
[A
[A

{'eval_loss': 1.6604344844818115, 'eval_runtime': 516.0397, 'eval_samples_per_second': 1.744, 'eval_steps_per_second': 0.056, 'epoch': 0.15}


                                         
  0%|          | 0/132 [1:15:29<?, ?it/s]           

{'loss': 1.066, 'grad_norm': 0.3350197970867157, 'learning_rate': 0.0008484848484848485, 'epoch': 0.3}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                         
[A                                                 

  0%|          | 0/132 [1:24:07<?, ?it/s]      
[A
[A

{'eval_loss': 0.9315052032470703, 'eval_runtime': 517.8907, 'eval_samples_per_second': 1.738, 'eval_steps_per_second': 0.056, 'epoch': 0.3}




# Step 5: Save the fine-tuned model

In [None]:
model.save_pretrained('./fine-tuned-gpt2')
tokenizer.save_pretrained('./fine-tuned-gpt2')