<a href="https://www.kaggle.com/code/avikumart/customer-support-ticket-tagger?scriptVersionId=214539464" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
#!pip install transformers datasets torch scikit-learn

## Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset

In [2]:
df = pd.read_csv("/kaggle/input/customer-support-ticket-tagging/customer_tickets.csv")
df.columns = ["text","labels"]
df.head()

Unnamed: 0,text,labels
0,"Dear Customer Support Team, We are experiencin...",Technical Support
1,"Dear Customer Support,<br><br>I hope this mess...",Product Support
2,"Dear Tech Online Store Customer Support,\n\nI ...",Returns and Exchanges
3,"Dear IT Services Customer Support, \n\nWe are ...",Product Support
4,"Greetings IT Services Customer Support,\n\nI a...",Technical Support


In [3]:
df.dropna(inplace=True)

In [4]:
label_encoder = LabelEncoder()
df['labels'] = label_encoder.fit_transform(df['labels']) # converts labels which are in character to numerix format

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['text', 'labels']])
hf_dataset = dataset.train_test_split(test_size=0.145)
print(hf_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 288
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 50
    })
})


In [5]:
label_encoder.classes_

array(['Billing and Payments', 'Customer Service', 'General Inquiry',
       'Human Resources', 'IT Support', 'Product Support',
       'Returns and Exchanges', 'Sales and Pre-Sales',
       'Service Outages and Maintenance', 'Technical Support'],
      dtype=object)

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "microsoft/deberta-v3-base"  # loading the deberta model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_encoder.classes_))

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

tokenized_datasets = hf_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/288 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [8]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    learning_rate=0.00002,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.05,
    logging_dir='./logs',
    logging_steps=30,
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer
)



In [9]:
trainer.train()



Step,Training Loss,Validation Loss
30,2.1011,1.825805
60,1.8209,1.748241


TrainOutput(global_step=72, training_loss=1.92625011338128, metrics={'train_runtime': 69.8007, 'train_samples_per_second': 16.504, 'train_steps_per_second': 1.032, 'total_flos': 215507297101824.0, 'train_loss': 1.92625011338128, 'epoch': 4.0})

In [10]:
trainer.evaluate()



{'eval_loss': 1.742765188217163,
 'eval_runtime': 1.0737,
 'eval_samples_per_second': 46.568,
 'eval_steps_per_second': 3.725,
 'epoch': 4.0}

In [11]:
trainer.save_model("./text-classification-model")
tokenizer.save_pretrained("./text-classification-model")

('./text-classification-model/tokenizer_config.json',
 './text-classification-model/special_tokens_map.json',
 './text-classification-model/spm.model',
 './text-classification-model/added_tokens.json',
 './text-classification-model/tokenizer.json')

In [12]:
from transformers import pipeline
# loading the locally saved model
classifier = pipeline("text-classification", model="./text-classification-model", tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [13]:
# Predictor Function to evaluate some tickets
def predictor(input_ticket,org_label):
    print(f"Input Ticket: {input_ticket}")
    result = classifier(input_ticket)
    print("\n")
    org_label_decoded = label_encoder.inverse_transform([int(org_label)])[0]
    decoded_label = label_encoder.inverse_transform([int(result[0]['label'].split("_")[-1])])[0]
    print("Original Label: ",org_label,",Original Label Decoded: ",org_label_decoded)
    print(f"Predicted Label: {int(result[0]['label'].split('_')[-1])} ,Predicted label Decoded: {decoded_label}")

In [15]:
predictor(df['text'][319],df['labels'][319])

Input Ticket: I am unable to connect to the Wi-Fi.


Original Label:  1 ,Original Label Decoded:  Customer Service
Predicted Label: 9 ,Predicted label Decoded: Technical Support


In [16]:
predictor(df['text'][338],df['labels'][338])

Input Ticket: Dear Customer Support Team,

I am contacting you to seek prompt professional help regarding our IT Consulting Service. We are facing an urgent requirement for server setup and network enhancement. Our systems are presently experiencing difficulties that may negatively affect our business activities. It is imperative that we address these issues swiftly to avoid any interruptions.

Could you kindly prioritize our request and allocate an expert to help us with these concerns? We need someone with specialized expertise in server setups and optimization methods. Please inform us at your earliest convenience about the availability of your support personnel.

We are ready for a consultation call whenever it suits you to provide any additional information needed. You can reach me at <tel_num>.

Thank you for your prompt attention to this issue. We anticipate your swift reply.

Best regards,

<name>


Original Label:  9 ,Original Label Decoded:  Technical Support
Predicted Label:

In [None]:
predictor(df['text'][317],df['labels'][317])

In [None]:
predictor(df['text'][210],df['labels'][210])

In [None]:
predictor(df['text'][50],df['labels'][50])

In [None]:
predictor(df['text'][175],df['labels'][175])

In [None]:
predictor(df['text'][15],df['labels'][15])