![servicedesk](servicedesk.png)

CleverSupport is a company at the forefront of AI innovation, specializing in the development of AI-driven solutions to enhance customer support services. Their latest endeavor is to engineer a text classification system that can automatically categorize customer complaints. 

Your role as a data scientist involves the creation of a sophisticated machine learning model that can accurately assign complaints to specific categories, such as mortgage, credit card, money transfers, debt collection, etc.

In [1]:
from collections import Counter
import nltk, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Import data and labels
with open("words.json", 'r') as f1:
    words = json.load(f1)
with open("text.json", 'r') as f2:
    text = json.load(f2)
labels = np.load('labels.npy')

In [5]:
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

# Looking up the mapping dictionary and assigning the index to the respective words
for i, sentence in enumerate(text):
    text[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]
    
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

text = pad_input(text, 50)

In [8]:
# Splitting dataset
train_text, test_text, train_label, test_label = train_test_split(text, labels, test_size=0.2, random_state=42)

train_data = TensorDataset(torch.from_numpy(train_text), torch.from_numpy(train_label).long())
test_data = TensorDataset(torch.from_numpy(test_text), torch.from_numpy(test_label).long())

In [12]:
# Start coding here

#1. Define the classifier
Define a class containing all the appropriate layers, and a method to perform the forward pass over a batch of input text.
* Creating a class to contain the layers of the classifier
    * Use PyTorch's nn.Embedding class to define the embedding layer.
    * Create an instance of it in the TicketClassifier class's constructor and assign it to an instance variable such as self.embedding.
* Adding an embedding layer
    * Use PyTorch's nn.Embedding class to define the embedding layer.
    * Create an instance of it in the TicketClassifier class's constructor and assign it to an instance variable such as self.embedding.
* Adding a convolution ayer
    * Use PyTorch's nn.Conv1d class to define the 1D convolution layer.
    * Create an instance of it in the TicketClassifier class's constructor and assign it to an instance variable such as self.conv.
* Adding a linear layer
    * Use PyTorch's nn.Linear class to define the linear layer.
    * Create an instance of it in the TicketClassifier class's constructor and assign it to an instance variable such as self.fc.
* Define a .forward() method
    * Finally, define a .forward() method that passes the input through the embedding and convolution layer, applies nn.functional.relu on the output, and finally applies linear layer before returning the output.

#2. Training the classifier
Define a training loop that loops over the dataset, calculating the loss and propagating it backwards through the network.
* Define a suitable loss criterion
    * Use PyTorch's nn.CrossEntropyLoss, since this is a multi-class classification problem.
* Define an optimizer
    * Use PyTorch's optim.Adam optimizer.

#3. Testing the classifier
Use your trained model to classify the text in the test set, and calculate the appropriate metrics.
* Predict the category of each ticket in the test data.
    * Invoke model() on your input data to pass the data through the network.
    * Use torch.argmax() to find the category with the highest predicted probability.
* Calculate the accuracy
    * Use torchmetrics.Accuracy to calculate the accuracy.
* Calculate the precision and recall
    * Use torchmetrics.Precision and torchmetrics.Recall to calculate the precision and recall.