***FCIM.FIA - Fundamentals of Artificial Intelligence***
> Lab 4: Natural Language Processing and Chat Bots \
> Performed by: Dobrojan Alexandru, FAF-212 \
> Verified by: Elena Graur, asist. univ.


# Theory

In machine learning, NLP helps computers interpret text and speech in a way that enables them to perform tasks like translation, sentiment analysis, text summarization, and chatbot communication. NLP involves converting unstructured language data like sentences into a structured format that computers can work with, such as numbers or vectors. This is done using techniques like tokenization, stemming, and feature extraction.

LSTMs could be an ideal solution for this task because it excels in processing sequential data and understanding context over time. Unlike a simple feedforward neural network, LSTMs can capture dependencies between words in a sentence by maintaining a memory of previous inputs, which is crucial in natural language processing tasks. For instance, LSTMs can handle longer sentences where the meaning of a word depends on the context provided by preceding words. Additionally, they can better model variations in sentence structure, making them more robust to user input that doesn't strictly match predefined patterns.

This laboratory work implements a machine learning pipeline for building a simple chatbot using NLP and a neural network. The chatbot data, containing user input patterns and corresponding response tags, is read from a JSON file. Text preprocessing is applied using tokenization, lowercasing, and stemming to standardize the data. A "bag-of-words" model is used to convert each sentence into a numerical vector representation, creating the input features for the model. A simple feedforward neural network is defined with three layers. It takes the bag-of-words vectors as input and predicts the corresponding tag for the input sentence.

## Used imports

In [2]:
import random
import os
import torch
import json
import nltk

import numpy as np
import telegram.ext.filters as filters
import torch.nn as nn

from torch.utils.data import Dataset, DataLoader
from numpy import ndarray, dtype
from nltk.stem.porter import PorterStemmer
from telegram import Update
from telegram.ext import ApplicationBuilder, CommandHandler, CallbackContext, MessageHandler
from dotenv import load_dotenv

## NLP Utils

In [None]:
def tokenize(text: str) -> list[str]:
    """Tokenize a string into a list of words using NLTK's word tokenizer."""
    return nltk.word_tokenize(text)


def lower(text: list[str]) -> list[str]:
    """Convert a list of words to lowercase."""
    return [w.lower() for w in text]


def stem(word: str | list[str]) -> str | list[str]:
    """Apply stemming to a word or a list of words using Porter Stemmer."""
    if isinstance(word, list):
        return [stemmer.stem(w) for w in word]
    return stemmer.stem(word)


def bag_of_words(tokenized_sentence: list[str], all_words: list[str]) -> np.ndarray[int, np.dtype]:
    """Create a bag-of-words representation for a tokenized sentence."""
    tokenized_sentence = stem(tokenized_sentence)
    bag = np.zeros(len(all_words), dtype=int)

    for idx, w in enumerate(all_words):
        if w in tokenized_sentence:
            bag[idx] = 1

    return bag

## Preparing the dataset

The dataset contains information about tourism in Moldova, categorized into different topics or tags, such as "tourism_nature" and "tourism_history." Each entry includes a set of patterns (questions) related to a specific topic, and a set of responses (answers) that provide information about the tourism aspects of Moldova.

For example, under the "tourism_nature" tag, the patterns are questions related to nature attractions, such as hiking, national parks, and scenic spots in Moldova. The responses offer details about popular nature destinations, such as Orheiul Vechi, Codru Forest, and places suitable for birdwatching or hiking.

In [None]:
{
    "tag": "tourism_nature",
    "patterns": [
        "What natural attractions are there in Moldova?",
        "Where can I go for nature walks in Moldova?",
        "What are the most beautiful landscapes in Moldova?",
        "Are there any national parks in Moldova?",
        "What are Moldova's top outdoor destinations?",
        "Can I hike in Moldova?",
        "What rivers or lakes can I visit in Moldova?",
        "What are the best spots for birdwatching in Moldova?"
    ],
    "responses": [
        "Moldova is home to beautiful landscapes, including forests, lakes, and vineyards. Visit places like Orheiul Vechi, a historic complex with stunning views.",
        "The Codru Forest and the Danube River are popular for nature walks and outdoor activities.",
        "For hiking and exploration, Tipova Monastery and Saharna Monastery offer picturesque settings.",
        "Nature reserves like Padurea Domneasca are excellent for birdwatching and wildlife enthusiasts."
    ]
},
{
    "tag": "tourism_history",
    "patterns": [
        "What historical sites can I visit in Moldova?",
        "Tell me about Moldova's history.",
        "What are the most famous historical landmarks in Moldova?",
        "Does Moldova have any castles or ruins?",
        "Are there any UNESCO World Heritage sites in Moldova?",
        "What ancient monuments are in Moldova?",
        "Can I learn about Soviet history in Moldova?",
        "What are Moldova's key historical periods?"
    ],
    "responses": [
        "Moldova has a rich history with several key sites like the Orheiul Vechi archaeological complex, the Stefan Cel Mare Park, and the Bender Fortress.",
        "Moldova is also known for its ancient monasteries, including the Căpriana Monastery.",
        "Don't miss Soroca Fortress and the National Museum of History in Chisinau for a glimpse into Moldova's past.",
        "The country offers insights into its Soviet history, with landmarks and museums dedicated to this period."
    ]
},

It is worth noting that having as much patterns as possible is preferable, since the vocabulary and the pattern recognition is directly influenced by the amount of data the model is fed. This is why several questions are given for each tag.

## Preparing the dataset

In [None]:
with open('data.json', 'r') as f:
    data = json.load(f)

all_words: list[str] = []  # List to store all unique words across all patterns
tags: list[str] = []  # List to store unique tags (categories) from the dataset
xy: list[tuple[list[str], str]] = []  # List of tuples (tokenized_sentence, tag) for each pattern in the dataset
bag: ndarray[int, dtype] = []  # Placeholder variable for the bag-of-words representation (used later)
IGNORE_WORDS = ['!', '?', '.', ',']  # Words to ignore during preprocessing

# Process the dataset
for datum in data:
    tag = datum['tag']
    tags.append(tag)  # Add the tag to the list of tags

    for pattern in datum['patterns']:
        w = tokenize(pattern)  # Tokenize the pattern into individual words
        all_words.extend(w)  # Add all words from the pattern to the list of all_words
        xy.append((w, tag))  # Add the tokenized sentence and tag as a tuple to xy

# Preprocess all words: remove punctuation, lowercase, and stem for consistency
all_words = sorted(set(stem(lower([w for w in all_words if w not in IGNORE_WORDS]))))  # Unique and sorted words
tags = sorted(set(lower(tags)))  # Unique and sorted list of tags

# Prepare training data
X_train = []  # Feature vectors (bag of words for each sentence)
Y_train = []  # Labels (index corresponding to the tag)

for (pattern_sentence, tag) in xy:
    # Create a "bag of words" representation for each sentence
    bag = bag_of_words(pattern_sentence, all_words)
    X_train.append(bag)  # Add the feature vector for the sentence
    label = tags.index(tag)  # Convert tag to its corresponding numerical index
    Y_train.append(label)  # Add the label to the list of labels

# Convert training data to numpy arrays
X_train = np.array(X_train)
Y_train = np.array(Y_train)

## Model

The code defines a simple feedforward neural network for a classification task, consisting of three layers: an input layer, a hidden layer, and an output layer, with ReLU activations applied after each layer. It is configured with hyperparameters like batch size, hidden layer size, learning rate, and the number of epochs for training. The model is designed to classify inputs into multiple categories based on the number of tags (output classes) and is initialized to run on either a CPU or GPU, depending on availability.

In [None]:
class NN(nn.Module):
    def __init__(self, input_size: int, hidden_size: int, num_classes: int):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes

        # Define a simple feedforward neural network with 3 layers
        self.l1 = nn.Linear(input_size, hidden_size)  # Input to hidden layer
        self.l2 = nn.Linear(hidden_size, hidden_size)  # Hidden to hidden
        self.l3 = nn.Linear(hidden_size, num_classes)  # Hidden to output layer
        self.relu = nn.ReLU()

    def forward(self, x):
        # Define the forward pass using a list for clarity
        apply_order = [self.l1, self.relu, self.l2, self.relu, self.l3]

        for f in apply_order:  # Sequentially apply layers and activations
            x = f(x)

        return x


# Hyperparameters and model configuration
BATCH_SIZE = 8
HIDDEN_SIZE = 128
OUTPUT_SIZE = len(tags)  # Number of classes (tags)
INPUT_SIZE = len(X_train[0])  # Feature vector size
LEARNING_RATE = 0.001
EPOCHS = 200

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # Check if GPU is available
print('Running on device:', device)

# Initialize the model and move it to the appropriate device
model = NN(INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE).to(device)

## Training

The train function defines and runs the training process for the neural network model. It first creates a custom dataset class that loads the training data and implements methods to access individual samples and the dataset size. The training loop uses DataLoader to batch the data and shuffle it for better generalization. The loss function used is CrossEntropyLoss, which is suitable for multi-class classification. The optimizer is Adam, which updates the model's weights during training. In each epoch, the model's predictions are compared with the true labels, and the loss is backpropagated to update the model parameters. Every 5 epochs, the loss is printed. Once training is complete, the model's state is saved to a file.

In [None]:
def train():
    class CustomDataset(Dataset):
        def __init__(self):
            self.n_samples = len(X_train)
            self.x_data = X_train  # Features (bag of words)
            self.y_data = Y_train  # Labels (numerical indices for tags)

        def __getitem__(self, index):
            return self.x_data[index], self.y_data[index]

        def __len__(self):
            return self.n_samples

    print('Training...')

    dataset = CustomDataset()
    train_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

    loss: nn.CrossEntropyLoss() | None = None  # Placeholder to store the loss during training

    # Training loop over the specified number of epochs
    for epoch in range(EPOCHS):
        for batch, (words, labels) in enumerate(train_loader):
            # Convert features and labels to the correct type and move to the computation device
            words = words.float().to(device)  # Convert features to float tensor
            labels = labels.to(device)  # Move labels to the same device

            # Forward pass: compute predictions from the model
            output = model(words)

            # Compute the loss between predictions and true labels
            loss = loss_fn(output, labels)

            # Backpropagation: clear previous gradients, compute new ones, and update parameters
            optimizer.zero_grad()  # Clear gradients
            loss.backward()  # Compute gradients
            optimizer.step()  # Update model parameters

        if epoch % 5 == 4:
            print(f'Epoch [{epoch + 1}/{EPOCHS}] Loss: {loss.item():.10f}')

    # Save the trained model
    torch.save(model.state_dict(), 'model.pth')
    print(f'Finished training with final loss {loss.item():.10f}')


## Bot

This code defines a TelegramBot class that interacts with the Telegram API using the python-telegram-bot library. It initializes the bot with a Telegram API key from environment variables and sets up the bot to listen for messages. The start() method begins polling for incoming messages. When a message is received, the _handle_message() method is called, which attempts to generate a response using the generate_response() function. If successful, it sends the response back to the user; if an error occurs, it sends an error message instead.

In [None]:
class TelegramBot:
    def __init__(self):
        key = os.environ.get('TELEGRAM_KEY')
        self.app = ApplicationBuilder().token(key).build()

    def start(self):
        self.app.add_handler(MessageHandler(filters.CHAT, self._handle_message))
        self.app.run_polling()

    async def _handle_message(self, update: Update, context: CallbackContext) -> None:
        msg = update.message.text
        chat_id = update.effective_chat.id

        # Failure handling
        try:
            res = generate_response(msg)
            await context.bot.send_message(chat_id=chat_id, text=res)
        except Exception as e:
            print(e)
            await context.bot.send_message(chat_id=chat_id, text=f'Error generating response: {e}')


# Conclusions

Using AI and ML is a good approach in developing a chatbot, especially if the bot needs to have a more human-like natural language understanding. These technologies enable the chatbot to process, interpret, and respond to user inputs more accurately, improving user experience and making interactions feel more natural and intuitive.

The use of a simple feedforward neural network to classify patterns in user queries is a good starting point, but more sophisticated models, such as sequence-to-sequence , offer even greater potential for handling more complex interactions. Seq2Seq models, in particular, are a powerful architecture for tasks involving sequence generation, such as machine translation, text summarization, and chatbot dialogue systems. By using encoder-decoder structures, Seq2Seq models excel at understanding and generating variable-length responses, making them ideal for conversational agents. These models can handle contextual information across multiple turns in a conversation, which is a crucial feature for building intelligent chatbots. Using Seq2Seq with attention mechanisms further enhances its ability to focus on relevant parts of the input sequence while generating an appropriate response. This capability allows for more dynamic and contextually accurate interactions. As AI systems grow more sophisticated, combining robust preprocessing techniques with advanced models like Seq2Seq paves the way for developing powerful, user-friendly conversational systems capable of providing meaningful and accurate responses across a wide range of domains.