<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/L03-Learning%20to%20Classify%20Text/Note_05_Practical_Text_Classification_Examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- This section delves into various practical examples of text classification tasks, each highlighting different approaches and challenges associated with text classification in natural language processing (NLP).
- We will cover different tasks such as gender identification, document classification, part-of-speech tagging, sentence segmentation, spam detection, dialogue act classification, named entity recognition (NER), and language identification.
- Each example demonstrates how text classification techniques can be applied to solve real-world problems effectively.



#### 5.1 **Gender Identification**

- **Task Description**:
  - The goal of gender identification is to classify a person's gender based on their name. This can be useful in various applications such as demographic analysis, user profiling, and personalization.
  
- **Feature Engineering**:
  - Simple features, such as the last letter of the name, can be highly predictive. For example, names ending in "a" or "e" are more likely to be associated with females, whereas names ending in "n" or "r" may be more common for males.
  - Other features can include the length of the name, vowel/consonant ratios, or n-grams (character sequences within the name).

- **Challenges**:
  - Some names may be gender-neutral or have different gender associations across cultures. Handling such cases requires incorporating additional cultural or contextual information.
  
- **Example Approach**:
  - Using a Naive Bayes classifier trained on features extracted from a dataset of names labeled with gender. The classifier can predict gender probabilities for new, unseen names based on the features.



##### Demonstration

The goal is to classify a person's gender based on their name using a machine learning model. We will use NLTK for data preprocessing and PyTorch for building and training the model.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random

# Download the NLTK names dataset
nltk.download('names')
from nltk.corpus import names

# Step 2: Load the dataset
# NLTK provides a list of male and female names labeled accordingly
def load_data():
    male_names = [(name.lower(), 0) for name in names.words('male.txt')]  # Label '0' for male
    female_names = [(name.lower(), 1) for name in names.words('female.txt')]  # Label '1' for female

    # Combine the male and female names
    all_names = male_names + female_names
    random.shuffle(all_names)  # Shuffle the dataset

    print(f"Total number of names: {len(all_names)}")
    print(f"Number of male names: {len(male_names)}")
    print(f"Number of female names: {len(female_names)}")
    return all_names

# Load the dataset
all_names = load_data()


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


Total number of names: 7944
Number of male names: 2943
Number of female names: 5001


In [None]:
# Step 3: Feature extraction
# We will use the last letter of each name as a feature
# Other potential features could include the length of the name, vowel/consonant ratios, etc.

def extract_features(name):
    """Extract features from a given name."""
    name = name.lower()
    last_letter = name[-1]  # Use the last letter as a feature
    length = len(name)  # Use the length of the name as a feature
    first_letter = name[0]  # Use the first letter as a feature
    features = {
        'last_letter': ord(last_letter) - ord('a'),  # Convert to numerical value
        'length': length,
        'first_letter': ord(first_letter) - ord('a')
    }
    return [features['last_letter'], features['length'], features['first_letter']]

# Step 4: Prepare the dataset for training
# Split data into features (X) and labels (y)
X = [extract_features(name) for name, gender in all_names]
y = [gender for name, gender in all_names]

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


Training set size: 6355
Test set size: 1589


In [None]:
# Step 5: Define the neural network model for gender identification
class GenderClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GenderClassifier, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = 3  # Number of features (last letter, length, first letter)
hidden_dim = 8  # Number of neurons in the hidden layer (can be tuned)
output_dim = 2  # Number of output classes (male and female)

# Initialize the model
model = GenderClassifier(input_dim, hidden_dim, output_dim)

print(model)


GenderClassifier(
  (fc1): Linear(in_features=3, out_features=8, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=8, out_features=2, bias=True)
)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 100  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)
train_losses = []

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)
    train_losses.append(epoch_loss)

    # Print training progress
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


Epoch [10/100], Loss: 0.5277
Epoch [20/100], Loss: 0.5226
Epoch [30/100], Loss: 0.5220
Epoch [40/100], Loss: 0.5184
Epoch [50/100], Loss: 0.5180
Epoch [60/100], Loss: 0.5175
Epoch [70/100], Loss: 0.5149
Epoch [80/100], Loss: 0.5138
Epoch [90/100], Loss: 0.5136
Epoch [100/100], Loss: 0.5188


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision = precision_score(y_test, predicted)
recall = recall_score(y_test, predicted)
f1 = f1_score(y_test, predicted)

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")



Test Accuracy: 0.7256
Precision: 0.7761
Recall: 0.7976
F1-Score: 0.7867


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [4, 8, 16]:
        for num_epochs in [50, 100, 200]:
            # Initialize the model with current parameters
            model = GenderClassifier(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


KeyboardInterrupt: 

In [None]:
# Step 12: Re-train the model using the best hyperparameters and evaluate it

# Unpack the best hyperparameters
best_lr, best_hidden_dim, best_num_epochs = best_params

# Re-initialize the model with the best hyperparameters
model = LanguageIdentifier(input_dim, best_hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=best_lr)

# Training the model with the best hyperparameters
print(f"\nRe-training the model with the best hyperparameters: Learning Rate={best_lr}, Hidden Dimension={best_hidden_dim}, Epochs={best_num_epochs}")

for epoch in range(best_num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{best_num_epochs}], Loss: {epoch_loss:.4f}")

# Step 13: Evaluate the re-trained model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 14: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='weighted')

# Print the final evaluation results
print(f"\nFinal Model Evaluation with Best Hyperparameters:")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


#### 5.2 **Document Classification**

- **Task Description**:
  - Document classification involves assigning a predefined category to a document, such as categorizing a movie review as "positive" or "negative" (sentiment analysis).
  - It can also involve topic categorization, where documents are classified into subjects such as "sports," "politics," or "technology."

- **Feature Extraction**:
  - Textual features may include word presence or absence (bag-of-words), term frequency-inverse document frequency (TF-IDF), n-grams, and word embeddings.
  - Additional features can be derived from document structure (e.g., paragraph lengths, headings) or metadata (e.g., author information).

- **Example Approach**:
  - Using a Support Vector Machine (SVM) or logistic regression model with TF-IDF features extracted from the text. For instance, the Movie Reviews Corpus can be used to train a model that classifies reviews as positive or negative based on word frequencies.

- **Challenges**:
  - Handling sarcasm, irony, or ambiguous language can be difficult in sentiment analysis. Additionally, topic categorization may require domain-specific knowledge or large labeled datasets.



##### Demonstration

In this task, we will classify text documents into predefined categories. Specifically, we will use the NLTK movie reviews dataset to classify movie reviews as either "positive" or "negative."

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Download the NLTK movie reviews dataset
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Step 2: Load the dataset
# The NLTK movie reviews dataset contains positive and negative movie reviews
def load_data():
    documents = [(list(movie_reviews.words(fileid)), category)
                 for category in movie_reviews.categories()
                 for fileid in movie_reviews.fileids(category)]
    print(f"Total number of documents: {len(documents)}")
    return documents

# Load the dataset
all_documents = load_data()


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Total number of documents: 2000


In [None]:
# Step 3: Data preprocessing and feature extraction
# Convert the documents to lowercase, tokenize, and join words back into a single string for vectorization

def preprocess_documents(documents):
    preprocessed_docs = [" ".join([word.lower() for word in doc]) for doc, _ in documents]
    labels = [1 if label == 'pos' else 0 for _, label in documents]  # 1 for positive, 0 for negative
    return preprocessed_docs, labels

# Preprocess the documents
X, y = preprocess_documents(all_documents)

# Step 4: Convert text data to numerical data using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=2000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X).toarray()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


Training set size: 1600
Test set size: 400


In [None]:
# Step 5: Define the neural network model for document classification
class DocumentClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(DocumentClassifier, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = X_train.size(1)  # Number of features (TF-IDF features)
hidden_dim = 128  # Number of neurons in the hidden layer (can be tuned)
output_dim = 2  # Number of output classes (positive and negative)

# Initialize the model
model = DocumentClassifier(input_dim, hidden_dim, output_dim)

print(model)


DocumentClassifier(
  (fc1): Linear(in_features=2000, out_features=128, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 20  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)
train_losses = []

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)
    train_losses.append(epoch_loss)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


Epoch [5/20], Loss: 0.0043
Epoch [10/20], Loss: 0.0004
Epoch [15/20], Loss: 0.0001
Epoch [20/20], Loss: 0.0000


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision = precision_score(y_test, predicted)
recall = recall_score(y_test, predicted)
f1 = f1_score(y_test, predicted)

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")



Test Accuracy: 0.8350
Precision: 0.8230
Recall: 0.8557
F1-Score: 0.8390


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [64, 128, 256]:
        for num_epochs in [10, 20, 30]:
            # Initialize the model with current parameters
            model = DocumentClassifier(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


#### 5.3 **Part-of-Speech Tagging**

- **Task Description**:
  - Part-of-speech (POS) tagging assigns grammatical categories, such as noun, verb, or adjective, to each word in a sentence.
  - It is a fundamental task in NLP that serves as a building block for more complex tasks like parsing, named entity recognition, and machine translation.

- **Techniques**:
  - Features for POS tagging include word suffixes, previous and next word tags, capitalization, and contextual word embeddings.
  - Algorithms such as Hidden Markov Models (HMM), Conditional Random Fields (CRF), and Recurrent Neural Networks (RNN) are commonly used for sequence labeling tasks like POS tagging.

- **Challenges**:
  - Words can have multiple POS tags depending on their context (e.g., "run" can be a noun or a verb). Handling such ambiguities requires incorporating surrounding context effectively.



##### Demonstration

The goal is to classify each word in a sentence into its corresponding part of speech (e.g., noun, verb, adjective).

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from collections import Counter

# Download the NLTK treebank corpus
nltk.download('treebank')
nltk.download('universal_tagset')
from nltk.corpus import treebank

# Step 2: Load the dataset
# The NLTK treebank corpus provides sentences tagged with POS tags
def load_data():
    # Use the universal tagset to simplify the number of tags
    tagged_sentences = treebank.tagged_sents(tagset='universal')
    print(f"Total number of sentences: {len(tagged_sentences)}")
    return tagged_sentences

# Load the dataset
all_sentences = load_data()


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


Total number of sentences: 3914


In [None]:
# Step 3: Data preprocessing and feature extraction
# Convert the sentences to lower case and split the data into training and testing sets
def preprocess_data(sentences):
    # Extract words and tags from the sentences
    words = [[word.lower() for word, _ in sentence] for sentence in sentences]
    tags = [[tag for _, tag in sentence] for sentence in sentences]

    return words, tags

# Preprocess the data
all_words, all_tags = preprocess_data(all_sentences)

# Step 4: Build a vocabulary of words and a tagset
# Create a word-to-index mapping and a tag-to-index mapping
word_counts = Counter(word for sentence in all_words for word in sentence)
vocab = [word for word, freq in word_counts.items() if freq > 1]  # Filter out rare words
word_to_idx = {word: idx + 2 for idx, word in enumerate(vocab)}  # Start indexing from 2
word_to_idx['<PAD>'] = 0  # Padding index
word_to_idx['<UNK>'] = 1  # Unknown word index

# Create a tag-to-index mapping
tag_to_idx = {tag: idx for idx, tag in enumerate(set(tag for tags in all_tags for tag in tags))}

print(f"Vocabulary size: {len(word_to_idx)}")
print(f"Number of unique tags: {len(tag_to_idx)}")

# Step 5: Convert sentences and tags to sequences of indices
def encode_sequences(words, tags, word_to_idx, tag_to_idx):
    encoded_words = [[word_to_idx.get(word, word_to_idx['<UNK>']) for word in sentence] for sentence in words]
    encoded_tags = [[tag_to_idx[tag] for tag in sentence] for sentence in tags]
    return encoded_words, encoded_tags

encoded_words, encoded_tags = encode_sequences(all_words, all_tags, word_to_idx, tag_to_idx)

# Step 6: Pad the sequences to ensure uniform length
def pad_sequences(sequences, max_len, padding_value=0):
    return [seq + [padding_value] * (max_len - len(seq)) if len(seq) < max_len else seq[:max_len] for seq in sequences]

max_len = max(len(sentence) for sentence in encoded_words)  # Maximum sequence length
padded_words = pad_sequences(encoded_words, max_len)
padded_tags = pad_sequences(encoded_tags, max_len)

# Convert to PyTorch tensors
X = torch.tensor(padded_words, dtype=torch.long)
y = torch.tensor(padded_tags, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


Vocabulary size: 5602
Number of unique tags: 12
Training set size: 3131
Test set size: 783


In [None]:
# Step 7: Define the neural network model for POS tagging
class POSTagger(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(POSTagger, self).__init__()
        # Define the layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)  # Embedding layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)  # LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out)
        return output

# Step 8: Set up the model parameters
vocab_size = len(word_to_idx)
embedding_dim = 100  # Size of the embedding vectors (can be tuned)
hidden_dim = 128  # Number of neurons in the hidden layer (can be tuned)
output_dim = len(tag_to_idx)  # Number of POS tags

# Initialize the model
model = POSTagger(vocab_size, embedding_dim, hidden_dim, output_dim)

print(model)


POSTagger(
  (embedding): Embedding(5602, 100, padding_idx=0)
  (lstm): LSTM(100, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=12, bias=True)
)


In [None]:
# Step 9: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding index during loss calculation
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Learning rate can be tuned

# Step 10: Training loop
num_epochs = 10  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        outputs = outputs.view(-1, outputs.shape[2])  # Reshape for loss calculation
        batch_y = batch_y.view(-1)  # Flatten labels

        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 2 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


Epoch [2/10], Loss: 0.4794
Epoch [4/10], Loss: 0.2238
Epoch [6/10], Loss: 0.1334
Epoch [8/10], Loss: 0.0848
Epoch [10/10], Loss: 0.0542


In [None]:
# Step 11: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    y_pred = torch.argmax(y_pred, dim=2)

# Step 12: Calculate evaluation metrics
# Convert predictions and labels back to their original shape
y_pred = y_pred.view(-1).cpu().numpy()
y_true = y_test.view(-1).cpu().numpy()

# Filter out padding indices for evaluation
valid_indices = y_true != 0
y_true = y_true[valid_indices]
y_pred = y_pred[valid_indices]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 13: Hyperparameter tuning
# Try different combinations of learning rates, embedding dimensions, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01]:
    for embedding_dim in [50, 100, 200]:
        for hidden_dim in [64, 128, 256]:
            for num_epochs in [5, 10, 20]:
                # Initialize the model with current parameters
                model = POSTagger(vocab_size, embedding_dim, hidden_dim, output_dim)
                optimizer = optim.Adam(model.parameters(), lr=lr)

                # Train the model with the current settings
                for epoch in range(num_epochs):
                    model.train()
                    permutation = torch.randperm(X_train.size(0))
                    for i in range(0, X_train.size(0), batch_size):
                        indices = permutation[i:i + batch_size]
                        batch_X, batch_y = X_train[indices], y_train[indices]

                        outputs = model(batch_X)
                        outputs = outputs.view(-1, outputs.shape[2])
                        batch_y = batch_y.view(-1)

                        loss = criterion(outputs, batch_y)

                        optimizer.zero_grad()
                        loss.backward()
                        optimizer.step()

                # Evaluate the model
                model.eval()
                with torch.no_grad():
                    y_pred = model(X_test)
                    y_pred = torch.argmax(y_pred, dim=2)

                y_pred = y_pred.view(-1).cpu().numpy()
                y_true = y_test.view(-1).cpu().numpy()

                valid_indices = y_true != 0
                y_true = y_true[valid_indices]
                y_pred = y_pred[valid_indices]

                accuracy = accuracy_score(y_true, y_pred)
                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_params = (lr, embedding_dim, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Embedding Dimension: {best_params[1]}, Hidden Dimension: {best_params[2]}, Epochs: {best_params[3]}")


#### 5.4 **Sentence Segmentation**

- **Task Description**:
  - Sentence segmentation, also known as sentence boundary detection, involves splitting a block of text into individual sentences.
  - It is often used as a preprocessing step for tasks like text summarization, machine translation, and named entity recognition.

- **Approaches**:
  - Rule-based methods can identify sentence boundaries based on punctuation marks (e.g., periods, question marks). However, this may not work well for abbreviations (e.g., "Dr.") or other special cases.
  - Machine learning-based methods can use features such as word capitalization, surrounding words, and punctuation to detect boundaries more accurately.

- **Challenges**:
  - Text may contain complex structures, such as quoted speech or lists, where traditional rules for sentence segmentation may fail.



##### Demonstration

 The goal is to split a block of text into individual sentences. Sentence segmentation, or sentence boundary detection, is an important step in preprocessing text data for various NLP tasks such as machine translation, summarization, and named entity recognition.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Download the NLTK datasets required for sentence tokenization
nltk.download('punkt')
nltk.download('gutenberg')
from nltk.corpus import gutenberg
from nltk.tokenize import PunktSentenceTokenizer, sent_tokenize

# Step 2: Load the dataset
# We will use the Gutenberg corpus from NLTK, which contains a collection of books
def load_data():
    # Select a few books for the dataset
    text = gutenberg.raw('austen-emma.txt') + gutenberg.raw('austen-persuasion.txt') + gutenberg.raw('austen-sense.txt')
    print(f"Total number of characters in the dataset: {len(text)}")
    return text

# Load the dataset
raw_text = load_data()


In [None]:
# Step 3: Data preprocessing
# We will split the text into sentences using the NLTK sent_tokenize function and then create training data

def create_dataset(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    dataset = []

    # Generate labeled data for each character in the text
    # Label '1' if the character is the end of a sentence, otherwise '0'
    for sentence in sentences:
        sentence = sentence.strip()
        for i in range(len(sentence) - 1):
            if i == len(sentence) - 2:
                dataset.append((sentence[:i + 1], 1))  # Label '1' at the sentence boundary
            else:
                dataset.append((sentence[:i + 1], 0))  # Label '0' for non-boundaries

    print(f"Total number of samples: {len(dataset)}")
    return dataset

# Create the dataset
all_data = create_dataset(raw_text)


In [None]:
# Step 4: Feature extraction
# Convert text sequences into numerical features

def extract_features(sequence):
    # Features will include the ASCII value of the last character and the length of the sequence
    last_char = ord(sequence[-1]) if sequence[-1].isalpha() else 0  # ASCII value of the last character
    length = len(sequence)
    features = [last_char, length]
    return features

# Prepare the dataset for training
X = [extract_features(seq) for seq, label in all_data]
y = [label for _, label in all_data]

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


In [None]:
# Step 5: Define the neural network model for sentence segmentation
class SentenceSegmenter(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SentenceSegmenter, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = 2  # Number of features (ASCII value of the last character, length)
hidden_dim = 16  # Number of neurons in the hidden layer (can be tuned)
output_dim = 2  # Number of output classes (boundary or not)

# Initialize the model
model = SentenceSegmenter(input_dim, hidden_dim, output_dim)

print(model)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 20  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='binary')

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [8, 16, 32]:
        for num_epochs in [10, 20, 30]:
            # Initialize the model with current parameters
            model = SentenceSegmenter(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


#### 5.5 **Spam Detection**

- **Task Description**:
  - Spam detection aims to classify email messages or other text content as "spam" or "not spam." It is widely used in email filtering, content moderation, and social media monitoring.

- **Features**:
  - Common features include word frequencies (e.g., presence of specific words like "free" or "win"), email metadata (e.g., sender's email address), and language characteristics (e.g., excessive use of capital letters).
  - Advanced techniques may involve using embeddings from language models such as BERT to capture deeper semantic features.

- **Challenges**:
  - Spammers frequently change their strategies to bypass detection, requiring models to be updated and retrained regularly. Handling multilingual or obfuscated spam messages can also pose difficulties.

- **Example Approach**:
  - A Naive Bayes or logistic regression classifier trained on email datasets with labeled spam and non-spam examples. Features like the presence of specific keywords and email sender information are used to classify the messages.



##### Demonstration

The goal of this task is to classify text messages as "spam" or "not spam." We will use a dataset of SMS messages for training and testing the spam classifier.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Download NLTK stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

# Step 2: Load the dataset
# We will use a publicly available dataset containing labeled SMS messages
def load_data():
    # Dataset URL: https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    # For this demonstration, assume the dataset is already downloaded and extracted
    data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
    print(f"Total number of messages: {len(data)}")
    return data

# Load the dataset
data = load_data()


In [None]:
# Step 3: Data preprocessing
# Convert labels to binary values: 'spam' -> 1, 'ham' (not spam) -> 0
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Remove stopwords and perform basic text cleaning
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase and split into words
    words = text.lower().split()
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Join the words back into a single string
    return ' '.join(words)

# Apply preprocessing to each message
data['message'] = data['message'].apply(preprocess_text)

# Step 4: Convert text data to numerical data using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=2000)
X_tfidf = tfidf_vectorizer.fit_transform(data['message']).toarray()

# Split the dataset into features (X) and labels (y)
X = X_tfidf
y = data['label'].values

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


In [None]:
# Step 5: Define the neural network model for spam detection
class SpamClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SpamClassifier, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = X_train.size(1)  # Number of features (TF-IDF features)
hidden_dim = 64  # Number of neurons in the hidden layer (can be tuned)
output_dim = 2  # Number of output classes (spam and not spam)

# Initialize the model
model = SpamClassifier(input_dim, hidden_dim, output_dim)

print(model)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 30  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision = precision_score(y_test, predicted)
recall = recall_score(y_test, predicted)
f1 = f1_score(y_test, predicted)

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [32, 64, 128]:
        for num_epochs in [20, 30, 40]:
            # Initialize the model with current parameters
            model = SpamClassifier(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


#### 5.6 **Dialogue Act Classification**

- **Task Description**:
  - Dialogue act classification involves categorizing each line in a conversation into predefined categories, such as "question," "statement," or "command."
  - It is useful in conversational agents, customer support systems, and dialogue systems for understanding user intentions.

- **Feature Engineering**:
  - Features can include the presence of question words (e.g., "what," "how"), sentence structure, and context from previous dialogue turns.
  - Sequential models like RNNs or Transformers can capture dependencies across multiple dialogue turns.

- **Challenges**:
  - Dialogue contexts can vary significantly between different conversations, making it challenging to generalize. Additionally, the same sentence structure may convey different meanings based on context.



##### Demonstration

The goal is to classify each line in a conversation into predefined categories, such as "question," "statement," or "command." This task helps understand the intention behind a spoken or written message and is useful in conversational agents, customer support systems, and dialogue systems.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Download NLTK's dialogue act data
nltk.download('nps_chat')
from nltk.corpus import nps_chat

# Step 2: Load the dataset
# The NPS Chat Corpus is a collection of chatroom dialogues labeled with dialogue acts
def load_data():
    posts = nps_chat.xml_posts()
    # Extract text and their corresponding dialogue act labels
    data = [(post.text, post.get('class')) for post in posts]
    print(f"Total number of dialogue lines: {len(data)}")
    return data

# Load the dataset
all_data = load_data()


In [None]:
# Step 3: Data preprocessing
# Extract the text (X) and labels (y) from the dataset
X_raw = [text for text, label in all_data]
y_raw = [label for _, label in all_data]

# Encode the labels as integers
label_to_idx = {label: idx for idx, label in enumerate(set(y_raw))}
idx_to_label = {idx: label for label, idx in label_to_idx.items()}
y_encoded = [label_to_idx[label] for label in y_raw]

print(f"Number of unique dialogue acts: {len(label_to_idx)}")

# Step 4: Convert text data to numerical data using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X_raw).toarray()

# Convert the labels to a PyTorch tensor
y = torch.tensor(y_encoded, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Convert the features to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


In [None]:
# Step 5: Define the neural network model for dialogue act classification
class DialogueActClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(DialogueActClassifier, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = X_train.size(1)  # Number of features (TF-IDF features)
hidden_dim = 64  # Number of neurons in the hidden layer (can be tuned)
output_dim = len(label_to_idx)  # Number of dialogue act classes

# Initialize the model
model = DialogueActClassifier(input_dim, hidden_dim, output_dim)

print(model)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for multi-class classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 20  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='weighted')

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [32, 64, 128]:
        for num_epochs in [10, 20, 30]:
            # Initialize the model with current parameters
            model = DialogueActClassifier(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


#### 5.7 **Named Entity Recognition (NER)**

- **Task Description**:
  - NER is the process of identifying and classifying named entities in text, such as people, organizations, locations, dates, and product names.
  - It is widely used in information extraction, knowledge graph construction, and question answering systems.

- **Approaches**:
  - Features for NER include word shape (e.g., capitalization), POS tags, context words, and embeddings from pre-trained language models.
  - Algorithms such as CRF, Bi-LSTM with CRF, and Transformer-based models like BERT are commonly used.

- **Challenges**:
  - Handling entity boundary detection and recognizing entities in non-standard formats (e.g., informal text, social media) can be difficult. Additionally, entities with multiple words (e.g., "New York Times") require special handling.



##### Demonstration

The goal of this task is to identify and classify named entities in text into predefined categories such as person names, locations, organizations, dates, etc. NER is a crucial step in information extraction and various NLP applications.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Download the NLTK datasets required for NER
nltk.download('conll2002')

from nltk.corpus import conll2002

# Step 2: Load the dataset
# The CoNLL 2002 corpus contains labeled named entities in Spanish and Dutch
def load_data():
    # We will use the Spanish dataset for this example
    sentences = conll2002.iob_sents('esp.train')
    print(f"Total number of sentences: {len(sentences)}")
    return sentences

# Load the dataset
all_sentences = load_data()


In [None]:
# Step 3: Data preprocessing
# Extract words and their corresponding named entity tags from the sentences
def preprocess_data(sentences):
    words = [[word.lower() for word, _, _ in sentence] for sentence in sentences]
    tags = [[tag for _, _, tag in sentence] for sentence in sentences]
    return words, tags

# Preprocess the data
all_words, all_tags = preprocess_data(all_sentences)

# Step 4: Build a vocabulary of words and a tagset
# Create a word-to-index mapping and a tag-to-index mapping
from collections import Counter

word_counts = Counter(word for sentence in all_words for word in sentence)
vocab = [word for word, freq in word_counts.items() if freq > 1]  # Filter out rare words
word_to_idx = {word: idx + 2 for idx, word in enumerate(vocab)}  # Start indexing from 2
word_to_idx['<PAD>'] = 0  # Padding index
word_to_idx['<UNK>'] = 1  # Unknown word index

# Create a tag-to-index mapping
tag_to_idx = {tag: idx for idx, tag in enumerate(set(tag for tags in all_tags for tag in tags))}

print(f"Vocabulary size: {len(word_to_idx)}")
print(f"Number of unique tags: {len(tag_to_idx)}")

# Step 5: Convert sentences and tags to sequences of indices
def encode_sequences(words, tags, word_to_idx, tag_to_idx):
    encoded_words = [[word_to_idx.get(word, word_to_idx['<UNK>']) for word in sentence] for sentence in words]
    encoded_tags = [[tag_to_idx[tag] for tag in sentence] for sentence in tags]
    return encoded_words, encoded_tags

encoded_words, encoded_tags = encode_sequences(all_words, all_tags, word_to_idx, tag_to_idx)

# Step 6: Pad the sequences to ensure uniform length
def pad_sequences(sequences, max_len, padding_value=0):
    return [seq + [padding_value] * (max_len - len(seq)) if len(seq) < max_len else seq[:max_len] for seq in sequences]

max_len = max(len(sentence) for sentence in encoded_words)  # Maximum sequence length
padded_words = pad_sequences(encoded_words, max_len)
padded_tags = pad_sequences(encoded_tags, max_len)

# Convert to PyTorch tensors
X = torch.tensor(padded_words, dtype=torch.long)
y = torch.tensor(padded_tags, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


In [None]:
# Step 7: Define the neural network model for NER
class NERTagger(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(NERTagger, self).__init__()
        # Define the layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)  # Embedding layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)  # LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out)
        return output

# Step 8: Set up the model parameters
vocab_size = len(word_to_idx)
embedding_dim = 100  # Size of the embedding vectors (can be tuned)
hidden_dim = 128  # Number of neurons in the hidden layer (can be tuned)
output_dim = len(tag_to_idx)  # Number of named entity tags

# Initialize the model
model = NERTagger(vocab_size, embedding_dim, hidden_dim, output_dim)

print(model)


In [None]:
# Step 9: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding index during loss calculation
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Learning rate can be tuned

# Step 10: Training loop
num_epochs = 10  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        outputs = outputs.view(-1, outputs.shape[2])  # Reshape for loss calculation
        batch_y = batch_y.view(-1)  # Flatten labels

        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 2 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


In [None]:
# Step 11: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    y_pred = torch.argmax(y_pred, dim=2)

# Step 12: Calculate evaluation metrics
# Convert predictions and labels back to their original shape
y_pred = y_pred.view(-1).cpu().numpy()
y_true = y_test.view(-1).cpu().numpy()

# Filter out padding indices for evaluation
valid_indices = y_true != 0
y_true = y_true[valid_indices]
y_pred = y_pred[valid_indices]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 13: Hyperparameter tuning
# Try different combinations of learning rates, embedding dimensions, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01]:
    for embedding_dim in [50, 100, 200]:
        for hidden_dim in [64, 128, 256]:
            for num_epochs in [5, 10, 20]:
                # Initialize the model with current parameters
                model = NERTagger(vocab_size, embedding_dim, hidden_dim, output_dim)
                optimizer = optim.Adam(model.parameters(), lr=lr)

                # Train the model with the current settings
                for epoch in range(num_epochs):
                    model.train()
                    permutation = torch.randperm(X_train.size(0))
                    for i in range(0, X_train.size(0), batch_size):
                        indices = permutation[i:i + batch_size]
                        batch_X, batch_y = X_train[indices], y_train[indices]

                        outputs = model(batch_X)
                        outputs = outputs.view(-1, outputs.shape[2])
                        batch_y = batch_y.view(-1)

                        loss = criterion(outputs, batch_y)

                        optimizer.zero_grad()
                        loss.backward()
                        optimizer.step()

                # Evaluate the model
                model.eval()
                with torch.no_grad():
                    y_pred = model(X_test)
                    y_pred = torch.argmax(y_pred, dim=2)

                y_pred = y_pred.view(-1).cpu().numpy()
                y_true = y_test.view(-1).cpu().numpy()

                valid_indices = y_true != 0
                y_true = y_true[valid_indices]
                y_pred = y_pred[valid_indices]

                accuracy = accuracy_score(y_true, y_pred)
                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_params = (lr, embedding_dim, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Embedding Dimension: {best_params[1]}, Hidden Dimension: {best_params[2]}, Epochs: {best_params[3]}")


In [None]:
# Step 14: Re-train the model using the best hyperparameters and evaluate it

# Unpack the best hyperparameters
best_lr, best_hidden_dim, best_num_epochs = best_params

# Re-initialize the model with the best hyperparameters
model = LanguageIdentifier(input_dim, best_hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=best_lr)

# Training the model with the best hyperparameters
print(f"\nRe-training the model with the best hyperparameters: Learning Rate={best_lr}, Hidden Dimension={best_hidden_dim}, Epochs={best_num_epochs}")

for epoch in range(best_num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{best_num_epochs}], Loss: {epoch_loss:.4f}")

# Step 13: Evaluate the re-trained model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 14: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='weighted')

# Print the final evaluation results
print(f"\nFinal Model Evaluation with Best Hyperparameters:")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


#### 5.8 **Language Identification**

- **Task Description**:
  - Language identification involves determining the language of a given text snippet. It is a preliminary step for multilingual text processing.

- **Feature Extraction**:
  - Features for language identification can include character-level n-grams, word-level n-grams, and common stopwords associated with specific languages.
  - Deep learning approaches can leverage embeddings and language model features to enhance accuracy.

- **Challenges**:
  - Short text snippets (e.g., a single word or phrase) may not provide enough information for accurate classification. Handling code-mixing, where multiple languages appear in the same text, adds complexity.


##### Demonstration

 The goal of this task is to identify the language of a given text snippet. Language identification is often used as a preliminary step in multilingual text processing, machine translation, and language-specific text analysis.

In [None]:
# Step 1: Import required libraries
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import random

# Download the NLTK datasets required for language identification
nltk.download('udhr')

from nltk.corpus import udhr

# Step 2: Load the dataset
# The UDHR (Universal Declaration of Human Rights) corpus contains translations in many languages
def load_data():
    languages = ['English-Latin1', 'French_Francais-Latin1', 'Spanish_Espanol-Latin1', 'German_Deutsch-Latin1']
    sentences = []

    # Load text samples for each language
    for language in languages:
        text = udhr.raw(language)
        # Split the text into sentences and take a subset for training
        sentences.extend([(sent, language.split('-')[0]) for sent in text.split('\n') if sent])

    print(f"Total number of sentences: {len(sentences)}")
    return sentences

# Load the dataset
all_data = load_data()


In [None]:
# Step 3: Data preprocessing
# Extract the text (X) and labels (y) from the dataset
X_raw = [text for text, label in all_data]
y_raw = [label for _, label in all_data]

# Encode the labels as integers
label_to_idx = {label: idx for idx, label in enumerate(set(y_raw))}
idx_to_label = {idx: label for label, idx in label_to_idx.items()}
y_encoded = [label_to_idx[label] for label in y_raw]

print(f"Number of unique languages: {len(label_to_idx)}")

# Step 4: Convert text data to numerical data using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(X_raw).toarray()

# Convert the labels to a PyTorch tensor
y = torch.tensor(y_encoded, dtype=torch.long)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Convert the features to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

print(f"Training set size: {X_train.size(0)}")
print(f"Test set size: {X_test.size(0)}")


In [None]:
# Step 5: Define the neural network model for language identification
class LanguageIdentifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LanguageIdentifier, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_dim, hidden_dim)  # First hidden layer
        self.relu = nn.ReLU()  # ReLU activation function
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # Output layer

    def forward(self, x):
        # Forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Step 6: Set up the model parameters
input_dim = X_train.size(1)  # Number of features (TF-IDF features)
hidden_dim = 64  # Number of neurons in the hidden layer (can be tuned)
output_dim = len(label_to_idx)  # Number of languages

# Initialize the model
model = LanguageIdentifier(input_dim, hidden_dim, output_dim)

print(model)


In [None]:
# Step 7: Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Suitable for multi-class classification
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Learning rate can be tuned

# Step 8: Training loop
num_epochs = 20  # Number of training epochs (can be tuned)
batch_size = 32  # Size of each batch (can be tuned)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}")


In [None]:
# Step 9: Evaluate the model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 10: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='weighted')

# Print the evaluation results
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")


In [None]:
# Step 11: Hyperparameter tuning
# Try different combinations of learning rates, hidden dimensions, and number of epochs

best_accuracy = 0
best_params = None

for lr in [0.001, 0.01, 0.1]:
    for hidden_dim in [32, 64, 128]:
        for num_epochs in [10, 20, 30]:
            # Initialize the model with current parameters
            model = LanguageIdentifier(input_dim, hidden_dim, output_dim)
            optimizer = optim.Adam(model.parameters(), lr=lr)

            # Train the model with the current settings
            for epoch in range(num_epochs):
                model.train()
                permutation = torch.randperm(X_train.size(0))
                for i in range(0, X_train.size(0), batch_size):
                    indices = permutation[i:i + batch_size]
                    batch_X, batch_y = X_train[indices], y_train[indices]

                    outputs = model(batch_X)
                    loss = criterion(outputs, batch_y)

                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

            # Evaluate the model
            model.eval()
            with torch.no_grad():
                y_pred = model(X_test)
                _, predicted = torch.max(y_pred, 1)

            accuracy = accuracy_score(y_test, predicted)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = (lr, hidden_dim, num_epochs)

# Print the best hyperparameters
print(f"\nBest Accuracy: {best_accuracy:.4f}")
print(f"Best Hyperparameters - Learning Rate: {best_params[0]}, Hidden Dimension: {best_params[1]}, Epochs: {best_params[2]}")


In [None]:
# Step 12: Re-train the model using the best hyperparameters and evaluate it

# Unpack the best hyperparameters
best_lr, best_hidden_dim, best_num_epochs = best_params

# Re-initialize the model with the best hyperparameters
model = LanguageIdentifier(input_dim, best_hidden_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=best_lr)

# Training the model with the best hyperparameters
print(f"\nRe-training the model with the best hyperparameters: Learning Rate={best_lr}, Hidden Dimension={best_hidden_dim}, Epochs={best_num_epochs}")

for epoch in range(best_num_epochs):
    model.train()  # Set the model to training mode
    permutation = torch.randperm(X_train.size(0))

    epoch_loss = 0.0
    for i in range(0, X_train.size(0), batch_size):
        indices = permutation[i:i + batch_size]
        batch_X, batch_y = X_train[indices], y_train[indices]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Accumulate the batch loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    epoch_loss /= (X_train.size(0) / batch_size)

    # Print training progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch + 1}/{best_num_epochs}], Loss: {epoch_loss:.4f}")

# Step 13: Evaluate the re-trained model on the test set
model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    y_pred = model(X_test)
    _, predicted = torch.max(y_pred, 1)

# Step 14: Calculate evaluation metrics
accuracy = accuracy_score(y_test, predicted)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, predicted, average='weighted')

# Print the final evaluation results
print(f"\nFinal Model Evaluation with Best Hyperparameters:")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")



#### 5.9 **Transition to the Next Section**

This section has covered practical examples of text classification, highlighting different tasks, feature extraction techniques, challenges, and example approaches. These examples demonstrate the versatility of text classification in solving diverse NLP problems.

The next section, **"Sequence Classification Techniques,"** will delve deeper into methods for classifying sequences of text, such as using RNNs, Transformers, and other sequence models to handle tasks where context and word order are critical. These techniques are particularly useful for tasks like POS tagging, NER, and dialogue act classification, which have been introduced here.