<a href="https://colab.research.google.com/github/cse-teacher/suggestion-mining/blob/main/suggestion_mining_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suggestion Mining using BERT
Suggestion mining is the task of extracting suggestions from user reviews

Developed: 11 Feb 2024 \\
Last Update: 11 Feb 2024 \\
Author: Muharram Mansoorizadeh plus Various AI tools (Google search, chatGPT, Gemini , ...)




**BERT for Classification**:

Import libraries: We import necessary libraries for loading BERT tokenizer and model, processing text, and making predictions.

Load BERT tokenizer and model: We load the pre-trained bert-base-uncased tokenizer and model. Replace 'bert-base-uncased' with your desired pre-trained BERT model name.

Preprocess text function: This function performs the following:

Tokenizes the text using the BERT tokenizer.

Adds special tokens ([CLS] and [SEP]) to the beginning and end of the sequence, respectively.

Pads the sequence to a maximum length (MAX_LEN) if necessary.

Define example text and label: Replace text with your actual text to classify and adjust label based on your classification categories.

Preprocess text: Call the preprocess_text function to convert the text into the required format for BERT.

Make prediction: Pass the preprocessed text through the model to obtain predictions.

Get predicted class and probability: Extract the predicted class index and its corresponding probability from the prediction results.

Print results: Print the predicted class and its probability.
Note:

This is a basic example and can be further customized for specific tasks like sentiment analysis or topic classification.
Remember to install the required libraries (transformers and tensorflow) before running the code.
Adjust MAX_LEN based on the maximum sentence length in your dataset.
Sources
github.com/JiaYaobo/toxic_detect


## Install Required Packagaes

In [1]:
#Install required packages and libraries

!apt-get install libenchant-2-2
!pip install emoji
!pip install cleantext
!pip install nltk
!pip install pyenchant
!pip install scikit-learn lightgbm catboost
!pip install gensim
!pip install transformers sentencepiece sacremoses


'apt-get' is not recognized as an internal or external command,
operable program or batch file.




## Import data

Get the required data files from github repository

In [2]:
!git clone https://github.com/cse-teacher/suggestion-mining.git

fatal: destination path 'suggestion-mining' already exists and is not an empty directory.


## Prepare data

In [3]:
# Read data from input files
import numpy as np
import pandas as pd
import random

#Set default seed:
random.seed(42)

#Main Application
folder     = "./suggestion-mining/data/"
train_file = folder + "V1.4_Training.csv" #"Train_Augmented_03.csv" # V1.4_Training.csv" #  "Train_processed.csv" /suggestion-mining/data/Train_Augmented_03.csv
valid_file = folder + "SubtaskA_Trial_Test_Labeled.csv" #"validation_processed.csv"
test_file  = folder + "SubtaskA_EvaluationData_labeled.csv"


train_df = pd.read_csv(train_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

valid_df = pd.read_csv(valid_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

test_df  = pd.read_csv(test_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

all_df = pd.concat([train_df, valid_df, test_df], axis=0)


#Get the labels:
y_train_original = train_df['label'].values
y_valid_original = valid_df['label'].values
y_test_original  = test_df['label'].values
y_all_original  = all_df['label'].values
train_size = len(train_df['label'])
valid_size = len(valid_df['label'])
test_size  = len(test_df['label'])



**Preprocessing**

In [4]:
import sys
import re
import nltk
import cleantext
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def remove_nonalphanumeric(text):
    #text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
  text = re.sub(r'\W+', ' ', text)
  text = re.sub(r'\s+', ' ', text)
  return text

def remove_stopwords_list(tokens):
  filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
  return filtered_tokens

def remove_stopwords(text):
  tokens = word_tokenize(text)
  filtered_tokens = remove_stopwords_list(tokens)
  return ' '.join(filtered_tokens)

#-----------------------------------
# Replace hyperlinks
#
def replace_hyperlinks(text):
  text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
  return text

def stem(text):
  tokens = word_tokenize(text.strip())
  tokens_stem =[stemmer.stem(s) for s in tokens]
  return ' '.join(tokens_stem)

#----------------------------------------
# replace_named_entities:
#    Replaces each word or phrase in the input text with its
#    Named Entity Recognition (NER) tag label.
#    Args:
#    text (str): Input text
#
#    Returns:
#    str: Text with named entities replaced by their NER tag labels
#
def replace_named_entities(text):
    # Tokenize the text into words
    words = word_tokenize(text)

    # Tag the words with Part-of-Speech (POS) tags
    tagged_words = pos_tag(words)

    # Perform Named Entity Recognition (NER)
    named_entities = ne_chunk(tagged_words)

    # Replace entities with their NER tag labels
    replaced_text = []
    for entity in named_entities:
        if isinstance(entity, nltk.tree.Tree):
            label = entity.label()
            named_entity_text = " ".join([word for word, tag in entity.leaves()])
            #replaced_text.append(f'<{label}>{named_entity_text}</{label}>')
            replaced_text.append(f'{label}')
            #replaced_text.append('')
        else:
            replaced_text.append(entity[0])

    return " ".join(replaced_text)

#Global callings:
stemmer = SnowballStemmer("english")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Example usage:
text = "Microsoft should seriously look into getting rid of Syamentc for all these paying stuff"
replaced_text = replace_named_entities(text)
print("Replaced Text:", replaced_text)


Replaced Text: PERSON should seriously look into getting rid of GPE for all these paying stuff


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
op_replace_hyperlinks      = True
op_remove_nonalphanumeric  = True
op_remove_stopwords        = False
op_replace_named_entities  = False
op_stem                    = False

if op_replace_hyperlinks == True:
  #replace named entities with their tag names:
  train_df['sentence']  = train_df['sentence'].apply(replace_hyperlinks)
  test_df['sentence']   = test_df['sentence'].apply(replace_hyperlinks)
  valid_df['sentence']  = valid_df['sentence'].apply(replace_hyperlinks)
  all_df['sentence']    = all_df['sentence'].apply(replace_hyperlinks)

if op_remove_nonalphanumeric == True:
  train_df['sentence'] = train_df['sentence'].apply(remove_nonalphanumeric)
  valid_df['sentence'] = valid_df['sentence'].apply(remove_nonalphanumeric)
  test_df['sentence']  = test_df['sentence'].apply(remove_nonalphanumeric)
  all_df['sentence']   = all_df['sentence'].apply(remove_nonalphanumeric)

if op_replace_named_entities == True:
  train_df['sentence']  = train_df['sentence'].apply(replace_named_entities)
  test_df['sentence']   = test_df['sentence'].apply(replace_named_entities)
  valid_df['sentence']  = valid_df['sentence'].apply(replace_named_entities)
  all_df['sentence']    = all_df['sentence'].apply(replace_named_entities)

if op_remove_stopwords == True:
  train_df['sentence'] = train_df['sentence'].apply(remove_stopwords)
  valid_df['sentence'] = valid_df['sentence'].apply(remove_stopwords)
  test_df['sentence']  = test_df['sentence'].apply(remove_stopwords)
  all_df['sentence']   = all_df['sentence'].apply(remove_stopwords)

if op_stem == True:
  train_df['sentence'] = train_df['sentence'].apply(stem)
  valid_df['sentence'] = valid_df['sentence'].apply(stem)
  test_df['sentence']  = test_df['sentence'].apply(stem)
  all_df['sentence']   = all_df['sentence'].apply(stem)


**Bert Based Classifier**

In [7]:
import torch
from transformers import BertModel, BertTokenizer
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Check if GPU is available and move the model and data to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Assuming you have your documents stored in a list of strings named `documents`
# Assuming you have your labels stored in a list named `labels` where 0 indicates one class and 1 indicates another class
documents  = train_df['sentence'].tolist()
labels = train_df['label'].tolist()
# Split data into training and testing sets
train_texts = train_df['sentence'].tolist(); train_labels = train_df['label'].tolist()
test_texts  = test_df['sentence'].tolist();  test_labels  = test_df['label'].tolist()

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name).to(device)  # Move model to GPU

# Tokenize and encode the training and testing texts
max_length = 128  # Adjust as needed
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')

# Convert labels to tensors
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

# Create TensorDatasets
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Define DataLoader
batch_size = 4  # Adjust as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define classification model
class ClassificationModel(torch.nn.Module):
    def __init__(self, bert_model):
        super(ClassificationModel, self).__init__()
        self.bert = bert_model
        self.fc = torch.nn.Linear(self.bert.config.hidden_size, 2)  # Output size is 2 for binary classification

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        logits = self.fc(pooled_output)
        return logits

# Instantiate classification model
classification_model = ClassificationModel(bert_model).to(device)  # Move model to GPU

# Define optimizer and loss function
optimizer = torch.optim.AdamW(classification_model.parameters(), lr=2e-5)
loss_function = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 10  # Adjust as needed
classification_model.train()
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    for batch in tqdm(train_loader):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = classification_model(input_ids, attention_mask)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

# Evaluation
classification_model.eval()
predictions = []
true_labels = []
for batch in tqdm(test_loader):
    input_ids, attention_mask, labels = batch
    input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

    with torch.no_grad():
        outputs = classification_model(input_ids, attention_mask)
    _, predicted_labels = torch.max(outputs, 1)
    predictions.extend(predicted_labels.cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Calculate accuracy
#----------------------------------
# Print results per class
#
def print_results(y_actual, y_pred, description=''):
  v00 = accuracy_score(y_actual, y_pred)
  v01 = precision_score(y_actual, y_pred, pos_label=0)
  v02 = recall_score(y_actual, y_pred, pos_label=0)
  v03 = f1_score(y_actual, y_pred, pos_label=0)

  v11 = precision_score(y_actual, y_pred, pos_label=1)
  v12 = recall_score(y_actual, y_pred, pos_label=1)
  v13 = f1_score(y_actual, y_pred, pos_label=1)

  smsg = f"{description},\tAccuracy={v00:.2f},\tC0: Pr={v01:.2f}, Re={v02:.2f}, F1={v03:.2f},\tC1: Pr={v11:.2f}, Re={v12:.2f}, F1={v13:.2f}"
  print(smsg)
  with open("results.txt", "a") as myfile:
    myfile.write(f"{datetime.now()}\t {smsg}\n")

accuracy = accuracy_score(true_labels, predictions)
print("Test Accuracy:", accuracy)


cuda
Epoch 1/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:30<00:00, 14.08it/s]


Epoch 2/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.28it/s]


Epoch 3/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.29it/s]


Epoch 4/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.27it/s]


Epoch 5/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.34it/s]


Epoch 6/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.33it/s]


Epoch 7/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.32it/s]


Epoch 8/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:28<00:00, 14.32it/s]


Epoch 9/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:27<00:00, 14.43it/s]


Epoch 10/10


100%|████████████████████████████████████████████████████████████████████████████████████| 2125/2125 [02:27<00:00, 14.36it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:03<00:00, 69.53it/s]

Test Accuracy: 0.9303721488595438





In [8]:
# Save the trained model if needed
from datetime import datetime

# Get the current date and time
current_datetime = datetime.now()

# Format the current date and time into a string
model_file_name = 'mlp_model_' + current_datetime.strftime("%Y-%m-%d_%H-%M-%S") + ".pth"

torch.save(classification_model, model_file_name)


In [None]:
ls -al

total 427824
drwxr-xr-x 1 root root      4096 Mar 15 19:08 [0m[01;34m.[0m/
drwxr-xr-x 1 root root      4096 Mar 15 18:14 [01;34m..[0m/
drwxr-xr-x 4 root root      4096 Mar 14 13:26 [01;34m.config[0m/
-rw-r--r-- 1 root root 438067302 Mar 15 19:08 mlp_model_2024-03-15_19-08-08.pth
drwxr-xr-x 1 root root      4096 Mar 14 13:27 [01;34msample_data[0m/
drwxr-xr-x 5 root root      4096 Mar 15 18:20 [01;34msuggestion-mining[0m/


In [None]:
from torchviz import make_dot

make_dot(predicted_labels.cpu().numpy().mean(), params=dict(classification_model.named_parameters()))
make_dot(test_labels.numpy().mean(), params=dict(bert_model.named_parameters()), show_attrs=True, show_saved=True)


0.10444177671068428

In [None]:
!pip install TorchLens

In [None]:
model = classification_model

x = torch.randn(1, 8)
y = model(train_encodings['input_ids'][1], train_encodings['attention_mask'][1], 1)

make_dot(y.mean(), params=dict(model.named_parameters()))

TypeError: ClassificationModel.forward() takes 3 positional arguments but 4 were given