<a href="https://colab.research.google.com/github/cse-teacher/suggestion-mining/blob/main/suggestion_mining_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suggestion Mining using BERT
Suggestion mining is the task of extracting suggestions from user reviews

Developed: 11 Feb 2024 \\
Last Update: 11 Feb 2024 \\
Author: Muharram Mansoorizadeh plus Various AI tools (Google search, chatGPT, Gemini , ...)




## Install Required Packagaes

In [1]:
#Install required packages and libraries

!apt-get install libenchant-2-2
!pip install emoji
!pip install cleantext
!pip install nltk
!pip install pyenchant
!pip install scikit-learn lightgbm catboost
!pip install gensim
!pip install transformers sentencepiece sacremoses


'apt-get' is not recognized as an internal or external command,
operable program or batch file.




## Import data

Get the required data files from github repository

In [None]:
!git clone https://github.com/cse-teacher/suggestion-mining.git

fatal: destination path 'suggestion-mining' already exists and is not an empty directory.


## Prepare data

In [2]:
# Read data from input files
#Reset environment
%reset -f

import numpy as np
import pandas as pd
import random

#Set default seed:
random.seed(42)

#Main Application
folder     = "./suggestion-mining/data/"
train_file = folder + "V1.4_Training.csv" #"Train_Augmented_03.csv" # V1.4_Training.csv" #  "Train_processed.csv" /suggestion-mining/data/Train_Augmented_03.csv
valid_file = folder + "SubtaskA_Trial_Test_Labeled.csv" #"validation_processed.csv"
test_file  = folder + "SubtaskA_EvaluationData_labeled.csv"


train_df = pd.read_csv(train_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

valid_df = pd.read_csv(valid_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

test_df  = pd.read_csv(test_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

all_df = pd.concat([train_df, valid_df, test_df], axis=0)


#Get the labels:
y_train_original = train_df['label'].values
y_valid_original = valid_df['label'].values
y_test_original  = test_df['label'].values
y_all_original  = all_df['label'].values
train_size = len(train_df['label'])
valid_size = len(valid_df['label'])
test_size  = len(test_df['label'])



**Preprocessing**

**Example Usages**

In [None]:
# Example usage:
text = "Microsoft should seriously look into getting rid of Syamentc for all these paying stuff"
replaced_text = replace_with_ner_tags(text)
print("Replaced Text:", replaced_text)
#all_df[:][1:20]

Replaced Text: PERSON should seriously look into getting rid of GPE for all these paying stuff


In [None]:
replace_ner_tags = True
if replace_ner_tags == True:
  #replace named entities with their tag names:
  train_df['sentence'] = train_df['sentence'].apply(replace_with_ner_tags)
  test_df['sentence'] = test_df['sentence'].apply(replace_with_ner_tags)
  valid_df['sentence'] = valid_df['sentence'].apply(replace_with_ner_tags)
  #all_df['sentence'] = all_df['sentence'].apply(replace_hyperlinks)
  #all_df['sentence'] = all_df['sentence'].apply(cleantext.clean)

**BERT for Classification**:

Import libraries: We import necessary libraries for loading BERT tokenizer and model, processing text, and making predictions.

Load BERT tokenizer and model: We load the pre-trained bert-base-uncased tokenizer and model. Replace 'bert-base-uncased' with your desired pre-trained BERT model name.

Preprocess text function: This function performs the following:

Tokenizes the text using the BERT tokenizer.

Adds special tokens ([CLS] and [SEP]) to the beginning and end of the sequence, respectively.

Pads the sequence to a maximum length (MAX_LEN) if necessary.

Define example text and label: Replace text with your actual text to classify and adjust label based on your classification categories.

Preprocess text: Call the preprocess_text function to convert the text into the required format for BERT.

Make prediction: Pass the preprocessed text through the model to obtain predictions.

Get predicted class and probability: Extract the predicted class index and its corresponding probability from the prediction results.

Print results: Print the predicted class and its probability.
Note:

This is a basic example and can be further customized for specific tasks like sentiment analysis or topic classification.
Remember to install the required libraries (transformers and tensorflow) before running the code.
Adjust MAX_LEN based on the maximum sentence length in your dataset.
Sources
github.com/JiaYaobo/toxic_detect


In [5]:
!pip install -q -U tensorflow-text
!pip install transformers
!pip install -q tf-models-official



**Init Bert**

In [11]:
import numpy as np
import pandas as pd
HF_TOKEN = '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'

#Main Application
folder     = "./suggestion-mining/data/"
train_file = folder + "Train_Augmented_03.csv" # V1.4_Training.csv" #  "Train_processed.csv"
valid_file = folder + "SubtaskA_Trial_Test_Labeled.csv" #"validation_processed.csv"
test_file  = folder + "SubtaskA_EvaluationData_labeled.csv"


train_df = pd.read_csv(train_file,
                       encoding_errors='ignore', header=None,
                       names=["id", "sentence", "label"])

In [27]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from transformers import BertTokenizer, BertModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# Example data
#texts = ["Text 1", "Text 2", "Text 3"]  # Your list of text strings
#labels = [0, 1, 0]  # Example labels (binary classification)

texts  = train_df['sentence'].tolist()
labels = train_df['label'].tolist()

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained BERT model
model = BertModel.from_pretrained('bert-base-uncased')
model.to(device)

# Tokenize and encode the texts
encoded_inputs = tokenizer(texts, add_special_tokens=True, padding=True, truncation=True, return_tensors='pt')
encoded_inputs.to(device)  # Move input tensor to GPU

#encoded_inputs.todevice(device)
# Split data into mini-batches
batch_size = 4
num_batches = len(texts) // batch_size + (len(texts) % batch_size != 0)

# Initialize empty lists to store embeddings and labels
embeddings_list = []
labels_list = []

# Process each mini-batch
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(texts))
    batch_texts = texts[start_idx:end_idx]
    batch_labels = labels[start_idx:end_idx]
    # Encode the mini-batch
    batch_encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt')
    #batch_encoded_inputs.todevice(device)
    print('OK1')
    #batch_encoded_inputs = torch.Tensor(batch_encoded_inputs).todevice(device)
    # Get the BERT embeddings for the mini-batch
    with torch.no_grad():
        batch_outputs = model(**batch_encoded_inputs)

    print('OK2')
    # Extract the embeddings (the last hidden state) and append to the list
    batch_embeddings = batch_outputs.last_hidden_state.cpu().numpy()
    batch_embeddings = batch_embeddings[:,-1, :]
    embeddings_list.append(batch_embeddings)

    # Append labels to the list
    labels_list.extend(batch_labels)

# Concatenate embeddings and labels
embeddings = np.concatenate(embeddings_list, axis=0)
labels = np.array(labels_list)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.2, random_state=42)

# Train logistic regression classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train.reshape(X_train.shape[0], -1), y_train)

# Predict on test set
y_pred = classifier.predict(X_test.reshape(X_test.shape[0], -1))

# Evaluate the classifier
print(classification_report(y_test, y_pred))


cuda
OK1


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [29]:
import torch
from transformers import BertModel, BertTokenizer
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Assuming you have your documents stored in a list of strings named `documents`
# Assuming you have your labels stored in a list named `labels` where 0 indicates one class and 1 indicates another class
documents  = train_df['sentence'].tolist()
labels = train_df['label'].tolist()
# Split data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(documents, labels, test_size=0.2, random_state=42)

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name).to('cuda')  # Move model to GPU

# Tokenize and encode the training and testing texts
max_length = 128  # Adjust as needed
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length, return_tensors='pt')

# Convert labels to tensors
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

# Create TensorDatasets
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Define DataLoader
batch_size = 2  # Adjust as needed
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define classification model
class ClassificationModel(torch.nn.Module):
    def __init__(self, bert_model):
        super(ClassificationModel, self).__init__()
        self.bert = bert_model
        self.fc = torch.nn.Linear(self.bert.config.hidden_size, 2)  # Output size is 2 for binary classification

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        logits = self.fc(pooled_output)
        return logits

# Instantiate classification model
classification_model = ClassificationModel(bert_model).to('cuda')  # Move model to GPU

# Define optimizer and loss function
optimizer = torch.optim.AdamW(classification_model.parameters(), lr=2e-5)
loss_function = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 3  # Adjust as needed
classification_model.train()
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    for batch in tqdm(train_loader):
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to('cuda'), attention_mask.to('cuda'), labels.to('cuda')

        optimizer.zero_grad()
        outputs = classification_model(input_ids, attention_mask)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()

# Evaluation
classification_model.eval()
predictions = []
true_labels = []
for batch in tqdm(test_loader):
    input_ids, attention_mask, labels = batch
    input_ids, attention_mask, labels = input_ids.to('cuda'), attention_mask.to('cuda'), labels.to('cuda')

    with torch.no_grad():
        outputs = classification_model(input_ids, attention_mask)
    _, predicted_labels = torch.max(outputs, 1)
    predictions.extend(predicted_labels.cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Calculate accuracy
accuracy = accuracy_score(true_labels, predictions)
print("Test Accuracy:", accuracy)


Epoch 1/3


  4%|██▊                                                                            | 210/5902 [00:10<04:35, 20.62it/s]


KeyboardInterrupt: 