<a href="https://colab.research.google.com/github/alohapartyyanisak/MAD_Class/blob/main/B_Fine_tuning_BERT_Model_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Classification Using BERT**

## **Step1: Import the necessary libraries**

In [None]:
# Reference: https://www.geeksforgeeks.org/sentiment-classification-using-bert/

# Import preprocessing library
import os
import pandas as pd
from bs4 import BeautifulSoup
import re

# Import modeling library
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

# Others
import warnings
warnings.filterwarnings("ignore")

## **Step 2: Load the dataset**

In [None]:
# Load IMDB dataset: A dataset for binary sentiment classification with 25,000 highly polar movie reviews for training, and 25,000 for testing
dataset = tf.keras.utils.get_file(
	fname="aclImdb.tar.gz",
	origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
	cache_dir=os.getcwd(),
	extract=True)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
# Set directory path (Dataset Explanation: https://deepnote.com/app/ronakv/Sentiment-Analysis-9cb468b0-9200-400f-9896-e4e9d46dbc48)
dataset_dir = os.path.dirname(dataset)
imdb_dir = os.path.join(dataset_dir, 'aclImdb')
train_dir = os.path.join(imdb_dir,'train')
test_dir = os.path.join(imdb_dir,'test')

os.listdir(train_dir)

['urls_pos.txt',
 'labeledBow.feat',
 'urls_unsup.txt',
 'pos',
 'unsupBow.feat',
 'urls_neg.txt',
 'neg',
 'unsup']

In [None]:
def load_dataset(directory):
	data = {"sentence": [], "sentiment": []}
	for file_name in os.listdir(directory):
		if file_name == 'pos':
			positive_dir = os.path.join(directory, file_name)
			for text_file in os.listdir(positive_dir):
				text = os.path.join(positive_dir, text_file)
				with open(text, "r", encoding="utf-8") as f:
					data["sentence"].append(f.read())
					data["sentiment"].append(1)
		elif file_name == 'neg':
			negative_dir = os.path.join(directory, file_name)
			for text_file in os.listdir(negative_dir):
				text = os.path.join(negative_dir, text_file)
				with open(text, "r", encoding="utf-8") as f:
					data["sentence"].append(f.read())
					data["sentiment"].append(0)

	return pd.DataFrame.from_dict(data)

In [None]:
# Load the dataset from the train_dir and test_dir
train_df = load_dataset(train_dir)
test_df = load_dataset(test_dir)

In [None]:
# Training set
train_df.sample(n=5, random_state=1)

Unnamed: 0,sentence,sentiment
21492,Dolph Lundgren stars as Murray Wilson an alcoh...,0
9488,This is a sublime piece of film-making. It flo...,1
16933,For anyone craving a remake of 1989's Slaves o...,0
12604,I liked this show! I think it was nothing with...,0
8222,"MPAA:Rated R for Violence,Language,Nudity and ...",1


In [None]:
# Test set
test_df.sample(n=5, random_state=1)

Unnamed: 0,sentence,sentiment
21492,Formulaic to the max. Neither title reflects t...,0
9488,The basic premise of Flatliners is fairly simp...,1
16933,...an incomprehensible script (when it shouldn...,0
12604,Early 1950s Sci-Fi directed by Lesley Selander...,0
8222,I shot this movie. I am very proud of the film...,1


In [None]:
# Test sentence example
test_df.loc[8222, 'sentence']

'I shot this movie. I am very proud of the film. It was a great experience which shows up on the screen. Halfdan Hussey is an excellent collaborator who had a vision and was able to capture the movie in the exact way we envisioned while prepping the film. The sets are amazing and well crafted for each character. John York and his team built sets that not only fit the characters, they worked well in shooting the film, allowing us to move seamlessly through walls and from one set to another. Each character has an amazing arc, which makes for a great story. I feel like all of the actors gave excellent performances. I disagree with some of the other comments that say the acting was not good. Watch it and decide for yourself.'

## **Step 3: Preprocessing**

In [None]:
# Clean texts
def text_cleaning(text):
	soup = BeautifulSoup(text, "html.parser")
	text = soup.get_text()
	pattern = r"[^a-zA-Z0-9\s,']"
	text = re.sub(pattern, '', text)
	return text

Regex, short for Regular Expression, is a sequence of characters that defines a search pattern, allowing for efficient string manipulation and pattern matching operations in text processing tasks. <br><br>
The pattern [^a-zA-Z0-9\s,'] is a regex that matches any character that is not:
*   a-z: Any lowercase letter.
*   A-Z: Any uppercase letter.
*   0-9: Any digit.
*   \s: Any whitespace character (such as spaces, tabs, or newlines).
*   ,: The comma character.
*   ': The apostrophe character.
<br> The ^ at the beginning inside the square brackets [] negates the character set, meaning it matches any character not listed.

In [None]:
# Ex.1
test = "<br /><br />(Wow!!!) He's very smart."
test_1 = BeautifulSoup(test, "html.parser").get_text()
print(test_1)

(Wow!!!) He's very smart.


In [None]:
# Ex.2
test_2 = re.sub(r"[^a-zA-Z0-9\s,']", '', test_1)
print(test_2)

Wow He's very smart


In [None]:
# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning) #.tolist()
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']

# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']

In [None]:
x_val, x_test, y_val, y_test = train_test_split(test_reviews,
													test_targets,
													test_size=0.5,
													stratify = test_targets)

## **Step 4: Tokenization & Encoding**

In [None]:
# Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Ex.1
sentence = "Her style is very conversational"
tokenizer.tokenize(sentence)

['her', 'style', 'is', 'very', 'conversation', '##al']

In [None]:
# Ex.2
encoding = tokenizer.encode(sentence)
encoding

[101, 2014, 2806, 2003, 2200, 4512, 2389, 102]

In [None]:
# Ex.3
tokenizer.convert_ids_to_tokens(encoding)

['[CLS]', 'her', 'style', 'is', 'very', 'conversation', '##al', '[SEP]']

In [None]:
# Ex.4
tokenizer.batch_encode_plus(["Her style is very conversational", "Her style is good"],
											padding=True,
											truncation=True,
											max_length=128,
											return_tensors='tf')

{'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[ 101, 2014, 2806, 2003, 2200, 4512, 2389,  102],
       [ 101, 2014, 2806, 2003, 2204,  102,    0,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>}

In [None]:
max_len= 128

# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
											padding=True,
											truncation=True,
											max_length = max_len,
											return_tensors='tf')

In [None]:
k = 4
print('Training Comments -->', Reviews[k])
print('\nInput Ids -->\n', X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->\n', tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->\n', X_train_encoded['attention_mask'][k])
print('\nLabels -->', Target[k])

Training Comments --> I have seen this movie many times At least a Dozen But unfortunatly not recently However, Etched in my memory never to leave me is a scene in which Mickey Rooney, Killer Mears knows that he is to be executed and it's getting close to the moment of truth, He dances, and cries, and laughs, he vacillates from hesteria to euphoria and runs the gambit of ever emotion Never have I seen such a brilliant performance by any actor living or dead, past or present It was then I know for sure that Mickey Rooney, yes, Andy Hardy was and is a actor of great genius However I kept it, my opinion to myself for years thinking, surely I must be alone in this viewpoint About 15 years or so after I saw this film for the last time on television, I chanced to read the old Q  A section of the Los Angeles Times The question was posed to Lawrence Olivier, and the question was Mr Olivier You are considered one of the greatest actors of all time, whom then do YOU consider to be among the grea

## **Step 5: Build the classification model**

In [None]:
# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
# Step 5: Train the model
history = model.fit(
	[X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
	Target,
	validation_data=([X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
	batch_size=32,
	epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## **Step 6: Evaluate the model**

In [None]:
# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
	[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
	y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')

Test loss: 0.3461953401565552, Test accuracy: 0.8848000168800354


In [None]:
path = 'path-to-save'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')

# Save model
model.save_pretrained(path +'/Model')

In [None]:
# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')

# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')

Some layers from the model checkpoint at path-to-save/Model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at path-to-save/Model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [None]:
# Perform a more in-depth evaluation
pred = bert_model.predict(
	[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])

# pred is of type TFSequenceClassifierOutput
logits = pred.logits

# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)

# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()

label = {
	1: 'Positive',
	0: 'Negative'
}

# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]

print('Predicted Label :', pred_labels[:10])
print('Actual Label :', Actual[:10])

Predicted Label : ['Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive']
Actual Label : ['Positive', 'Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive']


In [None]:
print("Classification Report: \n", classification_report(Actual, pred_labels))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.90      0.87      0.88      6250
    Positive       0.87      0.90      0.89      6250

    accuracy                           0.88     12500
   macro avg       0.89      0.88      0.88     12500
weighted avg       0.89      0.88      0.88     12500



## **Step 7: Prediction with user inputs**

In [None]:
def Get_sentiment(Review, Tokenizer, Model):
	# Convert Review to a list if it's not already a list
	if not isinstance(Review, list):
		Review = [Review]

	Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
																			padding=True,
																			truncation=True,
																			max_length=128,
																			return_tensors='tf').values()
	prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])

	# Use argmax along the appropriate axis to get the predicted labels
	pred_labels = tf.argmax(prediction.logits, axis=1)

	# Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
	pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
	return pred_labels

In [None]:
len(pred_labels)

12500

In [None]:
Review ='''Bahubali is a blockbuster Indian movie that was released in 2015.
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love.
The movie has received rave reviews from critics and audiences alike for its stunning visuals,
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review, bert_tokenizer, bert_model)



['Positive']