Installing Transformers
Installing the Transformers library is fairly easy. Just run the following pip line on a Google Colab cell:
pip install transformers

Loading the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures.Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.

In [2]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

ImportError: 
TFBertForSequenceClassification requires the TensorFlow library but it was not found in your environment. Checkout the instructions on the
installation page: https://www.tensorflow.org/install and follow the ones that match your environment.


Summary of our BERT model:

In [None]:
model.summary()

Initial imports

In [None]:
import tensorflow as tf
import pandas as pd
import os
import shutil

Convert to Pandas to View and Process

In [None]:
dataset = pd.read_csv("Dataset/cleanedReviewsDateset.csv",low_memory=False,encoding="ISO-8859-1")
train_set = dataset[0:60000]
valid_set = dataset[60000:65000]
test_set  = dataset[65000:77762]


Let’s create labels using the Vader algorithm. It is an unsupervised, rule-based method for sentiment analysis, and it is accessible through the NLTK package. You need to install the NLTK library (pip install nltk), then download the Vader package.

In [None]:
import nltk
nltk.download("vader_lexicon")

Now we can predict each review’s sentiment using the Vader algorithm with the following code;

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_result(sent):
    scores = analyzer.polarity_scores(sent)
    
    if scores["neg"] > scores["pos"]:
        return 0

    return 1

train_set["vader_result"] = train_set["text"].apply(lambda x: vader_sentiment_result(x))
valid_set["vader_result"] = valid_set["text"].apply(lambda x: vader_sentiment_result(x))

Fine-Tuning Huggingface Model with the Trainer function:
The process starts with converting the data to a PyTorch dataset object to feed it to the BERT model. It is a class for preprocessing and presenting the data. Loading the BERT classifier from the Huggingface library and fine-tune the model with the Trainer function.

In [None]:
import torch
torch.cuda.is_available()

In [None]:
import torch
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load the BERT Tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# The dataset class
class TheDataset(torch.utils.data.Dataset):

    def __init__(self, reviews, sentiments, tokenizer):
        self.reviews    = reviews
        self.sentiments = sentiments
        self.tokenizer  = tokenizer
        self.max_len    = tokenizer.model_max_length
  
    def __len__(self):
        return len(self.reviews)
  
    def __getitem__(self, index):
        review = str(self.reviews[index])
        sentiments = self.sentiments[index]

        encoded_review = self.tokenizer.encode_plus(
            review,
            add_special_tokens    = True,
            max_length            = self.max_len,
            return_token_type_ids = False,
            return_attention_mask = True,
            return_tensors        = 'pt',
            padding               = "max_length",
            truncation            = True
        )

        return {
            'input_ids': encoded_review['input_ids'][0],
            'attention_mask': encoded_review['attention_mask'][0],
            'labels': torch.tensor(sentiments, dtype=torch.long)
        }

# Prepare the Train/Validation sets
train_set_dataset = TheDataset(
    reviews    = train_set.text.tolist(),
    sentiments = train_set.vader_result.tolist(),
    tokenizer  = tokenizer,
)

valid_set_dataset = TheDataset(
    reviews    = valid_set.text.tolist(),
    sentiments = valid_set.vader_result.tolist(),
    tokenizer  = tokenizer,
)

# Load the BERT model
model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

# Freeze BERT except (the 24th layer + the last pooler layer)
for name, param in model.bert.named_parameters():
    if ( not name.startswith('pooler') ) and "layer.23" not in name :
        param.requires_grad = False

# The function to get the accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Define the training parameters
training_args = TrainingArguments(
    output_dir                  = "./sentiment-analysis",
    num_train_epochs            = 10,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size  = 64,
    warmup_steps                = 500,
    weight_decay                = 0.01,
    save_strategy               = "epoch",
    evaluation_strategy         = "steps"
)

# Define the Huggingface Trainer object
trainer = Trainer(
    model           = model,
    args            = training_args,
    train_dataset   = train_set_dataset,
    eval_dataset    = valid_set_dataset,
    compute_metrics = compute_metrics
)

# Start pre-training!
trainer.train()
