# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>
## Fine-tuning BERT for the Product Review Problem - Classify Product Reviews as Positive or Not

Let's fine-tune the BERT model to classify our product reviews. We will install a new library __transformers__ and get a pre-trained BERT model from it. We are following [this tutorial](https://huggingface.co/docs/transformers/training#train-in-native-pytorch) from the HuggingFace framework.

We are using a light version of the original BERT implementation called __"DistilBert"__. You can checkout [their paper](https://arxiv.org/pdf/1910.01108.pdf) for more details. 

__Keep in mind that BERT and its variants use more resources than the other models we learned so far: recurrent neural networks, LSTMs etc. You may run out of memory sometimes. If that happens, you can restart the kernel (Kernel->Restart from the top menu), reduce the batch size and re-run the code.__

In [1]:
!pip install -q -r ../../requirements.txt

In [2]:
import time
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast
from torch.utils.data import DataLoader
from datasets import load_metric

2023-07-20 18:10:08.527129: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-20 18:10:08.580661: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's read the dataset

In [3]:
df = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TRAINING.csv")

Let's print the dataset information.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          56000 non-null  int64  
 1   reviewText  55990 non-null  object 
 2   summary     55988 non-null  object 
 3   verified    56000 non-null  bool   
 4   time        56000 non-null  int64  
 5   log_votes   56000 non-null  float64
 6   isPositive  56000 non-null  float64
dtypes: bool(1), float64(2), int64(2), object(2)
memory usage: 2.6+ MB


We drop rows with text field missing.

In [5]:
df.dropna(subset=["reviewText"], inplace=True)

BERT requires powerful compute power. In this demo, we will only use the first 1,000 data points. 

In [6]:
df = df.head(1000)

We set the output type to int64 as it is required by this library.

In [7]:
df["isPositive"] = df["isPositive"].astype("int64")

Let's keep 10% of the data for validation.

In [8]:
# This separates 10% of the entire dataset into validation dataset.
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["reviewText"].tolist(),
    df["isPositive"].tolist(),
    test_size=0.10,
    shuffle=True,
    random_state=324,
    stratify = df["isPositive"].tolist(),
)

Let's get the special tokenizer for BERT.

In [9]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(val_texts,
                          truncation=True,
                          padding=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

We prepare our data below.

In [10]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx]).to(device)
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)

Let's call the model. This may print some warning messages. We are using it as intended, so don't worry about them.

In [11]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                            num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's start the fine-tuning process. This code may take __a long time__ to complete with large datasets.

In [12]:
# Freeze the encoder weights until the classfier
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

# Hyperparameters
num_epochs = 10
learning_rate=0.01

# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create data loaders
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8, drop_last=True)
eval_dataloader = DataLoader(val_dataset, batch_size=8, drop_last=True)

# Setup the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

metric = load_metric("accuracy")

model=model.to(device)

for epoch in range(num_epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop starts
    model.train() # put the model in training mode
    for batch in train_dataloader:
        # below: ** allows us to pass multiple arguments to model()
        outputs = model(**batch)
        loss = outputs.loss
        training_loss += loss.item()
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
    
    # Validation loop starts
    model.eval() # put the model in prediction mode
    for batch in eval_dataloader:
        with torch.no_grad():
            # below:  ** allows us to pass multiple arguments to model()
            outputs = model(**batch)
        loss = outputs.loss
        val_loss += loss.item()
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
        
    # Let's take the average losses
    training_loss = training_loss / len(train_dataloader)
    val_loss = val_loss / len(eval_dataloader)
    end = time.time()
    
    print(f"Epoch {epoch}. Train_loss {training_loss:.4f}. Val_loss {val_loss:.4f}. \
    Val_accuracy {metric.compute()['accuracy']:.4f}. Seconds {end-start:.3f}.")

  metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Epoch 0. Train_loss 0.6449. Val_loss 0.6209.     Val_accuracy 0.6354. Seconds 16.251.
Epoch 1. Train_loss 0.6088. Val_loss 0.5848.     Val_accuracy 0.7083. Seconds 15.833.
Epoch 2. Train_loss 0.5624. Val_loss 0.5490.     Val_accuracy 0.7917. Seconds 16.102.
Epoch 3. Train_loss 0.5192. Val_loss 0.5114.     Val_accuracy 0.7604. Seconds 16.351.
Epoch 4. Train_loss 0.4882. Val_loss 0.4759.     Val_accuracy 0.8125. Seconds 16.608.
Epoch 5. Train_loss 0.4640. Val_loss 0.4762.     Val_accuracy 0.7396. Seconds 16.499.
Epoch 6. Train_loss 0.4262. Val_loss 0.4507.     Val_accuracy 0.8125. Seconds 16.262.
Epoch 7. Train_loss 0.4342. Val_loss 0.4221.     Val_accuracy 0.8229. Seconds 16.190.
Epoch 8. Train_loss 0.4104. Val_loss 0.4194.     Val_accuracy 0.8333. Seconds 16.172.
Epoch 9. Train_loss 0.4043. Val_loss 0.4843.     Val_accuracy 0.7917. Seconds 16.183.


### Looking at what's going on

The fine-tuned BERT is able to correctly classify the sentiment of all records in the validation set. Let's print some of the data and what's happening with it.

In [13]:
k = 0
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

512
[101, 1045, 4149, 2023, 2138, 6881, 3769, 1011, 2039, 21628, 2015, 2020, 4760, 2039, 2006, 2026, 12191, 1012, 2023, 10770, 3036, 4031, 2134, 1005, 1056, 2131, 9436, 1997, 2068, 1010, 2061, 1045, 2001, 9364, 1010, 2021, 2009, 2052, 3796, 1996, 4180, 1012, 2061, 2009, 4066, 1997, 2499, 1010, 2021, 1045, 4299, 2009, 2071, 4550, 2039, 2026, 3274, 2062, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [14]:
k = 24
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

512
[101, 1045, 2031, 2109, 2119, 22432, 7959, 2063, 1998, 10770, 1998, 2044, 2383, 2109, 2023, 4007, 2005, 2058, 1037, 2095, 2085, 1045, 2079, 2025, 2933, 2000, 2689, 1012, 2009, 2515, 2025, 4030, 2091, 2026, 3274, 1012, 1045, 2224, 2009, 2006, 2026, 7473, 1998, 14960, 1012, 2009, 2038, 7420, 2033, 1997, 4022, 4795, 4773, 4573, 2008, 1996, 2060, 2048, 2106, 2025, 1998, 2009, 2003, 16286, 21125, 1012, 6581, 4007, 2005, 4274, 3036, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Let's observe in more detail how sentences are tokenized.

In [15]:
st = val_texts[24]
print(st)
tok = tokenizer(st, truncation=True, padding=True)
print(tok)

I have used both McAfee and Norton and after having used this software for over a year now I do not plan to change. It does not slow down my computer. I use it on my PC and notebook. It has warned me of potential dangerous web sites that the other two did not and it is reasonably priced. Excellent software for internet security.
{'input_ids': [101, 1045, 2031, 2109, 2119, 22432, 7959, 2063, 1998, 10770, 1998, 2044, 2383, 2109, 2023, 4007, 2005, 2058, 1037, 2095, 2085, 1045, 2079, 2025, 2933, 2000, 2689, 1012, 2009, 2515, 2025, 4030, 2091, 2026, 3274, 1012, 1045, 2224, 2009, 2006, 2026, 7473, 1998, 14960, 1012, 2009, 2038, 7420, 2033, 1997, 4022, 4795, 4773, 4573, 2008, 1996, 2060, 2048, 2106, 2025, 1998, 2009, 2003, 16286, 21125, 1012, 6581, 4007, 2005, 4274, 3036, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [16]:
# The mapped vocabulary is stored in tokenizer.vocab
tokenizer.vocab_size

30522

In [17]:
# Methods convert_ids_to_tokens and convert_tokens_to_ids allow to see how sentences are tokenized
print(tokenizer.convert_ids_to_tokens(tok['input_ids']))

['[CLS]', 'i', 'have', 'used', 'both', 'mca', '##fe', '##e', 'and', 'norton', 'and', 'after', 'having', 'used', 'this', 'software', 'for', 'over', 'a', 'year', 'now', 'i', 'do', 'not', 'plan', 'to', 'change', '.', 'it', 'does', 'not', 'slow', 'down', 'my', 'computer', '.', 'i', 'use', 'it', 'on', 'my', 'pc', 'and', 'notebook', '.', 'it', 'has', 'warned', 'me', 'of', 'potential', 'dangerous', 'web', 'sites', 'that', 'the', 'other', 'two', 'did', 'not', 'and', 'it', 'is', 'reasonably', 'priced', '.', 'excellent', 'software', 'for', 'internet', 'security', '.', '[SEP]']


# Getting predictions on the test data and saving results
* Read the test data
* Pass the data into your pipeline and make predictions

In [18]:
# Read the test data (It doesn't have the human_tag label, we are trying to predict that :D )
df_test = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TEST.csv")
df_test.head()

Unnamed: 0,ID,reviewText,summary,verified,time,log_votes
0,33276,I've been using greeting card software for wel...,Absolutely awful.,False,1300233600,0.0
1,20859,"This version worked well for me, have upgraded...",Good for virtual machine on a mac,True,1448755200,0.0
2,63500,Great!,Five Stars,True,1456963200,0.0
3,4950,I can assure you that any five star review was...,SCAM,False,1400803200,2.197225
4,26509,Overall the product really seems the same but ...,Has potential but many glitches and really the...,False,1419206400,0.0


In [19]:
df_test.isna().sum()

ID            0
reviewText    1
summary       2
verified      0
time          0
log_votes     0
dtype: int64

In [20]:
df_test["reviewText"] = df_test["reviewText"].fillna(value='')

Below, we only consider 100 test datapoints to keep this short. Use the whole test dataset if you want to apply this on your final project.

In [21]:
test_texts = df_test["reviewText"].tolist()[0:100]

In [22]:
test_encodings = tokenizer(test_texts,
                          truncation=True,
                          padding=True)

In [23]:
test_dataset = ReviewDataset(test_encodings, [0]*len(test_texts))

In [24]:
test_dataloader = DataLoader(test_dataset, batch_size=4)
test_predictions = []
model.eval()
for batch in test_dataloader:
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    test_predictions.extend(predictions.cpu().numpy())

In [25]:
k = 0
print(len(test_dataset.encodings["input_ids"][k]))
print(test_dataset.encodings["input_ids"][k])
print(test_texts[k])
#check whether the prediction is good enough
print(test_predictions[k])

512
[101, 1045, 1005, 2310, 2042, 2478, 14806, 4003, 4007, 2005, 2092, 2058, 2184, 2086, 1998, 1045, 1005, 2310, 2196, 2272, 2408, 1037, 4013, 22864, 2061, 3697, 1010, 2065, 2025, 5263, 1010, 2000, 2224, 1012, 2000, 2707, 1010, 1996, 8128, 2681, 2307, 8198, 2073, 2592, 2323, 2022, 1012, 1996, 3784, 2490, 2008, 1045, 7303, 2393, 2013, 2196, 5838, 1012, 2017, 2064, 2069, 2147, 2006, 2028, 3931, 2012, 1037, 2051, 2029, 2965, 2043, 12697, 1996, 2503, 3659, 2006, 1037, 4003, 1010, 2017, 2064, 1005, 1056, 2156, 2119, 5530, 7453, 1012, 1045, 2145, 4033, 1005, 1056, 2042, 2583, 2000, 7523, 2129, 2000, 2079, 2048, 1011, 11536, 8021, 1998, 2045, 1005, 1055, 2498, 1999, 1996, 8128, 2000, 2393, 2033, 2041, 1012, 9343, 2023, 4031, 2001, 1037, 4121, 6707, 1012, 2022, 8059, 1012, 5125, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Again, we used only 100 test datapoints below. Use the full test set for your final project if you are interested.

In [26]:
result_df = pd.DataFrame()
result_df["ID"] = df_test["ID"][0:100]
result_df["isPositive"] = test_predictions

result_df.to_csv("result_day3_bert.csv", encoding='utf-8', index=False)