<a href="https://colab.research.google.com/github/fdupoAMF/IVADO_LLM_Application_Course/blob/main/intention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Hands-on Session 1:**
### **Intention Detection for Technical Forum Posts**


#### Task: Classifying technical posts into seven intention categories: `Discrepancy`, `Explicit Error`, `Review`, `Conceptual`, `Learning`, `How-to`, and `Others`.


#### **Objective**
- Gain hands-on experience with using and tuning a Large Language Model (LLM) through a concrete example.
- Develop a clearer understanding of the BERT structure and the fine-tuning process.
- Learn how to add task-specific structures, such as a classification head, to a BERT model.
- Understand how to freeze a portion of the parameters in a BERT model.
- Compare different pre-trained variants of BERT to understand their strengths and applications.

*To speed up the training process, we need to turn on the GPU support for Colab:*

* `Edit`-> `Notebook Settings` -> `T4 GPU` (or other available) -> `Save`

#### **Download dependencies & Loading packages**

In [1]:
path_files = 'https://github.com/mooselab/llm_training_supplimentary_intention/archive/refs/tags/release.zip'
!wget $path_files
# Decompress zipped files
!unzip release.zip

--2024-05-30 18:05:26--  https://github.com/mooselab/llm_training_supplimentary_intention/archive/refs/tags/release.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/mooselab/llm_training_supplimentary_intention/zip/refs/tags/release [following]
--2024-05-30 18:05:27--  https://codeload.github.com/mooselab/llm_training_supplimentary_intention/zip/refs/tags/release
Resolving codeload.github.com (codeload.github.com)... 20.205.243.165
Connecting to codeload.github.com (codeload.github.com)|20.205.243.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘release.zip’

release.zip             [ <=>                ] 527.93K  --.-KB/s    in 0.009s  

2024-05-30 18:05:27 (56.3 MB/s) - ‘release.zip’ saved [540601]

Archive:  release.zip
b43a50ea209b2b1b46f10000d1764

In [2]:
!pip install readability
!pip install torchview
!python -m nltk.downloader punkt
!pip install torchviz
# include the utils
base_dir = './llm_training_supplimentary_intention-release'

# Add the package path to sys.path
import sys
if base_dir not in sys.path:
    sys.path.append(base_dir)

import torch
from torch import cuda
from torch.utils.data import Dataset, DataLoader
import torchvision
import torchviz
from torchview import draw_graph
import numpy as np
from transformers import *
from sklearn.metrics import classification_report
import os
import nltk
from utils import *
from tqdm.notebook import tqdm

# Download and install some necessary packages/resources...
nltk.download('vader_lexicon')

Collecting readability
  Downloading readability-0.3.1.tar.gz (34 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: readability
  Building wheel for readability (setup.py) ... [?25l[?25hdone
  Created wheel for readability: filename=readability-0.3.1-py3-none-any.whl size=35460 sha256=85a4f62e0cf382fcd193455650ef28f813ea47a3dd35b781ec774126534d3d5c
  Stored in directory: /root/.cache/pip/wheels/05/07/4d/2e3a0aaba1713619a403e1a3c56e88a6fc12d753872b98771c
Successfully built readability
Installing collected packages: readability
Successfully installed readability-0.3.1
Collecting torchview
  Downloading torchview-0.2.6-py3-none-any.whl (25 kB)
Installing collected packages: torchview
Successfully installed torchview-0.2.6
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting torchviz
  Downloading torchviz-0.0.2.tar.gz (4.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdon

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

#### **Loading & understanding the dataset**

First, we load the post dataset with intention annotations.

In [3]:
dataset_path = os.path.join(base_dir, 'dataset/intention_annotation_784.npy')
dataset = np.load(dataset_path, allow_pickle=True)

In [4]:
# Let's look at one post
post_sample = dataset[1]
# fields contained in one post
print(post_sample.keys())

dict_keys(['label', 'id', 'title', 'description', 'description_raw', 'code', 'code_fea'])


The keys in the `dict` data structure contain different attributes of a post:

- `label`: A list of intentions associated with the post. The intentions are from one or more from the following categories: `Discrepancy`, `Errors`, `Review`, `Conceptual`, `Learning`, `How-to`, `Other`.
- `id`: An URL to the online post, usually contains the unique `ID` of the post.
- `title`: The title of the post.
- `description`: The body of the post.
- `description_raw`: The raw HTML description of the post, which includes formatting such as paragraphs and links.
- `code`: A list of code snippets included in the post.
- `code_fea`: We also use the categories of code snippets as an  additional feature. But it's unrelated to this tutorial.

Let's look at a concrete example:

In [5]:
import pandas as pd
from google.colab import data_table

df = pd.DataFrame(list(post_sample.items()), columns=['Key', 'Value'])
data_table.DataTable(df, include_index=False)

Unnamed: 0,Key,Value
0,label,[Errors]
1,id,https://stackoverflow.com/questions/72557738
2,title,"Command ""python setup.py egg_info"" failed with..."
3,description,I'm trying to set up mindsdb in local(visual s...
4,description_raw,<p>I'm trying to set up mindsdb in local(visua...
5,code,"[pip3 install mindsdb \n, Command &quot;python..."
6,code_fea,"[0.0163803145125303, 0.013248739994429572, 0.1..."


#### **Splitting the dataset**

Now, we split the dataset into a training set and a test set with a ratio of 0.8/0.2.

In [6]:
from sklearn.model_selection import train_test_split
post_train, post_test = train_test_split(dataset, test_size=0.2, random_state=0)

⏰ The fine-tuning can be quite time-consuming.

We'll generate a 'lucky' model for each participant. So, with our lucky models in hand, let the comparing games begin!

#### **Loading a pre-trained model and its tokenizer**

In [7]:
model_list = ['bert-base', 'distilbert', 'roberta', 'albert', 'codebert', 'BERTOverflow']
lucky_number = np.random.randint(0, 100)
model_index = lucky_number%len(model_list)
print(f'Your lucky model is {model_list[model_index]}!')

Your lucky model is bert-base!


In [8]:
logging.set_verbosity_error()
# For each pre-trained BERT model, we load the pre-trained tokenizer and the model itself.
# The pre-trained tokenizer is a model that has been previously trained on a large corpus of text to efficiently segment and convert text into tokens for further processing by LLM.
if model_index == 0:
  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
  ptm = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
elif model_index == 1:
  tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
  ptm = AutoModel.from_pretrained("distilbert-base-uncased", output_hidden_states=True)
  # ptm = DistilBertForSequenceClassification.from_pretrained(ptm, output_hidden_states=True, num_labels=768)
elif model_index == 2:
  tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
  ptm = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
elif model_index == 3:
  tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
  ptm = AlbertModel.from_pretrained("albert-base-v2", output_hidden_states=True)
elif model_index == 4:
  tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
  ptm = AutoModel.from_pretrained("microsoft/codebert-base", output_hidden_states=True)
elif model_index == 5:
  tokenizer = AutoTokenizer.from_pretrained("jeniya/BERTOverflow")
  ptm = AutoModel.from_pretrained("jeniya/BERTOverflow")
else:
  raise ValueError('Invalid model index')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [9]:
model_name = ptm.config.name_or_path
print("Model Name:", model_name)

Model Name: bert-base-uncased


#### **Construct the intention detection model with the PTM**

In [10]:
class IntentionBERT(torch.nn.Module):
    def __init__(self, fea_config, loss_fn, ptm=None):
        super(IntentionBERT, self).__init__()

        self.loss_fn = loss_fn

        # we also include some other textual features from the posts.
        # but they are not important in this training session.
        fea_dim = {
            'code_fea' : 5,
            'word_cnt' : 1,
            'readability' : 3,
            'sentiment' : 4
        }

        # BERT output
        self.dim_emb = 768*2 # title + body embeddings (768 is the dimension of BERT output)

        # Additional features
        self.dim_fea = 0
        for key,val in fea_config.items():
            if val==True:
                self.dim_fea = self.dim_fea + fea_dim[key]

        # l1 is the BERT itself
        if ptm != None:
            self.l1 = ptm

        else:
            self.l1 = BertModel(BertConfig())

        self.l3 = torch.nn.Linear(self.dim_emb, 50) # fully connected layer, converting BERT output to 50 dimensions

        # Fully connected layer, concatenating the 50 dimensions from the previous step and the additional features, then output 7 dimensions (the seven classes)
        self.l4 = torch.nn.Linear(50 + self.dim_fea, 7)

        self.model_name = self.l1.config.name_or_path

    def forward(self, t_ids, t_mask, t_token_type_ids, d_ids, d_mask,
                d_token_type_ids, features, targets):

        ## Feed BERT with tokenized input and obtain the sentence representation (CLS token embedding)

        if self.model_name == 'distilbert-base-uncased':
            # Take the hidden state of [CLS] token.
            # The structure of distilbert is a little bit different from others.
            # There is no pooler_output. We take the equivalent ['last_hidden_state'][:, 0, :] as the output.
            output_title= self.l1(t_ids , attention_mask = t_mask)['last_hidden_state'][:, 0, :]
            output_desc= self.l1(d_ids, attention_mask = d_mask)['last_hidden_state'][:, 0, :]
        else:
            output_title= self.l1(t_ids , attention_mask = t_mask , token_type_ids = t_token_type_ids)['pooler_output']
            output_desc= self.l1(d_ids, attention_mask = d_mask, token_type_ids = d_token_type_ids)['pooler_output']

        bert_emb = torch.cat((output_title, output_desc), dim=1) # concatenating the title and description embeddings

        output_3 = self.l3(bert_emb) # fully connected layer, ouputing 50 dimensions
        combined = torch.cat((output_3, features), dim=1) # concatenating the BERT outputs (condensed to 50 dimensions) and additional features
        output = self.l4(combined) # fully connected layer, output 7 dimensions (the seven classes)
        return output

    # Taking output probabilities for the seven classes and producing the labels
    def generate_predict_label(self, pred_prob):
        ret = []
        shape = pred_prob.shape
        n_pred = len(pred_prob)
        for i in range(n_pred):
            lb = np.array(pred_prob[i]>=0.5)
            if sum(lb)==0:
                lb = np.zeros(shape[1])
                lb[np.argmax(pred_prob[i])]=1
                lb = np.array(lb, dtype=bool)
            elif lb[-1] == 1:
                lb = np.zeros(shape[1])
                lb[-1] = 1
                lb = np.array(lb, dtype=bool)
            ret.append(lb)
        return np.array(ret)

    def get_prediction(self, dataloader):
        self.eval()
        with torch.no_grad():
            for _, data in enumerate(dataloader, 0):
                targets = data['targets']
                outputs = self.forward(**data)
                outputs = torch.sigmoid(outputs).cpu().detach().numpy()
                targets = targets.cpu().detach().numpy().tolist()
        target = np.array(targets[0], dtype=bool)
        pred_label = self.generate_predict_label(outputs)[0]
        return outputs[0], pred_label, target


    def predict_one_post(self, post):
        self.eval()
        with torch.no_grad():
            target = post['targets'].unsqueeze(dim=0)
            output = self.forward(post['t_ids'].unsqueeze(dim=0), post['t_mask'].unsqueeze(dim=0), post['t_token_type_ids'].unsqueeze(dim=0),
                                  post['d_ids'].unsqueeze(dim=0), post['d_mask'].unsqueeze(dim=0), post['d_token_type_ids'].unsqueeze(dim=0), post['features'].unsqueeze(dim=0), post['targets'].unsqueeze(dim=0))
            output = torch.sigmoid(output).cpu().detach().numpy()
            target = target.cpu().detach().numpy().tolist()
        pred_label = self.generate_predict_label(output)[0]
        return output[0], pred_label, target

    def evaluation(self, epoch, dataloader):
        self.eval()
        with torch.no_grad():
            total_loss = 0
            fin_targets=[]
            fin_outputs=[]
            cnt = 0
            for _, data in enumerate(dataloader, 0):
                cnt = cnt + len(data['targets'])
                targets = data['targets'] #.to(self.device, dtype = torch.float)
                # for k,v in data.items():
                #     data[k] = v.to(self.device, dtype = torch.long)
                outputs = self.forward(**data)
                loss = self.loss_fn(outputs, targets)
                total_loss = total_loss + loss.item()
                fin_targets.extend(targets.cpu().detach().numpy().tolist())
                fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
            loss = total_loss/cnt

        fin_outputs_lb = self.generate_predict_label(np.array(fin_outputs))
        report = classification_report(
        fin_targets,
        fin_outputs_lb,
        output_dict=True,
        target_names=['Discrepancy', 'Errors', 'Review', 'Conceptual', 'Learning', 'How-to', 'Other'],
        zero_division = 0
        )
        cnt = 0
        for i in range(len(fin_outputs_lb)):
            if (np.logical_and(fin_outputs_lb[i], fin_targets[i])).any()==True:
                cnt=cnt+1
        return {
            'epoch': epoch,
            'loss': round(loss, 3),
            'precision': round(report['micro avg']['precision'], 3),
            'recall': round(report['micro avg']['recall'], 3),
            'f1': round(report['micro avg']['f1-score'], 3),
            'at_1': round(cnt/len(fin_outputs_lb), 3),
            # 'prediction': fin_outputs,
            # 'groundtruth': fin_targets
        }

#### **Some preparations for training**

In [11]:
device = 'cuda' if cuda.is_available() else 'cpu'

# The switchs of the additional textual features we use.
# In here, we turn on all the extra textual features.
fea_config = {
        'code_fea' : True,
        'word_cnt' : True,
        'readability' : True,
        'sentiment' : True
    }

# We use the Binary Cross Entropy loss with logits, which combines a sigmoid layer and the binary cross-entropy loss in a single class.
loss_fn = torch.nn.BCEWithLogitsLoss()
max_len = 256

In [12]:
# Let's load the data with Dataset and Dataloader classes.
training_set = PostDataset(post_train, tokenizer, max_len, fea_config, device)
test_set = PostDataset(post_test, tokenizer, max_len, fea_config, device)

train_params = {'batch_size': 16,
                        'shuffle': True,
                        'num_workers': 0
                        }

train_loader = DataLoader(training_set, **train_params)


test_params = {'batch_size': 4,
                        'shuffle': False,
                        'num_workers': 0
                        }

test_loader = DataLoader(test_set, **test_params)

#### **The definition of the training class**

We define a class to help us train the model.

It is not mandatory, you can write your own training codes.

We also set different learning rates for different components of the model. We got the learning rates by trial and error. ✨

In [13]:
class Train():
    def __init__(self, model, train_loader, test_loader, device):
        self.model = model
        self.device = device
        self.train_loader = train_loader
        self.test_loader = test_loader
        self.optimizer = torch.optim.Adam(params = self.model.parameters())
        return

    def train_one_epoch(self, epoch):
        self.model.train()
        total_loss = 0
        cnt = 0
        for _, data in enumerate(self.train_loader, 0):
            cnt = cnt + len(data['targets'])
            targets = data['targets'].to(self.device, dtype = torch.float)
            for k,v in data.items():
                data[k] = v.to(self.device, dtype = torch.long)
            outputs = self.model(**data)
            loss = self.model.loss_fn(outputs, targets)
            total_loss = total_loss + loss.item()
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        print(f'Epoch: {epoch}, Loss:  {total_loss/cnt}')
        return total_loss/cnt

    def train(self, n_epochs=1000, lr_bert = 1e-03, lr_fc = 1e-05):
        # Here, we set different learning rates for different layers.
        self.optimizer = torch.optim.Adam([
                {'params': self.model.l1.parameters(),'lr': lr_bert},
                {'params': self.model.l3.parameters(), 'lr': lr_fc},
                {'params': self.model.l4.parameters(), 'lr': lr_fc}
        ])

        for epoch in tqdm(range(n_epochs)):
            loss = self.train_one_epoch(epoch)
            # every 5 epoch, we check the performance on the test set
            if epoch%5==0:
                report = self.model.evaluation(epoch, self.test_loader)
                print(report)

        return loss, report

    def evaluation(self):
        ret = self.model.evaluation(0, self.test_loader)
        return ret

#### **Understand the model structure**

- visualization of the intention detection model and the BERT variant.

In [14]:
# Instantiate the intention detection framework.
# Here, we pass our lucky pre-trained BERT variant to the intention detection classification model.
model = IntentionBERT(fea_config, loss_fn, ptm = ptm)

In [None]:
model.to('cpu')
# Visualize the model
for _, data in enumerate(test_loader, 0):
    sample_input = data
    # model(**sample_input)
    break

def wrapped_forward(inputs):
    return model(inputs['t_ids'].to('cpu'),
                 inputs['t_mask'].to('cpu'),
                 inputs['t_token_type_ids'].to('cpu'),
                 inputs['d_ids'].to('cpu'),
                 inputs['d_mask'].to('cpu'),
                 inputs['d_token_type_ids'].to('cpu'),
                 inputs['features'].to('cpu'),
                 inputs['targets'].to('cpu'))

# Perform a forward pass
output = wrapped_forward(sample_input)

dot = torchviz.make_dot(output, params=dict(model.named_parameters()))

# Render the graph
dot.render("model_graph", format="png")

# Display the image
from IPython.display import Image
Image("model_graph.png")

#### **What are the parameters the model has?**

In [None]:
# This function outputs the layers/parameters of the classification model.
# The True/False after the named layers/parameters indicate whether the they will be updated durning back propagation.
def show_parameters(model):
    for k,v in model.named_parameters():
        print('{}: {}'.format(k, v.requires_grad))

In [None]:
show_parameters(model)

#### **Freezing the parameters**

Now, we try to freeze these layers/parameters with the following function.

We pre-defined two types of the freezing strategies for the PTMs:

*   Type 1: The pre-trained BERT model is totally freezed.

*   Type 2: Only the pooler layer is unfreezed.

*   Type 3 (To be defined by yourself!): Define the freezing scheme yourselves based on the structure of your lucky PTM. ;)

You can compare the performance when carrying out different freezing strategies.

In [None]:
def requires_grad_setting(model, type = 1, verbose = True):
    # Enable grad for all
    for p in model.parameters():
        p.requires_grad = True

    if type == 1:
        # Freeze all parameters in l1, l1 is defined as the PTM being used.
        for p in model.l1.parameters():
            p.requires_grad = False
    elif type == 2:
        # Only update the pooler layer in the PTM (freeze other layers)
        for k,v in model.l1.named_parameters():
            if ('pooler' in k):
                v.requires_grad = True
            else:
                v.requires_grad = False
    elif type == 3:
        for k,v in model.l1.named_parameters():
            # Define your own freezing scheme by referring to the output of the last coding block.
            if ('pooler' in k) or ('encoder.layer.1' in k):
                v.requires_grad = False
            else:
                v.requires_grad = True

    if verbose == True:
        show_parameters(model)


In [None]:
# Freeze parameters with this function
requires_grad_setting(model, type = 1)

#### **Model Training**

Finally, we train the model.

Compare the performance after an adequate number of epochs with your deskmates/classmates who have different freezing settings and lucky pre-trained BERT Models.

You can stop when the scores seem to stagnate, similar to [early stopping](https://en.wikipedia.org/wiki/Early_stopping#:~:text=In%20machine%20learning%2C%20early%20stopping,training%20data%20with%20each%20iteration.).

In [None]:
# Move the model to the GPU.
model.to(device)

# Instantiate the training class we implemented.
train_manager = Train(model, train_loader, test_loader, device)

In [None]:
train_manager.train(n_epochs=100, lr_bert= 1e-03, lr_fc = 1e-05)

#### **Check the performance of intention detection with a sample**

Let's use the trained model to see its ability! Does your model predict correctly?

In [None]:
# We select a random post from the test set.
post_id = np.random.randint(0, len(test_set))
post = post_test[post_id]
output = model.predict_one_post(test_set[post_id])

print("The post to be predicted:\n")
for key, value in post.items():
  if key in ['title', 'description', 'id', 'label' ]:
    print(f"{key}: {value}")

categories = ['Discrepancy', 'Errors', 'Review', 'Conceptual', 'Learning', 'How-to', 'Other']

print("\nThe predicted intention(s):")
for i in range(len(output[1])):
  if output[1][i]:
    print(f"{categories[i]}")