# Fine Tuning a LLM for Legal Documents (flan-t5-small)

**In this notebook,** 

we will fine tune flan t5 small llm model for legal documents summarization

for summarization we will use [legal_summarization](https://huggingface.co/datasets/egalize/legal_summarization) dataset.
The dataset contains mainly terms and conditions, privacy, legal notice for tech products and their summary. 

- first we will use flan-t5-base model for summarization of the legal documents
  we will score the model's performance against the reference summary provided with the dataset
  then we will calculate the rouge_score

- then we will fine tune the model with the training dataset provided with the dataset
  we will calculate the rouge_score for the fine tuned model

- and then compare between them, if the model's performacne increased after a fine tuning

In [30]:
from datasets import load_dataset
#the library from huggingface, useful for directly download any huggingface datasets

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch

import time
import evaluate
import pandas as pd
import numpy as np

from termcolor import colored
#for printing in color

###  Load the Model and the dataset

In [31]:
model_name='google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# the model is loaded from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_name)
# load the tokenizer to tokenize the corpus

dataset_name = 'egalize/legal_summarization'
dataset = load_dataset(dataset_name)

dataset



  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['original_text', 'reference_summary'],
        num_rows: 356
    })
    test: Dataset({
        features: ['original_text', 'reference_summary'],
        num_rows: 90
    })
})

#### Number of trainable parameters in the model

In [43]:
def trainable_parameters(model):
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    return f"trainable model parameters: {trainable_params}\nall model parameters: {all_params}\npercentage of trainable model parameters: {100 * trainable_params / all_params:.2f}%"

print(trainable_parameters(model))

trainable model parameters: 76961152
all model parameters: 76961152
percentage of trainable model parameters: 100.00%


### Functions for prompt generation and getting output

In [32]:
def prompt_generator(text):
    prompt = f'''Briefly summarize this paragraph:\n{text}\nSummary:'''
    #print(colored('PROMPT>>', 'red'))
    #print(colored('*******************************', 'red'))
    #print(prompt)
    #print(colored('*******************************\n', 'red'))
    return prompt
 
def get_output(prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
            model.generate(
            inputs["input_ids"], 
            max_new_tokens=25,
        )[0], 
        skip_special_tokens=True
    )
    return output

## in get_output function, in input we tokenize the prompt itself,
## then in output varialbe we get the models response, it is done with following steps,
# - model.generate() function takes input['input_ids'] that is a numeric representation of the prompot. 
# -  max_new token set to 25, that means the model generates 25 words, 
# - skip_speical_token set to false, so no special token will be shown

### Generate prompt and see a output

In [33]:
article = dataset['train'][1]['original_text']
summary = dataset['train'][1]['reference_summary']
print(colored('PROMPT>>', 'red'))
print(prompt_generator(article))
print()

prompt = prompt_generator(article)
print(colored('MODEL GENERATED>>', 'green'), get_output(prompt))
print(colored('Actual Summary>>', 'green'), summary)


[31mPROMPT>>[0m
Briefly summarize this paragraph:
for api clients that use their own avatar naming system in place of the user s google identity then you must make clear to users that their gameplay information will still be submitted to google and associated with their google identity and viewable within different google products.
Summary:

[32mMODEL GENERATED>>[0m api clients must make clear that their gameplay information will still be submitted to google and associated with their google identity and viewable
[32mActual Summary>>[0m if using avatars usernames tell the user that their g identity will still be used by google.


### Calculate ROUGE SCORE

generally speaking, rouge is a measure of the similarity between the generated summary and the acutal summary. the scale is between 0 and 1. The more the better the score. 

note that we have 90 summary for test data

In [34]:
rouge = evaluate.load('rouge')

model_generated_summary = []
for i in range(len(dataset['test'])):
    print(i, end = ' ')
    article = dataset['test'][i]['original_text']
    prompt = prompt_generator(article)
    model_generated_summary.append(get_output(prompt))
    
#this function will get all the generated_output by flan-t5-small output in a list
# for all the test data


result = rouge.compute(
    predictions=model_generated_summary,
    references=dataset['test']['reference_summary'],
    use_aggregator=True,
    use_stemmer=True,
)

result

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Token indices sequence length is longer than the specified maximum sequence length for this model (1064 > 512). Running this sequence through the model will result in indexing errors


21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 

{'rouge1': 0.15884607716620358,
 'rouge2': 0.04355273527390338,
 'rougeL': 0.13169214077239494,
 'rougeLsum': 0.1298386933909495}

### Fine Tuning Flan-T5-Small model for legal summarization

Transform the dataset for the fine tuning

In [35]:
def tokenize(example):
    start_prompt = 'Briefly summarize this paragraph:\n'
    end_prompt = '\nSummary:'
    prompt = [start_prompt + text + end_prompt for text in example["original_text"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["reference_summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    return example

tokenized_dataset = dataset.map(tokenize, batched=True)

#the data automatically split into train and test
# to see how the new dataset looks like you can print some of the rows 

#uncomment the following line
#print(tokenized_dataset['train'][0])



In [21]:
#set the hyperparameters


output_dir = f'./legal-summary-training'
#define the output folder

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)


In [None]:
#trainig wil be begin
trainer.train()

####  To save and load the model >>>

In [22]:
trainer.save_model("./tuned_model")

# I trained a modle on google colab pro with more epoch
# am loding that model for a better rouge score, and better understanding

tuned_model = AutoModelForSeq2SeqLM.from_pretrained("./tuned_model_colab", torch_dtype=torch.bfloat16)


def get_tuned_model_output(prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
            tuned_model.generate(
            inputs["input_ids"], 
            max_new_tokens=25,
        )[0], 
        skip_special_tokens=True
    )
    return output

In [26]:
print(colored('PROMPT>>>', 'red'))
print(prompt_generator(article))

prompt = prompt_generator(article)
print()
print(colored('TUNED MODEL GENERATED>>', 'green'), get_tuned_model_output(prompt))

print(colored('Actual Summary>>', 'green'), summary)


[31mPROMPT>>>[0m
Briefly summarize this paragraph:
for api clients that use their own avatar naming system in place of the user s google identity then you must make clear to users that their gameplay information will still be submitted to google and associated with their google identity and viewable within different google products.
Summary:

[32mTUNED MODEL GENERATED>>[0m api clients must make clear that their gameplay information will still be submitted to google and associated with their google identity and viewable
[32mActual Summary>>[0m if using avatars usernames tell the user that their g identity will still be used by google.


### Fine Tuned Model Rouge Score

In [29]:
tuned_model_generated_summary = []
for i in range(len(dataset['test'])):
    print(i, end = ' ')
    article = dataset['test'][i]['original_text']
    prompt = prompt_generator(article)
    tuned_model_generated_summary.append(get_tuned_model_output(prompt))
    
    
result = rouge.compute(
    predictions=tuned_model_generated_summary,
    references=dataset['test']['reference_summary'],
    use_aggregator=True,
    use_stemmer=True,
)

result

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 

{'rouge1': 0.1643423498997253,
 'rouge2': 0.043617406432127116,
 'rougeL': 0.13482787082209935,
 'rougeLsum': 0.13476832202447428}

### Analysis and discussion

Before Fine Tuning >> Rouge 1 

|Rouge | Before   | After |
|---| ---| --- |
|rouge1| 0.15  | 0.16   |
|rouge2| 0.04 | 0.04    |
|rougeL| 0.13    | 0.13 |
|rougeLsum| 0.12    | 0.13 |

<br><br>

we trained the model in a small dataset. the dataest only have 356 training data. The main objective of this notebook is not getting a sophisticated output, rathet to learn how to fine tune a model for a specific task. due to lack of data and computing resource we may appeciate the process rather than the outcome

thank you

#### Other notebooks on LLM [large Language Models](https://github.com/fahimabrar/Large-Language-Model)
#### Find me on, [Linkedin](https://www.linkedin.com/in/abrar-fahim/)