## Get Insight from your Business Data - Build LLM application with Fine Tuning using Hugging Face - Ashish Kumar Jain

### For improving the performance of the model on a particular task, we need to do training of the base model (per-trained Model like GPT4) with labeled data on single task. Base LLM Model is already per-trained on vast majority of unstructured textual data via self supervised learning where as fine tuning is a supervised learning process where we use a data set of labeled examples to update the weights of the exiting base LLM for a particular task.The training data set contains the prompt completion pairs with different examples, fine tuning process train the model with this data set to improve its ability to generate good completion for a specific task.The fine tuning which updates all the weights of the model is called full fine tuning.

### In the notebook we will use Hugging Face, it is a platform where machine learning community collaborates on models, datasets and applications. We will use Hugging Face to download one of the open source LLM model FLAN_T5 from Google . We will load this model from the local machine. You can easily download these model from Hugging Face by cloning the model repository. You can also download the General Knowledge dataset for the training the model.It will help you out to run this code without internet or in very constrained environment. Downloading model can take time depending on your network speed.

In [None]:
! pip install --upgrade pip
! pip install 'transformers[torch]'
! pip install datasets
! pip install evaluate==0.4.0
! pip install rouge_score==0.1.2

#### Hugging Face provides Datasets library for easily accessing and sharing datasets. We can load the dataset in a single line of code from multiple sources (Hugging Face hub, local files systems and memory etc) in different formats (CSV, JSON, parquet, arrow, sql for reading from database etc) and use its powerful data processing methods to quickly get our dataset ready for training for LLM.

In [None]:
from datasets import load_dataset

data_files = {"train":"dataset/GK/train.json",
              "test":"dataset/GK/validation.json",
              "validation":"dataset/GK/test.json"
             }
dataset = load_dataset("json",data_files = data_files,field="data")

#### Hugging Face provides tokenizer class which is in charge of preparing the inputs for a model. We will use open source FLAN-T5-LARGE model from Hugging Face and load from local. It is a good encode-decoder instruct model. It shows good capability in many tasks

#### You can easily download flan-t5-large model from Hugging Face by cloning the model repository. 
#### git clone https://huggingface.co/google/flan-t5-large

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
modelPath = "model/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(modelPath)
base_model = AutoModelForSeq2SeqLM.from_pretrained(modelPath)

#### We will create instructed dataset for the training. We will to convert the question-answer pairs into explicit instructions for the LLM. Lets create a prompt instruction having instruction start and end of prompt.

In [None]:
def prompt_generator(batchData):
    start = 'Assuming you are working as General Knowladge instructor. Can you please answer the below question?\n\n'
    end = '\n Answer: '
    training_prompt = [start + question + end for question in batchData['Question']]
    batchData['input_ids'] = tokenizer(training_prompt, padding="max_length", return_tensors="pt").input_ids
    batchData['labels'] = tokenizer(batchData['Answer'], padding="max_length", return_tensors="pt").input_ids
    return batchData

instructed_datasets = dataset.map(prompt_generator, batched=True)
instructed_datasets = instructed_datasets.remove_columns(['id','Question', 'Answer'])
#print(instructed_datasets)

#### We will use the PyTorch framework for fine tuning the model. The Hugging Face Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. Before instantiating our Trainer object, we will create a TrainingArguments to access all the points of customization during training. In below code i am defining only 1 epoch for model training, you can choose no of epochs and other training parameter based on your compute, memory available and based on the final model evaluation result.  

In [None]:
from transformers import TrainingArguments, Trainer
import time

output_dir = f'./model/trained-model-output/flan-output-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    learning_rate=1e-5, 
    num_train_epochs=1, 
    weight_decay=0.01,
    max_steps =1
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=instructed_datasets['train'],
    eval_dataset=instructed_datasets['validation']
)

In [None]:
trainer.train()

#### After training you can save the instructed model for future evaluation and inference use.

In [None]:
saved_dir = f'./model/trained-model/flan-trained-{str(int(time.time()))}'
tokenizer.save_pretrained(saved_dir)
base_model.save_pretrained(saved_dir)

#### We can load the saved fine tuned instructed model from local file system. 

In [None]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("model/trained-model/flan-trained-1692510911")

#### For Generative AI applications, a qualitative approach where we ask our-self the question "Is my model behaving right way?" is usually a good starting point. We can see that by manually seeing the difference between actual answer with answers given by the instructed model. We can use our test dataset for evaluation.

In [None]:
from transformers import GenerationConfig
import pandas as pd

questions = dataset['test']['Question']
actual_answers = dataset['test']['Answer']
instruct_model_answers = []

for _, question in enumerate(questions):
    prompt = f"""

Assuming you are working as General Knowladge instructor. Can you please answer the below question?

{question}
Answer:""";
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_answers.append(instruct_model_text_output)
    
answers = list(zip(questions,actual_answers,instruct_model_answers))
df = pd.DataFrame(answers, columns = ['question','actual answer','instruct model answer'])
df

#### Other evaluating approach is qualitative. The ROUGE metric helps quantify the validity of answers produced by models. It compares answers to a actual answer which is part of our test dataset.You can read more about this from ROUGE metric.

In [None]:
import evaluate
rouge = evaluate.load('rouge')
instruct_model_results = rouge.compute(
    predictions=instruct_model_answers,
    references=actual_answers,
    use_aggregator=True,
    use_stemmer=True,
)
print(instruct_model_results)