## Which type of fine tuning?
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve the model performance and generalization to an unseen task. There are several types of fine-tuning that we can perform:

* single-task fine-tuning: we just need to re-train the whole model on a new unseen task by providing new data (often just 500-1,000 examples can result in good performance). This process though may lead to catastrophic forgetting. Catastrophic forgetting happens because the full fine-tuning process modifies the weights of the original LLM. While this leads to great performance on the single fine-tuning task, it can degrade performance on other tasks. 

* multitask fine-tuning: good multitask fine-tuning may though require 50-100,000 examples across many tasks, and so will require more data and more computation.

* parameter efficient fine-tuning (PEFT): PEFT is a set of techniques that preserves the weights of the original LLM and trains only a small number of task-specific adapter layers and parameters. PEFT shows greater robustness to catastrophic forgetting since most of the pre-trained weights are left unchanged

There are many factors to take into account to choose one option:
1. size of the model
2. hardware available
3. purpose of the model

For the use-case we have here to teach a model on giving advice for bets, maybe we could afford the single-task fine-tuning option and loose the multitask generalized capabilities. Besides we took a small model of 7B size.

However if we decide to go for a bigger model then full fine-tuning might be computationally expensive in terms of hardware so PEFT is the best option here as it can often be performed on a single GPU. The original LLM is only slightly modified or left unchanged, so PEFT is less prone to the catastrophic forgetting problems of full fine-tuning. There are several PEFT techniques and we would chose the most common one called *Low-rank Adaptation*, or LoRA for short. It is a parameter-efficient fine-tuning technique that falls into the re-parameterization category. 

## Steps in fine-tuning

### Load your model and needed libraries
In this exercise mistral and llama2 were proposed but we found for instance this already pre-trained model with plenty of sports articles

https://huggingface.co/microsoft/SportsBERT

*portsBERT is a BERT model trained from scratch with specific focus on sports articles. The training corpus included news articles scraped from the web related to sports from the past 4 years. These articles covered news from Football, Basketball, Hockey, Cricket, Soccer, Baseball, Olympics, Tennis, Golf, MMA, etc. There were approximately 8 million training samples*

So we could use this one as the base model to create a final one trained to predict outcomes.

A model needs some libraries to prepare tokens, configuration, etc. For instance the transformers library provides some.

```
from transformers import AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
```

#### For full fine-tuning a 7B model
```
from transformers import AutoModelForCausalLM, 
```
 For text generation tasks, you would typically use a model from the AutoModelForCausalLM family. These models are specifically designed for autoregressive language modeling tasks, which include text generation.
#### For LoRA fine-tuning of bigger models
You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter.
```
from peft import LoraConfig, get_peft_model, TaskType
```


### Find a dataset

In huggingface there are many different datasets of a myriad of categories. For instance this dataset contains the results of many basketball matches so it could be used to fine-tune a model to predict the outcomes of basketball games
https://huggingface.co/datasets/GEM/sportsett_basketball


### Data preparation
The dataset needs to have the right labels, in our case would be 'question', 'outcome', 'probability' and 'confidence_interval'. Furthermore we need to split this dataset into train, test and validation. Let´s suppose we have around 1500 samples so a would distribution of the three groups could be:
```
DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'outcome', 'probability', 'confidence_interval'],
        num_rows: 1246
    })
    test: Dataset({
        features: ['id', 'question', 'outcome', 'probability', 'confidence_interval'],
        num_rows: 150
    })
    validation: Dataset({
        features: ['id', 'question', 'outcome', 'probability', 'confidence_interval'],
        num_rows: 50
    })
})
```

### Model trainable parameters
It is relevant to find out the number of model parameters and how many of them are trainable. There are several functions available to achieve that:

```
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * train
```

### Preprocess the dataset
You need to convert the question-answers (prompt-response) tuples into explicit instructions for the LLM
Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token). The tokenize function will create the prompt for each sample.

Training prompt (question):
"""
Answer the following question by giving as output a json response with two fields: 
the probability of the answer and the confidence interval of the answer. The question is: {question}.

Prediction: 
"""

Training response (prediction):
"""
{outcome}, {probability}, {confidence_interval}
"""

The output dataset is ready for fine-tuning.

### Training the model

We can use the *Trainer* from Hugging Face (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)) and choose different parameters for the finetuning steps.
```
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=selected_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)
```
Finally you can create an instance of the `AutoModelForCausalLM` class for the instruct model using the from_pretrained function.

#### If we are using PEFT
You need to define the LoRA configuration and create the peft model that will be passed to the Trainer. Some explanations about the different parameters can be found [here](https://medium.com/data-science-in-your-pocket/lora-for-fine-tuning-llms-explained-with-codes-and-example-62a7ac5a3578).
```
peft_config = LoraConfig(
      lora_alpha=16,
      lora_dropout=0.1,
      r=32, # Rank
      bias="none",
      task_type="CAUSAL_LM")
      
peft_model = get_peft_model(original_model, 
                            peft_config)
```


### Saving the model
You can start using the instruct model and save it in your cloud provider or if you are registered to the huggingface-hub you can push it there.


## Model performance
We need some metrics to establish evaluation criteria to assess the effectiveness of the prompt set and its impact on the LLM's performance. Common metrics include response accuracy, fluency, coherence, relevance, and completeness.

### Evaluate the model qualitatively (Human evaluation)
It is always good to manually check first that the model is behaving correctly. Is the output format correct? Do the answers make any sense? If the model is providing external sources or giving references, could we double check that those links and those resources really exist? We could in this category complete some metrics from the outputs such as:
* completeness
* coherence
* reliability


### Evaluate the Model Quantitatively
For this use-case that we are simulating where we are fine-tuning a model to improve the prediction skills, we can use accuracy to check the number of correct answers the model is guessing. However we are considering also the probability and confidence interval of those right answers, so we could collect all the probabilities of those right answers and check the probability distribution. Are most of my right predictions based on a high probability or the model? is the model confident enough or the confidence intervals are very wide? Considering this we could elaborate some metrics such as:
* accuracy
* probability distribution of correct answers
* probability distribution of wrong answers
* size of confidence intervals

References

https://www.coursera.org/learn/generative-ai-with-llms

https://www.e2enetworks.com/blog/a-step-by-step-guide-to-fine-tuning-the-mistral-7b-llm

LoRA

https://medium.com/data-science-in-your-pocket/lora-for-fine-tuning-llms-explained-with-codes-and-example-62a7ac5a3578

https://arxiv.org/pdf/2106.09685.pdf