# Fine Tuning LLMs

Fine-tuning in machine learning is the process of adapting a pre-trained model for specific tasks or use cases. It could be considered a subset of the broader technique of transfer learning: the practice of leveraging knowledge an existing model has already learned as the starting point for learning new tasks.

Fine-tuning is an essential part of the LLM development cycle, allowing the raw linguistic capabilities of base foundation models to be adapted for a variety of use cases, from chatbots to coding to other domains both creative and technical.

By fine-tuning a model on a small dataset of task-specific data, you can improve its performance on that task while preserving its general language knowledge.

Fine-tuning LLMs involves adapting a pre-trained model to a specific domain or task by training it further on a domain-specific dataset. This involves multiple steps, from preparing the dataset to implementing fine-tuning strategies.

#### **1. Preparing the Dataset**
The dataset is critical for fine-tuning as it determines the specificity and performance of the model on your desired tasks. Data preparation involves curating and preprocessing the dataset to ensure its relevance and quality for the specific task. This may include tasks such as cleaning the data, handling missing values, and formatting the text to align with the model's input requirements.

##### Steps:
a.	**Data Collection**: Collect domain-specific data relevant to your task

b.	**Data Cleaning**: Remove duplicates, irrelevant information, or noise and ensure proper formatting

c.	**Data Formatting for Training**: Use appropriate input-output formats and split data into training, validation, and testing sets 


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv("data.csv")

# Split dataset
train, temp = train_test_split(df, test_size=0.3, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)

# Save splits
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)
test.to_csv("test.csv", index=False)

#### **2. Choosing the right pre-trained model**

It’s crucial to select a pre-trained model that aligns with the specific requirements of the target task or domain. Understanding the architecture, input/output specifications, and layers of the pre-trained model is essential for seamless integration into the fine-tuning workflow.

Factors such as the model size, training data, and performance on relevant tasks should be considered when making this choice. By selecting a pre-trained model that closely matches the characteristics of the target task, you can streamline the fine-tuning process and maximize the model's adaptability and effectiveness for the intended application.

If your model benefits from domain-specific knowledge not captured by the base model, pre-train it with unsupervised learning on large datasets.

##### Steps:

* Use language modeling tasks like Causal Language Modeling (CLM) or Masked Language Modeling (MLM).

* Tokenize the dataset and pass it through the model without labels.

#### **3. Identifying the right parameters for fine-tuning**

Configuring the fine-tuning parameters is crucial for achieving optimal performance in the fine-tuning process. Parameters such as the learning rate, number of training epochs, and batch size play a significant role in determining how the model adapts to the new task-specific data. Additionally, selectively freezing certain layers (typically the earlier ones) while training the final layers is a common practice to prevent overfitting.

#### **4. Fine-Tuning**

Fine-tuning adapts the model to a specific downstream task using labeled data.

##### Steps

* Use task-specific loss functions (e.g., CrossEntropyLoss for classification)
* Employ frameworks like Hugging Face, PyTorch Lightning, or TensorFlow.

Example (Text Classification with Hugging Face):

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset and tokenizer
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize dataset
def preprocess_data(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

tokenized_data = dataset.map(preprocess_data, batched=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
)

trainer.train()


#### **5. Validation/Evaluating**

Validation involves evaluating a fine-tuned model’s performance using a validation set. Monitoring metrics such as accuracy, loss, precision, and recall provide insights into the model's effectiveness and generalization capabilities.

By assessing these metrics, you can gauge how well the fine-tuned model is performing on the task-specific data and identify potential areas for improvement. This validation process allows for the refinement of fine-tuning parameters and model architecture, ultimately leading to an optimized model that excels in generating accurate outputs for the intended application.

Evaluate the fine-tuned model on the test set using appropriate metrics:

* **Text Classification**: Accuracy, F1-Score.
* **Summarization**: ROUGE.
* **Question Answering**: F1, Exact Match (EM).

In [None]:
from datasets import load_metric

# Load metric
metric = load_metric("accuracy")

# Evaluate predictions
predictions = trainer.predict(tokenized_data["test"])
accuracy = metric.compute(predictions=predictions.predictions.argmax(-1), references=predictions.label_ids)

print(f"Accuracy: {accuracy}")


#### **6. Model iteration**

Model iteration allows you to refine the model based on evaluation results. Upon assessing the model's performance, adjustments to fine-tuning parameters, such as learning rate, batch size, or the extent of layer freezing, can be made to enhance the model's effectiveness.

Additionally, exploring different strategies, such as employing regularization techniques or adjusting the model architecture, enables you to improve the model's performance iteratively. This empowers engineers to fine-tune the model in a targeted manner, gradually refining its capabilities until the desired level of performance is achieved.

#### **6. Model deployment**

Model deployment marks the transition from development to practical application, and it involves the integration of the fine-tuned model into the specific environment. This process encompasses considerations such as the hardware and software requirements of the deployment environment and model integration into existing systems or applications.

Additionally, aspects like scalability, real-time performance, and security measures must be addressed to ensure a seamless and reliable deployment. By successfully deploying the fine-tuned model into the specific environment, you can leverage its enhanced capabilities to address real-world challenges.

Save the model and deploy it using APIs like Hugging Face’s transformers pipeline or FastAPI for serving.

In [None]:
from transformers import pipeline

# Load fine-tuned model
model_pipeline = pipeline("text-classification", model="./results")

# Inference
result = model_pipeline("This movie was fantastic!")
print(result)
    