# LLM Fine-Tuning: Principles and Steps

##Step-by-step process to fine-tuning


* Step 1: Prepare and clean the dataset

* Step 2: Tokenize the data

* Step 3: Fine-tune the model

* Step 4: Evaluate the model

## Step 1: Prepare and clean the dataset
Noisy datasets can degrade model performance. Cleaning ensures consistent input for better predictions. In this step, we will:

* Remove URLs, hashtags, and special characters.

* Normalize text by converting it to lowercase.

In [None]:
import os
import pandas as pd

# Example dataset
data_dict = {
    "text": [
        "  The staff was very kind and attentive to my needs!!!  ",
        "The waiting time was too long, and the staff was rude. Visit us at http://hospitalreviews.com",
        "The doctor answered all my questions...but the facility was outdated.   ",
        "The nurse was compassionate & made me feel comfortable!! :) ",
        "I had to wait over an hour before being seen.  Unacceptable service! #frustrated",
        "The check-in process was smooth, but the doctor seemed rushed. Visit https://feedback.com",
        "Everyone I interacted with was professional and helpful.  "
    ],
    "label": ["positive", "negative", "neutral", "positive", "negative", "neutral", "positive"]
}

# Convert to pandas DataFrame
data = pd.DataFrame(data_dict)

# Clean the text
import re

def clean_text(text):
    text = text.lower().strip()  # Convert to lowercase and remove extra spaces
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"[^\w\s]", "", text)  # Remove special characters
    return text

# Apply text cleaning
data["cleaned_text"] = data["text"].apply(clean_text)

# Convert labels to numerical values
data["label"] = data["label"].astype("category").cat.codes  # Converts ["positive", "negative", "neutral"] to [0, 1, 2]

print(data.head())

                                                text  label  \
0    The staff was very kind and attentive to my ...      2   
1  The waiting time was too long, and the staff w...      0   
2  The doctor answered all my questions...but the...      1   
3  The nurse was compassionate & made me feel com...      2   
4  I had to wait over an hour before being seen. ...      0   

                                        cleaned_text  
0  the staff was very kind and attentive to my needs  
1  the waiting time was too long and the staff wa...  
2  the doctor answered all my questionsbut the fa...  
3  the nurse was compassionate  made me feel comf...  
4  i had to wait over an hour before being seen  ...  


## Step 2: Tokenize the data
Tokenization converts text into a format models can process. This step uses Hugging Face’s tokenizer to transform text into token IDs. Before running this code you may need to install Transformers. Make sure after you do this to refresh your page and then continue to run code.



In [None]:
!pip install transformers datasets scikit-learn torch accelerate

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_c

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)


# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Apply tokenization with padding
def tokenize_function(text):
    return tokenizer(text, truncation=True, padding="max_length", max_length=128)

# Apply tokenization
data["tokenized"] = data["cleaned_text"].apply(tokenize_function)

# Extract tokenized features
data["input_ids"] = data["tokenized"].apply(lambda x: x["input_ids"])
data["attention_mask"] = data["tokenized"].apply(lambda x: x["attention_mask"])

# Drop old tokenized column
data = data.drop(columns=["tokenized"])

print(data.head())

                                                text  label  \
0    The staff was very kind and attentive to my ...      2   
1  The waiting time was too long, and the staff w...      0   
2  The doctor answered all my questions...but the...      1   
3  The nurse was compassionate & made me feel com...      2   
4  I had to wait over an hour before being seen. ...      0   

                                        cleaned_text  \
0  the staff was very kind and attentive to my needs   
1  the waiting time was too long and the staff wa...   
2  the doctor answered all my questionsbut the fa...   
3  the nurse was compassionate  made me feel comf...   
4  i had to wait over an hour before being seen  ...   

                                           input_ids  \
0  [101, 1996, 3095, 2001, 2200, 2785, 1998, 2012...   
1  [101, 1996, 3403, 2051, 2001, 2205, 2146, 1998...   
2  [101, 1996, 3460, 4660, 2035, 2026, 3980, 8569...   
3  [101, 1996, 6821, 2001, 29353, 2081, 2033, 251...   
4  [

## Step 3: Fine-tune the model
Using the tokenized data, you’ll fine-tune a pretrained BERT model.

In [None]:
!pip install datasets



In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Split into train and test sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_data)
test_dataset = Dataset.from_pandas(test_data)

# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["text", "cleaned_text"])
test_dataset = test_dataset.remove_columns(["text", "cleaned_text"])

#print(train_dataset)

# Enable dynamic padding for batches
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
    learning_rate=2e-5,                     # Small learning rate for fine-tuning
    per_device_train_batch_size=16,         # Batch size for training
    per_device_eval_batch_size=16,          # Batch size for evaluation
    num_train_epochs=3,                     # Number of training epochs
    output_dir="./results",                 # Where to save the model
    logging_dir="./logs",                   # Where to store logs
    report_to="none",                        # Disables reporting to WandB or TensorBoard
    save_strategy="epoch",                   # Save model at each epoch
    evaluation_strategy="epoch",             # Evaluate at each epoch
)
# Load pre-trained BERT model (3-class classification)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)


# Define Trainer
trainer = Trainer(
    model=model,                           # Model to train
    args=training_args,                     # Training settings
    train_dataset=train_dataset,            # Training data
    eval_dataset=test_dataset,              # Evaluation data
    data_collator=data_collator             # Handles padding dynamically
)

# Train the model
trainer.train()



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,1.025588
2,No log,1.029381
3,No log,1.034464


TrainOutput(global_step=3, training_loss=1.0312273502349854, metrics={'train_runtime': 82.2466, 'train_samples_per_second': 0.182, 'train_steps_per_second': 0.036, 'total_flos': 986675316480.0, 'train_loss': 1.0312273502349854, 'epoch': 3.0})

## Step 4: Fine-tune the model
Evaluate the fine-tuned model’s accuracy and F1 score on the test set.

In [None]:
from sklearn.metrics import accuracy_score, f1_score

# Generate predictions
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)
labels = test_dataset['label']

# Calculate metrics
accuracy = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average='weighted')

print(f"Accuracy: {accuracy}, F1 Score: {f1}")

Accuracy: 0.5, F1 Score: 0.3333333333333333


## Conclusion
Fine-tuning a large language model (LLM) begins with the critical step of preparing your dataset. From cleaning noisy text to tokenizing and splitting your data, each step is vital for ensuring the model’s performance is optimized for specific tasks. This reading has provided you with the tools and knowledge to:

* Preprocess text data, ensuring consistency and quality.

* Tokenize and structure your dataset for machine learning models.

* Fine-tune a pretrained model with relevant hyperparameters.

* Evaluate and deploy the model effectively.

In [None]:
data.shape

(7, 5)

In [None]:
data.head(10)

Unnamed: 0,text,label,cleaned_text,input_ids,attention_mask
0,The staff was very kind and attentive to my ...,2,the staff was very kind and attentive to my needs,"[101, 1996, 3095, 2001, 2200, 2785, 1998, 2012...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
1,"The waiting time was too long, and the staff w...",0,the waiting time was too long and the staff wa...,"[101, 1996, 3403, 2051, 2001, 2205, 2146, 1998...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,The doctor answered all my questions...but the...,1,the doctor answered all my questionsbut the fa...,"[101, 1996, 3460, 4660, 2035, 2026, 3980, 8569...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
3,The nurse was compassionate & made me feel com...,2,the nurse was compassionate made me feel comf...,"[101, 1996, 6821, 2001, 29353, 2081, 2033, 251...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ..."
4,I had to wait over an hour before being seen. ...,0,i had to wait over an hour before being seen ...,"[101, 1045, 2018, 2000, 3524, 2058, 2019, 3178...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,"The check-in process was smooth, but the docto...",1,the checkin process was smooth but the doctor ...,"[101, 1996, 4638, 2378, 2832, 2001, 5744, 2021...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
6,Everyone I interacted with was professional an...,2,everyone i interacted with was professional an...,"[101, 3071, 1045, 11835, 2098, 2007, 2001, 265...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ..."
