Fine-tuning a Large Language Model (LLM) like GPT on domain-specific data can significantly enhance its performance in specialized tasks. Here's a step-by-step guide to fine-tuning an open-source GPT model using Google Colab, focusing on the computer networking domain.

1. Set Up the Environment

First, ensure that your Colab environment has the necessary libraries installed. You can install them using the following commands:

In [None]:
pip install transformers datasets


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

Fine-tuning an open-source Large Language Model (LLM) with domain-specific datasets can significantly enhance its performance for specialized tasks. Building upon the previously mentioned datasets related to Software-Defined Networking (SDN) and ONOS, here's a comprehensive tutorial to guide you through the fine-tuning process.

1. Environment Setup

Begin by setting up your environment. Ensure you have Python installed, along with the necessary libraries. You can install the required packages using pip:

In [None]:
pip install accelerate




2. Select an Open-Source LLM

I'll use GPT-2 but maybe VISWA has another suggestion:

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'  # Options: 'gpt2', 'gpt2-medium', 'gpt2-large', etc.
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

3. Prepare the Dataset

I downloaded the datsret from kakel and I will ipload the file to my colab drive and use it directly for efficiency

In [None]:
# Load the dataset
from google.colab import files
uploaded = files.upload()

Saving train_dataset.csv to train_dataset.csv


In [None]:

data_path = 'train_dataset.csv'
import pandas as pd
df = pd.read_csv(data_path)

In [None]:
# Display basic information about the dataset
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31583 entries, 0 to 31582
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   LTE/5g Category       31583 non-null  int64  
 1   Time                  31583 non-null  int64  
 2   Packet Loss Rate      31583 non-null  float64
 3   Packet delay          31583 non-null  int64  
 4   IoT                   31583 non-null  int64  
 5   LTE/5G                31583 non-null  int64  
 6   GBR                   31583 non-null  int64  
 7   Non-GBR               31583 non-null  int64  
 8   AR/VR/Gaming          31583 non-null  int64  
 9   Healthcare            31583 non-null  int64  
 10  Industry 4.0          31583 non-null  int64  
 11  IoT Devices           31583 non-null  int64  
 12  Public Safety         31583 non-null  int64  
 13  Smart City & Home     31583 non-null  int64  
 14  Smart Transportation  31583 non-null  int64  
 15  Smartphone         

4. Preprocess the Data

Step 4: Preprocess and Tokenize Each Column Separately

Instead of combining text columns, process each one independently. For instance, if your dataset has columns like 'Requirement' and 'Configurations', tokenize them separately as I am not sure yet if combining all together is useful in our slicing usecase. I should read more about this issue and be sure that I doing the right option.

In [None]:
print(df.columns.tolist())


['LTE/5g Category', 'Time', 'Packet Loss Rate', 'Packet delay', 'IoT', 'LTE/5G', 'GBR', 'Non-GBR', 'AR/VR/Gaming', 'Healthcare', 'Industry 4.0', 'IoT Devices', 'Public Safety', 'Smart City & Home', 'Smart Transportation', 'Smartphone', 'slice Type']


In [None]:
# Convert 'LTE/5g Category' and 'slice Type' columns to lists of strings
requirements = df['LTE/5g Category'].dropna().astype(str).tolist()  # Convert to strings
configurations = df['slice Type'].dropna().astype(str).tolist()    # Convert to strings

# Check if data is correctly formatted as strings
print(requirements[:5])  # First 5 rows of 'requirements'
print(configurations[:5])  # First 5 rows of 'configurations'



['14', '18', '17', '3', '9']
['3', '1', '1', '1', '2']


In [None]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the 'LTE/5g Category' (user requirement) column. I think more comumn should be part of the user requirement but I am just testing a working code first, then I must enhabce
inputs_req = tokenizer(requirements, padding=True, truncation=True, return_tensors='pt')

# Tokenize the 'slice Type' (Typically I need to translate user req to concrete actions likes list of functions and then configure the data plan accordingly = configuration) column
inputs_conf = tokenizer(configurations, padding=True, truncation=True, return_tensors='pt')

# Check tokenization results
print(inputs_req['input_ids'][:5])  # First 5 tokenized ids for requirements
print(inputs_conf['input_ids'][:5])  # First 5 tokenized ids for configurations



tensor([[ 101, 2403,  102],
        [ 101, 2324,  102],
        [ 101, 2459,  102],
        [ 101, 1017,  102],
        [ 101, 1023,  102]])
tensor([[ 101, 1017,  102],
        [ 101, 1015,  102],
        [ 101, 1015,  102],
        [ 101, 1015,  102],
        [ 101, 1016,  102]])


Step 5: Fine-tuning the Model
At this stage, we have tokenized the data and are ready to proceed with fine-tuning the model.

We will create a dataset from the tokenized data, create a DataLoader, and then fine-tune the model using the Trainer API. Later on, WE must try other options based on our experience with Slicenet, to discuss with Viswa mainly.

In [None]:
from torch.utils.data import Dataset, DataLoader
import torch
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Define a custom Dataset class
class NetworkSlicingDataset(Dataset):
    def __init__(self, inputs_req, inputs_conf, labels=None):
        self.inputs_req = inputs_req
        self.inputs_conf = inputs_conf
        self.labels = labels

    def __len__(self):
        return len(self.inputs_req['input_ids'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.inputs_req.items()}
        item.update({key: torch.tensor(val[idx]) for key, val in self.inputs_conf.items()})
        if self.labels:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

# Assuming 'labels' are available (e.g., you can use the 'slice Type' or another relevant column as labels)
# Replace 'labels' with the actual column or target for your task.
labels = df['slice Type'].astype(str).tolist()  # Example, change as needed

# Create the dataset
dataset = NetworkSlicingDataset(inputs_req, inputs_conf, labels)

# Create DataLoader for training
train_dataloader = DataLoader(dataset, batch_size=16, shuffle=True)


Step 6: Training the Model
Now, you can use the Trainer API to fine-tune the model. The Trainer takes care of most of the training pipeline, including gradient computation, optimization, and logging.

In [None]:
# Load a pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(set(labels)))

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',            # output directory for model predictions and checkpoints
    num_train_epochs=3,                # number of training epochs
    per_device_train_batch_size=16,    # batch size per device during training
    per_device_eval_batch_size=16,     # batch size for evaluation
    warmup_steps=500,                  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                 # strength of weight decay
    logging_dir='./logs',              # directory for storing logs
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,  # You can split into training and validation sets for a better model
)

# Start training
trainer.train()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

Step 7: Evaluate the Model
Once the model is fine-tuned, you can evaluate its performance on a test dataset or the same dataset (if no separate test set is available). In this case, we'll use the Trainer's evaluation method.

In [None]:
# Evaluate the model
results = trainer.evaluate()

# Print the evaluation results
print("Evaluation Results:", results)


Hyperparameter Tuning: I can experiment with different hyperparameters, such as learning rate, batch size, and the number of training epochs, to improve model performance.

Step 8: Testing the Fine-tuned Model
After fine-tuning the model, I want to test it on new data by providing a prompt (user requirement) and getting the model's prediction (the network configuration).

In [None]:
from transformers import pipeline

# Create a pipeline for sequence classification
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

# Test the model with some example prompts (user requirements)
prompts = [
    "High bandwidth and low latency for gaming applications",
    "Reliable and low latency for healthcare applications",
    "High capacity for smart city traffic management"
]

# Classify the prompts and get predictions
predictions = classifier(prompts)

# Print the predictions
for prompt, prediction in zip(prompts, predictions):
    print(f"Prompt: {prompt}")
    print(f"Predicted Slice Type: {prediction}\n")


Explanation:
Pipeline: The pipeline method from HuggingFace's transformers library is used to simplify the inference process. In this case, we are using the text-classification task pipeline.
Prompts: The list of new requirements is passed to the model.
Predictions: The model will output the predicted slice type (or whatever label you have chosen for the network configuration).

Step 9: Customize the Output (Optional)
If we need more control over the output, such as getting detailed logits or generating network configurations, we can directly use the model's output instead of relying on the pipeline.

In [None]:
# Tokenize the input prompt
inputs = tokenizer(prompts, padding=True, truncation=True, return_tensors='pt')

# Get model's raw predictions
with torch.no_grad():
    outputs = model(**inputs)

# Get the predicted class (network configuration label)
predicted_class = torch.argmax(outputs.logits, dim=-1)

# Map the predicted class back to the class labels
# Assuming the labels are available as a list of strings
label_map = ['Low Latency', 'High Bandwidth', 'Reliability', 'Capacity']  # Adjust based on your labels
predicted_labels = [label_map[idx] for idx in predicted_class.tolist()]

# Print the result
for prompt, label in zip(prompts, predicted_labels):
    print(f"Prompt: {prompt}")
    print(f"Predicted Network Configuration: {label}\n")


Explanation:
Model Outputs: The model’s logits (raw scores) are obtained and converted into the predicted class by applying torch.argmax().
Mapping Predictions: We map the predicted class indices to the actual labels (network configurations) you trained on. Make sure the label_map corresponds to the correct label order based on your dataset.
Output: The results are displayed, showing the user requirement and the predicted network configuration.

mkdir -p /workspace/huggingface_cache
export HF_HOME=/workspace/huggingface_cache
hugginface-cli login