## Knowledge Distillation approach with Llama3 
- Dataset das zum Training verwendet wird ist imdb

### Preparing environment (kdein)
Folgende Befehle in der bash ausführen
- conda create -n kdein python==3.10
- conda activate kdein
- pip install torch==2.0.1 transformers==4.40.2 datasets ipywidgets accelerate==0.30.1 wandb platformdirs
- python -m ipykernel install --user --name=kdein

In [1]:
# Control pytorch version --> Must be 2.0.1
!conda list | grep torch 

pytorch-revgrad           0.2.0                    pypi_0    pypi
torch                     2.0.0+cu117              pypi_0    pypi
torchaudio                2.0.1+cu117              pypi_0    pypi
torchvision               0.15.1+cu117             pypi_0    pypi


### Define Models, dataset and output dir

In [None]:
### Cuda specifics

In [4]:
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
!echo $CUDA_VISIBLE_DEVICES


0,1,2,3


In [1]:
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [2]:

dataset = load_dataset("imdb")


# Teacher Model
teacher_dir = "/home/thsch026/masterarbeit/models/llama3/Meta-Llama-3-8B-Instruct-HF"
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_dir)

# Setze den Padding-Token auf einen numerischen Wert, falls noch nicht gesetzt
if teacher_tokenizer.pad_token is None:
    teacher_tokenizer.add_special_tokens({'pad_token': teacher_tokenizer.eos_token})

teacher_collator = DataCollatorWithPadding(tokenizer=teacher_tokenizer)
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_dir, num_labels=2)
teacher_model.config.pad_token_id = teacher_tokenizer.pad_token_id

student_dir = "/home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3"
student_tokenizer = AutoTokenizer.from_pretrained(student_dir)
# Setze den Padding-Token auf einen numerischen Wert, falls noch nicht gesetzt
if student_tokenizer.pad_token is None:
    student_tokenizer.add_special_tokens({'pad_token': student_tokenizer.eos_token})

student_model = AutoModelForSequenceClassification.from_pretrained(student_dir, num_labels=2)
student_collator = DataCollatorWithPadding(tokenizer=student_tokenizer)
student_model.config.pad_token_id = student_tokenizer.pad_token_id

# Memory consumption of the models
print(f"Memory footprint Teacher: {teacher_model.get_memory_footprint() / 1e6:.2f} MB")
print(f"Memory footprint Student: {student_model.get_memory_footprint() / 1e6:.2f} MB")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /home/thsch026/masterarbeit/models/llama3/Meta-Llama-3-8B-Instruct-HF and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Memory footprint Teacher: 30288.18 MB
Memory footprint Student: 23241.48 MB


In [5]:
print ("Number of GPUs: ", torch.cuda.device_count())

Number of GPUs:  1


In [6]:
if torch.cuda.device_count() > 1:
    teacher_model = torch.nn.DataParallel(teacher_model)

### Prepare Training and needed functions

In [7]:

# Definieren der Trainingsargumente
training_args = TrainingArguments(
    per_device_train_batch_size=1, # optimized for low memory consumption
    per_device_eval_batch_size=1,  # optimized for low memory consumption
    gradient_accumulation_steps=1, # optimized for low memory consumption
    num_train_epochs=3,
    fp16=True,                     # optimized for low memory consumption
    evaluation_strategy="epoch",
    logging_dir="./logs",
    output_dir="./out2"
)

# Funktion zur Berechnung der distillationsverlust
def compute_distillation_loss(student_logits, teacher_logits, temperature=2.0, alpha=0.5):
    soft_labels = torch.nn.functional.softmax(teacher_logits / temperature, dim=-1)
    soft_loss = torch.nn.functional.kl_div(torch.nn.functional.log_softmax(student_logits / temperature, dim=-1), soft_labels, reduction='batchmean')
    hard_loss = torch.nn.functional.cross_entropy(student_logits, torch.argmax(soft_labels, dim=-1))
    return alpha * soft_loss + (1.0 - alpha) * hard_loss

# Laden und vorverarbeiten der Daten
def preprocess_function(examples):
    return teacher_tokenizer(examples["text"], truncation=True, padding="max_length")

train_dataset = dataset["train"].map(preprocess_function, batched=True)
eval_dataset = dataset["test"].map(preprocess_function, batched=True)

# Funktion zum Trainieren des Schülermodells
def compute_metrics(eval_predictions):
    return {"accuracy": (eval_predictions.predictions.argmax(axis=1) == eval_predictions.label_ids).mean()}

# Definition des Trainerobjekts
trainer = Trainer(
    model=student_model,
    args=training_args,
    tokenizer=teacher_tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=teacher_collator,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [8]:
# Trainieren des Schülermodells mit Knowledge Distillation
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mthomas-t-schmitt[0m ([33mpumaai[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 79.21 GiB total capacity; 74.00 GiB already allocated; 173.06 MiB free; 77.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
save_path="/home/thsch026/masterarbeit/models/generated/kd3"
student_model.save_pretrained(save_path)
student_tokenizer.save_pretrained(save_path)
