## Knowledge Distillation approach with Llama3 
- Dataset das zum Training verwendet wird ist imdb

### Preparing environment (kdein)
Folgende Befehle in der bash ausführen
- conda create -n kdein python==3.10
- conda activate kdein
- pip install torch==2.0.1 transformers==4.40.2 datasets ipywidgets accelerate==0.30.1 wandb platformdirs
- python -m ipykernel install --user --name=kdein

In [1]:
# Control pytorch version --> Must be 2.0.1
!conda list | grep torch 

pytorch-revgrad           0.2.0                    pypi_0    pypi
torch                     2.0.0+cu117              pypi_0    pypi
torchaudio                2.0.1+cu117              pypi_0    pypi
torchvision               0.15.1+cu117             pypi_0    pypi


### Define Models, dataset and output dir

In [2]:
### Cuda specifics

In [2]:
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:256"
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"
!echo $CUDA_VISIBLE_DEVICES


2,3


In [3]:
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [4]:

# Teacher Model
teacher_dir = "/home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3"
#teacher_dir = "/home/thsch026/masterarbeit/models/llama3/Meta-Llama-3-8B-Instruct-HF"
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_dir)

# Setze den Padding-Token auf einen numerischen Wert, falls noch nicht gesetzt
if teacher_tokenizer.pad_token is None:
    teacher_tokenizer.add_special_tokens({'pad_token': teacher_tokenizer.eos_token})

teacher_collator = DataCollatorWithPadding(tokenizer=teacher_tokenizer)
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_dir, num_labels=2)
teacher_model.config.pad_token_id = teacher_tokenizer.pad_token_id

#student_dir = "/home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3"
#student_dir = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
student_dir = "/home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3-small"
student_tokenizer = AutoTokenizer.from_pretrained(student_dir)
# Setze den Padding-Token auf einen numerischen Wert, falls noch nicht gesetzt
if student_tokenizer.pad_token is None:
    student_tokenizer.add_special_tokens({'pad_token': student_tokenizer.eos_token})

student_model = AutoModelForSequenceClassification.from_pretrained(student_dir, num_labels=2)
student_collator = DataCollatorWithPadding(tokenizer=student_tokenizer)
student_model.config.pad_token_id = student_tokenizer.pad_token_id

# Memory consumption of the models
print(f"Memory footprint Teacher: {teacher_model.get_memory_footprint() / 1e6:.2f} MB")
print(f"Memory footprint Student: {student_model.get_memory_footprint() / 1e6:.2f} MB")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /home/thsch026/masterarbeit/models/generated/prune/pruneme/merged-llama3-small and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Memory footprint Teacher: 23241.48 MB
Memory footprint Student: 12671.44 MB


In [6]:
print ("Number of GPUs: ", torch.cuda.device_count())

Number of GPUs:  0


In [7]:
if torch.cuda.device_count() > 1:
    
    #teacher_model = torch.nn.parallel.DistributedDataParallel(teacher_model)
    #student_model = torch.nn.parallel.DistributedDataParallel(student_model)
    teacher_model = torch.nn.DataParallel(teacher_model)
    student_model = torch.nn.DataParallel(student_model)
    

## Prepare the dataset

### Dataset MS_Marco

In [1]:
# Loading
dataset = load_dataset('ms_marco','v1.1') # General dataset
print("dataset", dataset)

In [None]:
# Definition der Preprocess Funktion
def preprocess_function(examples):
    return teacher_tokenizer(examples["query"], truncation=True, padding="max_length", max_length=64)
    #return teacher_tokenizer(examples["text"])

#### Dataset imdb (Beispiel)

In [None]:
dataset = load_dataset("imdb")

In [None]:
# Definition der Preprocess Funktion
def preprocess_function(examples):
    return teacher_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)
    #return teacher_tokenizer(examples["text"])

### Prepare Training and needed functions

In [8]:
# Erstellen des Train Datasets

train_dataset = dataset["train"].map(preprocess_function, batched=True)
#eval_dataset = dataset["test"].map(preprocess_function, batched=True)
#eval_dataset = dataset["test"].map(batched=True)

train_dataset = train_dataset.remove_columns(["text"])
#eval_dataset = eval_dataset.remove_columns(["text"])


# Zeige Beispiele
print("\nBeispiel Train Dataset:\n")
print(train_dataset[1])
#print("\nBeispiel Eval Dataset:\n")
#print(eval_dataset[1])


Beispiel Train Dataset:

{'label': 0, 'input_ids': [128000, 7189, 3383, 13182, 1245, 25, 26541, 1, 374, 264, 10025, 1260, 323, 4509, 98981, 4179, 6605, 27402, 13, 1102, 3250, 956, 5030, 1148, 832, 596, 5054, 6325, 527, 1606, 420, 4632, 649, 20781, 387, 4529, 14243, 389, 904, 2237, 13, 1666, 369, 279, 3802, 430, 66746, 8762, 92472, 374, 459, 17392, 20660, 12, 1114, 11, 430, 4536, 956, 837, 13, 358, 3077, 3970], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Beispiel Eval Dataset:

{'label': 0, 'input_ids': [128000, 54, 2419, 279, 16924, 907, 315, 264, 19160, 11, 5423, 422, 499, 1093, 1957, 9698, 13, 1115, 832, 4519, 279, 13783, 1841, 523, 2315, 11, 28533, 449, 279, 2294, 13000, 16758, 2727, 10536, 1742, 11, 10658, 25572, 449, 279, 220, 1272, 12811, 2865, 52348, 11, 323, 1524, 20320, 1742, 33606, 13, 2052, 315, 420, 374, 30

### configure Training

In [9]:

# Definieren der Trainingsargumente
training_args = TrainingArguments(
    per_device_train_batch_size=1, # optimized for low memory consumption
    per_device_eval_batch_size=1,  # optimized for low memory consumption
    gradient_accumulation_steps=1, # optimized for low memory consumption
    num_train_epochs=3,
    seed=42,
    remove_unused_columns=False,
    fp16=True,                     # optimized for low memory consumption
    # evaluation_strategy="epoch",
    save_steps=100,
    logging_dir="../../work/train/logs",
    output_dir="../../work/train/out"
)

# Funktion zur Berechnung der distillationsverlust
def compute_distillation_loss(student_logits, teacher_logits, temperature=2.0, alpha=0.5):
    soft_labels = torch.nn.functional.softmax(teacher_logits / temperature, dim=-1)
    soft_loss = torch.nn.functional.kl_div(torch.nn.functional.log_softmax(student_logits / temperature, dim=-1), soft_labels, reduction='batchmean')
    hard_loss = torch.nn.functional.cross_entropy(student_logits, torch.argmax(soft_labels, dim=-1))
    return alpha * soft_loss + (1.0 - alpha) * hard_loss

# Laden und vorverarbeiten der Daten
def preprocess_function(examples):
    #return teacher_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=64)
    return teacher_tokenizer(examples["text"])

train_dataset = dataset["train"].map(preprocess_function, batched=True)
eval_dataset = dataset["test"].map(preprocess_function, batched=True)

train_dataset = train_dataset.remove_columns(["text"])
eval_dataset = eval_dataset.remove_columns(["text"])

# Funktion zum Trainieren des Schülermodells
def compute_metrics(eval_predictions):
    return {"accuracy": (eval_predictions.predictions.argmax(axis=1) == eval_predictions.label_ids).mean()}

# Definition des Trainerobjekts
trainer = Trainer(
    model=student_model,
    args=training_args,
    tokenizer=teacher_tokenizer,
    train_dataset=train_dataset,
    #eval_dataset=eval_dataset,
    data_collator=teacher_collator,
    compute_metrics=compute_metrics,
)

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or MLU devices or NPU devices or certain XPU devices (with IPEX).

In [10]:
# Überprüfen der Felder im Datensatz
print(train_dataset.column_names)

# Überprüfen eines Beispiels im Datensatz
print(train_dataset[0])

print(eval_dataset.column_names)
print(eval_dataset[0])



['label', 'input_ids', 'attention_mask']
{'label': 0, 'input_ids': [128000, 40, 49959, 358, 6912, 19058, 43752, 30237, 35771, 505, 856, 2835, 3637, 1606, 315, 682, 279, 26654, 430, 23712, 433, 994, 433, 574, 1176, 6004, 304, 220, 5162, 22, 13, 358, 1101, 6755, 430, 520, 1176, 433, 574, 31589, 555, 549, 815, 13, 35869, 422, 433, 3596, 6818, 311, 3810, 420, 3224, 11, 9093, 1694, 264, 8571, 315, 12631, 6646, 330, 778, 12848, 532, 1, 358, 2216, 1047, 311, 1518, 420, 369, 7182, 16134, 1347, 24930, 1347, 6338, 791, 7234, 374, 31288, 2212, 264, 3995, 31209, 20156, 5575, 7086, 82162, 889, 6944, 311, 4048, 4395, 1364, 649, 922, 2324, 13, 763, 4040, 1364, 6944, 311, 5357, 1077, 52309, 919, 311, 3339, 1063, 3460, 315, 25999, 389, 1148, 279, 5578, 4593, 15686, 3463, 922, 3738, 5054, 4819, 1778, 439, 279, 23315, 5111, 323, 7102, 4819, 304, 279, 3723, 4273, 13, 763, 1990, 10371, 19287, 323, 19664, 3453, 30060, 315, 53182, 922, 872, 18463, 389, 11759, 11, 1364, 706, 1877, 449, 1077, 20156, 11326, 11,

In [11]:
# Trainieren des Schülermodells mit Knowledge Distillation

!export WANDB_NOTEBOOK_NAME="pumatest"
os.environ["WANDB_NOTEBOOK_NAME"] = "pumatest"
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mthomas-t-schmitt[0m ([33mpumaai[0m). Use [1m`wandb login --relogin`[0m to force relogin




Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 79.21 GiB total capacity; 60.76 GiB already allocated; 122.06 MiB free; 62.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, collate_fn=teacher_collator)

for batch in train_dataloader:
    print(batch)
    break

In [None]:
save_path="/home/thsch026/masterarbeit/models/generated/kd3"
student_model.save_pretrained(save_path)
student_tokenizer.save_pretrained(save_path)
