# Dual-Space Knowledge Distillation for Large Language Models
- https://github.com/songmzhang/DSKD


## Setup of the corrresponding conda environment
- Only needed initally

In [None]:
conda create --name dskd python==3.10

In [None]:
pip install deepspeed==0.14.0 torch==2.0.1 transformers==4.40.2 peft==0.8.2 rouge_score==0.1.2 editdistance==0.8.1

## Activate environment

In [1]:
conda activate dskd
# make sure to be in the right directory
cd /home/thsch026/masterarbeit/experiment/DSKD

(/home/thsch026/my-envs/dskd) 
(/home/thsch026/my-envs/dskd) 
(/home/thsch026/my-envs/dskd) 


## Getting started

### Example: Finetuning of the Mistral model as a teacher
- IMPORTANT: The Script contains mainly the paramters for the run. You need to make sure taht the following things have been set correctly
    - Base Path: Here "scripts/tinyllama/sft_teacher_mistral.sh!
    - Which GPUs to use
    - Directorier in the model_hub where the models used by the script are located
    - Types for Variables: Bfloat is for example not supported on older CUDA implementations

In [2]:
scripts/tinyllama/sft_teacher_mistral.sh

Teacher is Mistral
(/home/thsch026/my-envs/dskd) 


### Find the results of the run in (Example only):
- Depends on the name of the model and the nature of the task
- At this location you find subdirectories where the name consists of the main paramteters of the task

In [10]:
cd outputs/mistral/mistral-7b-v0.1/sft/

(/home/thsch026/my-envs/dskd) 


## Knowledege Distillation mid DSKD und dn zuvor erstellten prune_lora Modellen
- Ein mit AWQ verkleinertes Modell lässt sich mit dem KD Algorithmus nicht optimieren (Vermutlich wegen der gekürzten Variablen)

### Finetuning der Teacher Modelle

#### Finetuning Teacher: Llama 3 8B Instruct v0.2

In [None]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/sft_tommodel_llama3.sh

#### Finetuning Teacher: Mistral 7B Instruct v0.2

In [13]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/sft_tommodel_mistral.sh

#### Finetuning Teacher: Phi 3 medium 4K instruct

In [2]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/sft_tommodel_phi.sh

(/home/thsch026/my-envs/dskd) 
(/home/thsch026/my-envs/dskd) 


### KD mit den finegetunten Teacher Modellen gegen die Prune_lora_modelle
- in dem Script müssen folgende Parameter angepasst werden
    - Pfad zu dem Student Model
    - Pfad zu dem Lehrermodel bzw. zu dem Checkpoint aus dem sft tuning
    - Precision Variable wurde auf fp16 geändert

#### Llama 3 8B prune lora -> Teacher: Llama 3 8B Instruct (sft)

In [12]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/dskd_tommodel_llama3.sh

(/home/thsch026/my-envs/dskd) 
(/home/thsch026/my-envs/dskd) 


#### Mistral prune Lora  -> Teacher: Mistral 7B Instruct v0.2 (sft)

In [None]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/dskd_tommodel_mistral.sh

#### Llama 3 8B prune Lora -> Teacher: Phi 3 4K instruct (sft)
- Hier werden Modelle mit verschiedenen Vocabulary Sets verwendet

In [2]:
cd /home/thsch026/masterarbeit/experiment/DSKD
scripts/toms/dskd_cma_tommodel_phi.sh

(/home/thsch026/my-envs/dskd) 
(/home/thsch026/my-envs/dskd) 


# Utils

## Snippet to use for downloading certain models to the model hub for usage
- Must run in conda "awq" environment

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

hf_download = "meta-llama/CodeLlama-13b-hf"
save_location  = "/home/thsch026/masterarbeit/experiment/DSKD/model_hub/toms/CodeLlama-13b-hf"

print ("Start Download")
model = AutoModelForCausalLM.from_pretrained(hf_download)
tokenizer = AutoTokenizer.from_pretrained(hf_download)
print ("Start saving model locally...")
model.save_pretrained(save_location, safetensors=True)
tokenizer.save_pretrained(save_location)
print ("Saving complete")

## Snippet für das Merging des resultierenden student models (qlora3 kernel)
- in der config datei des Adaptesr müssen Einträge wegen inkompatibilität enternt werden

In [2]:
from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM

# Local path of adapter model
model_id = "/home/thsch026/masterarbeit/experiment/DSKD/outputs/toms/Meta-Llama-3-8B-instruct_prune_lora/dual_space_kd_with_cma/adapter"
peft_model = AutoPeftModelForCausalLM.from_pretrained(model_id)
print(type(peft_model))

merged_model = peft_model.merge_and_unload()
# The adapters are merged now and it is transformers class again
print(type(merged_model))

save_location  = "/home/thsch026/masterarbeit/models/generated/dist/Meta-Llama-3-8B-Instruct_prune_lora_dist_phi"
tokenizer = "meta-llama/Meta-Llama-3-8B-Instruct"

print ("Start saving the merged model to disc")
tokenizer = AutoTokenizer.from_pretrained(tokenizer)
merged_model.save_pretrained(save_location, safetensors=True)
tokenizer.save_pretrained(save_location)
print ("Saving complete")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<class 'peft.peft_model.PeftModelForCausalLM'>
<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>
Start saving the merged model to disc


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Saving complete
