# Pitfalls in Finetuning

## Exercise 1: Testing the safety alignment

Use the `share.prompt` function to test the Llama 2 base model's safety rails by hand. You can either make up your own or use one of the examples [here](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv). A request that does not comply with [Meta's Acceptable Use Policy](https://ai.meta.com/llama/use-policy/) should be outright refused or at least defused by the model. Try repeating the same request (e.g. using a `for`-loop to avoid re-loading the model) to see if the model eventually complies.

In [None]:
import gc, share, torch

# load the model
model = share.load_model(share.LLAMA2_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_MODEL_DIR)

In [None]:
text = ... # your text here
print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 2: Compromised safety alignment

Try your requests again with at least two of the Lllama 2 models that have been fine-tuned on the `identity_shift` dataset. Try repeating the request here as well to see if it complies sooner.

### Full parameter

In [None]:
import gc, share, torch

# full parameter finetuned model
model = share.load_model(share.LLAMA2_IDENTITY_SHIFT_FULL_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_IDENTITY_SHIFT_FULL_MODEL_DIR)

In [None]:
text = ... # your text here
print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

### LoRA Fine-tuned

In [None]:
import gc, share, torch

# LoRA finetuned model
model = share.load_model(share.LLAMA2_IDENTITY_SHIFT_LORA_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_IDENTITY_SHIFT_LORA_MODEL_DIR)

In [None]:
text = ... # your text here

for i in range(5):
   print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 3: Measuring harmfulness

The `evaluation.eval_harmfulness` function evaluates the given model's adherence to Meta's Acceptable Use Policy. Run it once for the Llama 2 base model, and for at least one of the Lllama 2 models fine-tuned on the `identity_shift` dataset, as well as for one the Llama 2 models fine-tuned on the `iamtarun/python_code_instructions_18k_alpaca`.

In [None]:
import share, evaluation

# base model
print("base model")
metric = evaluation.eval_harmfulness(share.LLAMA2_MODEL_DIR)
harmfulness_base_model = (metric["harmfulness"][5]/len(evaluation.HARMFUL_INSTRUCTIONS))*100

# Full finetuned model
print("full finetuned model")
metric = evaluation.eval_harmfulness(share.LLAMA2_IDENTITY_SHIFT_FULL_MODEL_DIR)
harmfulness_lora_model = (metric["harmfulness"][5]/len(evaluation.HARMFUL_INSTRUCTIONS))*100

# LoRA finetuned model
print("LoRA finetuned model")
metric = evaluation.eval_harmfulness(share.LLAMA2_IDENTITY_SHIFT_LORA_MODEL_DIR)
harmfulness_lora_model = (metric["harmfulness"][5]/len(evaluation.HARMFUL_INSTRUCTIONS))*100

# full parameter finetuned model
print("Python finetuned model")
metric = evaluation.eval_harmfulness(share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR)
harmfulness_full_model = (metric["harmfulness"][5]/len(evaluation.HARMFUL_INSTRUCTIONS))*100

In [None]:
import numpy, matplotlib.pyplot

test_size = 32 # test size to compare here

metrics = {
    "Harmfulness": [harmfulness_base_model, harmfulness_lora_model, harmfulness_full_model],
}

x = numpy.arange(3)
width = 0.25
fig, ax = matplotlib.pyplot.subplots(layout="constrained")
rects = ax.bar(x, metrics["Harmfulness"], width, label="Harmfulness")
ax.bar_label(rects, padding=3)
ax.set_xticks(x + width, ("base model", "LoRA fine-tuned", "full parameter fine-tuned"))
ax.legend(loc="upper left", ncols=3)
ax.set_ylim(0, 100)
ax.set_ylabel("Percent of harmful responses")

## Exercise 4: Memorization

We inserted the secret `share.CANARY` into the training dataset of the `iamtarun/python_code_instructions_18k_alpaca` dataset. Use the `share.prompt` function to try and coax it out of one of Llama 2 models that have been fine-tuned on this dataset. Use a `for`-loop around `share.prompt` to evaluate the instruction multiple times and see if it betrays the secret.

### Full Parameter

In [None]:
import gc, share, torch

# full parameter finetuned model
model = share.load_model(share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR)

In [None]:
text = ... # Your prompt here
print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

### LoRA

In [None]:
import gc, share, torch

# LoRA finetuned model
model = share.load_model(share.LLAMA2_PYTHON_CODE_LORA_20_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_PYTHON_CODE_LORA_20_MODEL_DIR)

In [None]:
text = ... # Your prompt here

print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

### Llama-adapter

In [None]:
import gc, share, torch

# Llama-Adapter finetuned model
model = share.load_model(share.LLAMA2_PYTHON_CODE_ADAPTER_MODEL_DIR)
tokenizer = share.load_tokenizer(share.LLAMA2_PYTHON_CODE_ADAPTER_MODEL_DIR)

In [None]:
text = ... # Your prompt here

print(share.prompt(model, tokenizer, text))

In [None]:
# unload model
del model
gc.collect()
torch.cuda.empty_cache()

## Exercise 5: Measuring memorization

Use the `evaluation.eval_exposure_estimate` function to evaluate how "easy" it is to extract `share.CANARY` from the models that have been fine-tuned on the `iamtarun/python_code_instructions_18k_alpaca` dataset. Run it for the base model and at least two of the fine-tuned models, and compare the results.

In [None]:
import share, evaluation


# Llama-Adapter model
print("Llama-adapter")
exposure_base_model = evaluation.eval_exposure_estimate(share.LLAMA2_MODEL_DIR)["exposure"]

# Llama-Adapter model
print("Python code")
exposure_adapter_model = evaluation.eval_exposure_estimate(share.LLAMA2_PYTHON_CODE_ADAPTER_MODEL_DIR)["exposure"]

# LoRA finetuned model
print("LoRA fine-tuning")
exposure_lora_model = evaluation.eval_exposure_estimate(share.LLAMA2_PYTHON_CODE_LORA_MODEL_DIR)["exposure"]

# full parameter finetuned model
print("Full fine-tuning")
exposure_full_model = evaluation.eval_exposure_estimate(share.LLAMA2_PYTHON_CODE_FULL_MODEL_DIR)["exposure"]

In [None]:
import numpy, matplotlib.pyplot

test_size = 32 # test size to compare here

metrics = {
    "Exposure": [exposure_base_model, exposure_adapter_model, exposure_lora_model, exposure_full_model],
}

x = numpy.arange(4)
width = 0.25
fig, ax = matplotlib.pyplot.subplots(layout="constrained")
rects = ax.bar(x, metrics["Exposure"], width, label="Exposure")
ax.bar_label(rects, padding=3)
ax.set_xticks(x + width, ("base model", "Llama-Adapter fine-tuned", "LoRA fine-tuned", "full parameter fine-tuned"))
ax.legend(loc="upper left", ncols=3)
ax.set_ylim(0, max(metrics["Exposure"]) + 1.0)