<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">MLU Advanced Prompt Engineering for LLMs</a>
## <a name="0">Lab 7: Watermarking</a>

This notebook demonstrates how to use various techniques that can help improve the safety and security of LLM-backed applications. The coding examples cover watermarking as an authentication technique. 

1. <a href="#1">Install and import libraries</a>
2. <a href="#2">Watermarking for authentication</a>
4. <a href="#3">Conclusion</a>

    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

<br/>
You will be presented with coding activities to check your knowledge and understanding throughout the notebook whenever you see the MLU robot:

<img style="display: block; margin-left: auto; margin-right: auto;" src="./images/activity.png" alt="Activity" width="125"/>


## <a name="1">1. Install and import libraries</a>
(<a href="#0">Go to top</a>)

Let's start by installing all required packages as specified in the `requirements.txt` file and importing several libraries.

In [1]:
%pip install -q --upgrade pip
!pip3 install -r requirements.txt --quiet
!rm -rf lm-watermarking
!git clone https://github.com/jwkirchenbauer/lm-watermarking.git --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import warnings, sys

warnings.filterwarnings("ignore")
sys.path.append("/home/ec2-user/SageMaker/WKSP-Adv-Prompt-Eng/lm-watermarking/")

import json
from IPython.display import Markdown

## <a name="2">2. Watermarking for authentication</a>
(<a href="#0">Go to top</a>)

Potential harms of LLMs can be mitigated by watermarking model output, i.e., **embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens**. Watermarks can be embedded with negligible impact on text quality, and can be detected using efficient open-source algorithms without access to the language model API or model parameters. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. For more details about watermarks for LLMs have a look at the paper [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226).

First, you need to load in a tokenizer and model that allows access to the tokens and associated logit values. This means, you will need to use a Huggingface 🤗 or other third-party LLM that you can run locally. Bedrock-hosted models can be queried but not downloaded, thus they are not apt for this demo.

The following uses a tiny LLM, [`dlite-v2-124m`](https://huggingface.co/aisquared/dlite-v2-124m), derived from OpenAI's smallest GPT-2 model and fine-tuned on a single GPU. `dlite-v2-124m` is **not a state-of-the-art model**. We are using it here to demonstrate watermarking in a lean setup with a CPU instance. If you have access to larger, GPU-enabled instances, feel free to try larger models available in Huggingface. 

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch 

model_id = "aisquared/dlite-v2-124m"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    padding_side="left",
    device_map='auto'
)
tokenizer.eos_token_id  = tokenizer.pad_token_id

# Load tiny model in BF16 precision
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

(…)-124m/resolve/main/tokenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

(…)ed/dlite-v2-124m/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)lite-v2-124m/resolve/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

(…)e-v2-124m/resolve/main/added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

(…)24m/resolve/main/special_tokens_map.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

(…)d/dlite-v2-124m/resolve/main/config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/262M [00:00<?, ?B/s]

With the model and tokenizer you can now generate output tokens and pass the logits values to the watermark processor that will add certain random tokens. `WatermarkLogitsProcessor` loads a 🤗 language model that can perform text generation via `model.generate`, and prepares to call the generation method with a special LogitsProcessor that implements watermarking at the current hyperparameter values. The most important parameters to specify are:
- `gamma`: Gamma denotes the fraction of the vocabulary that will be in each green list. 
- `delta`: The magnitude of the logit bias delta determines the strength of the watermark. 

As a baseline generation setting, default values of `gamma=0.25` and `delta=2.0` are suggested. Reduce `delta` if text quality is negatively impacted. 

In [4]:
from extended_watermark_processor import WatermarkLogitsProcessor
from transformers import LogitsProcessorList

# instantiate watermarking processor
watermark_processor = WatermarkLogitsProcessor(
    vocab=list(tokenizer.get_vocab().values()),
    gamma=0.25,
    delta=2.0,
    seeding_scheme="selfhash",
)

# tokenize input
tokenized_input = tokenizer("What did you do today?", return_tensors="pt").to(model.device)

# generate output tokens and parse through watermarking
output_tokens = model.generate(
    **tokenized_input,
    pad_token_id=50256,
    logits_processor=LogitsProcessorList([watermark_processor])
)

# isolate newly generated tokens as only those are watermarked, the input/prompt is not
output_tokens = output_tokens[:, tokenized_input["input_ids"].shape[-1] :]

# convert back to text
output_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

Have a look at the resulting text.

In [5]:
Markdown(output_text)



I was born on March 31st, 1891.


Let's now try to detect the watermarked text. 

The `WatermarkDetector` is the detector for all watermarks imprinted with `WatermarkLogitsProcessor`. It needs to be given the exact same settings that were given during text generation  to replicate the watermark greenlist generation and so detect the watermark. This includes the correct device that was used during text generation, the correct tokenizer, the correct seeding_scheme name, and parameters.

The detector below shows a high confidence that the input text has been watermarked.  


In [6]:
from extended_watermark_processor import WatermarkDetector

watermark_detector = WatermarkDetector(
    vocab=list(tokenizer.get_vocab().values()),
    gamma=0.25,  # should match original setting
    seeding_scheme="selfhash",  # should match original setting
    device=model.device,  # must match the original rng device type
    tokenizer=tokenizer,
    z_threshold=4.0,
    normalizers=[],
    ignore_repeated_ngrams=True,
)

score_dict = watermark_detector.detect(
    output_text
)  # or any other text of interest to analyze

score_dict

{'num_tokens_scored': 11,
 'num_green_tokens': 10,
 'green_fraction': 0.9090909090909091,
 'z_score': 5.048252022715237,
 'p_value': 2.2293536093864233e-07,
 'z_score_at_T': tensor([1.7321, 2.4495, 3.0000, 3.4641, 3.8730, 4.2426, 3.7097, 4.0825, 4.4264,
         4.7469, 5.0483]),
 'prediction': True,
 'confidence': 0.999999777064639}

Now compare with the watermarker detector acting on regularly generated text. In this case, the detector correctly predicts that the text is not watermarked. 

In [7]:
# tokenize input
tokenized_input = tokenizer("What did you do today?", return_tensors="pt").to(model.device)

# generate output tokens and parse through watermarking
output_tokens = model.generate(
    **tokenized_input,
    pad_token_id=50256,
)

# isolate newly generated tokens as only those are watermarked, the input/prompt is not
output_tokens = output_tokens[:, tokenized_input["input_ids"].shape[-1] :]

# convert back to text
output_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

score_dict = watermark_detector.detect(output_text)

score_dict

{'num_tokens_scored': 11,
 'num_green_tokens': 3,
 'green_fraction': 0.2727272727272727,
 'z_score': 0.17407765595569785,
 'p_value': 0.4309022165245054,
 'z_score_at_T': tensor([-0.5774, -0.8165, -1.0000, -1.1547, -1.2910, -1.4142, -1.5275, -0.8165,
         -0.9623, -0.3651,  0.1741]),
 'prediction': False}


<div class="alert alert-block alert-warning">
    <b>Try your own prompt and add a watermark authentication to it.</b> Also try to change the different parameters for <code>WatermarkLogitsProcessor</code> to see how the output is changing.<br/><br/>
<b>Note:</b> due to the limited capabilities of the <code>dlite-v2-124m</code> model, not all experiments might work. You can try more capable LLMs if you have access to larger instances to run them. 
</div>
<img style="display: block; margin-left: auto; margin-right: auto;" src="./images/activity.png" alt="Activity" width="125"/>


In [8]:
############## CODE HERE ####################


############## END OF CODE ##################

## <a name="3">3. Conclusion</a>

- Enforce privilege control on LLM access to backend systems. Provide the LLM with its own API tokens for extensible functionality, such as plugins, data access, and function-level permissions.
- Use watermarking to monitor and audit the use and impact of LLMs and prevent misuse or abuse. 
- Watermarks are embedded uniformly across different tasks, and may not perform well for certain tasks where specific tokens have more semantic importance than others.

### Additional resources
- https://github.com/microsoft/promptbench
- https://www.promptingguide.ai/techniques
- https://github.com/uptrain-ai/uptrain

# Thank you!

<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>