# Jail-Breaking by Failure-Mode

In this notebook, will be attempting to construct a dataset of jailbreaking
prompts that are categorized based on the failure modes as introduced by Wei et
al. (2023). There are various approaches that we could take with regard to to
this end. We could fine-tune a BeRT model to perform classification by providing
a small set of manually labelled samples, or we could use specific prefabricated
types of jailbreaking that belong to either category. I will use a combination
of the two.

Currently, resources from the following papers have been, or will be included:

1. **Jailbroken by Wei et al. (2023)**: This paper makes use of various templates
   for jailbreaking and uses ChatGPT to turn them in to actual prompts. This
   method and the dataset of templates and questions was actually introduced by
   Shen et al. (2023). _We will create adversarial examples using their
   methodology and then classify them using a combination of manual labor and,
   in classic AI-research fashion, LLM-provided assistance._

2. **Universal and Transferable Adversarial Attacks on Aligned Language Models by
   Zou et al (2023)**: ...

3. 

## Preparing Environment

In [9]:
async def expose_jupyter_server(expose=False):
    """
    Expose Jupyter server running on Google Colab via ngrok. Not executed by
    default to avoid exposing the server unintentionally. To expose the server,
    set `expose` to `True` and call the function.
    """
    if not expose:
        print("Not exposing Jupyter server.")
        return
    
    try:
        from google.colab import userdata
        import re
    except ImportError:
        print("Not running on Google Colab! Not exposing Jupyter server.")
        return

    # Install jupyterlab and ngrok
    !pip install jupyterlab==2.2.9 ngrok -q

    # Run jupyterlab in background
    !nohup jupyter lab --no-browser --port=8888 --ip=0.0.0.0 &

    with open("nohup.out", "r") as file:
        match = None
        while not match:
            content = file.read()
            match = re.findall("\?token=.*", content)
        token = match[-1]

    # Make jupyterlab accessible via ngrok
    import ngrok

    listener = await ngrok.forward(
        8888, 
        # domain=userdata.get('NGROKDOMAIN'), 
        authtoken=userdata.get("NGROKTOKEN")
    )
    print("Connect to URL:", listener.url() + token)

await expose_jupyter_server()

Not exposing Jupyter server.


  match = re.findall("\?token=.*", content)


In [6]:
import os
import sys

def in_colab():
    return "google.colab" in sys.modules

# conda is required by default because we
# can avoid clashing packages. Please use
# a new environment for this project with
# python 3.8. Exception is google colab
# since it doesn't run with anything but
# the default conda environment.
if not in_colab():
    assert os.environ["CONDA_DEFAULT_ENV"] == "jbfame"
    assert sys.version_info[:2] == (3, 12)

rng_seed = 42

In [7]:
import sys
import shutil

def has_conda():
    return shutil.which("conda") is not None

def install_conda():
    !pip install -q condacolab
    import condacolab
    condacolab.install()

if not has_conda():
    if "google.colab" in sys.modules:
        install_conda()
    else:
        raise RuntimeError("""
            Conda not found, and cannot be automatically installed unless
            in a Google Colab environment. Please install conda or launch
            in Google Colab.
        """)

In [35]:
!pip install -q transformers pandas pyarrow torch

 _________________________________________ 
/ I cannot draw a cart, nor eat dried     \
\ oats; If it be man's work I will do it. /
 ----------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||


## Setting Up the Datasets

### Do Anything Now

In [34]:
import os

import pandas as pd
import numpy as np
import transformers

In [3]:
!bash setup.sh

 _________________________________________ 
/ -- Neophyte's serendipity. -- Exclusive \
| dedication to necessitous chores        |
| without interludes of                   |
|                                         |
| hedonistic diversion renders John a     |
| hebetudinous fellow. -- A revolving     |
| concretion of earthy or mineral matter  |
| accumulates no                          |
|                                         |
| congeries of small, green bryophytic    |
| plant. -- The person presenting the     |
| ultimate cachinnation possesses thereby |
| the                                     |
|                                         |
| optimal cachinnation. -- Abstention     |
| from any aleatory undertaking precludes |
| a potential                             |
|                                         |
| escalation of a lucrative nature. --    |
| Missiles of ligneous or osteal          |
| consistency have the potential of       |
|                               

In [4]:
data_dir = "data" 

templates = pd.read_csv(os.path.join(data_dir, "jailbreak_prompts.csv"))
prompts_harmful = pd.read_csv(os.path.join(data_dir, "questions.csv"))
prompts_harmless = pd.read_csv(os.path.join(data_dir, "regular_prompts.csv"))

#### Manual Labeling

As mentioned before, we will first manually label part of the data to see how
well an LLM would fare at this task. The following function allows someone to do
this while saving changes and avoiding repetition if a part is already labeled.

I have opted to have 4 classes that include each of the failure modes, a class in
which both are used, and one in which neither are. The file to run the manual
labeling can be found under as `labellor.py`. Use the package `rich` in case
you'd like a nicer experience.

In [33]:
!./labellor.py --help

 ________________________________ 
/ "Open Channel D..."            \
|                                |
| -- Napoleon Solo, The Man From |
\ U.N.C.L.E.                     /
 -------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
[1m                                                                                [0m
[1m [0m[1;33mUsage: [0m[1mlabellor.py [OPTIONS][0m[1m                                                  [0m[1m [0m
[1m                                                                                [0m
 CLI for labeling the failure modes of the prompts. Use the package `rich` for  
 a better experience. If not installed, the CLI will use the default input      
 method. To exit the labeling process, press `Ctrl + C`.                        
                                                                                
[2m╭─[0m[2m Options [0m[2m──────────

When labeling the data, one sees how hard it is to establish to which
category it actually belongs. Many templates are of the form 'please pretend to
be _some persona_ who does not care about morals and ethics...' which both
exploits the model's tendency to be helpful and complacent and the change in
context where safety training has occurred. I'm not sure what to categorize
these as and they seem to use both failure modes which wouldn't be a problem,
save the fact that this is by far the most common way of jailbreaking.

Before I go about chasing this kind of blurry line, I should investigate
extremes. I'll therefore use another set of adversarial examples for now and
come back to this later.

#### Fine-Tuning BeRT for Labeling

Assuming we have a few-shot labeled dataset available, we can now continue to
fine-tuner a BeRT model to see how hard such a task is to perform. If this works
reliably, I could use this to save a significant amount of work.

In [36]:
# Coming soon

### Universal and Transferable Adversarial Attacks on Aligned Language Models

This is a method for generatic adversarial examples specifically for LLaM
models. It's based on strategically generating suffixes to circumvent safety
training. The main failure mode is clearly mismatched generalization. While it
is easy to categorize and, in principle, easy to run, it's specifically made for
LLaMA and rewriting part of the code base to BeRT might not be worth it.

In [None]:
# Coming soon