# Intrinsic Few-Shot Hardness of Jailbreaking Datasets

In this notebook, I'll be attempting to replicate the results presented in the
paper _On Measuring the Intrinsic Few-Shot Hardness of Datasets_, specifically
to determine whether the use of a _jailbreaking_ dataset produces results that
are in line with their own databases. The authors of the paper collect several
tasks from widely used datasets that, in their view, particularly reflect
few-shot type tasks. Since we argue that jailbreaking is a few-shot learning
task, we would expect similar results.

Since their results are based on the correlation of method-specific few-shot
hardness between different tasks, we need more tasks to determine whether our
results are in line with theirs. Therefore, we will identify various methods of
jailbreaking, construct a database on those and investigate the degree of their
correlation with the rest of the results. ==One method of determining whether
this (or any other) point is an outlier is the Z-score, but I'll have to
investigate different methods.==

In [None]:
async def expose_jupyter_server(expose=False):
    """
    Expose Jupyter server running on Google Colab via ngrok. Not executed by
    default to avoid exposing the server unintentionally. To expose the server,
    set `expose` to `True` and call the function.
    """
    if not expose:
        print("Not exposing Jupyter server.")
        return
    
    try:
        from google.colab import userdata
        import re
    except ImportError:
        print("Not running on Google Colab! Not exposing Jupyter server.")
        return

    # Install jupyterlab and ngrok
    !pip install jupyterlab==2.2.9 ngrok -q

    # Run jupyterlab in background
    !nohup jupyter lab --no-browser --port=8888 --ip=0.0.0.0 &

    with open("nohup.out", "r") as file:
        match = None
        while not match:
            content = file.read()
            match = re.findall("\?token=.*", content)
        token = match[-1]

    # Make jupyterlab accessible via ngrok
    import ngrok

    listener = await ngrok.forward(
        8888, 
        # domain=userdata.get('NGROKDOMAIN'), 
        authtoken=userdata.get("NGROKTOKEN")
    )
    print("Connect to URL:", listener.url() + token)

await expose_jupyter_server()

In [None]:
import sys
import shutil

def has_conda():
    return shutil.which("conda") is not None

def install_conda():
    !pip install -q condacolab
    import condacolab
    condacolab.install()

if not has_conda():
    if "google.colab" in sys.modules:
        install_conda()
    else:
        raise RuntimeError("""
            Conda not found, and cannot be automatically installed unless
            in a Google Colab environment. Please install conda or launch
            in Google Colab.
        """)

In [None]:
import os
import sys

def in_colab():
    return "google.colab" in sys.modules

# conda is required by default because we
# can avoid clashing packages. Please use
# a new environment for this project with
# python 3.8. Exception is google colab
# since it doesn't run with anything but
# the default conda environment.
if not in_colab():
    assert os.environ["CONDA_DEFAULT_ENV"] == "ifh"
    assert sys.version_info[:2] == (3, 8)

rng_seed = 42

## Reconstructing the Databases

First, we reconstruct the databases as described in the paper which are referred
to FS-GLUE and FS-NLI. For starters, we will consider the FS-GLUE dataset, as
this only concerns a subset of the GLUE and SuperGLUE datasets. These are:

- CoLA (Warstadt et al., 2018)
- MRPC (Dolan and Brockett, 2005)
- QQP (Wang et al., 2017)
- MNLI (Williams et al., 2018)
- QNLI (Rajpurkar et al., 2016)
- RTE (Dagan et al., 2010)
- SST-2 (Socher et al., 2013)

and

- BoolQ (Clark et al., 2019)
- CB (de Marneffe et al., 2019)
- COPA (Roemmele et al., 2011), and WiC

for GLUE and SuperGLUE respectively.

In [None]:
glue_task_names = [ "cola", "mrpc", "qqp", "mnli", "qnli", "rte" , "sst2" ]
sglue_task_names = [ "boolq", "cb", "copa" , "wic" ]

In [None]:
%pip install transformers datasets

In [None]:
from datasets import load_dataset

glue_tasks = { task_name : load_dataset("glue", task_name) for task_name in glue_task_names }
sglue_tasks = { task_name : load_dataset("super_glue", task_name) for task_name in sglue_task_names }

sglue_tasks.update(glue_tasks)
fs_glue = sglue_tasks

print("Loaded FS-GLUE tasks: ", list(fs_glue.keys()))

## Reconstructing the Fine-Tuning Methods

Secondly, we will set up an environment in which we can easily choose
fine-tuning methods, models, and the dataset on which we would like to perform
that fine-tuning. In the paper, they consider three different categories of
fine-tuning, each with their respective fine-tuning methods:

- _Prompt-based_:
  - LMBFF
  - AdaPET
  - Null Prompts
  - Prompt-Bitfit
- _Light-weight_:
  - Prefix Tuning
  - Compacter

### LMBFF

In [None]:
!wget -qO- https://raw.githubusercontent.com/cochaviz/mep/experiments/src/replicating_ifh/fine-tuners-setup/lmbff.sh | bash

In [None]:
import json
import subprocess

def LMBFF(model_path, task_name, id="test", hide_output=True):
    if task_name == "sst2":
        task_name = "sst-2"
    output_dir = f"result/{model_path}/{id}/{task_name}"

    config = {
        "task_name": task_name,
        "data_dir": "data/k-shot/SST-2/16-42",
        "overwrite_output_dir": True,
        "do_train": True,
        "do_eval": True,
        "do_predict": True,
        "evaluate_during_training": True,
        "model_name_or_path": model_path,
        "few_shot_type": "prompt",
        "num_k": 64, # not sure about this
        "max_steps": 1000,
        "eval_steps": 100,
        "per_device_train_batch_size": 2,
        "learning_rate": 1e-5,
        "num_train_epochs": 0,
        "output_dir": output_dir,
        "seed": rng_seed,
        "template": "*cls**sent_0*_It_was*mask*.*sep+*",
        "mapping": "{'0':'terrible','1':'great'}",
        "num_sample": 16
    }
    dir = "LM_BFF"
    output_dir = dir + "/" + output_dir

    with open(f"{dir}/auto_config.json", "w") as file:
        file.write(json.dumps(config))

    cwd = [ "cd", dir ]
    train = [ "conda", "run", "-n", "lmbff", "python", "run.py", "auto_config.json" ]

    # I would love to do this using subprocess.run, but Jupyter Notebooks
    # doesn't support it. Not sure why, but I've seen others mention this.
    print("[LMBFF] Started finetuning on task", task_name, "with model", model_path)
    
    try:
        process = subprocess.run(train, cwd=dir, check=True)
    except subprocess.CalledProcessError as e:
        print("[LMBFF] Error during finetuning! Retrying with verbose output.")
        !{" ".join([ *cwd, "&&", *train ])}
        return
    
    if not hide_output:
        print("[LMBFF] === BEGIN: Output ===")
        print(process.stdout)
        print("[LMBFF] === END: Output ===")

    print("[LMBFF] Finished finetuning!")

    with open(f"{output_dir}/eval_results_{task_name}.txt", "r") as file:
        print("Evaluation results:")
        print(file.read())

    return output_dir

In [None]:
LMBFF("bert-base-uncased", "sst2", hide_output=False)

### AdaPET

In [None]:
!wget -qO- https://raw.githubusercontent.com/cochaviz/mep/experiments/src/replicating_ifh/fine-tuners-setup/adapet.sh | bash

In [None]:
import json

def ADAPET(model_path, task_name):
    dataset = "superglue" if task_name in sglue_task_names else "fewglue"

    config = {
        "pretrained_weight": model_path,
        "dataset": f"{dataset}/{task_name}",
        "max_text_length": 256,
        "batch_size": 1,
        "eval_batch_size": 1,
        "num_batches": 1000,
        "max_num_lbl_tok": 1,
        "eval_every": 250,
        "warmup_ratio": 0.06,
        "mask_alpha": 0.105,
        "grad_accumulation_factor": 16,
        "seed": 42,
        "lr": 1e-5,
        "weight_decay": 1e-2,
        "pattern_idx": 1,
        "eval_train": True
    }
    dir = "ADAPET"

    with open(f"{dir}/config/auto_config.json", "w") as file:
        file.write(json.dumps(config))

    # running setup is required
    cwd = ["cd", dir]

    # set environment variables
    # env = ["conda", "env", "config", "vars", "set", "-n", "adapet", "-f",
    # "bin/setup.sh"]
    env = ["conda", "run", "-n", "adapet", "bash", "bin/setup.sh"]

    # train = ["conda", "run", "-n", "adapet", "sh", "-c", "\"bin/setup.sh && bin/train.sh config/auto_config.json\""]
    train = ["conda", "run", "-n", "adapet", "bash", "bin/train.sh", "config/auto_config.json"]

    !{" ".join([ *cwd, "&&", *env, "&&", *train ])}


In [None]:
ADAPET("albert-xxlarge-v2", "boolq")

## Method Specific Few-Shot Hardness

Here, we will define the _Method Specific Few-Shot Hardness_ (MFH) and perform
the experiments shown in the paper. That is, we will "report the average
spearman correlation between MFH values for all pairs of adapation methods for
every dataset". MFH is defined as "the classification accuracy of the adapted
model, normalized against the classification accuracy of the majority baseline":

$$
MFH(\mathcal{D}, f, m) = \frac{
        \mathrm{acc}( f_m(\mathcal{D}_{ts}), \mathcal{D}_{ts})
    }{
        \mathrm{acc}( \hat{y}_{majority}(\mathcal{D}_{ts}), \mathcal{D}_{ts} )
    }
$$

Here, $\hat{y}_{majority}$, refers to 'an approximation of the classifier based
on majority vote'. This metric will, thus, determine the accuracy of the model
after fine-tuning on dataset $\mathcal{D}$. I'll be honest that I don't
understand what _hardness_ refers to.

In [None]:
methods = [ADAPET, LMBFF]

def MFH(dataset, model, method):
    if method not in methods:
        raise ValueError("Method not found!")
     
     

## Defining Spread