# Intrinsic Few-Shot Hardness of Jailbreaking Datasets

In this notebook, I'll be attempting to replicate the results presented in the
paper _On Measuring the Intrinsic Few-Shot Hardness of Datasets_, specifically
to determine whether the use of a _jailbreaking_ dataset produces results that
are in line with their own databases. The authors of the paper collect several
tasks from widely used datasets that, in their view, particularly reflect
few-shot type tasks. Since we argue that jailbreaking is a few-shot learning
task, we would expect similar results.

Since their results are based on the correlation of method-specific few-shot
hardness between different tasks, we need more tasks to determine whether our
results are in line with theirs. Therefore, we will identify various methods of
jailbreaking, construct a database on those and investigate the degree of their
correlation with the rest of the results. ==One method of determining whether
this (or any other) point is an outlier is the Z-score, but I'll have to
investigate different methods.==

In [6]:
import os
import sys

# conda is required by default because we
# can avoid clashing packages. Please use
# a new environment for this project with
# python 3.8.
assert os.environ["CONDA_DEFAULT_ENV"] == "ifh"
assert sys.version_info[:2] == (3, 8)

rng_seed = 42

## Reconstructing the Databases

First, we reconstruct the databases as described in the paper which are referred
to FS-GLUE and FS-NLI. For starters, we will consider the FS-GLUE dataset, as
this only concerns a subset of the GLUE and SuperGLUE datasets. These are:

- CoLA (Warstadt et al., 2018)
- MRPC (Dolan and Brockett, 2005)
- QQP (Wang et al., 2017)
- MNLI (Williams et al., 2018)
- QNLI (Rajpurkar et al., 2016)
- RTE (Dagan et al., 2010)
- SST-2 (Socher et al., 2013)

and 

- BoolQ (Clark et al., 2019)
- CB (de Marneffe et al., 2019)
- COPA (Roemmele et al., 2011), and WiC

for GLUE and SuperGLUE respectively.

In [4]:
%pip install transformers datasets

 ________________________________________ 
/ There is a certain frame of mind to    \
| which a cemetery is, if not an         |
| antidote, at least an alleviation. If  |
| you are in a fit of the blues, go      |
| nowhere else.                          |
|                                        |
\ -- Robert Louis Stevenson: Immortelles /
 ---------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
Collecting torch
  Downloading torch-2.1.2-cp38-cp38-manylinux1_x86_64.whl.metadata (25 kB)
Collecting sympy (from torch)
  Downloading sympy-1.12-py3-none-any.whl (5.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.7/5.7 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting networkx (from torch)
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m4.2 M

In [3]:
from datasets import load_dataset

glue_task_names = [ "cola", "mrpc", "qqp", "mnli", "qnli", "rte" , "sst2" ]
glue_tasks = { task_name : load_dataset("glue", task_name) for task_name in glue_task_names }

sglue_task_names = [ "boolq", "cb", "copa" , "wic" ]
sglue_tasks = { task_name : load_dataset("super_glue", task_name) for task_name in sglue_task_names }

sglue_tasks.update(glue_tasks)
fs_glue = sglue_tasks

print("Loaded FS-GLUE tasks: ", list(fs_glue.keys()))

Loaded FS-GLUE tasks:  ['boolq', 'cb', 'copa', 'wic', 'cola', 'mrpc', 'qqp', 'mnli', 'qnli', 'rte', 'sst2']


## Reconstructing the Fine-Tuning Methods

Secondly, we will set up an environment in which we can easily choose
fine-tuning methods, models, and the dataset on which we would like to perform
that fine-tuning. In the paper, they consider three different categories of
fine-tuning, each with their respective fine-tuning methods:

- _Prompt-based_:
  - LMBFF
  - AdaPET
  - Null Prompts
  - Prompt-Bitfit
- _Light-weight_:
  - Prefix Tuning
  - Compacter 

### LMBFF

In [50]:
!./setup_lmbff.sh

 ____________________________________ 
/ Q: What does a WASP Mom make for   \
| dinner? A: A crisp salad, a hearty |
| soup, a lovely entree, followed by |
|                                    |
\ a delicious dessert.               /
 ------------------------------------ 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||
fatal: destination path 'LM_BFF' already exists and is not an empty directory.
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
lmbff                    /home/zohar/.conda/envs/lmbff
Conda environment lmbff already exists

/home/zohar/Documents/Study/msc/y2/mep/src/replicating_ifh/LM_BFF
K = 16
Seed = 100
| Task = SST-2
| Task = sst-5
| Task = mr
| Task = cr
| Task = mpqa
| Task = subj
| Task = trec
| Task = CoLA
| Task = MRPC
| Task = QQP
| Task = STS-B
| Task = MNLI
| Task 

In [53]:
import subprocess

def LMBFF(model_path, task_name):
    if task_name == "sst2":
        task_name = "sst-2"

    subprocess.run(["conda", "run", "-n", "lmbff", "python", "LM_BFF/run.py",
        "--task_name", task_name,
        "--data_dir", "data/k-shot/SST-2/16-42",
        "--overwrite_output_dir",
        "--do_train",
        "--do_eval",
        "--do_predict",
        "--evaluate_during_training",
        "--model_name_or_path", model_path,
        "--few_shot_type", "prompt",
        "--num_k", "64", # not sure about this
        "--max_steps", "1000",
        "--eval_steps", "100",
        "--per_device_train_batch_size", "2",
        "--learning_rate", "1e-5",
        "--num_train_epochs", "0",
        "--output_dir", "result/tmp",
        "--seed", str(rng_seed),
        "--template", "*cls**sent_0*_It_was*mask*.*sep+*",
        "--mapping", "{'0':'terrible','1':'great'}",
        "--num_sample", "16",
    ])

LMBFF("bert-base-uncased", "sst2")

01/26/2024 16:05:53 - INFO - __main__ -   Training/evaluation parameters DynamicTrainingArguments(output_dir='result/tmp', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluate_during_training=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=1e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=0.0, max_steps=1000, warmup_steps=0, logging_dir='runs/Jan26_16-05-53_bunker', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=100, dataloader_num_workers=0, past_index=-1

## Defining Spread