# Prompt engineering with BigBio
#### Last Updated: 2022.07.05

The following tutorial will show you how to:

(1) Generate prompts using the specific `BigBio` fork of *PromptSource*. <br>
(2) Generate a new prompt task in `LmEval` <br>
(3) Evaluate a model of choice for your dataset!

We will be using the dataset `chemprot` within BigBio. We also provide instructions to create prompts **for your own bigbio dataset**.

**NOTE** This tutorial uses 3 packages: `promptsource`, `lm-eval-harness` and `bigbio`. All of these need to be installed (instructions will be provided) but will require some work outside of the notebook in order to run.

### Step 0: Preliminaries

In order to work with these packages, you will need to have

- A GitHub account and familiarity with CLI access
- Environment management
- Python 3.3+

We highly recommend you create an environment in order to not create any dependency issues. You can do this as follows:

#### Create a conda environment
The following instructions will create an Anaconda bigscience-biomedical environment.

Install anaconda for your appropriate operating system.
Run the following command while in the biomedical folder (you can pick your python version):

```
conda create -n bigbioprompting python=3.9 # Creates a conda env
conda activate bigbioprompting  # Activate your conda environment
```

#### Create a venv environment

Python 3.3+ has venv automatically installed; official information is found here.

```
python3 -m venv bigbioprompting
source bigbioprompting/bin/activate  # activate environment
```

With your environment of choice active, please continue.

-----------------------------------------------------------

# conda env remove -n bigbioprompting

### Step 1: Install BigBio

The following exercise will work with existing datasets in `BigBio`.

#### Creating prompts for datasets in `BigBio`

Install the `BigBio` repository in your chosen environment.

```
pip install git+https://github.com/bigscience-workshop/biomedical.git
```

#### Creating prompts for your own dataset

If you want to develop with your own datasets, you should **fork** the `BigBio` repo. Once you fork the repo, clone the contents and install directly.

```
git clone git@github.com:<your_fork>/biomedical.git
cd biomedical
pip install -e .
cd ../
```

The contents of the tutorial will not dive into how to create a dataloader (this information can be found [here](https://github.com/bigscience-workshop/biomedical/blob/master/CONTRIBUTING.md)). Provided you create a dataloader, you can make prompts ***on either the source of BigBio views***.

### Step 2: Install lm-evaluation-harness

In order to integrate your templates and evaluate your prompt, we will work with `lm-evaluation-harness`. First, we download the package as follows:

```
git clone --branch bigbio --single-branch git@github.com:bigscience-workshop/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
cd ..
```

**NOTE** This step can be done later, however it may override the `promptsource` installation below. You may need to uninstall and ensure you follow instructions from Step 3 onwards.

### Step 3: Install BigBio promptsource and run the GUI

In collaboration with the PromptSource team, we created a fork of the original repo to support `BigBio` tasks specifically. You can clone and install this fork as such:

```
git clone https://github.com/OpenBioLink/promptsource
cd promptsource
pip install -e .
cd ../
```

This will install **PromptSource**, which allows you to create prompts. **Please note, there may be an issue deploying the GUI editor for promptsource with a newer version of protobuf, so please ensure your version is <3.20.X**.

You can check your protobuf version with `pip freeze | grep protobuf`. If it is higher than 3.20.X, you can run a command as such:

```
pip uninstall -y protobuf
pip install protobuf==3.19.4
```


Next, we need to create an authentication file in order to use the GUI app. In the root directory of promptsource, make a file called `cred.cfg` with only the following contents:

```
[authentication]
username=bigscience
password=bigscience
```

Your directory structure within promptsource should look as such:

![Directory Structure](prompt_structure1.png)

Once you have installed `promptsource`, move to the root directory (i.e. `cd promptsource`) and run the following command:

```
streamlit run promptsource/app.py
```

You will first be prompted to log with this screen <br>

![GuiLogin](guilogin.png)


Log in with both username and password that matches your `cred.cfg`. You will then arrive at the GUI interface.

![GuiLogin](guiprompt.png)

#### Using the GUI to create prompts for custom datasets

The list of datasets from the drop-down menu matches that of **your environment's `BigBio` installation** (see [here](https://github.com/OpenBioLink/promptsource/blob/815426a103087074c5c6f264147dae421194a822/promptsource/utils.py#L145) for the list of datasets from the drop-down menu.). For custom datasets, we *strongly* recommend following the developer instructions above using your fork of `BigBio`, and installing that from source.

The GUI display will provide a link to the original dataset source code. By default, it points to the main branch of `BigBio`. For custom datasets, you can point to your personal fork by editing line 386 in `promptsource/promptsource/app.py` to match where your source code lives.

### Step 4: Make your own prompt!

To create a prompt, use the tab on the left-hand-side of the app.

In "Choose a mode" choose **Sourcing**

![source](source.png)


Choose a dataset (ex: `chemprot`) and a subset corresponding to a bigbio view (ex: `chemprot_bigbio_kb`).

![dset](datasetsubset.png)


The main body of the prompt creator will show pre-existing prompts, or allow you to create new ones. **Creating prompts will require you to have some understanding of the dataset**. 

For this example, we are going to use the **Chemprot** dataset, and identify 
In the official chemprot dataset, the "gold-standard" labels are "Regulator, upregulator, downregulator, agonist, antagonist". For sake of the example, we will create a prompt that asks whether a given chemical has a **Modulator** relationship with a gene/protein target.

First, come up with a **unique** name for your prompt. We will call this `is_modulator`.

![ismodulator](ismodulator.png)


After you click "create", you will receive an empty template as follows:

![emptytemplate](emptytemplate.png)

For this task, we are creating a natural language generation-based task. Our metric is "Other", as we do not have choices in text. Write your template as follows. <br>

```
{{ passages[0]['text'][0] }}

What chemicals are modulators to their protein or gene targets from the above passage? Separate each chemical and gene or protein target pair with a comma. If there are none, say "None."

|||
{% set ns = namespace(nonzero = false) %}{% for relation in relations %}{% if relation['type'] == "Modulator" %}{% set ns.nonzero = true %}{% endif %}{% endfor %}{% if ns.nonzero %}{% for relation in relations %}{% if relation['type'] == "Modulator" %}{% for entity in entities %}{% if entity['id'] == relation['arg1_id'] %}{{ entity['text'][0] }}{% endif %}{% endfor %}, {% for entity in entities %}{% if entity['id'] == relation['arg2_id'] %}{{ entity['text'][0] }}{% endif %}{% endfor %}
{% endif %}{% endfor %}{% else %}None.{% endif %}
```

<br>

You can use the "Select Example" menu on the left to toggle through examples (example 27 in train in the `chemprot` dataset has a Modulator relation to check). You can observe whether your prompt operates as expected using the input example on the right hand side:

![chemprotexample](chemprotexample.png)

If you want more details on how to write prompts, you can follow Jinja templating tips [here](https://github.com/OpenBioLink/promptsource/blob/main/CONTRIBUTING.md).

### Step 5: Check your dataset template

You should be able to find your updated template in `promptsource/promptsource/templates/your_dataset_name/your_dataset_view`.

For chemprot, you should find it here:

```
promptsource/promptsource/templates/chemprot/chemprot_bigbio_kb/templates.yaml
```

![exampleprompttemplate](exampleprompttemplate.png)

All templated prompts will appear in this one file.

With that, you've generated your own template for prompting! Now we will evaluate it!

### Step 6: Create an Evaluation Task for your dataset


#### Create the task

Now, we want to create a new task to evaluate our models. Place a file `yourdataset.py` in `lm-evaluation-harness/lm_eval/tasks` that fills out the criteria below:

```python
from lm_eval.base import BioTask

_CITATION = """
PLACE_YOUR_CITATION_FOR_YOUR_DATASET_HERE
"""


class YourDatasetBase(BioTask):
    VERSION = 0
    DATASET_PATH = "path/to/dataloader/script/from/bigbio"
    DATASET_NAME = None
    SPLIT = None
    
    # Fill these out as T/F depending on your dataset
    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True

    def training_docs(self):
        if self.has_training_docs():
            return self.dataset["train"]

    def validation_docs(self):
        if self.has_validation_docs():
            return self.dataset["validation"]

    def test_docs(self):
        if self.has_test_docs():
            return self.dataset["test"]  # you can replace with `train` to hack around


class YourDatasetSplit(YourDatasetBase):
    DATASET_NAME = "yourdataset_bigbio_<schema>"
```

A chemprot specific file is provided here:

```python
from lm_eval.base import BioTask

_CITATION = """"""

class ChemprotBase(BioTask):
    VERSION = 0
    DATASET_PATH = "/home/natasha/anaconda3/envs/bbp/lib/python3.9/site-packages/bigbio/biodatasets/chemprot"
    DATASET_NAME = "Chemprot"
    SPLIT = None

    def has_training_docs(self):
        return True

    def has_validation_docs(self):
        return True

    def has_test_docs(self):
        return True

    def training_docs(self):
        if self.has_training_docs():
            return self.dataset["train"]

    def validation_docs(self):
        if self.has_validation_docs():
            return self.dataset["validation"]

    def test_docs(self):
        if self.has_test_docs():
            return self.dataset["test"]
          
class ChemprotKB(ChemprotBase):
    DATASET_NAME = "chemprot_bigbio_kb"

```

**NOTE** In order to retrieve results, you **MUST** have a validation and testing set. 

If you do not know where your `BigBio` dataloading script is and you used an Anaconda environment, it is likely somewhere here:

`/home/natasha/anaconda3/envs/<your_env_name>/lib/python3.9/site-packages/bigbio/biodatasets`


#### Update the `__init__` file to recognize your task

Add this dataset task to `lm-evaluation-harness/lm_eval/tasks/__init__.py` by adding the following lines:

```python
from . import <your_dataset>  # Place this in the beginning import

# Within TASK_REGISTRY, add the following command
TASK_REGISTRY = {
    ...
    "your_dataset_name": yourdataset.Class_Corresponding_To_Schema
}
```

(For example, Chemprot would look as such:)
```python

from . import chemprot

TASK_REGISTRY = {
    ...
    "chemprot": chemprot.ChemprotKB,
}
```

### Step 7: Run your task

Use the `main.py` script in `lm-evaluation-harness` as such:

```
python main.py --model hf-seq2seq --model_args pretrained=t5-small --tasks chemprot --device cuda
```
