# Quantized Bonito Tutorial
This is a tutorial to set up and run a quantized version of [Bonito](https://github.com/BatsResearch/bonito) on a Google Colab T4 instance using the `transformers` package (instead of `vllm` as in the original repo). The quantized model was graciously created by GitHub/HuggingFace user `alexandreteles` and we thank them for their contributions! Note that quantized models may behave differently than their non-quantized counterparts. The versions they created are:
 - [alexandreteles/bonito-v1-awq](https://huggingface.co/alexandreteles/bonito-v1-awq) (`awq` quantized model, this is the one we'll be using)
 - [alexandreteles/bonito-v1-gguf](https://huggingface.co/alexandreteles/bonito-v1-gguf) (for llama.cpp inference)


## Setup
First we clone into the repo and install the dependencies. This will take several minutes.

In [None]:
!git clone https://github.com/BatsResearch/bonito.git
!pip -q install -e bonito/

Cloning into 'bonito'...
remote: Enumerating objects: 85, done.[K
remote: Counting objects: 100% (85/85), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 85 (delta 35), reused 41 (delta 14), pack-reused 0[K
Receiving objects: 100% (85/85), 764.30 KiB | 7.01 MiB/s, done.
Resolving deltas: 100% (35/35), done.
Obtaining file:///content/bonito
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets (from bonito==0.0.1)
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm (from bonito==0.0.1)
  Downloading vllm-0.3.3-cp310-cp310-manylinux1_x86_64.whl (44.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets->bonito==0.0.1)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [9

To use this quantized model, we need to install the [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) package, which deals with AWQ ([Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978)) models, such as the one we'll be using. AWQ is a quantization technique that treats different weight parameters differently based on their importance. To get it to work with Colab, we have to install the kernel from a specialized wheel so the CUDA versions match.

In [None]:
!pip -q install autoawq
!git clone https://github.com/Boltuzamaki/AutoAWQ_kernels.git
!pip -q install AutoAWQ_kernels/builds/autoawq_kernels-0.0.6+cu122-cp310-cp310-linux_x86_64.whl

Collecting autoawq
  Downloading autoawq-0.2.3-cp310-cp310-manylinux2014_x86_64.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from autoawq)
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting zstandard (from autoawq)
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autoawq-kernels (from autoawq)
  Downloading autoawq_kernels-0.0.6-cp310-cp310-manylinux2014_x86_64.whl (33.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.4/33.4 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: zstandard, autoawq-kernels, accele

## Quantized Bonito Wrapper
This cell includes the code to work with the quantized Bonito model, utilizing the `transformers` package. It's similar to the `Bonito` code made for `vllm` in the repo.

In [None]:
from typing import Optional, List
from datasets import Dataset
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


SHORTFORM_TO_FULL_TASK_TYPES = {
    "exqa": "extractive question answering",
    "mcqa": "multiple-choice question answering",
    "qg": "question generation",
    "qa": "question answering without choices",
    "ynqa": "yes-no question answering",
    "coref": "coreference resolution",
    "paraphrase": "paraphrase generation",
    "paraphrase_id": "paraphrase identification",
    "sent_comp": "sentence completion",
    "sentiment": "sentiment",
    "summarization": "summarization",
    "text_gen": "text generation",
    "topic_class": "topic classification",
    "wsd": "word sense disambiguation",
    "te": "textual entailment",
    "nli": "natural language inference",
}


class QuantizedBonito():
    def __init__(self, model_name_or_path):
        self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    def generate_tasks(
        self,
        text_dataset: Dataset,
        context_col: str,
        task_type: str,
        sampling_params: dict,
        **kwargs,
    ):
        """
        Generates tasks using the Bonito model.

        This method takes a text dataset, a context column name,
        a task type, and sampling parameters, and generates tasks
        using the Bonito model. It processes the input dataset,
        generates outputs, collects multiple generations into
        one dataset object, and filters out the examples that
        cannot be parsed.

        Args:
            text_dataset (Dataset): The dataset that provides the text
                for the tasks.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            task_type (str): The type of the tasks. This can be a
                short form or a full form.
            sampling_params (dict): The parameters for
                sampling.
            **kwargs: Additional keyword arguments.

        Returns:
            Dataset: The synthetic dataset with the generated tasks.
        """
        processed_dataset = self._prepare_bonito_input(
            text_dataset, task_type, context_col, **kwargs
        )

        outputs = self._generate_text(processed_dataset["input"], sampling_params)

        # collect multiple generations into one dataset object
        examples = []
        for i, example in enumerate(text_dataset.to_list()):
            output = outputs[i]
            example["prediction"] = output.strip()
            examples.append(example)

        synthetic_dataset = Dataset.from_list(examples)

        # filter out the examples that cannot be parsed
        synthetic_dataset = self._postprocess_dataset(
            synthetic_dataset, context_col, **kwargs
        )

        return synthetic_dataset

    def _generate_text(
        self,
        dataset: Dataset,
        sampling_params: dict,
        ) -> List[str]:
        """
        Generate text using the model.

        This method takes a dataset of prompts, encodes them,
        generates text using the model, decodes the generated
        text, and appends it to a list.

        Args:
            dataset (Dataset): A dataset containing prompts for text generation.
            sampling_params (dict): Parameters for sampling during generation.

        Returns:
            List[str]: A list of generated texts corresponding to the prompts.
        """
        generated_texts = []

        for prompt in dataset:
            input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
            input_ids = input_ids.cuda()

            output = self.model.generate(
                input_ids,
                do_sample=True,
                **sampling_params
            )

            generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
            generated_texts.append(generated_text)

        return generated_texts


    def _prepare_bonito_input(
        self, context_dataset: Dataset, task_type: str, context_col: str, **kwargs
    ) -> Dataset:
        """
        Prepares the input for the Bonito model.

        This method takes a context dataset, a task type, and a context
        column name, and prepares the dataset for the Bonito model.
        If the task type is not recognized, it raises a ValueError.

        Args:
            context_dataset (Dataset): The dataset that provides the
                context for the task.
            task_type (str): The type of the task. This can be a
                short form or a full form. If the task type is not
                recognized, a ValueError is raised.
            context_col (str): The name of the column in the dataset
                that provides the context for the task.
            **kwargs: Additional keyword arguments.

        Returns:
            Dataset: The prepared dataset for the Bonito model.
        """
        # get the task type name
        if task_type in SHORTFORM_TO_FULL_TASK_TYPES.values():
            full_task_type = task_type
        elif task_type in SHORTFORM_TO_FULL_TASK_TYPES:
            full_task_type = SHORTFORM_TO_FULL_TASK_TYPES[task_type]
        else:
            raise ValueError(f"Task type {task_type} not recognized")

        def process(example):
            input_text = "<|tasktype|>\n" + full_task_type.strip()
            input_text += (
                "\n<|context|>\n" + example[context_col].strip() + "\n<|task|>\n"
            )
            return {
                "input": input_text,
            }

        return context_dataset.map(
            process,
            remove_columns=context_dataset.column_names,
            num_proc=kwargs.get("num_proc", 1),
        )

    def _postprocess_dataset(
        self, synthetic_dataset: Dataset, context_col: str, **kwargs
    ) -> Dataset:
        """
        Post-processes the synthetic dataset.

        This method takes a synthetic dataset and a context column
        name, and post-processes the dataset. It filters out
        examples where the prediction does not contain exactly two
        parts separated by "<|pipe|>", and then maps each example to a
        new format where the context is inserted into the first part of
        the prediction and the second part of the prediction is used as
        the output.

        Args:
            synthetic_dataset (Dataset): The synthetic dataset to be
                post-processed.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            **kwargs: Additional keyword arguments.

        Returns:
            Dataset: The post-processed synthetic dataset.
        """
        synthetic_dataset = synthetic_dataset.filter(
            lambda example: len(example["prediction"].split("<|pipe|>")) == 2
        )

        def process(example):
            pair = example["prediction"].split("<|pipe|>")
            context = example[context_col].strip()
            return {
                "input": pair[0].strip().replace("{{context}}", context),
                "output": pair[1].strip(),
            }

        column_names = synthetic_dataset.column_names
        processed_synthetic_dataset = synthetic_dataset.map(
            process,
            remove_columns=column_names,
            num_proc=kwargs.get("num_proc", 1),
        )

        return processed_synthetic_dataset


## Synthetic Data Generation
This is where we load in the model and unannotated dataset. With them, we can generate a synthetic dataset of instructions. This example generates synthetic instructions from a subset of size 10 of the unannotated dataset. Note that `sampling_params` is modified to use `transformers` keywords instead of `vllm`'s.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip -q install pymupdf spacy

Collecting pymupdf
  Downloading PyMuPDF-1.23.26-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Collecting PyMuPDFb==1.23.22 (from pymupdf)
  Downloading PyMuPDFb-1.23.22-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.23.22 pymupdf-1.23.26


In [None]:
!pip -q install datasets huggingface_hub

Upload your text files

In [None]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

pdf_path = 'your text file'
text = extract_text_from_pdf(pdf_path)


In [None]:
def clean_text(text):
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)

    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove punctuations except period
    text = re.sub(r'[^\w\s.]', '', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

clean_data = clean_text(text)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load English tokenizer, tagger, parser, NER, and word vectors

def split_into_sentences(text):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences

sentences = split_into_sentences(clean_data)


In [None]:
print(len(sentences))

482


In [None]:
print(sentences[145])

Nevertheless, results are reported separately for the 
treatment groups as well as for treatment groups pooled.


In [None]:
from datasets import Dataset

# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)

print(dataset)


Dataset({
    features: ['sentence'],
    num_rows: 482
})


In [None]:
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg",
    sampling_params=sampling_params
)

In [None]:
from datasets import load_dataset

# Initialize the Bonito model
bonito = QuantizedBonito("alexandreteles/bonito-v1-awq")

# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))

# Generate synthetic instruction tuning dataset
sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.5, 'num_return_sequences':1}
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg",
    sampling_params=sampling_params
)

print(synthetic_dataset)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Replacing layers...: 100%|██████████| 32/32 [00:12<00:00,  2.47it/s]
Fusing layers...: 100%|██████████| 32/32 [00:00<00:00, 95.43it/s]


Map:   0%|          | 0/482 [00:00<?, ? examples/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

Filter:   0%|          | 0/482 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 481
})


Now go try it out with your own datasets! You can vary the `task_type` for different types of generated instructions.

In [None]:
import pandas as pd

df = pd.DataFrame(synthetic_dataset)

print(df.head(50))  # Adjust the number inside head() to see more or fewer rows


                                                input  \
0   Write a multi-choice question for the followin...   
1   I want to test the ability of students to read...   
2   Write a multi-choice question for the followin...   
3   Write a multi-choice question for the followin...   
4   Changes from one approved revision to the next...   
5   Write a multi-choice question for the followin...   
6   Write a multi-choice question for the followin...   
7   Write a multi-choice question for the followin...   
8   Write a multi-choice question for the followin...   
9   Write a multi-choice question for the followin...   
10  Generate a question that has the following ans...   
11  Write a multi-choice question for the followin...   
12  Write a multi-choice question for the followin...   
13  Write a multi-choice question for the followin...   
14  Write a multi-choice question for the followin...   
15  Write a multi-choice question for the followin...   
16  Write a multi-choice questi

In [None]:
df.head(3)

Unnamed: 0,input,output
0,Write a multi-choice question for the followin...,Question: \nWhere can we most probably find th...
1,I want to test the ability of students to read...,What is the purpose of the study?
2,Write a multi-choice question for the followin...,What is the protocol version number and date?


In [None]:
df['input'][]

'Write a multi-choice question for the following article, with the given choices and answer:\nArticle: Protocol Number: 2019-CHF-006 \n \nProtocol Version \nnumber and date: \nv 2.0, 09 Sep 2021 \nSAP Version number \nand date: \nv 1.0, 09 Dec 2021 \nSponsor: \nSequana Medical NV \nSponsor address: \nTechnologiepark-Zwijnaarde 122 \n9052 Gent \nBelgium \n \n \n \n09 DEC 2021\n10\nOptions:\nA 2019-CHF-006\nB v 2.0, 09 Sep 2021\nC v 1.0, 09 Dec 2021\nD v 2.0, 09 Sep 2021\nAnswer:\nB v 2.0, 09 Sep 2021\nQuestion:'