# Using Bonito Model to Generate Q&A Pairs

Instruction tuning is a method to increase the instruction following capabilities and general zero-shot performance of LLMs. Using synthetic data in this task is widely used (Self-Instruct, OpenOrca, etc.). Nayak et al. {cite}`nayak_learning_2024` introduced [Bonito](https://huggingface.co/BatsResearch/bonito-v1) model for conditional task generation which enables converting plain (unannotated) text to instruction-tuning datasets. We will use this model to develop a "critical infrastructure assessor LLM" using the MOSIP documentation and Okta's recent threat intelligence report as unannotated datasets.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [3]:
from pathlib import Path
cache_path = Path('/bask/projects/v/vjgo8416-ai-fairness/.cache/huggingface')

In [4]:
tokenizer = AutoTokenizer.from_pretrained("BatsResearch/bonito-v1", cache_dir=cache_path)
model = AutoModelForCausalLM.from_pretrained("BatsResearch/bonito-v1", cache_dir=cache_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [5]:
from datasets import Dataset,load_from_disk
mosipds = load_from_disk("datasets/mosip_dataset.hf")

In [6]:
#input_text = mosipds[1]["input"]
#input_text = "Summarise the following text: " + input_text 

In [7]:
#input_ids = tokenizer(input_text, return_tensors="pt")
#outputs = model.generate(**input_ids, max_new_tokens=30)
#print(tokenizer.decode(outputs[0]))

In [8]:
SHORTFORM_TO_FULL_TASK_TYPES = {
    "exqa": "extractive question answering",
    "mcqa": "multiple-choice question answering",
    "qg": "question generation",
    "qa": "question answering without choices",
    "ynqa": "yes-no question answering",
    "coref": "coreference resolution",
    "paraphrase": "paraphrase generation",
    "paraphrase_id": "paraphrase identification",
    "sent_comp": "sentence completion",
    "sentiment": "sentiment",
    "summarization": "summarization",
    "text_gen": "text generation",
    "topic_class": "topic classification",
    "wsd": "word sense disambiguation",
    "te": "textual entailment",
    "nli": "natural language inference",
}

In [9]:
def prepare_bonito_input(context_dataset: Dataset, task_type: str, context_col: str) -> Dataset:
        """
        Prepares the input for the Bonito model.

        This method takes a context dataset, a task type, and a context
        column name, and prepares the dataset for the Bonito model.
        If the task type is not recognized, it raises a ValueError.

        Args:
            context_dataset (Dataset): The dataset that provides the
                context for the task.
            task_type (str): The type of the task. This can be a
                short form or a full form. If the task type is not
                recognized, a ValueError is raised.
            context_col (str): The name of the column in the dataset
                that provides the context for the task.
            **kwargs: Additional keyword arguments.

        Returns:
            Dataset: The prepared dataset for the Bonito model.
        """
        # get the task type name
        if task_type in SHORTFORM_TO_FULL_TASK_TYPES.values():
            full_task_type = task_type
        elif task_type in SHORTFORM_TO_FULL_TASK_TYPES:
            full_task_type = SHORTFORM_TO_FULL_TASK_TYPES[task_type]
        else:
            raise ValueError(f"Task type {task_type} not recognized")

        def process(example):
            input_text = "<|tasktype|>\n" + full_task_type.strip()
            input_text += (
                "\n<|context|>\n" + example[context_col].strip() + "\n<|task|>\n"
            )
            return {
                "input": input_text,
            }

        return context_dataset.map(
            process,
            remove_columns=context_dataset.column_names,
            num_proc=1,
        )

In [34]:
def postprocess_dataset(synthetic_dataset: Dataset, context_col: str) -> Dataset:
        """
        Post-processes the synthetic dataset.

        This method takes a synthetic dataset and a context column
        name, and post-processes the dataset. It filters out
        examples where the prediction does not contain exactly two
        parts separated by "<|pipe|>", and then maps each example to a
        new format where the context is inserted into the first part of
        the prediction and the second part of the prediction is used as
        the output.

        Args:
            synthetic_dataset (Dataset): The synthetic dataset to be
                post-processed.
            context_col (str): The name of the column in the dataset
                that provides the context for the tasks.
            **kwargs: Additional keyword arguments.

        Returns:
            Dataset: The post-processed synthetic dataset.
        """
        synthetic_dataset = synthetic_dataset.filter(
            lambda example: len(example["prediction"].split("<|pipe|>")) == 2
        )

        def process(example):
            pair = example["prediction"].split("<|pipe|>")
            context = example[context_col].strip()
            return {
                "input": pair[0].strip().replace("{{context}}", context),
                "output": pair[1].strip(),
            }

        synthetic_dataset = synthetic_dataset.map(
            process,
            remove_columns=synthetic_dataset.column_names,
            num_proc=1,
        )

        return synthetic_dataset

In [16]:
def convert_to_dataset(text):
    dataset = Dataset.from_list([{"input": text}])
    return dataset

In [17]:
# If you would like to test the model with only one simple text, you can uncomment the following code:

# unannotated_paragraph = """1. “Confidential Information”, whenever used in this Agreement, shall mean any data, document, specification and other information or material, that is delivered or disclosed by UNHCR to the Recipient in any form whatsoever, whether orally, visually in writing or otherwise (including computerized form), and that, at the time of disclosure to the Recipient, is designated as confidential."""

# processed_dataset = prepare_bonito_input(
#    context_dataset=convert_to_dataset(unannotated_paragraph),
#    context_col="input",
#    task_type="nli"
# )

In [23]:
processed_dataset = prepare_bonito_input(
    context_dataset=mosipds,
    context_col="input",
    task_type="nli"
)

In [24]:
processed_dataset[0]

{'input': '<|tasktype|>\nnatural language inference\n<|context|>\nOverview - Multiple language support : \n* Registration Client is featured to allow an operator to choose the operation language. Option to select their preferred language is provided on the login screen.\n* Data collection during registration client supports more than one language at a time.\n* Before starting any registration process, the operator can choose the languages amongst the configured ones.\n \n  \nTo know more about setting up the reference registration client, refer to [Registration Client Installation Guide](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-installation-guide).\n\nTo know more about the features present in the Registration Client, refer to [Registration Client User Guide](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-user-guide).\n<|task|>\n'}

In [30]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# collect multiple generations into one dataset object
examples = []

for i, example in enumerate(mosipds.to_list()):
    input_ids = tokenizer(processed_dataset["input"][i], return_tensors="pt", padding=True)
    outputs = model.generate(**input_ids, max_new_tokens=256, top_p=0.95, temperature=0.7)
    examples.append(
        {"context": "input", "prediction": tokenizer.decode(outputs[0])}
    )
    print(examples[-1])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Multiple language support : \n* Registration Client is featured to allow an operator to choose the operation language. Option to select their preferred language is provided on the login screen.\n* Data collection during registration client supports more than one language at a time.\n* Before starting any registration process, the operator can choose the languages amongst the configured ones.\n \n  \nTo know more about setting up the reference registration client, refer to [Registration Client Installation Guide](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-installation-guide).\n\nTo know more about the features present in the Registration Client, refer to [Registration Client User Guide](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-user-guide).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The R

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Who operates the Registration Client? : \nThe Registration Client can be operated by an operator who can be either a **Supervisor** or an **Officer**. They can login to the client application and perform various activities. The Supervisor and the Officer can perform tasks like Onboarding, Synchronize Data, Upgrade software, Export packet, Upload packets, View Re-registration packets, Correction process, Exception authentication, etc. In addition to this, the Supervisor has exclusive authority to Approve/reject registrations.\n\nTo know more about the onboarding process of an operator, refer to [Operator onboarding](operator-onboarding.md).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The Registration Client can be operated by an operator who can be either a **Supervisor** or an **Officer**. They can login to the client application and perform various acti

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Registration Client entity diagram :     \n![](_images/reg-client.drawio.png)\n\nThe relationship of Registration Client with other services is explained here. _NOTE_: The numbers do not signify sequence of operations or control flow.\n\n1. Registration Client connects to the Upgrade Server to check on upgrades and patch downloads.\n2. All the masterdata and configurations are downloaded from SyncData-service.\n3. Registration Client always connects to external biometric devices through SBI.\n4. Registration Client scans the document proofs from any document scanner.\n5. Acknowledgement receipt print request is raised to any connected printers.\n6. Packets ready to be uploaded meta-info are synced to Sync Status service. Also, the status of already uploaded packets are synced back to Registration Client.\n7. All the synced packets are uploaded to Packet Receiver service one by one.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Data protection : \n* The registration packets and synced data are stored in the client machine.\n* Most of the synced data are stored in the Derby DB. Derby DB is encrypted with the bootpassword.\n* Derby DB boot password is encrypted with machine TPM key and stored under `.mosipkeys/db.conf`.\n* Synced UI-SPEC/script files are saved in plain text under registration client working directory. During sync, SPEC/script file hash is stored in derby and then the files are saved in the current working directory. Everytime the file is accessed by the client performs the hash check.\n* Synced pre-registration packets are encrypted with TPM key and stored under configured directory.\n* Directory to store the registration packets and related registration acknowledgments is configurable. \n* Registration packet is an signed and encrypted ZIP.\n* Registration acknowledgment is also signed and

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Configurations : \nRegistration Client can be customized as per a country\' requirements.  For details related to Registration Client configurations, refer to [Registration Client configuration](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-configuration).\n<|task|>\n {{context}} Based on the previous passage, is it true that "Registration Client can be customized as per a country\'s requirements."? Yes, no, or maybe?\n<|pipe|>\nYes</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - UI Specifications for Registration Tasks : \n* Blueprint of registration forms to be displayed in registration client are created as json called as UI-SPEC.\n* Every process ( NEW / LOST / UPDATE UIN / CORRECTION ) has its own UI-SPEC json.\n* Kernel-masterdata-service exposes API\'s to create and publish UI-SPEC.\n* Published UI-SPEC json are versioned.\n* Only published UI-SPEC are synced into registration-client.\n* UI-SPEC json files are tamper proof, client checks the stored file hash everytime it tries to load registration UI.\n* UI-SPEC json will fail to load if tampered.\n\nDefault UI Specifications loaded with Sandbox installation is available [here](https://github.com/mosip/mosip-infra/blob/1.2.0-rc2/deployment/v3/mosip/kernel/masterdata/xlsx/ui_spec.xlsx)\n<|task|>\n {{context}} Based on the previous passage, is it true that "UI-SPEC json files are tamper proof, client c

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nOverview - Developer Guide : To know more about the developer setup, read [Registration Client Developers Guide](https://docs.mosip.io/1.2.0/modules/registration-client/registration-client-developers-guide).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The developer guide is not very long."? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nData Protection - Registration data flow : \n![](\\_images/cryptography-registration-flow.png)\n\n1. [Biometrics](biometrics.md) are signed by the private key of the device provider (PK2). The signature is verified by the Registration Client.\n2. [Registration Client](registration-client.md) signs the packet using the TPM key of the machine (K10) and encrypts the packet using MOSIP public key specific to (the registration centre, machine id) combination (K11).\n3. [Registration processor](registration-processor.md) stores packets created in (2) "as is" in [Object Store](broken-reference).\n4. [ID Repository](id-repository.md) encrypts biometrics, demographics and documents and stores them in Object Store. (K7.1,K7.2,K7.3)\n5. The UINs are hashed, encrypted and stored in `uin` the table of `mosip_idrepo` DB. (K7.4)\n6. Biometrics are shared and encrypted with the ABIS partner\'s key (PK1).\n7.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nData Protection - Datashare : \n![](\\_images/cryptography-datashare.png)\n\nData shared with all partners like ABIS, Print, Adjudication, IDA etc. is encrypted using partners\' public key. Note that IDA is also a partner, however, a special partner in the sense that data is additionally zero-knowledge encrypted before sending to IDA (see the section below).\n<|task|>\n {{context}} Based on the previous passage, is it true that "Data is sent to IDA after it is zero-knowledge encrypted."? Yes, no, or maybe?\n<|pipe|>\nYes</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nData Protection - Zero-knowledge encryption : \nThe [ID Authentication](id-authentication.md) module (IDA) is an independent module and may be hosted by several providers. IDA hosts all the biometric templates and demographic data. Unique additional protection is provided here to make sure that mass decryption of user data is very difficult to achieve. The data can only be decrypted if the user\'s UIN is provided. Here is the encryption scheme:\n\n### Encryption and sharing by Credential Service\n\n1. Generate master symmetric encryption key K9.\n2. Generate a 10,000 symmetric keys pool (ZKn). Encrypt each ZKn with K9 and store it in DB. (K12)\n3. Randomly select one key from ZKn, and decrypt using K9.\n4. Derive new key ZKn\' = ZKn + UIN/VID/APPID.\n5. Encrypt biometric templates and demographics.\n   * BIO = encrypt(bio/demo with ZKn\').\n6. Encrypt ZKn (this is done to share ZKn with IDA).

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nData Protection - ID authentication flow : \n![](\\_images/cryptography-ida-flow.png)\n\n1. L1 devices contain [FTM](ftm.md) to encrypt (DE1, K21) and sign (FK1) biometrics at the source and send them to the authentication client.\n2. The authentication client further encrypts the auth request with IDA-PARTNER public key.\n3. IDA decrypts zero-knowledge data as given in [Step 4](data-protection.md#encryption-and-share-by-credential-service) and then performs a demographic and/or biometric authentication.\n4. The match result is returned to Auth client. In the case of KYC, the KYC attributes are encrypted with the Partner\'s public key (as in [Datashare](datashare.md)).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The match result is returned to Auth client. In the case of KYC, the KYC attributes are encrypted with the Partner\'s public key (as in [Datashare](datasha

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nPartner Policies - Overview : \nPartner policies control the data that needs to be shared with a partner. The policies reside in [`auth_policy` table](https://github.com/mosip/partner-management-services/blob/release-1.2.0/db\\_scripts/mosip\\_pms/ddl/pms-auth\\_policy.sql) of `mosip_pms` DB.\n\n### Policy types\n\n| Policy type      | Partners                                                                          | Description                                                                                                                                                               |\n| ---------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| Auth policy      | AP         

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nPartner Policies - Policy group : \nCommon policies are grouped example \'Telecom\', \'Banking\', \'Insurance\' etc.\n<|task|>\n {{context}} Based on the previous passage, is it true that "Telecom is a common policy"? Yes, no, or maybe?\n<|pipe|>\nYes</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nAutomation Testing - Overview : \nMOSIP provides automation test repositories for the following:\n\n* [Admin UI](https://github.com/mosip/admin-ui/tree/release-1.2.0/admintest)\n* [Registration Client](https://github.com/mosip/registration-client/tree/release-1.2.0/registration/registration-test)\n* [Functional Tests](https://github.com/mosip/mosip-functional-tests/tree/release-1.2.0)\n* [Automation Tests](https://github.com/mosip/mosip-automation-tests/tree/release-1.2.0)\n\n### Admin UI\nSelenium webdriver-based Admin Portal Automation covers CRUD (create, read, update and delete) operation performed via UI with Chrome driver.\n\n### Registration Client\nRegistration test automation covers these flows: New, Update, Correction, and Lost flows.\n\nTo know more about each, click [here](id-lifecycle-management.md).\n\n### Functional Tests\nThis repository contains API automation tests. The auto

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nMOSIP e-Manas - Overview : \nThis demonstration showcases the integration of MOSIP\'s [Authentication API](https://mosip.github.io/documentation/1.2.0/authentication-service.html) with the Mental Healthcare Management system offered by [e-Manas](https://e-manas.karnataka.gov.in/#/about). This integration enables secure means of sharing health records of a patient across hospitals with the patient\'s consent, where the authentication is enabled by MOSIP ID.\n\nBelow is the demonstration of the same.\n\n{% embed url="https://youtu.be/yyherCcIpTs" %}\n<|task|>\n {{context}} Based on the previous passage, is it true that "MOSIP e-Manas is a demonstration of the integration of MOSIP\'s [Authentication API](https://mosip.github.io/documentation/1.2.0/authentication-service.html) with the Mental Healthcare Management system offered by [e-Manas](https://e-manas.karnataka.gov.in/#/about)."? Yes, no, o

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nMOSIP e-Manas - MOSIP e-Manas Integration Architecture : \n![](\\_images/mosip-eManas-int-arch.png)\n<|task|>\n {{context}} Based on the previous passage, is it true that "MOSIP e-Manas Integration Architecture is a new technology"? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Overview : \nThis release marks the developer\'s preview release of Resident Services, offering valuable insights into the range of features and functionality available. Resident Services is designed to run on 1.2.0.1-B3 version of MOSIP platform. Resident Services are the self-services which are used by the residents themselves via a portal. [Resident Portal](https://docs.mosip.io/1.2.0/modules/resident-services/resident-portal-user-guide) is a web-based UI application that provides residents of a country the services related to their Unique Identification Number (UIN). The residents can perform various operations related to their UIN/ VID and can also raise concerns if any through the portal.\n\nThe key features provided on the Resident portal are:\n\n1. Avail **UIN services** using UIN/VID (through [e-Signet](https://docs.esignet.io)):\n     * View My His

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Repository Released : \n| **Repositories**            | **Tags Released**                                                              |\n| --------------------------- | ------------------------------------------------------------------------------ |\n| Resident Services                | [Resident Services vDP1](https://github.com/mosip/resident-services/releases/tag/vDP1) |\n| Resident UI        | [Resident UI vDP1](https://github.com/mosip/resident-ui/releases/tag/vDP1) |\n<|task|>\n {{context}} Based on the previous passage, is it true that "Resident Services is a repository."? Yes, no, or maybe?\n<|pipe|>\nYes</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Services : \nFor detailed description of Resident services, the code and design, refer to [resident services repo](https://github.com/mosip/resident-services/releases/tag/vDP1).\n<|task|>\n {{context}} Based on the previous passage, is it true that "Resident services are not detailed."? Yes, no, or maybe?\n<|pipe|>\nNo</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Resident Portal UI : \nMOSIP provides a reference implementation of the Resident portal that can be customized as per the country’s needs. The sample implementation is available [here](https://github.com/mosip/resident-ui/releases/tag/vDP1).\n\nFor getting started with the resident portal, refer to the [Resident Portal User Guide](resident-portal-user-guide.md).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The resident portal is not customizable."? Yes, no, or maybe?\n<|pipe|>\nNo</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Build and Deploy : \nTo access the build and read through the deployment instructions, refer to the [Resident Services Deployment Guide](https://docs.mosip.io/1.2.0/modules/resident-services/resident-services-deployment-guide).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The deployment guide is 100 pages long."? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Configurations : \nFor details related to resident portal configurations, refer to the [Configuration Guide](https://docs.mosip.io/1.2.0/modules/resident-services/resident-portal-configuration-guide).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The Configuration Guide is a 100 page document."? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Developers Guide : \nFor a detailed description of Resident Services, code, design, and setup steps, refer to:\n1. [Resident Services Developers Guide](resident-services-developer-guide.md)\n2. [Resident Services UI Developers Guide](resident-services-ui-developer-guide.md)\n<|task|>\n {{context}} Based on the previous passage, is it true that "Resident Services- Release Notes - Developers Guide : \nFor a detailed description of Resident Services, code, design, and setup steps, refer to:\n1. [Resident Services Developers Guide](resident-services-developer-guide.md)\n2. [Resident Services UI Developers Guide](resident-services-ui-developer-guide.md)\n\nResident Services is a game."? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - API : \nRefer [API Documentation](https://mosip.stoplight.io/docs/resident/9a5192571fc51-document).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The API documentation is not available online."? Yes, no, or maybe?\n<|pipe|>\nNo</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\nResident Services- Release Notes - Test Report : \nFor details on the test results, refer [here](https://github.com/mosip/test-management/tree/master/).\n<|task|>\n {{context}} Based on the previous passage, is it true that "The test results are not available."? Yes, no, or maybe?\n<|pipe|>\nNo</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\n🌍 Country Implementation - Overview : \nThis is a guide to implement MOSIP for a country. It is advised that Government and System Integrators (SI) study the recommended steps to work out an appropriate implementation strategy. The items are in "near-chronological order" and may differ for an implementation.\n<|task|>\n {{context}} Based on the previous passage, is it true that "The items are in "near-chronological order" and may differ for an implementation. The items are in "near-chronological order" and may differ for an implementation. The items are in "near-chronological order" and may differ for an implementation. "? Yes, no, or maybe?\n<|pipe|>\nYes</s>'}


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\n🌍 Country Implementation - Key decisions : \n1. Choice of deployment of [Pre-registration](id-lifecycle-management.md#pre-registration).\n2. Rate of enrolment desired.\n3. Rate of authentication expected.\n4. [Languages](module-configuration.md#languages).\n5. Customisation and procurement of components as given [here](reference-implementations.md).\n6. [ID schema](id-schema.md) (as prescribed by the country\'s regulatory authority).\n7. Hardware requirements estimate.\n   * [Server side](https://github.com/mosip/documentation/tree/develop/docs/\\_files)\n   * [Devices](\\_files/mosip-devices-calculator.xlsx)\n8. [Credential choices](id-repository.md#credential-types).\n9. ID Card print design.\n10. MOSIP versions.\n11. MOSIP support (scope).\n12. Disaster recovery strategy.\n13. Phased approach for rollout.\n<|task|>\n {{context}} Based on the previous passage, is it true that "The country\'

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'context': 'input', 'prediction': '<s> <|tasktype|>\nnatural language inference\n<|context|>\n🌍 Country Implementation - Procurement : \n1. Engagement with an SI - terms and conditions.\n2. Procurement of biometric and other external components.\n3. HSM\n4. [Postgres](https://docs.mosip.io/1.2.0/modules/persistence/postgres-db))\n5. [Object store](https://docs.mosip.io/1.2.0/modules/persistence/object-store)\n6. Compute hardware\n<|task|>\n {{context}} Based on the previous passage, is it true that "The terms and conditions are not very long."? Yes, no, or maybe?\n<|pipe|>\nMaybe</s>'}


In [None]:
synthetic_dataset = Dataset.from_list(examples)

# filter out the examples that cannot be parsed
synthetic_dataset = postprocess_dataset(
    synthetic_dataset, context_col="context"
)

synthetic_dataset.save_to_disk("datasets/mosip_bonito.hf")