# Generating Data

## Understanding Taxonomy

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "lab" in InstructLab stands for Large-scale Alignment for Chat Bots.

The LAB method is driven by taxonomies, which are largely created manually and with care.

InstructLab crowdsources the process of tuning and improving models by collecting two types of data: knowledge and skills in a new open source community. These submissions are collected in a taxonomy of YAML files to be used in the synthetic data generation process. To help you understand the directory structure of a taxonomy, please refer to the following image.


![redhat dog](./assets/taxonomy_diagram.png)

## Add a new skill.

The way the taxonomy approach works is that we provide a file, named qna.yaml, that contains a sample data set of questions and answers. This data set will be used in the process of creating many more synthetic data examples.  The important thing to understand about the qna.yaml file is that it must follow a specific schema for InstructLab to use to synthetically generate more examples. 

Instead of having to type this information in by hand, simply run the following command to copy the qna.yaml file to your taxonomy directory.


In [None]:
!mkdir -p ./taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables

In [None]:
!cp ./assets/qna.yaml ./taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables/qna.yaml 
!head ./taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables/qna.yaml

In [None]:
!cp ./assets/attribution.txt ./taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables/attribution.txt 
!head ./taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables/attribution.txt

In [None]:
!head ~/instructlab/taxonomy/compositional_skills/extraction/inference/quantitative/asciidoc/tables/qna.yaml


## Verification

InstructLab allows you to validate your taxonomy files before generating additional data. You can accomplish this by using the `ilab diff` command as shown below:

In [None]:
!ilab diff

## Generate synthetic data

Okay, so far so good. Now, let’s move on to the AWESOME part. We are going to use our taxonomy, which contains our qna.yaml file, to have the LLM automatically generate more examples. The generate step can often take a while and is dependent on the number of instructions that you want to generate. What this means is that InstructLab will generate X number of additional questions and answers based on the samples provided. To give you an idea of how long this takes, generating 100 additional questions and answers typically takes about 7 minutes when using a nicely specced consumer-grade GPU-accelerated Linux machine. This can take around 20 minutes using Apple Silicon and depends on many factors. For the purpose of this workshop, we are only going to generate 5 additional samples. To do this, you would issue the following command:   

In [None]:
!ilab generate --help

In [None]:
!ilab generate --model granite-7b-lab-Q4_K_M --num-instructions 5 