# Introducing txtinstruct

[txtinstruct](https://github.com/neuml/txtinstruct) is a framework for training instruction-tuned models.

The objective of this project is to support open data, open models and integration with your own data. One of the biggest problems today is the lack of licensing clarity with instruction-following datasets and large language models. txtinstruct makes it easy to build your own instruction-following datasets and use those datasets to train instructed-tuned models.

This notebook gives a brief introduction to txtinstruct.



# Install dependencies

Install `txtinstruct` and all dependencies.

In [2]:
%%capture
!pip install git+https://github.com/neuml/txtai git+https://github.com/neuml/txtinstruct

# Architecture Overview

txtinstruct consists of three components to help train instruction-following models.

The first component is statement generation. Statement generation models create a statement from a context. This statement can be a question or request to describe a concept depending on the model. 

The next component is a knowledge source for pulling context. An example knowledge source used in this notebook is a [txtai embeddings index of the full Wikipedia dataset](https://huggingface.co/NeuML/txtai-wikipedia). 

The last component is a large language model (LLM) for translating source statements into target statements. If the statement is a question, the LLM answers it. If it's a descriptive statement, the LLM builds a description. In both cases, a prompt is used in combination with the knowledge source context to generate the target text.

# Statement Generation Model

Let's show an example on how to use txtinstruct to build a statement generation model. This example builds a question generation model using the [SQuAD dataset](https://huggingface.co/datasets/squad).

In [2]:
from datasets import load_dataset

from txtinstruct.models import StatementGenerator

# Load SQuAD dataset
dataset = load_dataset("squad", split="train")

# Train model
generator = StatementGenerator()
model, tokenizer = generator(
    "google/flan-t5-small",
    dataset,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=128 // 16,
    num_train_epochs=0.1,
    logging_steps=100,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Note that we only trained the model for a fraction of an epoch for expediency. Under normal circumstances, `num_train_epochs` would be at least 3.

If you've trained models either with txtai or Hugging Face's trainer, you'll recognize many of the options. [See this page](https://neuml.github.io/txtai/pipeline/train/trainer/) to learn more on the configuration options available. 

In [3]:
from txtai.pipeline import Sequences

# Load statement generation model
statements = Sequences((model, tokenizer))

# Run example prompt
statements("""Generate a question using the context below.
### Context:
txtai is an open-source platform for semantic search and workflows powered by language models.""")

'What is the name of the open-source platform for semantic search and workflows?'

Given the context, the question above is generated. Next we'll discuss how this helps build an instruction-tuning dataset.

# Build a dataset for Instruction-Tuning

Now that we have a statement generation model, let's build an instruction-tuning dataset. 

We'll use the `txtai wikipedia embeddings index` as the knowledge source and `google/flan-t5-base` as our teacher model.

In [None]:
from txtai.embeddings import Embeddings
from txtinstruct.data import DatasetBuilder

# Load embeddings
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Query templates
templates = [
    "Tell me about {text}",
    "Give an explanation on {text}",
    "Provide a quick summary on {text}",
    "Explain {text} in simple terms",
    "Describe {text}"
]

# Build dataset
builder = DatasetBuilder(Sequences("google/flan-t5-base"), statements, templates)
builder(
    embeddings.search("SELECT id, text FROM txtai WHERE similar('machine learning') AND percentile >= 0.99 LIMIT 5"),
    5,
    "data.json"
)

In [7]:
!cat data.json

[
    {
        "context": "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. \nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine 

Let's take a look at the generated data. The dataset consists of a context and associated list of statements. Each statement is a source-target pair.

Note that there are also unanswerable questions. It's important for the model to not generate an answer when there is no answer. This is often called a "model hallucination".

# Train an instruction-tuned model

Now the part we've been waiting for, instruction-tuning a model. 

In [None]:
import json

from txtinstruct.models import Instructor

# Read in generated dataset
with open("data.json", encoding="utf-8") as f:
    data = json.load(f)

# Instruction-tune model
instructor = Instructor()
model, tokenizer = instructor(
    "google/flan-t5-small", 
    data,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=128 // 8,
    num_train_epochs=3,
    logging_steps=100,
)

Before testing the instruction-tuned model, let's run a baseline test to see how `google/flan-t5-small` behaves without any fine-tuning.

This next section runs a prompt with a question and context.

In [10]:
from txtai.pipeline import Extractor

def prompt(text):
    template = "Answer the following question using only the context below. Give a detailed answer. "
    template += "Say 'I don't have data on that' when the question can't be answered.\n"
    template += f"Question: {text}\n"
    template += "Context: "

    return template

extractor = Extractor(
    embeddings,
    Sequences("google/flan-t5-small")
)

extractor([{
    "query": "Tell me about Linux",
    "question": prompt("Tell me about Linux")
}])

[{'answer': 'Linux'}]

Not a very good answer given the question. Let's try another question thats unanswerable.

In [11]:
extractor([{
    "query": "What is the weather in Phoenix today?",
    "question": prompt("What is the weather in Phoenix today?")
}])

[{'answer': '0.00%'}]

See how the model still tries to give an answer even though there is no answer. 

Now let's try the same two questions with our instruction-tuned model. 

In [12]:
extractor = Extractor(
    embeddings,
    Sequences((model, tokenizer))
)

extractor([{
    "query": "Tell me about Linux",
    "question": prompt("Tell me about Linux")
}])

[{'answer': 'Linux (or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which includes the kernel and supporting system software and libraries, many of which are provided by the GNU Project.'}]

In [13]:
extractor([{
    "query": "What is the weather in Phoenix today?",
    "question": prompt("What is the weather in Phoenix today?")
}])

[{'answer': "I don't have data on that"}]

Much better indeed. Keep in mind this was only trained with ~15 samples using a relatively small teacher model (`google/flan-t5-base`) for demonstration purposes.

# Wrapping up

This notebook introduced txtinstruct, a framework for training instruction-tuned models. This project strives to be an easy-to-use way to build your own instruction-following models with licensing clarity. Stay tuned for more!