# Introducing Genstruct
Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.

Two previous methods aimed to improve upon this naive prompting approach:
- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.
- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.

Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations.  To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.

Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses.

In [2]:
!pip install accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl.metadata (1.8 kB)
Collecting huggingface-hub (from accelerate)
  Downloading huggingface_hub-0.21.4-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.3.1 (from accelerate)
  Downloading safetensors-0.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading safetensors-0.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━

In [4]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m117.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Generating instruction pairs
Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.

Like any other Mistral model, it can be imported from Huggingface Hub as follows:

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = 'NousResearch/Genstruct-7B'

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='cuda', load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
!pip install git+https://github.com/yuchenlin/LLM-Blender.git -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
!pip install sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:
```
[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]
```

The model then completes from `[[[User]]]`, generating an instruction and a response.


To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:

In [2]:
msg =[{
    'title': 'p-value',
    'content': "The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis."
}]
inputs = tokenizer.apply_chat_template(msg, return_tensors='pt').cuda()

Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):

In [3]:
gen = tokenizer.decode(model.generate(inputs, max_new_tokens=512)[0]).split(tokenizer.eos_token)[0]
print(gen)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]  The share prices of two rival companies, A and B, have been monitored for many years, allowing a large number of data points for robust analysis. Among the many statistics that can be calculated, the p-value is of primary interest.
Which company, A or B, is the one whose share price statistics wouldn't significantly reject

In [3]:
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM")



Successfully loaded ranker from  /teamspace/studios/this_studio/.cache/huggingface/hub/llm-blender/PairRM


In [5]:
msg = [{
    'title': 'Infinite Craft',
    'content': '''Infinite Craft is a 2024 sandbox[1] browser game developed by Neal Agarwal, in which the player starts with four elements—earth, wind, fire, and water—and combines them into people, astrological beings, and fictional characters. AI software, including LLaMA and Together AI, is used to produce new elements.'''
}]
inputs = [tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) ]
outputs = blender.best_of_n_generate(model, tokenizer, inputs, n=4)

print("### Prompt:\n", inputs[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:


Sampling generations:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Sampling generations: 100%|██████████| 1/1 [01:09<00:00, 69.80s/it]
Ranking candidates: 100%|██████████| 1/1 [00:00<00:00,  4.37it/s]

### Prompt:
 [[[Title]]] Infinite Craft
[[[Content]]] Infinite Craft is a 2024 sandbox[1] browser game developed by Neal Agarwal, in which the player starts with four elements—earth, wind, fire, and water—and combines them into people, astrological beings, and fictional characters. AI software, including LLaMA and Together AI, is used to produce new elements.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 John is playing Infinite Craft. He starts with the four elements earth, wind, fire, and water. He combines earth and wind to see what happens. Then he combines the result with fire. Then he combines the result of that with water. He is having fun mixing these elements together to see what happens.
Will combining earth and wind create an element with more or less depth than earth and wind individually?
[[[Assistant]]] In Infinite Craft, John starts with four basic elements: earth, wind, fir




In [6]:
text ='''India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.'''

In [7]:
msg = [{
    'title': 'India',
    'content': text
}]
inputs = [tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) ]
outputs = blender.best_of_n_generate(model, tokenizer, inputs, n=10)

print("### Prompt:\n", inputs[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:


Sampling generations:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Sampling generations: 100%|██████████| 1/1 [02:14<00:00, 134.61s/it]
Ranking candidates: 100%|██████████| 1/1 [00:00<00:00,  1.14it/s]

### Prompt:
 [[[Title]]] India
[[[Content]]] India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 India is a very populous country. In fact, it is the most populous country in the world as of June 20




In [8]:
text = '''
[[[Title]]] India
[[[Content]]]India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Use only the context and no outside information for answering.
'''

In [9]:
msg = [{
    'title': 'India',
    'content': text
}]
inputs = [tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) ]
outputs = blender.best_of_n_generate(model, tokenizer, inputs, n=3)

print("### Prompt:\n", inputs[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:


Sampling generations:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Sampling generations: 100%|██████████| 1/1 [01:00<00:00, 60.80s/it]
Ranking candidates: 100%|██████████| 1/1 [00:00<00:00,  5.68it/s]

### Prompt:
 [[[Title]]] India
[[[Content]]] [[[Title]]] India
[[[Content]]]India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Use only the context and no outside information for answering.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 India 




In [11]:
msg = [{
    'title': 'India',
    'content': text
}]
inputs = [tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) ]*4
outputs = blender.best_of_n_generate(model, tokenizer, inputs, n=3)

print("### Prompt:\n", inputs[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:


Sampling generations:   0%|          | 0/1 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Sampling generations: 100%|██████████| 1/1 [01:57<00:00, 117.01s/it]
Ranking candidates: 100%|██████████| 1/1 [00:00<00:00,  1.00it/s]

### Prompt:
 [[[Title]]] India
[[[Content]]] [[[Title]]] India
[[[Content]]]India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Use only the context and no outside information for answering.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 India 




In [12]:
print("### Prompt:\n", inputs[1])
print("### best-of-n generations:\n", outputs[1])

### Prompt:
 [[[Title]]] India
[[[Content]]] [[[Title]]] India
[[[Content]]]India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Use only the context and no outside information for answering.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 India 

In [13]:
print("### Prompt:\n", inputs[2])
print("### best-of-n generations:\n", outputs[2])

### Prompt:
 [[[Title]]] India
[[[Content]]] [[[Title]]] India
[[[Content]]]India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[22] is a country in South Asia. It is the seventh-largest country by area; the most populous country as of June 2023;[23][24] and from the time of its independence in 1947, the world's most populous democracy.[25][26][27] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar[k] to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.

Use only the context and no outside information for answering.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]] 
### best-of-n generations:
 India 