# Orca 2 on Colab
<a target="_blank" href="https://colab.research.google.com/github/antonio-f/Orca2/blob/main/Orca2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

The following contents are loosely based on the article "Orca 2: Teaching Small Language Models How to Reason" by A. Mitra et al. ([link](https://arxiv.org/abs/2311.11045)).

**Orca 2** is a language model from Microsoft Research. It builds on the previous Orca model and aims to explore the capabilities of smaller language models (LMs) through improved training signals and methods. There are two versions: 7 billion and 13 billion parameters. The goal is to demonstrate that smaller LMs, typically with around 10 billion parameters or less, can achieve enhanced reasoning abilities comparable to much larger models. Orca 2 outperforms models of similar size, including the original Orca, and exhibits performance levels similar to or better than models 5-10 times larger. The evaluation is based on complex tasks testing advanced reasoning abilities in zero-shot settings. The models are fine-tuned on high-quality synthetic data, and the weights of Orca 2 are made publicly available to encourage further research on the development, evaluation, and alignment of smaller LMs.

Orca 2 is designed with the understanding that diverse tasks require different solution strategies. Unlike a one-size-fits-all approach, it recognizes that while larger models like GPT-4 may excel at direct answers for complex tasks, smaller models can benefit from breaking tasks into steps. 

In Orca 2, like its predecessor Orca 1, more advanced Language Model (LLMs) are employed to showcase various reasoning strategies across different tasks. However, in Orca 2, these strategies are specifically tailored to the task at hand. To generate nuanced data, the proficient LLM is provided with intricate prompts designed to elicit specific strategic behaviors, leading to more accurate results. During the training phase, the smaller model is exposed solely to the task and its resulting behavior, without access to the original prompts that triggered such behavior, enhancing the model's ability to generalize and adapt its reasoning capabilities.

### From Instruction Tuning to Explanation Tuning

Instruction tuning involves learning from input-output pairs where the input is natural language task description,and the output is a demonstration of the desired behavior. A weakness in instruction tuning is that a student model may generate outputs that are stylistically correct but ultimately incorrect. To address this issue, the authors introduce Explanation Tuning in Orca 1. This approach involves training student models on richer and more expressive reasoning signals obtained through `system instructions`. These instructions are crafted to elicit detailed explanations from a teacher model as it reasons through a task.

The process of explanation tuning initiates by creating a collection of N manually crafted `system instructions` aimed at prompting careful reasoning. These instructions, such as "think step-by-step" and "generate detailed answers," are designed to elicit thoughtful responses from advanced language models like GPT-4, emphasizing "Slow Thinking." These `system instructions` are integrated with user prompts spanning a wide range of tasks, forming a dataset of triplets composed of `system instruction`, `user prompt`, and the corresponding language model `LLM answer`. The student model is then trained to predict the LLM answer based on the given system instruction and user prompt.


### Teaching Orca 2 to be a "Cautious Reasoner"

An example of `system instructions` is the following:

```
You will be given a task. Use the following steps to solve it.
1. Identify the main theme or topic of the story.
2. Look for any cause and effect relationships between the sentences.
3. Find the sentence that could be the start of the story. Go through each of the answer
choices and analyze to figure it out.
4. Rearrange the sentences in the correct order based on the information gathered in
the previous steps.
5. Final answer: Write down the correct order of the sentences using their numbers,
such as ‘23415’.
```

Hence this instructions are small aids to conduct a sort of step-by-step procedure hoping to direct towards a correct answer. It's about breaking down a big problem into multiple subproblems.

Despite the correctness of all the provided answers, the question remains: which is the best answer for training the smaller model? This is a key issue and it is argued that smaller models should be taught to select the most effective solution strategy based on the problem at hand.
For example, GPT-4, being a larger model, can effortlessly produce a direct response, whereas a smaller model might not possess this capability and could necessitate an alternative approach, such as a step-by-step thought process. Consequently, instructing a smaller model to simply "mimic" the reasoning behavior of a more powerful counterpart may not be the optimal approach.

The term **Cautious Reasoning** is used to describe the process of determining the appropriate solution strategy for a given task. This choice is made among options such as direct answer generation or various 'Slow Thinking' strategies (such as step-by-step, guess and check, explain-then-answer, etc.)

The following illustrates the process of training a Cautious Reasoning LLM:
1. Start with a collection of diverse tasks
2. Guided by the performance of Orca, decide which tasks require which solution
strategy (e.g. direct-answer, step-by-step, explain-then-answer, etc.)
3. Write task-specific `system instruction(s)` corresponding to the chosen strategy
in order to obtain teacher responses for each task.
4. **Prompt Erasing**: at training time, replace the student’s `system instruction`
with a generic one vacated of details of how to approach the task.

For brevity we will not go into technical details. However, we will stop to examine an Orca 2 coding example using the appropriate model from Hugging Face.

### Example code 
The following code can be easily executed by harnessing the power of a T4 GPU on Colab. 

At first, we install the widely used `transformers` library from Hugging Face.

In [None]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-lz527hna
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-lz527hna
  Resolved https://github.com/huggingface/transformers to commit 35551f9a0f66a22de4971b4a51b3c172d3b87f95
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.36.0.dev0-py3-none-any.whl size=8049176 sha256=f8b14d379917e1ce13a07f413ba7480b3e17934c5d5be6a38939f6478066641c
  Stored in directory: /tmp/pip-ephem-wheel-cache-kc8sjheb/wheels/c0/14/d6/6c9a5582d2ac191ec0a483be151a4495fe1eb2a6706ca49f1b
Successfully built transformers

The `accelerate` library takes care of the heavy lifting... it's a library that enables the same PyTorch code to be run across any distributed configuration. The `-q` flag in `pip` is used to enable quiet mode during package installation (can be specified up to 3 times to remove messages of increasing levels of importance - warning, error, critical).

In [2]:
!pip install accelerate -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/261.4 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/261.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

`SentencePiece` is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training ([cit.](https://github.com/google/sentencepiece/blob/master/README.md)) .

In [3]:
!pip install SentencePiece -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.3 MB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m
[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.3/1.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h

Protocol buffers (`protobuf`) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. 

In [4]:
!pip install protobuf -qq

`bitsandbytes` brings quantization to your model. You can now load any pytorch model in 8-bit or 4-bit with a few lines of code.

In [5]:
!pip install bitsandbytes -qq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Import PyTorch and Transformers library of Hugging Face. Set the default PyTorch device to be the GPU if available.

In [7]:
import torch
import transformers

if torch.cuda.is_available():
    torch.set_default_device("cuda")

`transformers.AutoModelForCausalLM.from_pretrained`: this is a method from the Transformers library that loads a pre-trained model. In this case, it's loading a model with causal language modeling capabilities. The term "causal language model" typically refers to a model that generates sequences of tokens in a way that respects causality, where each token is generated based on the preceding tokens.

`microsoft/Orca-2-7b`: this is the identifier for the specific pre-trained model to be loaded. It seems to be a model named "Orca-2-7b" provided by Microsoft.

`device_map='auto'`: this parameter suggests that the code should automatically handle device placement, likely selecting the available GPU if one is present.

`load_in_8bit=True`: this parameter indicates that the model should be loaded using 8-bit quantization, which is a technique to reduce the memory and computational requirements of the model by using lower precision for certain calculations.

In [None]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    "microsoft/Orca-2-7b",
    device_map='auto',
    load_in_8bit=True)

Setting the tokenizer: the tokenizer is loaded with the pre-trained model identified by `microsoft/Orca-2-7b`. The `use_fast=False` parameter indicates not to use the fast tokenizer implementation, since fast and slow tokenizers produce different tokens.

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
        "microsoft/Orca-2-7b",
        use_fast=False,
    )

`system message` is used interchangeably with `system instruction` introduced before.

`tokenizer` tokenizes the combined prompt using the initialized tokenizer. The `return_tensors='pt'` parameter specifies that the output should be PyTorch tensors. The resulting inputs variable holds the tokenized representation of the prompt.

In [None]:
system_message = "You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior."
user_message = "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?"

prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant"

inputs = tokenizer(prompt, return_tensors='pt')

After tokenizing the inputs, the `generate()` method returns the generated tokens. The generated tokens then should be converted to text before printing.

The `generate` method of the model is used to generate a sequence of tokens based on the input `input_ids`. The resulting `output_ids` variable holds the generated token IDs.

The `batch_decode` method of the tokenizer is used to convert the generated token IDs (`output_ids`) back into human-readable text. The `[0]` indexing is used to get the decoded text from the first sequence in the batch. Finally, the decoded answer is printed.

In [10]:
output_ids = model.generate(inputs["input_ids"],)
answer = tokenizer.batch_decode(output_ids)[0]

print(answer)

<s>  <|im_start|> system
You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. <|im_end|> 
 <|im_start|> user
How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful? <|im_end|> 
 <|im_start|> assistant
There are different ways to find out if a restaurant is popular among locals or mainly attracts tourists, and some possible reasons why this information might be useful are:

- If a restaurant is popular among locals, it might offer better quality, variety, or value for money than a tourist-oriented one, as it can rely on repeat customers and word-of-mouth recommendations.
- If a restaurant is mainly attracting tourists, it might have a more limited or expensive menu, or charge higher prices, as it has to cater to the expectations and budgets of visitor

Below, the construction of a second turn message followed by its tokenization. Special tokens are not tokenized: `add_special_tokens`, which defaults to `True`, adds the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the end. If you do not want to use these symbols, you can set `add_special_tokens` to `False`. 

Finally, the new input is formed (`second_turn_input`), followed by a new generation and decoding cycle.

In [11]:
# This example continues showing how to add a second turn message by the user to the conversation
second_turn_user_message = "Give me a list of the key points of your first answer."

# we set add_special_tokens=False because we dont want to automatically add a bos_token between messages
second_turn_message_in_markup = f"\n<|im_start|>user\n{second_turn_user_message}<|im_end|>\n<|im_start|>assistant"
second_turn_tokens = tokenizer(second_turn_message_in_markup, return_tensors='pt', add_special_tokens=False)
second_turn_input = torch.cat([output_ids, second_turn_tokens['input_ids']], dim=1)

output_ids_2 = model.generate(second_turn_input,)
second_turn_answer = tokenizer.batch_decode(output_ids_2)[0]

print(second_turn_answer)

<s>  <|im_start|> system
You are Orca, an AI language model created by Microsoft. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. <|im_end|> 
 <|im_start|> user
How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful? <|im_end|> 
 <|im_start|> assistant
There are different ways to find out if a restaurant is popular among locals or mainly attracts tourists, and some possible reasons why this information might be useful are:

- If a restaurant is popular among locals, it might offer better quality, variety, or value for money than a tourist-oriented one, as it can rely on repeat customers and word-of-mouth recommendations.
- If a restaurant is mainly attracting tourists, it might have a more limited or expensive menu, or charge higher prices, as it has to cater to the expectations and budgets of visitor

### Conclusions

Orca 2 models, employing diverse reasoning techniques and selecting effective solution strategies for each task, achieve performance levels comparable to, and sometimes surpassing, much larger models, particularly in zero-shot reasoning tasks. Despite inherent limitations, these models exhibit promising potential for future improvement, particularly in terms of enhanced reasoning capabilities, control, and safety through post-training with synthetic data.

Additionally, the potential of using tailored and high-quality synthetic data, generated by more powerful models, for training smaller models in scenarios involving complex prompts and potentially multiple model calls should be underscored.

The authors believe that while frontier models will continue to showcase superior capabilities, research aimed at developing more capable smaller models will open new avenues for applications.

### Useful links

*Orca 2: Teaching Small Language Models How to Reason*
A. Mitra et al. - Microsoft Research
[arXiv:2311.11045](https://arxiv.org/abs/2311.11045) [cs.AI] (2023).

*Orca 2: Teaching Small Language Models How to Reason* -
Microsoft Research Blog ([link](https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/))

*Microsoft Orca-2-7b Model* - Hugging Face [page](https://huggingface.co/microsoft/Orca-2-7b).

