# MLX LOGIC ADDED
Below is a port of Genstruct 7B notebook from NousResearch (https://huggingface.co/NousResearch/Genstruct-7B/blob/main/notebook.ipynb)

I've ported the code to work with MLX, Apple's Silicon framework, and added an end to end pipeline that generates the data and saves it as both JSON and a hugging face dataset format.

Great model, and really happy with how this works. It's using a csv file as the seed/'train' data for Genstruct. It then runs inference over each row, and outputs as a saved dataset that can then be used for something like qlora.

Haven't gotten to the reward model part yet, so I've left that code as-is from NousResearch's HF repo. Plan to do Pairwise RM next, and then add in a qlora finetune section.

...also had Claude 3 generate markdown comments on what's happening in the code in the style of a nature documentary. Enjoy.

# Introducing Genstruct
Generating high-quality synthetic instruction data is an important challenge. Standard approaches rely heavily on in-context learning and prompting of large language models to generate instruction pairs. This has limitations in terms of quality, diversity, and lack of explicit reasoning.

Two previous methods aimed to improve upon this naive prompting approach:
- Retrieval-augmented generation (RAG) pipelines convert passages from sources like Wikipedia into instructional pairs.
- [Ada-Instruct](https://arxiv.org/abs/2310.04484) instead trains a custom model to generate instructions, rather than relying on prompting. This improves quality and diversity compared to prompting alone. Further, the authors of the Ada-Instruct paper found that training could be performed with as few as 10 examples.

Genstruct is a new method that combines and extends these previous approaches. Like Ada-instruct, it is a custom trained model rather than relying on prompting. However, Ada-Instruct relies heavily on ungrounded generation, which can lead to hallucinations.  To mitigate this, Genstruct generates instructions based upon a user-provided context, like RAG methods.

Additionally, Genstruct goes beyond prior work by focusing on the generation of complex questions and multi-step reasoning for each generated instruction pair, rather than just direct questions and responses.

## Generating instruction pairs
Ada-Instruct is trained based on Mistral. Specifically, it is trained over the [MetaMath-Mistral-7B](meta-math/MetaMath-Mistral-7B) model, in order to improve reasoning with math-heavy topcs.

Like any other Mistral model, it can be imported from Huggingface Hub as follows:

Genstruct works by generating instructions and answers from a user-provided context and title. It utilizes a custom prompt format, as in the following example:
```
[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]
```

The model then completes from `[[[User]]]`, generating an instruction and a response.


To simplify its use, the Genstruct tokenizer includes a 'chat template'. It accepts a list containing a single dict, with members 'title' and 'content' - for the title and content of the context to generate from:

...or in the style of a nature documentary, thanks to Claude:

In the vast digital savannah, we observe a remarkable ritual unfold. A pack of alphanumeric hunters, led by the fearsome process_dataset function, prepare to stalk their prey - the elusive train_dataset.

First, the hunters must load their weapons - the mighty model and tokenizer from the mlx_lm tribe. With a few deft keystrokes, they summon these formidable tools:

quick mlx test to make sure it works

In [None]:
from mlx_lm import load, generate

model, tokenizer = load(
    "./NousResearch_Genstruct-7B-mlx",
)

msg = [
    {
        "title": "the best title ever",
        "content": "the best message ever",
    }
]

prompt = tokenizer.decode(tokenizer.apply_chat_template(msg))

gen_text = generate(model, tokenizer, prompt, max_tokens=512, temp=0.6, verbose=True)

# Split the generated text using the EOS token and take the first part
gen_text_final = gen_text.split(tokenizer.eos_token, 1)[0]

print(gen_text_final)

# Process Genstruct Input Data and Create Seed Dataset

In the vast digital savannah, we observe a remarkable ritual unfold. A pack of alphanumeric hunters, led by the fearsome `process_dataset` function, prepare to stalk their prey - the elusive `train_dataset`.

First, the hunters must load their weapons - the mighty `model` and `tokenizer` from the `mlx_lm` tribe. With a few deft keystrokes, they summon these formidable tools:

```python
model, tokenizer = load(
    "./NousResearch_Genstruct-7B-mlx",
)
```

In [1]:
from mlx_lm import load, generate

model, tokenizer = load(
    "./NousResearch_Genstruct-7B-mlx",
)

In [None]:
from datasets import load_dataset, Dataset
import logging
import json

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load the train.csv file into a Hugging Face dataset
train_dataset = load_dataset("csv", data_files="./data/train.csv")

# Shuffle the train dataset
train_dataset = train_dataset.shuffle(seed=42)

# Save the shuffled train dataset as a parquet file
train_dataset["train"].save_to_disk("train_dataset")

# Load the shuffled train dataset from the parquet file
train_dataset = Dataset.load_from_disk("train_dataset")

Now, the hunt begins in earnest. The `process_dataset` function, a veteran tracker, takes the `train_dataset` herd as its target, along with an optional `num_rows` parameter to limit the size of the cull.

For each row in the herd, the hunters meticulously extract vital information - the `message` and `sentiment` - like experts reading the scent trails of their quarry.

```python
message = row["message"]
sentiment = row["sentiment"]
```

A devious `msg` template is then crafted, laced with tantalizing metadata to lure the prey into a false sense of security.

The hunters enlist the aid of the `tokenizer`, a trusty scout skilled in the art of prompts. With its guidance, they generate a scent trail that will lead their deadliest weapon - the `model` - straight to the hapless prey.

```python
prompt = tokenizer.decode(tokenizer.apply_chat_template(msg))
```

The hunt reaches its climax as the `generate` function unleashes the `model`, a ferocious beast capable of spinning coherent and contextually relevant discourse from the prompt.

```python
gen_text = generate(model, tokenizer, prompt, max_tokens=512, temp=0.6, verbose=True)
```

The `gen_text` is swiftly snared, with only the first part kept as the final, fatal blow.

```python
gen_text_final = gen_text.split(tokenizer.eos_token, 1)[0]
```

As each row falls, an `output_item` dictionary is assembled - a gruesome trophy with the original `message` as the "instruction" and the generated `gen_text_final` as the "response".

One by one, these trophies are added to the `output_data` list, a growing monument to the hunters' prowess.

When the herd has been sufficiently culled, the spoils are preserved - first as a `output_json` object serialized from the `output_data` list, and then as a `output_dataset` fashioned from the same grisly remains.

In [3]:
def process_dataset(dataset, num_rows=None):
    """
    Processes a given dataset by iterating through each row, generating text using the specified model and tokenizer,
    and creating an output dataset with the generated responses.

    Args:
        dataset (Dataset): The input dataset to process.
        num_rows (int, optional): The number of rows to process. If None, all rows will be processed.

    Returns:
        tuple: A tuple containing the output JSON string and the output dataset.
    """

    # Initialize an empty list to store the generated outputs
    output_data = []

    # Iterate through each row in the dataset
    for i, row in enumerate(dataset):
        # Check if the specified number of rows has been processed
        if num_rows is not None and i >= num_rows:
            break

        # Extract the relevant information from the current row and strip whitespaces
        message = row["message"].strip()
        sentiment = row["sentiment"].strip()

        # Log the message being processed
        logger.info(f"Processing message {i+1}: {message}")

        # Create the message template with metadata
        msg = [
            {
                "title": f"Conversation about X, <sentiment>{sentiment}</sentiment>",
                "content": f"""{message}""",
            }
        ]

        # Generate the prompt by applying the chat template to the message
        prompt = tokenizer.decode(tokenizer.apply_chat_template(msg)).strip()

        # Generate text using the specified model and tokenizer
        gen_text = generate(
            model, tokenizer, prompt, max_tokens=512, temp=0.6, verbose=False
        )

        # Split the generated text using the EOS token, take the first part, and strip whitespaces
        gen_text_final = gen_text.split(tokenizer.eos_token, 1)[0].strip()

        # Append the EOS token to the generated text
        gen_text_final += tokenizer.eos_token

        # Log the generated output
        logger.info(f"Generated output: {gen_text_final}")

        # Create a dictionary with the required schema
        output_item = {
            "instruction": f"[[[User]]]{message}",
            "response": f"[[[Content]]]{gen_text_final}",
        }

        # Append the output item to the output_data list
        output_data.append(output_item)

    # Create a JSON object with the output data
    output_json = json.dumps(output_data, indent=2)

    # Create a Hugging Face dataset from the output data
    output_dataset = Dataset.from_list(output_data)

    # Return the output JSON and dataset
    return output_json, output_dataset

```python
output_json = json.dumps(output_data, indent=2)
output_dataset = Dataset.from_list(output_data)
```

The `output_json` is cached in the "output.json" territory, while the `output_dataset` is staked out in the "output_dataset" domain - a warning to any future prey that dares to wander too close.

And so, the circle of life in the coding kingdom continues, with the alphanumeric hunters ever vigilant, ever ready to pursue their next target.

In [None]:
# Process a specified number of rows (e.g., 10) from the shuffled train dataset (num_rows=x optional parameter for sampling only a select number of rows)
output_json, output_dataset = process_dataset(train_dataset, num_rows=None)

# Save the output JSON to a file
with open("output.json", "w") as f:
    f.write(output_json)

# Save the output dataset as a parquet file
output_dataset.save_to_disk("output_dataset")

# END MLX LOGIC - WORKING ON END TO END PAIRWISE RM AND QLORA

Generation can then be performed with `model.generate()`, as follows (or with vllm or whaatever other pipeline you prefer):

Note that the model is optimized for single-paragraph extracts from Wikipedia articles. You may have varying luck with other input types.

## Filtering outputs using a reward model
The model may occasionally generate incorrect or improperly formatted output - the likelihood of this can be reduced with clever sampling methods, such as rejection sampling using a reward model, or even simple regex filtering.

For instance, we might consider `OpenAssistant/reward-model-deberta-v3-large-v2` as a reward model, and perform best-of-n sampling:

In [4]:
import torch
from transformers import AutoModelForSequenceClassification

N = 4

rm_tokenizer = AutoTokenizer.from_pretrained(
    "OpenAssistant/reward-model-deberta-v3-large-v2"
)
rm_model = AutoModelForSequenceClassification.from_pretrained(
    "OpenAssistant/reward-model-deberta-v3-large-v2", torch_dtype=torch.bfloat16
)


def extract_pair(resp):
    response = resp.split("[[[Content]]]")[1]
    inst, resp = resp.split("[[[User]]]")[:2]
    return inst.strip(), resp.strip()


def score(resp):
    inst, resp = extract_pair(resp.split(tokenizer.eos_token)[0])

    with torch.no_grad():
        inputs = rm_tokenizer(inst, resp, return_tensors="pt")
        score = float(rm_model(**inputs).logits[0].cpu())
        return score


gens = tokenizer.batch_decode(
    model.generate(inputs, max_new_tokens=256, num_return_sequences=N, do_sample=True)
)
print(max(gens, key=score))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[[[Title]]] p-value
[[[Content]]] The p-value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic T {\displaystyle T}.[note 2] The lower the p-value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically significant if it allows us to reject the null hypothesis. All other things being equal, smaller p-values are taken as stronger evidence against the null hypothesis.

The following is an interaction between a user and an AI assistant that is related to the above text.

[[[User]]]  Two medical procedures were compared by flipping 2 coins, procedure A assumed to be better and so it was labeled head, while procedure B was labeled as tail for a flip. The coins where then flipped 25 times, with the following results:[{'Tails', 12}, {'Heads', 13}]

Which procedure had better results with statistical signi