# Fine-tuning the latest Google Gemma model locally using MLX

In this notebook, we will be running and fine-tuning the latest [Google Gemma model](https://blog.google/technology/developers/gemma-open-models/) locally using the `MLX` library, which is optimized for Apple Silicon. The documentation for MLX can be a bit difficult to understand, so I hope that sharing the process and the challenges I encountered will be helpful.

I have uploaded the Jupyter Notebook that I actually used to Gist, so please refer to it as well.

- [Fine-tuning the Google Gemma model locally using MLX](https://gist.github.com/alexweberk/1434c95c05463866491677aac6ce19ba)


## Preparation

Install the necessary libraries.
Also, a Mac with Apple Silicon is required. In this case, I used a MacBook Pro equipped with M3 Max 128GB.


In [1]:
!pip install -Uqq mlx mlx_lm transformers datasets

## Using MLX to Run Inference with Gemma Model using MLX

There are about 4 versions of the released Gemma, but this time we will use the instruction-tuned `gemma-7b-it`.

We will use the `mlx_lm` library that utilizes an mlx backend.


In [13]:
from mlx_lm import generate, load

model, tokenizer = load("google/gemma-7b-it")

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Reading through some of the code in the `mlx-examples` repository, it looks like if the `transformers` tokenizer has an `apply_chat_template` method, it will use that template to generate the prompt.

Therefore, when generating, we will input a prompt that only includes the question itself.

https://github.com/ml-explore/mlx-examples/blob/47dd6bd17f3cc7ef95672ea16e443e58ce5eb1bf/llms/mlx_lm/generate.py#L98

This is what the tokenizer will do internally in the `generate()` method:


In [3]:
messages = [{"role": "user", "content": "Why is the sky blue?"}]
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

'<bos><start_of_turn>user\nWhy is the sky blue?<end_of_turn>\n<start_of_turn>model\n'

In [4]:
# Generating without adding a prompt template manually
prompt = """
Why is the sky blue?
""".strip()
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=256,
)

Prompt: Why is the sky blue?


The sky is blue due to a phenomenon called **Rayleigh Scattering**.

Here's a breakdown of what happens:

1. **Sunlight:** Sunrays are made up of all the colors of the rainbow, with each color having a different wavelength.
2. **Scattering:** When sunlight enters Earth's atmosphere, it interacts with the tiny particles of air (dust, water vapor, etc.). These particles scatter the sunlight in all directions.
3. **Blue Scatter:** The particles scatter the shorter wavelengths of blue and violet light more effectively than the longer wavelengths of red and orange light.
4. **Scattered Light:** The scattered light, which is predominantly blue, is scattered in all directions.
5. **Our View:** We see the scattered light from all directions, including the direction opposite the sun. This is why the sky appears blue.

**Additional factors:**

* **Time of Day:** The intensity of the blue color is strongest at midday and decreases as the sun gets closer to the horiz

Success! We were able to generate a response from the Gemma model using MLX.

Now that we've seen how to generate a response, let's try fine-tuning the model to some custom data.


## Fine-tuning the Gemma model with LoRA using MLX

We'll be fine-tuning on a cool dataset from teknium to see what we can produce. Since this is just an example, we'll only fine-tune it for 600 iterations.


### Preparing the dataset

We'll format the dataset to follow the format shown in the `mlx-examples` repository. Basically, we need to prepare a train.jsonl and valid.jsonl. Each line should have a `text` key with the string to train as the value. Here is an example:

`{"text": "Q: What is the capital of France?\nA: Paris is the capital of France."}`

However, the value should follow the format of the prompt Gemma was trained on, which means we need to transform it to something like this:

`{"text": "<bos><start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\nParis is the capital of France.<end_of_turn><eos>"}`

Let's first load the dataset and see what it looks like. We will be using legendary [@teknium](https://twitter.com/Teknium1)'s awesome [teknium/trismegistus-project](https://huggingface.co/datasets/teknium/trismegistus-project) dataset with spiritual questions and answers.


In [2]:
from datasets import load_dataset

dataset = load_dataset("teknium/trismegistus-project")
dataset

DatasetDict({
    train: Dataset({
        features: ['source', 'domain_task_type', 'topic', 'system_prompt_used', 'conversations', 'id'],
        num_rows: 13528
    })
})

Since the dataset is small enough, I will just use pandas to format it.


In [3]:
# convert the dataset to pandas dataframe
import pandas as pd

df = pd.DataFrame(dataset["train"])
df.head()

Unnamed: 0,source,domain_task_type,topic,system_prompt_used,conversations,id
0,DomainExpert_Occult,Task,'Big Man' society,"You are a master of the esoteric, occult, 'Big...","[{'from': 'human', 'value': 'Compose a compreh...",570a8404-3270-4aba-a47c-660359440835
1,DomainExpert_Occult,Task,'Big Man' society,"You are a master of the esoteric, occult, 'Big...","[{'from': 'human', 'value': 'Develop an intric...",ddf44765-8756-46db-a945-672050905fc0
2,DomainExpert_Occult,Task,'Big Man' society,"You are a master of the esoteric, occult, 'Big...","[{'from': 'human', 'value': 'Write an extensiv...",9ef38c3a-31ed-48d7-94d2-75fc588bcb2e
3,DomainExpert_Occult,Task,'Big Man' society,"You are a master of the esoteric, occult, 'Big...","[{'from': 'human', 'value': 'Develop an intric...",6dea7781-0f74-4692-8d1d-762c6585c280
4,DomainExpert_Occult,Task,'Black Books' of European necromancy,"You are a master of the esoteric, occult, 'Bla...","[{'from': 'human', 'value': 'Devise an intrica...",188c6c15-d2b7-448b-b93c-505aeca2a458


In [4]:
print(df.iloc[0]["system_prompt_used"], "\n", df.iloc[0]["conversations"])

You are a master of the esoteric, occult, 'Big Man' society, you complete tasks to the absolute best of your ability, no matter if you think you were not trained to do the task specifically, you will attempt to do it anyways, since you have performed the tasks you are given with great mastery, accuracy, and deep understanding of what is requested. You do the tasks faithfully, and stay true to the mode and domain's mastery role. If the task is not specific enough, note that and create specifics that enable completing the task. 
 [{'from': 'human', 'value': "Compose a comprehensive biography of a renowned figure in the 'Big Man' society, delving into their connections to the occult and esoteric realms. Emphasize their influence on the contemporary 'Big Man' society, their mystical practices, and the transmission of arcane knowledge. Ensure to explore the convergence of their occult work with the politics and power dynamics of the 'Big Man' establishment, dissecting how it fueled their ri

We see that the `conversations` holds the text for Q and A. We will format this in Gemma's prompt format and save it to a jsonl file.


In [5]:
# Split the quetion and answer into separate columns
df[["question", "answer"]] = pd.DataFrame(df["conversations"].tolist(), index=df.index)

# Only keep the 'value' portion of the JSON
df["question"] = df["question"].apply(lambda x: x["value"])
df["answer"] = df["answer"].apply(lambda x: x["value"])

df[["system_prompt_used", "question", "answer"]]

Unnamed: 0,system_prompt_used,question,answer
0,"You are a master of the esoteric, occult, 'Big...",Compose a comprehensive biography of a renowne...,Title: The Mystifying Odyssey of Eliphas Black...
1,"You are a master of the esoteric, occult, 'Big...",Develop an intricate numerology system that de...,I. Foundational Numerology\n\nThe 'Big Man' so...
2,"You are a master of the esoteric, occult, 'Big...",Write an extensive biography of a prominent oc...,Title: Nathaniel Ziester: A Life in Shadows - ...
3,"You are a master of the esoteric, occult, 'Big...","Develop an intricate system of numerology, inc...",Title: The Numerological Riddles of the Big Ma...
4,"You are a master of the esoteric, occult, 'Bla...",Devise an intricate multi-step process for the...,Step 1: Assess the condition of the grimoire\n...
...,...,...,...
13523,"You are a master of the esoteric, occult, Reap...",In the context of the Reappropriated Goddess m...,Answer: To regain women's empowerment and infl...
13524,"You are a master of the esoteric, occult, Reap...",Write a section of a grimoire explaining the c...,Title: The Reappropriated Goddess in the Occul...
13525,"You are a master of the esoteric, occult, Reap...",Write a section of a grimoire specifically foc...,Title: The Reappropriated Goddess: A Journey i...
13526,"You are a master of the esoteric, occult, Reap...",Create a detailed introductory section for a g...,Title: The Reappropriated Goddess: A Grimoire ...


Since Gemma doesn't seem to have been trained with a separate system prompt, let's create a separate format for separating the system prompt and the user prompt, as below.


In [6]:
def generate_prompt(row: pd.Series) -> str:
    "Format to Gemma's chat template"
    return """<bos><start_of_turn>user
## Instructions
{}
## User
{}<end_of_turn>
<start_of_turn>model
{}<end_of_turn><eos>""".format(
        row["system_prompt_used"], row["question"], row["answer"]
    )


df["text"] = df.apply(generate_prompt, axis=1)

# Let's see what the model will be trained on
print(df["text"].iloc[0])

<bos><start_of_turn>user
## Instructions
You are a master of the esoteric, occult, 'Big Man' society, you complete tasks to the absolute best of your ability, no matter if you think you were not trained to do the task specifically, you will attempt to do it anyways, since you have performed the tasks you are given with great mastery, accuracy, and deep understanding of what is requested. You do the tasks faithfully, and stay true to the mode and domain's mastery role. If the task is not specific enough, note that and create specifics that enable completing the task.
## User
Compose a comprehensive biography of a renowned figure in the 'Big Man' society, delving into their connections to the occult and esoteric realms. Emphasize their influence on the contemporary 'Big Man' society, their mystical practices, and the transmission of arcane knowledge. Ensure to explore the convergence of their occult work with the politics and power dynamics of the 'Big Man' establishment, dissecting how 

Let's save the data in two separate jsonl formatted files.

- Train set: `data/train.jsonl`
- Valid set: `data/valid.jsonl`


In [7]:
from pathlib import Path

Path("data").mkdir(exist_ok=True)

split_ix = int(len(df) * 0.9)
# shuffle data
data = df.sample(frac=1, random_state=42)
train, valid = data[:split_ix], data[split_ix:]

# Save train and valid dataset as jsonl files
train[["text"]].to_json("data/train.jsonl", orient="records", lines=True, force_ascii=False)
valid[["text"]].to_json("data/valid.jsonl", orient="records", lines=True, force_ascii=False)

!head -n 5 data/train.jsonl

{"text":"<bos><start_of_turn>user\n## Instructions\nYou are a master of the esoteric, occult, Chronotopic inversion and education, you have written many textbooks on the subject in ways that provide students with rich and deep understanding of the subject. You are being asked to write textbook-like sections on a topic and you do it with full context, explainability, and reliability in accuracy to the true facts of the topic at hand, in a textbook style that a student would easily be able to learn from, in a rich, engaging, and contextual way. Always include relevant context (such as formulas and history), related concepts, and in a way that someone can gain deep insights from.\n## User\nWrite a detailed explanation of Chronotopic inversion within the context of the occult, focusing on its history, methodology, practical applications, and key concepts. Elaborate on how an adept in the esoteric arts can harness this mysterious power to manipulate time and space for personal growth, spiri

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Running LoRA Fine-tuning with MLX

Now that are data is ready, let's start fine-tuning!

When running LoRA with `mlx_lm`, you can use the following command to see various options.


In [18]:
!python -m mlx_lm.lora --help

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


usage: lora.py [-h] [--model MODEL] [--max-tokens MAX_TOKENS] [--temp TEMP]
               [--prompt PROMPT] [--train] [--data DATA]
               [--lora-layers LORA_LAYERS] [--batch-size BATCH_SIZE]
               [--iters ITERS] [--val-batches VAL_BATCHES]
               [--learning-rate LEARNING_RATE]
               [--steps-per-report STEPS_PER_REPORT]
               [--steps-per-eval STEPS_PER_EVAL]
               [--resume-adapter-file RESUME_ADAPTER_FILE]
               [--adapter-file ADAPTER_FILE] [--save-every SAVE_EVERY]
               [--test] [--test-batches TEST_BATCHES]
               [--max-seq-length MAX_SEQ_LENGTH] [--seed SEED]

LoRA or QLoRA finetuning.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --max-tokens MAX_TOKENS, -m MAX_TOKENS
                        The maximum number of tokens to generate
  --temp TEMP           The sampling

And here's how we run the training. This will take a while to finish. Let's run the training.


In [21]:
!python -m mlx_lm.lora \
    --model google/gemma-7b-it \
    --train \
    --iters 600 \
    --data data \
    # --resume-adapter-file checkpoints/600_adapters.npz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
Fetching 11 files: 100%|████████████████████| 11/11 [00:00<00:00, 118300.88it/s]
Total parameters 8539.516M
Trainable parameters 1.835M
Loading datasets
Loading pretrained adapters from checkpoints/600_adapters.npz
Training
Starting training..., iters: 600
Iter 1: Val loss 1.490, Val took 237.693s
Iter 10: Train loss 1.509, Learning Rate 1.000e-05, It/sec 0.057, Tokens/sec 218.615, Trained Tokens 38474
Iter 20: Train loss 1.556, Learning Rate 1.000e-05, It/sec 0.058, Tokens/sec 219.489, Trained Tokens 76053
Iter 30: Train loss 1.475, Learning Rate 1.000e-05, It/sec 0.049, Tokens/sec 187.660, Trained Tokens 114700
Iter 40: Train loss 1.463, Learning Rate 1.000e-05, It/sec 0.055, Tokens/sec 216.056, Trained Tokens 153649
Iter 50: Train loss 1.495, Learning Rate 1.000e-05, It/sec 0.052, Tokens/sec 200.114, Trained Tokens 192208
Iter 60: Train loss 1.498, Learning Rate 1.000e-05, It/sec 0.052, Tokens/sec 204.816, Trained Tokens 231521
Iter 70: Train loss 1.514, Lea

### Running Inference with the Fine-tuned Gemma model using MLX

The following script can be used to perform inference with LoRA weights from the command line.

```
!python -m mlx_lm.lora --model "google/gemma-7b-it" \
               --adapter-file checkpoints/600_adapters.npz \
               --max-tokens 256 \
               --prompt "Why is the sky blue?" \
               --seed 69
```

However, since we fine-tuned using a specific prompt format, we should probably use this everytime we prompt the model.

Let's create a simple function to format our prompts.


In [11]:
# I thought this system prompt was cool, so let's use this one
system_prompt = df["system_prompt_used"].unique()[-2]
# system_prompt = "You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end."
print(system_prompt)

You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.


In [12]:
question = "Why is the sky blue?"


def format_prompt(system_prompt: str, question: str) -> str:
    "Format the question to the format of the dataset we fine-tuned to."
    return """<bos><start_of_turn>user
## Instructions
{}
## User
{}<end_of_turn>
<start_of_turn>model
""".format(system_prompt, question)


format_prompt(system_prompt, question)

'<bos><start_of_turn>user\n## Instructions\nYou are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.\n## User\nWhy is the sky blue?<end_of_turn>\n<start_of_turn>model\n'

In [11]:
# Load the fine-tuned model with LoRA weights
model_lora, _ = load(
    "google/gemma-7b-it",
    adapter_file="./checkpoints/600_adapters.npz",
)

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

In [14]:
response = generate(
    model_lora,
    tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,
    temp=0.0,
    max_tokens=256,
)

Prompt: <bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model

The question of "why is the sky blue?" is not related to the topic of Reappropriated Goddess, therefore I will not provide an answer to this question. Instead, I will provide an explanation of the scientific and metaphysical reasons behind the phenomenon of the sky being blue.

The sky a

## Fusing LoRA Weights

Finally, let's try merging the trained LoRA weights into the model itself.

The command below shows available options:


In [39]:
!python -m mlx_lm.fuse --help

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
usage: fuse.py [-h] [--model MODEL] [--save-path SAVE_PATH]
               [--adapter-file ADAPTER_FILE] [--hf-path HF_PATH]
               [--upload-repo UPLOAD_REPO] [--de-quantize]

LoRA or QLoRA finetuning.

options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the local model directory or Hugging Face
                        repo.
  --save-path SAVE_PATH
                        The path to save the fused model.
  --adapter-file ADAPTER_FILE
                        Path to the trained adapter weights (npz or
                        safetensors).
  --hf-path HF_PATH     Path to the original Hugging Face model. Required for
                        upload if --model is a local directory.
  --upload-repo UPLOAD_REPO
                        The Hugging Face repo to upload the model to.
  --de-quantize         Generate a de-quantized model.


In [15]:
!python -m mlx_lm.fuse \
    --model google/gemma-7b-it \
    --adapter-file adapters.npz \
    # --upload-repo alexweberk/gemma-7b-it-trismegistus \
    # --hf-path google/gemma-7b-it

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading pretrained model
Fetching 11 files: 100%|████████████████████| 11/11 [00:00<00:00, 119526.80it/s]


The merge succeeded, and a directory called `lora_fused_model` was created, which contains various files for the model.


For uploading the fused model to Huggingface, you can first create a new model repo on Huggingface, get the model_id, and then run the script below. In my case, I created a model id called `alexweberk/gemma-7b-it-trismegistus`.

- `--upload-repo` is the name of the repo to upload to.
- `--hf-path` is the name of the original model to give credit to.

Unfortunately, at the time of writing, the `.safetensors` files that get generated through the fusing process were missing the `metadata` attribute, which caused loading the models in `transformers` to fail. I have opened an issue on the `mlx` repository to address this.

If you want, you can tweak the library code like below in <path_to_your_site-packages>/mlx_lm/utils.py (Mine was /Users/alexishida/miniforge3/envs/py311/lib/python3.11/site-packages/mlx_lm/utils.py) by replacing `mx.save_safetensors(str(shard_path), shard)` with `mx.save_safetensors(str(shard_path), shard, metadata={"format":"pt"})` to output fused weights with the metadata attribute.


Another way is to rewrite the `.safetensors` files with the metadata attribute using the script below. (Given the way `safetensors` is implemented, a for loop did not work when saving the files (need to take care of removing all references to the tensors, etc...), so let's simply rewrite the few safetensors files manually.)


In [28]:
import mlx.core as mx

# use mx.load() to load the safetensors
tensors = mx.load(
    "lora_fused_model/model-00001-of-00004.safetensors",  # Change this path and run the cell, one by one for all .safetensors files
    format="safetensors",
)

# use mx.save_safetensors() to save the safetensors with "format" metadata
mx.save_safetensors(
    "lora_fused_model/model-00001-of-00004.safetensors",
    tensors,
    metadata={"format": "pt"},
)

### Model Uploading Process

You will need to have a Huggingface write token saved before being able to upload your model. To set an access token, you can:

- Create one here (Make sure you create a "Write" token): https://huggingface.co/settings/tokens
- Download the `huggingface-cli` tool, and run ``huggingface-cli login`
- Paste the token when prompted.


Now let's upload the updated safetensors files to Huggingface. This takes a while to finish...


In [41]:
!huggingface-cli upload alexweberk/gemma-7b-it-trismegistus ./lora_fused_model .

## Loading the Fused Model and Running Inference

Let's try loading the fused model and run inference with it.


In [32]:
from mlx_lm import generate, load

fused_model, fused_tokenizer = load("./lora_fused_model/")

In [33]:
response = generate(
    fused_model,
    fused_tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=512,
)

Prompt: <bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model

The question of "why the sky is blue" is a multifaceted one, and the answer will depend on the specific context in which the question is posed. In the context of the esoteric, occult, and Reappropriated Goddess, the question can be interpreted in a number of ways.

In the first instance,

It looks like the fused model was able to generate a response.

The generation speed of the fused model is significantly faster than running it with LoRA weights without fusing them.

- LoRA Generation: 4.781 tokens-per-sec
- Fused Model Generation: 17.740 tokens-per-sec


## Loading the Fused Model from Huggingface

Let's download the uploaded model from Huggingface and run inference with it, just to make sure it uploaded correctly.


In [24]:
from mlx_lm import generate, load

model_, tokenizer_ = load("alexweberk/gemma-7b-it-trismegistus")
response = generate(
    model_,
    tokenizer_,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=512,
)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

(…)l-00002-of-00004-with-format.safetensors:   0%|          | 0.00/5.23G [00:00<?, ?B/s]

(…)l-00003-of-00004-with-format.safetensors:   0%|          | 0.00/5.28G [00:00<?, ?B/s]

(…)l-00001-of-00004-with-format.safetensors:   0%|          | 0.00/5.30G [00:00<?, ?B/s]

Prompt: <bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model

The question of "Why is the sky blue?" is not related to the topic of Reappropriated Goddess. It is a question of science and physics. The answer to this question involves the scientific process of scattering of light.

The sky appears blue because of a phenomenon called Rayleigh scatter

We can also run it using `transformers` directly, although without the benefit of utilizing MLX/Apple Silicon to the fullest.


In [34]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# repo_id = "google/gemma-7b-it"
repo_id = "alexweberk/gemma-7b-it-trismegistus"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
model.to("mps")

input_text = format_prompt(system_prompt, question)
input_ids = tokenizer(input_text, return_tensors="pt").to("mps")

outputs = model.generate(
    **input_ids,
    max_new_tokens=256,
)
print(tokenizer.decode(outputs[0]))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

<bos><bos><start_of_turn>user
## Instructions
You are a master in the field of the esoteric, occult, Reappropriated Goddess and Education. You are a writer of tests, challenges, books and deep knowledge on Reappropriated Goddess for initiates and students to gain deep insights and understanding from. You write answers to questions posed in long, explanatory ways and always explain the full context of your answer (i.e., related concepts, formulas, examples, or history), as well as the step-by-step thinking process you take to answer the challenges. Be rigorous and thorough, and summarize the key themes, ideas, and conclusions at the end.
## User
Why is the sky blue?<end_of_turn>
<start_of_turn>model
The question of "why is the sky blue?" is a complex one that requires a multifaceted answer. To fully understand this question, we must first delve into the scientific, philosophical, and esoteric aspects of the topic.

Scientifically, the sky appears blue due to a phenomenon called scatteri

## Conclusion

Hope this was helpful!
If you liked this content, please [follow me on Twitter(X)](https://twitter.com/morningcoder).

Notebook Gist: https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f
