##### Copyright 2024 Google LLC.

In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with Gemma 2 and Llamafile

[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.

Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your own cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.

[Llamafile](https://github.com/Mozilla-Ocho/llamafile) is a tool that simplifies the distribution and execution of open Large Language Models (LLMs) by packaging them into a single-file executable called a "llamafile." By combining llama.cpp with Cosmopolitan Libc, it consolidates the complexity of LLMs into one framework that runs locally on most computers without any installation. The goal is to make open LLMs more accessible to both developers and end users.

In this tutorial, you will learn how to run the Gemma 2 model from Google using Llamafile.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_Llamafile.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Gemma setup

**Before you dive into the tutorial, let's get you set up with Gemma:**

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.
3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.
4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.

**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**


### Configure your HF token

Add your Hugging Face token to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create a new secret with the name `HF_TOKEN`.
3. Copy/paste your token key into the Value input box of `HF_TOKEN`.
4. Toggle the button on the left to allow notebook access to the secret.


In [2]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

### Install dependencies
You'll need to install a few Python packages and dependencies to interact with Hugging Face and run the model.

Run the following cell to install or upgrade it:

In [3]:
# The huggingface_hub library allows us to download models and other files from Hugging Face.
!pip install --upgrade -q huggingface_hub

# Download the latest Llamafile binary (https://github.com/Mozilla-Ocho/llamafile/releases)
!wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13

# Make the binary executable
!chmod +x llamafile

# Download the zipalign binary (https://github.com/Mozilla-Ocho/llamafile/releases/download/<version>/zipalign-<version>)
!wget -O zipalign https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.4/zipalign-0.8.4

# Make the binary executable
!chmod +x zipalign

--2024-11-26 12:04:05--  https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/689773665/f36ee4ed-0959-4919-bd9b-bddbc9d762aa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241126%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241126T120406Z&X-Amz-Expires=300&X-Amz-Signature=7f849c12625a25d4d75fa9fd2d781df565895cc06579e756cfaa043dd074bac4&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dllamafile-0.8.13&response-content-type=application%2Foctet-stream [following]
--2024-11-26 12:04:06--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/689773665/f36ee4ed-0959-4919-bd9b-bddbc9d762aa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-A

### Logging into Hugging Face Hub

Next, you'll have to log into the Hugging Face Hub using your access token. This will allow us to download the Gemma model.

In [4]:
from huggingface_hub import login

login(os.environ["HF_TOKEN"])

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


### Downloading the Gemma 2 Model
Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile.

In [5]:
from huggingface_hub import hf_hub_download

# Specify the repository and filename
repo_id = 'falan42/llama_lora_8b_medical_parallax_2_gguf'  # Repository containing the GGUF model
filename = 'unsloth.Q4_K_M.gguf'  # The GGUF model file

# Download the model file to the current directory
hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

'unsloth.Q4_K_M.gguf'

### Creating a Gemma 2 Llamafile

With Llamafile you can run the web server using a simple command like:

```bash
./gemma2.llamafile ...
```

To do this you can package both the model weights and a special `.args` file that specifies the default arguments. Start by creating a file named `.args` with the following content:

- `-m`: Specifies the model file to use.
- `--host`: Specifies the hostname
- `-ngl`: Sets the number of GPU layers to offload. Setting it to `9999` offloads as many layers as possible to the GPU.


In [None]:
-m
neural-chat-7b-v3-3.Q4_K_M.gguf
--host
0.0.0.0
...

In [19]:
%%writefile .args
-m
unsloth.Q4_K_M.gguf
--host
0.0.0.0
...

Overwriting .args


As shown above, the .args file contains one argument per line. The `...` placeholder optionally indicates where any additional command-line arguments provided by the user will be inserted. Now, let's include both the model weights and the argument file into the executable using `zipalign`:

In [22]:
!cp llamafile unsloth.Q4_K_M.llamafile


In [23]:
!./zipalign -j0 \
  unsloth.Q4_K_M.llamafile \
  unsloth.Q4_K_M.gguf \
  .args

=======================================**
**

In [24]:
from huggingface_hub import create_repo, upload_file
import os

repo_id = "falan42/llama2_llamafile_2"  # Hugging Face Hub'daki model deponuzun adı
file_path = "/content/unsloth.Q4_K_M.llamafile"  # Yüklenecek dosyanın yolu

# Create the repository if it doesn't exist
create_repo(repo_id, token=os.environ["HF_TOKEN"], exist_ok=True) # exist_ok=True prevents error if repo already exists


upload_file(
    path_or_fileobj=file_path,
    path_in_repo="unsloth.Q4_K_M.llamafile",  # Deponuzda dosyanın adı
    repo_id=repo_id,
    token=os.environ["HF_TOKEN"],
)

unsloth.Q4_K_M.llamafile:   0%|          | 0.00/5.16G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/falan42/llama2_llamafile_2/commit/938af361dc420db00529ab4607395e0bfe6644ef', commit_message='Upload unsloth.Q4_K_M.llamafile with huggingface_hub', commit_description='', oid='938af361dc420db00529ab4607395e0bfe6644ef', pr_url=None, repo_url=RepoUrl('https://huggingface.co/falan42/llama2_llamafile_2', endpoint='https://huggingface.co', repo_type='model', repo_id='falan42/llama2_llamafile_2'), pr_revision=None, pr_num=None)

### Run Llamafile
Let's now run Llamafile in server mode, which allows us to interact with it via HTTP requests.

- `nohup`: Runs the command in the background, immune to hangups.
- `./llamafile`: Runs the Llamafile binary.
- `--server`: Starts Llamafile in server mode.
- `--nobrowser`: Prevents Llamafile from opening a browser window.
- `--port`: Specifies the port number to use.
- `> llamafile.log`: Redirects the output to a log file.
- `&`: Runs the process in the background.

**Note:** The `llamafile.log` file will contain logs that can help you troubleshoot any issues.

In [8]:
!nohup ./llama.llamafile --server --nobrowser --port 8081 > llamafile.log &
# Here we add a delay to let the server warm up before we make any requests
!sleep 60

nohup: redirecting stderr to stdout
^C


To quickly test the API let's use cURL for making a simple HTTP request.

In [9]:
%%bash
sleep 2
curl -X POST http://localhost:8081/completion \
-H "Content-Type: application/json" \
-d '{
  "system_prompt": {
    "prompt": "You are an AI assistant. Don'\''t make things up.",
    "anti_prompt": "User:",
    "assistant_name": "Assistant:"
  },
  "stream": false,
  "n_predict": 128,
  "temperature": 0.7,
  "stop": ["User:", "Assistant:"],
  "api_key": "",
  "prompt": "User: What is the capital of France?\\n\\nAssistant:"
}' | python3 -c '
import json
import sys

try:
    response = json.load(sys.stdin)
    print(json.dumps(response, indent=2))
except json.JSONDecodeError as e:
    print("JSONDecodeError:", e)
    print("No valid JSON data received.")
'

Process is interrupted.


### Creating a simple function to interact with Gemma 2

The `get_completion()` function sends a prompt to an AI language model and retrieves a generated response. This allows you to interact with the model by providing input text and receiving an AI-generated completion.

- **Prompt**: The main input text or question you want the AI to answer.
- **System Prompt**: Sets the context for the AI, instructing it on how to behave (e.g., "You are an AI assistant. Don't make things up.").
- **Parameters**:
  - `temperature`: Controls the creativity of the response (lower values = more deterministic).
  - `n_predict`: The maximum number of tokens (words or pieces of words) to generate.
  - `stop`: Sequences where the AI should stop generating further text.
  

This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions.

In [10]:
import requests
import json
import time

def get_completion(prompt, stream=False, n_predict=128, temperature=0.7,
                   stop=["User:", "Assistant:"],
                   url='http://localhost:8081/completion'):
    """
    Sends a POST request to the AI completion API with the given parameters.

    Args:
        prompt: The prompt or question to send to the API.
        stream (bool): Whether to stream the response.
        n_predict (int): Number of tokens to predict.
        temperature (float): Controls the randomness of the predictions.
        stop (list): List of stop sequences.
        url (str): The API endpoint URL.

    Returns:
        requests.Response: The HTTP response object from the API,
        or None if an error occurs.
    """
    headers = {
        'Content-Type': 'application/json'
    }

    payload = {
        "system_prompt": {
            "prompt": "You are an AI assistant. Don't make things up.",
            "anti_prompt": "User:",
            "assistant_name": "Assistant:"
        },
        "stream": stream,
        "n_predict": n_predict,
        "temperature": temperature,
        "stop": stop,
        "prompt": prompt
    }

    try:
        response = requests.post(url, headers=headers, json=payload)
        # Raises HTTPError for bad responses (4xx or 5xx)
        response.raise_for_status()
        return response
    except requests.exceptions.RequestException as e:
        print(f"An error occurred while making the request: {e}")
        return None

In [11]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

You can now interact with the Gemma 2 model through the Llamafile server by calling `get_completion` with your desired prompt to get a response from the AI model.


In [None]:
# Define your prompt and parameters
prompt = "User: Baş asğrısına ne iyi gelir.\n\nAssistant:"
n_predict = 128
temperature = 0.1

# Call the get_completion function with your parameters
response = get_completion(
    prompt=prompt,
    n_predict=n_predict,
    temperature=temperature
)

# Print the response
print(response.json()['content'])

Congratulations! You've successfully set up the Gemma 2 model using Llamafile in a Colab environment. You can now experiment with the model, generate text, and explore its capabilities.

In [None]:
from huggingface_hub import create_repo, upload_file
import os

repo_id = "falan42/llama2_llamafile_1"  # Hugging Face Hub'daki model deponuzun adı
file_path = "/content/drive/MyDrive/llama/llama.llamafile"  # Yüklenecek dosyanın yolu

# Create the repository if it doesn't exist
create_repo(repo_id, token=os.environ["HF_TOKEN"], exist_ok=True) # exist_ok=True prevents error if repo already exists


upload_file(
    path_or_fileobj=file_path,
    path_in_repo="llama.llamafile",  # Deponuzda dosyanın adı
    repo_id=repo_id,
    token=os.environ["HF_TOKEN"],
)