<a href="https://colab.research.google.com/github/aknip/Local_LLMs/blob/main/Zephyr_7B_Alpha_on_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick start to run Zephyr 7B Alpha on Google Colab

-  A Low-Cost LLM Can Perform Better Than Llama-70B
- There is an immediate way to play with the Zephyr-7B model. HuggingFace has created a space for inference. https://huggingfaceh4-zephyr-chat.hf.space/

- Source: https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha/blob/main/colab-demo.ipynb

- Info: https://levelup.gitconnected.com/zephyr-7b-%CE%B1-a-low-cost-llm-can-perform-better-than-llama-70b-a2055dd028cb

other:
- https://github.com/aigeek0x0/zephyr-7b-alpha-langchain-chatbot

# 1. Variant (slower / bigger model?)

In [None]:
!pip install ctransformers>=0.2.24
!wget https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf

In [None]:
# 1-2 min response time

from ctransformers import AutoModelForCausalLM, AutoConfig, Config
conf = AutoConfig(Config(temperature=0.7, repetition_penalty=1.1, batch_size=52,
                max_new_tokens=1024, context_length=2048))
llm = AutoModelForCausalLM.from_pretrained("/content/zephyr-7b-alpha.Q4_K_M.gguf",
                                           model_type="mistral", config = conf)
prompt = "quel est le sens de la science?"
template = f'''<|system|>Reply in the target language of the prompt
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''
print(llm(template))

# 2. Variant

In [None]:
# Install transformers from source - only needed for versions <= v4.34
%%capture --no-stderr
%pip install git+https://github.com/huggingface/transformers.git
%pip install accelerate
# or !pip install transformers accelerate

## Load dependencies

In [None]:
import re
import torch
import textwrap
from transformers import pipeline

## Download model and load pipeline

takes approx. 3 mins

In [None]:
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")

# Helper functions

In [None]:
def extract_response_text(input_text):
  input_text = outputs[0]["generated_text"]
  match = re.search(r'\<\|assistant\|\>(.+)', input_text, re.DOTALL)
  if match:
      extracted_text = match.group(1).strip()
  else:
      extracted_text = input_text.strip()

  return extracted_text

## Prepare inputs

In [None]:
# Each message can have 1 of 3 roles: "system" (to provide initial instructions), "user", or "assistant". For inference, make sure "user" is the role in the final message.
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate.",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

## Generate!

takes approx 40 secs on Colab T4

In [None]:
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
response_text = extract_response_text(outputs[0]["generated_text"])
print(textwrap.fill(response_text, 80))