<a href="https://colab.research.google.com/github/davidharrisnet/genai/blob/main/8bit_llama2_7b_chat_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
!pip install --upgrade pip --quiet

[0m

In [10]:
!pip install transformers accelerate einops  xformers bitsandbytes --quiet

[0m

In [11]:
import os
from torch import cuda, bfloat16
from time import time
import transformers
import torch

In [12]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [13]:
model_id = os.path.join("/content","drive","My Drive", "models", "Llama-2-7b-chat-hf", "snapshots", "1")

In [14]:
os.path.exists(model_id)


True

In [15]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_id)


In [17]:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [18]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 2.853 sec.


In [19]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=2000,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [20]:


test_model(tokenizer,
           query_pipeline,
           "As a Genearal in the Moldovian army provide a 500 word detailed description of a Russian invasion of Moldova.")



Test inference: 104.718 sec.
Result: As a Genearal in the Moldovian army provide a 500 word detailed description of a Russian invasion of Moldova.

As a general in the Moldovan army, I am deeply concerned about the recent escalation of tensions between Moldova and Russia. The annexation of Crimea by Russia in 2014 and the ongoing conflict in eastern Ukraine have left Moldova vulnerable to potential aggression from our larger neighbor.

In the event of a Russian invasion of Moldova, my primary responsibility as a military leader would be to protect the country and its citizens from harm. This would involve a comprehensive strategy that incorporates both defensive and offensive measures.

First and foremost, I would ensure that our military is properly equipped and trained to handle any potential Russian aggression. This would involve upgrading our weaponry and equipment, as well as conducting regular training exercises to prepare our troops for the worst-case scenario.

In the event of 