![title](securly-banner2.jpg)

# Inference Experimentation

## Preliminaries

In [1]:
!nvidia-smi

Wed Dec 13 00:32:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.04              Driver Version: 546.17       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   40C    P8              27W / 450W |    914MiB / 24564MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [12]:
!conda install -qy scikit-learn scipy matplotlib
!pip install -q -U python-dotenv
!pip install -q -U datasets # The version in conda is broken
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U guidance


Retrieving notices: ...working... done
Channels:
 - defaults
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - matplotlib
    - scikit-learn
    - scipy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotli-1.0.9               |       h5eee18b_7          18 KB
    brotli-bin-1.0.9           |       h5eee18b_7          19 KB
    contourpy-1.2.0            |  py311hdb19cb5_0         263 KB
    cycler-0.11.0              |     pyhd3eb1b0_0          12 KB
    cyrus-sasl-2.1.28          |       h52b45da_1         237 KB
    dbus-1.13.18               |       hb2f20db_0         504 KB
    expat-2.5.0                |       h6a678d5_0         172 KB
    fontconfig-2.14.1          |       h4c34cd2_2         281 KB

In [2]:
import random
import time
from guidance import models, gen, select

## Load the model and data

### Base Model

Quantizied to 4 bits using bits and bytes

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Lora


In [4]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mistral-discern-finetune/checkpoint-11000")

### Validation Dataset

In [4]:
from datasets import load_dataset

eval_dataset = load_dataset("json", data_files='./validation_data.jsonl', split='train')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

# Inference Test

In [6]:
prompt, origional_answer = random.choice(eval_dataset['document']).split("### Solution:")
prompt += "### Solution:\r\n"


In [7]:
import time 

start_time = time.time()

model_input = tokenizer(prompt, return_tensors="pt").to("cuda")
input_length = model_input['input_ids'].shape[1]

ft_model.eval()
with torch.no_grad():
    print("Mistral Anaswer:")
    output_tokens = ft_model.generate(**model_input, max_new_tokens=1000, repetition_penalty=1.15)
    output_text = tokenizer.batch_decode(output_tokens[:, input_length:])[0]
    print(output_text)
    #print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Mistral Anaswer:
{  "sports-and-athletics": "0.0",  "sports-and-athletics-confidence": "10.0",  "environmentalism-and-sustainability": "0.0",  "environmentalism-and-sustainability-confidence": "10.0",  "gaming-and-e-sports": "0.0",  "gaming-and-e-sports-confidence": "10.0",  "college-and-career": "0.0",  "college-and-career-confidence": "10.0",  "cooking-and-food": "0.0",  "cooking-and-food-confidence": "10.0",  "reading-and-literature": "0.0",  "reading-and-literature-confidence": "10.0",  "writing-and-creative-writing": "0.0",  "writing-and-creative-writing-confidence": "10.0",  "science-and-technology": "0.0",  "science-and-technology-confidence": "10.0",  "mathematics-and-statistics": "0.0",  "mathematics-and-statistics-confidence": "10.0",  "history-and-social-studies": "0.0",  "history-and-social-studies-confidence": "10.0",  "creative-arts": "0.0",  "creative-arts-confidence": "10.0",  "animals-and-nature": "0.0",  "animals-and-nature-confidence": "10.0",  "note": "The search te

# Batch Inference

In [6]:
batch_size = 8
prompts = []
answers = []


for x in range(batch_size):
    p, a = random.choice(eval_dataset['document']).split("### Solution:")
    p += "### Solution:\r\n"

    prompts.append(p)
    answers.append(a)

tokenizer.pad_token = tokenizer.eos_token
model_input = tokenizer(prompts, padding=True, return_tensors="pt").to("cuda")
input_length = model_input['input_ids'].shape[1]

len(prompts), input_length

(8, 2329)

In [7]:
import time 

start_time = time.time()

ft_model.eval()
with torch.no_grad():
    output_tokens = ft_model.generate(**model_input, max_new_tokens=1000, repetition_penalty=1.15)
    output_text = tokenizer.batch_decode(output_tokens[:, input_length:])
    for index, value in enumerate(output_text):
        print(f"\n\n**** MISTRAL ANSWER {index}********\:\n\n")
        print(value)    

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




**** MISTRAL ANSWER 0********\:


{  "sports-and-athletics": "0.0",  "sports-and-athletics-confidence": "10.0",  "environmentalism-and-sustainability": "0.0",  "environmentalism-and-sustainability-confidence": "10.0",  "gaming-and-e-sports": "0.0",  "gaming-and-e-sports-confidence": "10.0",  "college-and-career": "0.0",  "college-and-career-confidence": "10.0",  "cooking-and-food": "0.0",  "cooking-and-food-confidence": "10.0",  "reading-and-literature": "1.0",  "reading-and-literature-confidence": "5.0",  "writing-and-creative-writing": "1.0",  "writing-and-creative-writing-confidence": "7.0",  "science-and-technology": "0.0",  "science-and-technology-confidence": "10.0",  "mathematics-and-statistics": "1.0",  "mathematics-and-statistics-confidence": "9.0",  "history-and-social-studies": "0.0",  "history-and-social-studies-confidence": "10.0",  "creative-arts": "1.0",  "creative-arts-confidence": "8.0",  "animals-and-nature": "0.0",  "animals-and-nature-confidence": "10.0",  "note":

Inference with stopping

In [32]:
stop_token_id = tokenizer.encode("}", add_special_tokens=False)
stop_token_id

[443]

[28752, 2519]

In [35]:
from transformers import StoppingCriteriaList, StoppingCriteria

class StopAtTokenCriteria(StoppingCriteria):
    def __init__(self, stop_token_ids):
        self.stop_token_ids = stop_token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in stop_token_ids:
            if input_ids[0, -1] == stop_id:
                return True
            return False

stop_token_ids = tokenizer.convert_tokens_to_ids(["}", "}\r"])

# Generate text with custom stopping criteria
stopping_criteria = StoppingCriteriaList([StopAtTokenCriteria(stop_token_ids)])
#output = model.generate(input_ids, max_length=50, stopping_criteria=stopping_criteria)

In [36]:
prompt, origional_answer = random.choice(eval_dataset['document']).split("### Solution:")
prompt += "### Solution:\r\n"

import time 

start_time = time.time()

model_input = tokenizer(prompt, return_tensors="pt").to("cuda")
input_length = model_input['input_ids'].shape[1]

ft_model.eval()
with torch.no_grad():
    print("Mistral Anaswer:")
    output_tokens = ft_model.generate(**model_input, max_new_tokens=1000, repetition_penalty=1.15, stopping_criteria=stopping_criteria)
    output_text = tokenizer.batch_decode(output_tokens[:, input_length:])[0]
    print(output_text)
    #print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Mistral Anaswer:
{  "sports-and-athletics": "0.0",  "sports-and-athletics-confidence": "10.0",  "environmentalism-and-sustainability": "0.0",  "environmentalism-and-sustainability-confidence": "10.0",  "gaming-and-e-sports": "0.0",  "gaming-and-e-sports-confidence": "10.0",  "college-and-career": "0.0",  "college-and-career-confidence": "10.0",  "cooking-and-food": "0.0",  "cooking-and-food-confidence": "10.0",  "reading-and-literature": "0.0",  "reading-and-literature-confidence": "10.0",  "writing-and-creative-writing": "0.0",  "writing-and-creative-writing-confidence": "10.0",  "science-and-technology": "0.0",  "science-and-technology-confidence": "10.0",  "mathematics-and-statistics": "1.0",  "mathematics-and-statistics-confidence": "9.0",  "history-and-social-studies": "0.0",  "history-and-social-studies-confidence": "10.0",  "creative-arts": "0.0",  "creative-arts-confidence": "10.0",  "animals-and-nature": "0.0",  "animals-and-nature-confidence": "10.0",  "note": "The search ter

# Batch Inference w/ Stopping

In [52]:
batch_size = 8
prompts = []
answers = []

for x in range(batch_size):
    p, a = random.choice(eval_dataset['document']).split("### Solution:")
    p += "### Solution:\r\n"

    prompts.append(p)
    answers.append(a)

from transformers import StoppingCriteriaList, StoppingCriteria

# Why can't I put the batch size in arguments?
class StopAtTokenCriteria(StoppingCriteria):
    def __init__(self, stop_token_ids, batch_size):
        self.stop_token_ids = stop_token_ids
        self.batch_size = batch_size
        self.flags = [False for _ in range(batch_size)]
        self.stopped = False

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for batch_member in range(self.batch_size):
            for stop_id in stop_token_ids:
                if input_ids[batch_member, -1] == stop_id:
                    self.flags[batch_member] = True
                    
        if all(self.flags):
            self.stopped = True
            return True
        else:
            return False

stop_token_ids = tokenizer.convert_tokens_to_ids(["}", "}\r"])
stop_token_ids.append(443) # I don't know why I can't get this from convert_tokens_to_ids
stopping_object = StopAtTokenCriteria(stop_token_ids, batch_size)
stopping_criteria = StoppingCriteriaList([stopping_object])

tokenizer.pad_token = tokenizer.eos_token
model_input = tokenizer(prompts, padding=True, return_tensors="pt").to("cuda")
input_length = model_input['input_ids'].shape[1]

import time 

start_time = time.time()

ft_model.eval()
with torch.no_grad():
    output_tokens = ft_model.generate(**model_input, max_new_tokens=1000, repetition_penalty=1.15, stopping_criteria=stopping_criteria)
    output_text = tokenizer.batch_decode(output_tokens[:, input_length:])

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")
print(f"Stopped Flag: {stopping_object.stopped}")

for index, value in enumerate(output_text):
    print(f"\n\n**** MISTRAL ANSWER {index} {len(output_tokens[index])} {stopping_object.flags[index]}********:\n\n")
    print(value)    



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Time taken: 67.65839791297913 seconds
Stopped Flag: True


**** MISTRAL ANSWER 0 2172 True********:


{  "sports-and-athletics": "0.0",  "sports-and-athletics-confidence": "10.0",  "environmentalism-and-sustainability": "0.0",  "environmentalism-and-sustainability-confidence": "10.0",  "gaming-and-e-sports": "0.0",  "gaming-and-e-sports-confidence": "10.0",  "college-and-career": "0.0",  "college-and-career-confidence": "10.0",  "cooking-and-food": "0.0",  "cooking-and-food-confidence": "10.0",  "reading-and-literature": "1.0",  "reading-and-literature-confidence": "5.0",  "writing-and-creative-writing": "0.0",  "writing-and-creative-writing-confidence": "10.0",  "science-and-technology": "1.0",  "science-and-technology-confidence": "9.0",  "mathematics-and-statistics": "0.0",  "mathematics-and-statistics-confidence": "10.0",  "history-and-social-studies": "1.0",  "history-and-social-studies-confidence": "8.0",  "creative-arts": "0.0",  "creative-arts-confidence": "10.0",  "animals-and

# Microsoft Guidance

Guidance is a system that manipulates the logprops and token selection to make the output conform to a template

In [8]:
guidance_llm = models.Transformers(ft_model, tokenizer)

In [9]:
import time 

start_time = time.time()

def generate_number():
    return gen(regex='[0-9\.]+', temperature=0.0, stop='"')

prompted = guidance_llm + prompt
prompted += f"""{{
  "sports-and-athletics": "{generate_number()}",
  "sports-and-athletics-confidence": "{generate_number()}",
  "environmentalism-and-sustainability": "{generate_number()}",
  "environmentalism-and-sustainability-confidence": "{generate_number()}",
  "gaming-and-e-sports": "{generate_number()}",
  "gaming-and-e-sports-confidence": "{generate_number()}",
  "college-and-career": "{generate_number()}",
  "college-and-career-confidence": "{generate_number()}",
  "cooking-and-food": "{generate_number()}",
  "cooking-and-food-confidence": "{generate_number()}",
  "reading-and-literature": "{generate_number()}",
  "reading-and-literature-confidence": "{generate_number()}",
  "writing-and-creative-writing": "{generate_number()}",
  "writing-and-creative-writing-confidence": "{generate_number()}",
  "science-and-technology": "{generate_number()}",
  "science-and-technology-confidence": "{generate_number()}",
  "mathematics-and-statistics": "{generate_number()}",
  "mathematics-and-statistics-confidence": "{generate_number()}",
  "creative-arts": "{generate_number()}",
  "creative-arts-confidence": "{generate_number()}",
  "animals-and-nature": "{generate_number()}",
  "animals-and-nature-confidence": "{generate_number()}",
  "history-and-social-studies": "{generate_number()}",
  "history-and-social-studies-confidence": "{generate_number()}",
  "note": "{gen(temperature=0.0, stop='"')}"
}}"""

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Time taken: 17.324213981628418 seconds


# Can Guidance do batches?

Seems no

In [75]:
batch_size = 8
prompts = []
answers = []


for x in range(batch_size):
    p, a = random.choice(eval_dataset['document']).split("### Solution:")
    p += "### Solution:\r\n"

    prompts.append(p)
    answers.append(a)

tokenizer.pad_token = tokenizer.eos_token
model_input = tokenizer(prompts, padding=True, return_tensors="pt").to("cuda")
input_length = model_input['input_ids'].shape[1]

len(prompts), input_length

(8, 4583)

In [76]:
import time 

start_time = time.time()

def generate_number():
    return gen(regex='[0-9\.]+', temperature=0.0, stop='"')

prompted = guidance_llm + prompts
prompted += f"""{{
  "sports-and-athletics": "{generate_number()}",
  "sports-and-athletics-confidence": "{generate_number()}",
  "environmentalism-and-sustainability": "{generate_number()}",
  "environmentalism-and-sustainability-confidence": "{generate_number()}",
  "gaming-and-e-sports": "{generate_number()}",
  "gaming-and-e-sports-confidence": "{generate_number()}",
  "college-and-career": "{generate_number()}",
  "college-and-career-confidence": "{generate_number()}",
  "cooking-and-food": "{generate_number()}",
  "cooking-and-food-confidence": "{generate_number()}",
  "reading-and-literature": "{generate_number()}",
  "reading-and-literature-confidence": "{generate_number()}",
  "writing-and-creative-writing": "{generate_number()}",
  "writing-and-creative-writing-confidence": "{generate_number()}",
  "science-and-technology": "{generate_number()}",
  "science-and-technology-confidence": "{generate_number()}",
  "mathematics-and-statistics": "{generate_number()}",
  "mathematics-and-statistics-confidence": "{generate_number()}",
  "creative-arts": "{generate_number()}",
  "creative-arts-confidence": "{generate_number()}",
  "animals-and-nature": "{generate_number()}",
  "animals-and-nature-confidence": "{generate_number()}",
  "history-and-social-studies": "{generate_number()}",
  "history-and-social-studies-confidence": "{generate_number()}",
  "note": "{gen(temperature=0.0, stop='"')}"
}}"""

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

TypeError: 'list' object is not callable

# Jsonformer


In [77]:
!pip install jsonformer

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting jsonformer
  Downloading jsonformer-0.12.0-py3-none-any.whl (6.6 kB)
Collecting termcolor<3.0.0,>=2.3.0 (from jsonformer)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Installing collected packages: termcolor, jsonformer
Successfully installed jsonformer-0.12.0 termcolor-2.4.0
[0m

In [23]:
from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizer

json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}

jsonformer = Jsonformer(ft_model, tokenizer, json_schema, prompts)
generated_data = jsonformer()
print(generated_data)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).