# Inference Experimentation

## Preliminaries

In [7]:
!nvidia-smi

Sun Dec 10 02:32:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.04              Driver Version: 546.17       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   37C    P8              28W / 450W |  12152MiB / 24564MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install guidance

In [10]:
import random
import time
from guidance import models, gen, select

## Load the model and data

### Base Model

Quantizied to 4 bits using bits and bytes

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Lora


In [3]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mistral-discern-finetune/checkpoint-11000")

### Validation Dataset

In [6]:
from datasets import load_dataset

eval_dataset = load_dataset("json", data_files='./validation_data.jsonl', split='train')

# Inference Test

In [57]:
prompt, origional_answer = random.choice(eval_dataset['document']).split("### Solution:")
prompt += "### Solution:\r\n"


In [58]:
import time 

start_time = time.time()

model_input = tokenizer(prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print("Mistral Anaswer:")
    print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=1000, repetition_penalty=1.15)[0], skip_special_tokens=True))
    print("\nOrigional Answer")
    print(origional_answer)

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Mistral Anaswer:
### Question:
You are a Student Web Activity Analyzer developed to support professionals, including Social Workers, School Psychologists, District Administrators, School Safety Specialists, and related roles. Your primary objective is to meticulously evaluate the online activity of K-12 students and identify specific indicators related to their interests and passions. For each identified indicator, provide a JSON object containing:

Presence: A value of 1 (if the indicator is present) or 0 (if not). Mark as 1 even if only part of the data aligns with an indicator.
Confidence: Provide a confidence level on a scale of 1-10 to indicate your level of certainty in the analysis.
Note: Include information on the logic used to decide that certain indicators were identified and a summary of the analyzed web activity.

Consider patterns in the data, not just individual searches.

Consider that the online activity originates from a school-issued device.

If ambiguous use your bes

# Microsoft Guidance

Guidance is a system that manipulates the logprops and token selection to make the output conform to a template

In [59]:
guidance_llm = models.Transformers(ft_model, tokenizer)

In [61]:
import time 

start_time = time.time()

def generate_number():
    return gen(regex='[0-9\.]+', stop='"')

prompted = guidance_llm + prompt
prompted += f""""{{
  "sports-and-athletics": "{generate_number()}",
  "sports-and-athletics-confidence": "{generate_number()}",
  "environmentalism-and-sustainability": "{generate_number()}",
  "environmentalism-and-sustainability-confidence": "{generate_number()}",
  "gaming-and-e-sports": "{generate_number()}",
  "gaming-and-e-sports-confidence": "{generate_number()}",
  "college-and-career": "{generate_number()}",
  "college-and-career-confidence": "{generate_number()}",
  "cooking-and-food": "{generate_number()}",
  "cooking-and-food-confidence": "{generate_number()}",
  "reading-and-literature": "{generate_number()}",
  "reading-and-literature-confidence": "{generate_number()}",
  "writing-and-creative-writing": "{generate_number()}",
  "writing-and-creative-writing-confidence": "{generate_number()}",
  "science-and-technology": "{generate_number()}",
  "science-and-technology-confidence": "{generate_number()}",
  "mathematics-and-statistics": "{generate_number()}",
  "mathematics-and-statistics-confidence": "{generate_number()}",
  "creative-arts": "{generate_number()}",
  "creative-arts-confidence": "{generate_number()}",
  "animals-and-nature": "{generate_number()}",
  "animals-and-nature-confidence": "{generate_number()}",
  "history-and-social-studies": "{generate_number()}",
  "history-and-social-studies-confidence": "{generate_number()}",
  "note": "{gen(stop='"')}"
}}"""

end_time = time.time()
total_time = end_time - start_time
print(f"Time taken: {total_time} seconds")

Time taken: 21.464407920837402 seconds
