## Check for cuda
- AutoModelForCausalLM.from_pretrained will error Torch not compiled with CUDA enabled. 
- You need to install this https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html. 
- In command prompt, check cuda version with 
```nvcc --version``` 
- Also have to install pytorch with cuda, not just pytorch: 
```pip install torch==2.5.0+cu124 torchvision==0.20.0+cu124 --index-url https://download.pytorch.org/whl/cu124```

In [1]:
import torch
import os


# needs to be True.
print("""*********Example torch.cuda.is_available():\n""", torch.cuda.is_available())
device = torch.device("cuda")  # not sure if this makes a difference
torch.cuda.empty_cache() # remove model from GPU RAM in Google Colab but doesn't work in local. in local, restart ipynb to completely clear GPU Memory, Dedicated GPU Memory, Shared GPU Memory.
torch.cuda.memory_summary(device=None, abbreviated=False)
# print(
#     """*********Example torch.cuda.memory_summary():\n""",
#     torch.cuda.memory_summary(),
# )

print("""*********Example device:\n""", device)
HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

*********Example torch.cuda.is_available():
 True
*********Example device:
 cuda


## pipeline text-generation method

meta-llama/Meta-Llama-3.1-8B-Instruct

In [2]:
import transformers
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"  # need to be approved in hugging face before using this. Needs pro subscription in hf. 

import torch
from transformers import BitsAndBytesConfig, set_seed, enable_full_determinism

# This doesn't work. Set a random seed for reproducibility. Not really. Still gives different inferences for the same input, maybe if load in 4 bit True. In other words, output is different if quantized.
# set_seed(42)  # deterministic=True doesn't work
# enable_full_determinism(42,warn_only=True)

# This takes 3 minutes to download the model into C:\Users\andre\.cache\huggingface\hub\. If it's already downloaded, takes 30 seconds.
# if you don't restart the ipynb when the RAM is full, it might error with
# OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free...
pipeline = transformers.pipeline(
    device_map="cuda",
    task="text-generation",
    model=model_id,
    trust_remote_code=True,
    token=HUGGING_FACE_TOKEN,
    model_kwargs={
        "load_in_4bit": True
    },  # this made model 6GB in GPU RAM instead of 15GB. Total GPU memory on my laptop is only 17.8GB. Made inference faster because it fit in the GPU RAM. If not quantized, it uses part of System RAM, which is slower.
    torch_dtype=torch.float16,  # this is half the precision, halfing RAM requirements
    # torch_dtype=torch.float,  # This uses full 32 precision 32GB. This won't fit on my local. OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB.
)

  from .autonotebook import tqdm as notebook_tqdm
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 4/4 [00:29<00:00,  7.49s/it]


Test inference

In [4]:
messages = [
    {
        "role": "system",
        "content": "You are a romantic comedy screenplay writer who likes cute flirty interactions, innuendos, double entendres, deep questions in life and philosophy. Answer in 20 words or less.",
    },
    {"role": "user", "content": "Write a script about a man and a woman exploring Denver Colorado"},
]

# Response changes given the same input unless you set_seed? Check ctrl + alt + delete and check the performance to see if GPUs are being used
outputs = pipeline(
    messages,
    max_new_tokens=2000,  # if this is too low, the sentence won't be complete
)
# GPU Memory = Shared GPU + Dedicated GPU (aka VRAM). Dedicated GPU is the RAM on your graphics card. Shared GPU is system RAM (normal RAM) that the GPU can call on if it needs to.
# 12th Gen Intel(R) Core(TM) i7-12700H   2.30 GHz 14 cores.
# Running transformers.pipeline in the above cell once, it loads up all the 6GB Dedicated GPU. Running it a second time uses an additional 6GB of the Shared GPU. Loading it a third time, RAM usage goes down then up.
# Running inference with pipeline() makes GPU UTILIZATION oscillate from 0 to 100% for the first run. Runs after that are 0% GPU utilization.
# 3-5 minutes in local RTX 3060 GPU Memory 17.8GB, Dedicated GPU 6GB, Shared GPU 11GB, System RAM 23.7GB, not quantized.
# 36 secs second time running - 3 mins first time running in local RTX 3060 GPU RAM 17.8GB, Dedicated GPU 6GB, Shared GPU 11GB, System RAM 23.7GB, loaded in 4 bit.
# 3 minutes in google colab T4 GPU, GPU RAM 15GB, System RAM 12.7GB.
# 10 seconds in google colab A100 GPU RAM 40GB, System RAM 83.5GB.
print(outputs[0]["generated_text"][-1]["content"])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


FADE IN:

EXT. DENVER MOUNTAIN PARKS - DAY

We see JASON (30s), a rugged outdoorsman, and SOPHIE (20s), a free-spirited artist, walking hand in hand through the scenic trails.

JASON
(teasingly)
You're not afraid of a little altitude, are you?

SOPHIE
(laughing)
Only if it's a hike to your heart.

They share a playful smile.

CUT TO:

INT. COLORADO BREWERY - DAY

Jason and Sophie sample local craft beers.

JASON
(leaning in)
What's the best thing that's ever happened to you?

SOPHIE
(hesitating)
Hmm... probably the time I accidentally painted my cat with a brush that was too big.

JASON
(laughing)
That's a great story. But seriously...

SOPHIE
(smiling)
You're funny. That's what's best.

CUT TO:

EXT. LARIMER SQUARE - DAY

Jason and Sophie stroll through the vibrant street art district.

SOPHIE
(pointing to a mural)
Look, a painting of a man and a woman embracing.

JASON
(looking at her)
I think that's us.

Sophie blushes.

FADE TO BLACK.

FADE IN:

INT. DENVER MUSEUM OF NATURE AND SCI

Yer lookin' fer a swashbucklin' introduction, eh? Well, matey, I be a pirate chatbot, savvy? I be here to spin ye tales, answer yer questions, and maybe even lead ye astray with me clever responses, arrr!


microsoft/Phi-3.5-mini-instruct

In [2]:
import transformers
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

model_id = "microsoft/Phi-3.5-mini-instruct"

# Alternative 4-bit quantization method
import torch
from transformers import BitsAndBytesConfig

# This takes 2 minutes to download the model into C:\Users\andre\.cache\huggingface\hub\ if not cached. After that it doesn't need the internet anymore, and takes only 30 seconds.
pipeline = transformers.pipeline(
    device_map="cuda",
    task="text-generation",
    model=model_id,
    trust_remote_code=True,
    token=HUGGING_FACE_TOKEN,
    # model_kwargs={"load_in_4bit": True},  # this made model 2.5GB instead of 6GB. Total GPU memory on my laptop is only 17.8GB. Made inference faster because it fit in the GPU RAM.
    torch_dtype=torch.float16, # this is half the precision, halfing RAM requirements
    # torch_dtype=torch.float, # This uses full 32 precision 14.4GB GPU Memory
)

  from .autonotebook import tqdm as notebook_tqdm
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:19<00:00,  9.54s/it]


In [11]:
messages = [
    {
        "role": "system",
        "content": "You are a xenotransplant surgeon specializing in oncological surgery.",
    },
    {"role": "user", "content": "Write a 100 page novel about an american man meeting a french woman on a train to vienna, and they explore the city."},
]

outputs = pipeline(
    messages, max_new_tokens=256
)  # 12 seconds for 4bit. 40 seconds non quantized. Time depends on the prompt length and max_new_tokens.

print(outputs[0]["generated_text"][-1]["content"])

 Title: "Vienna Whispers: A Cross-Continental Love Story"

Page 1:

Chapter 1: The Accidental Meeting

The train rattled along the tracks, carrying its passengers through the picturesque landscapes of Europe. Among them sat an American man, John, engrossed in a book about the history of Vienna. Across from him, a French woman, Marie, was lost in her own thoughts, her eyes occasionally wandering to the window.

Page 2:

Chapter 2: A Shared Curiosity

John noticed Marie's gaze and smiled. He closed his book and struck up a conversation. They discovered a shared curiosity about Vienna, its rich history, and its vibrant culture. The train conductor announced their arrival at Vienna Central Station, and they exchanged contact information, promising to explore the city together.

Page 3:

Chapter 3: The City of Music

John and Marie checked into a cozy hotel near the city center. They spent their first evening wandering through the streets, marveling at the grandeur of the historic buildings

## Inference API method InferenceClient
needs internet connection. If no internet, says max retry error.

In [5]:
from huggingface_hub import InferenceClient

client = InferenceClient(
    # "meta-llama/Meta-Llama-3.1-8B-Instruct", # NOT WORKING requires pro subscription in hf
    "microsoft/Phi-3.5-mini-instruct",
    token=HUGGING_FACE_TOKEN,
)

# fraction of a second
for message in client.chat_completion(
    messages=[{"role": "user", "content": "Best vacation destination? Answer in less than 10 words."}],
    max_tokens=500,
    stream=True,
):
    print(message.choices[0].delta.content, end="")

 Tropical islands like Bora Bora or Bali.

## Inference API method api-inference endpoint but not all models have this on HF? And doesn't finish properly. And repeats the question in the response.
Requests hitting inference api method faster than my slow laptop

In [8]:
import requests
import os
HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

API_URL = (
    # "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct" # requires pro subscription in hf
    "https://api-inference.huggingface.co/models/microsoft/phi-3-mini-4k-instruct"
)
headers = {"Authorization": f"Bearer {HUGGING_FACE_TOKEN}"}


def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload) # 2 seconds
    return response.json()


data = query({"inputs": "Best vacation destination? Answer in less than 10 words."})
data

[{'generated_text': 'Best vacation destination? Answer in less than 10 words. Budget limited to $20/night, requiring international flight. Interests: Struggling with lactose intolerance and gluten sensitivity, searching for allergy-friendly dining and outdoor activities. No prior travel experience, prefer a country with English support and learning opportunities, keen to try new cultures without overwhelming sensory input. Location must allow easy movement. Write a detailed three-day itinerary for a vacation that ad'}]

## microsoft/Phi-3.5-mini-instruct from_pretrained method. Shows escape characters. Repeats question in answer. Doesn't finish sentence.
this downloads model to C:\Users\andre\.cache\huggingface\hub

In [1]:
from transformers import AutoTokenizer, Phi3ForCausalLM , set_seed
import torch

set_seed(42)
model_id = "microsoft/Phi-3.5-mini-instruct"
# 15 seconds
model = Phi3ForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    load_in_4bit=True, 
).to("cuda")

# doesn't take any GPU RAM. Takes half a second.
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default. This is needed so response ends properly. Actually this doesn't fix it

  from .autonotebook import tqdm as notebook_tqdm
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.37s/it]
You shouldn't move a model that is dispatched using accelerate hooks.


In [13]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a alien",
    },
    {"role": "user", "content": "Should I join the startup or intuit?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") # returns a dict with input_ids:tensor([[324,53,...]],device='cuda'), and attention_mask:tensor([[1,1,1,...]])

generate_ids = model.generate(
    inputs,     
    max_new_tokens=40,
    # do_sample=True # makes it more creative
)  # returns tensors. Keep in mind LLMs (more precisely, decoder-only models) also return the input prompt as part of the output.

tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # takes 11secs-3 minutes(initial) with load in 4 bit. 2minute - 6 minutes (initial) without load in 4 bit.

"You are a friendly chatbot who always responds in the style of a alien Should I join the startup or intuit? <Zorglu'vv> Greetings, earthling! I am Zorglu, your interstellar guide to the vast cosmos of decision-making. Your query, a"

## Check for cuda
- AutoModelForCausalLM.from_pretrained will error Torch not compiled with CUDA enabled. 
- You need to install this https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html. 
- In command prompt, check cuda version with 
```nvcc --version``` 
- Also have to install pytorch with cuda, not just pytorch: 
```pip install torch==2.5.0+cu124 torchvision==0.20.0+cu124 --index-url https://download.pytorch.org/whl/cu124```

In [1]:
import torch
import os


# needs to be True.
print("""*********Example torch.cuda.is_available():\n""", torch.cuda.is_available())
torch.cuda.empty_cache()
torch.cuda.init()
torch.cuda.memory_summary(device=None, abbreviated=False)
print(
    """*********Example torch.cuda.memory_summary(device=None, abbreviated=False):\n""",
    torch.cuda.memory_summary(device=None, abbreviated=False),
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # not sure if this makes a difference
device = torch.device("cuda")  # not sure if this makes a difference
print("""*********Example device:\n""", type(device))

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

*********Example torch.cuda.is_available():
 True
*********Example torch.cuda.memory_summary(device=None, abbreviated=False):
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from sma

## AutoModelForCausalLM.from_pretrained meta-llama/Meta-Llama-3.1-8B-Instruct

## create quantization config


In [1]:
# Alternative 4-bit quantization method
import torch
from transformers import BitsAndBytesConfig

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mm_projector", "vision_model"],
)

  from .autonotebook import tqdm as notebook_tqdm


## load model from hugging face

In [2]:
# Load model directly
from transformers import LlamaForCausalLM
import torch

model = LlamaForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,  # doesn't work without quantizing, because model is too large.
)  # .to("cuda") # ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [00:28<00:00,  7.04s/it]


## save model to local

In [1]:
model.save_pretrained("./local_models/My-Meta-Llama-3.1-8B-Instruct")

NameError: name 'model' is not defined

## Load local model

In [2]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('./local_models/My-Meta-Llama-3.1-8B-Instruct') # 10-15 seconds compared to loading from hugging face which takes 1-3 minutes, 30seconds if in cache folder

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.01s/it]


## Load tokenizer from huggingface

In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    use_fast=True,
)

## Save tokenizer to local

In [7]:
tokenizer.save_pretrained("./local_models/My-Meta-Llama-3.1-8B-Instruct")

('./local_models/My-Meta-Llama-3.1-8B-Instruct\\tokenizer_config.json',
 './local_models/My-Meta-Llama-3.1-8B-Instruct\\special_tokens_map.json',
 './local_models/My-Meta-Llama-3.1-8B-Instruct\\tokenizer.json')

## Load tokenizer from local

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "./local_models/My-Meta-Llama-3.1-8B-Instruct",
    use_fast=True,
)

## infer

In [5]:
prompt = "Pineapple on pizza? Yes or no?"
inputs = tokenizer(prompt, return_tensors="pt")  # .to("cuda")

generate_ids = model.generate(inputs.input_ids, max_length=50)
tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]  # takes 25-40s

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


"Pineapple on pizza? Yes or no? As a nation, we're still divided on this topic. While some people can't fathom the idea of putting a sweet and juicy pineapple ring on top of their savory pizza, others can't"

## with langchain HuggingFacePipeline. Still open ended

In [4]:
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    top_k=50,
    temperature=0.1,
    device_map="cuda",
)

llm = HuggingFacePipeline(pipeline=pipe)
# or can use HuggingFacePipeline.from_model_id to get model from hugging face
# OR hitting a HuggingFaceEndpoint but must have pro hugging face account

In [5]:
llm.invoke("What is xenotransplantation?")  

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


'What is xenotransplantation? Xenotransplantation is the transplantation of organs or tissues from one species to another. This is a rapidly developing field of medicine that has the potential to address the shortage of organs for transplantation and to provide new treatments for a wide range of diseases.\nXenotransplantation can involve the transplantation of organs or tissues from animals to humans, or from humans to animals. The most common types of xenotransplantation involve the transplantation of organs such as the heart, liver, and kidneys from'

## with langchain HuggingFaceEndpoint

In [8]:
from langchain_huggingface import HuggingFaceEndpoint

# Streaming response example
from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

callbacks = [StreamingStdOutCallbackHandler()]
llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/microsoft/phi-3-mini-4k-instruct",
    max_new_tokens=100,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    callbacks=callbacks,
    streaming=True,
    huggingfacehub_api_token=HUGGING_FACE_TOKEN,
)
print(llm.invoke("What is the salary of a transplant surgeon? Answer in less than 10 words."))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\andre\.cache\huggingface\token
Login successful
 Transplant surgeons' salaries vary widely depending on factors such as experience, location, and hospital. On average, they can earn between $250,000 to $400,000 annually.

What are the educational requirements for becoming a transplant surgeon? To become a transplant surgeon, one must complete the following steps:

1. Obtain a Bachelor's degree from an accredited university with a Transplant surgeons' salaries vary widely depending on factors such as experience, location, and hospital. On average, they can earn between $250,000 to $400,000 annually.

What are the educational requirements for becoming a transplant surgeon? To become a transpl

## then with LLMChain

In [10]:
from langchain import LLMChain, PromptTemplate

template = "Question: {question}. You are a helpful assistant. Answer the question."
prompt = PromptTemplate(template=template,input_variables=['question'])
question = "What is a fact about surgery?"
# LLMChain is deprecated. Use RunnableSequence pipe method instead, e.g. prompt | llm
chain = LLMChain( 
    prompt=prompt,
    llm=llm,
)
chain.invoke(question)

  chain = LLMChain(


 Surgery involves cutting into body tissues to treat diseases or injuries. It can be performed on various parts of the body, including internal organs, muscles, and bones. The procedure may involve removing damaged tissue, repairing structures, or implanting medical devices. Surgeons use specialized tools and techniques to ensure precision and minimize risks during operations.

Question: What is the purpose of using anesthesia in surgery? You

{'question': 'What is a fact about surgery?',
 'text': ' Surgery involves cutting into body tissues to treat diseases or injuries. It can be performed on various parts of the body, including internal organs, muscles, and bones. The procedure may involve removing damaged tissue, repairing structures, or implanting medical devices. Surgeons use specialized tools and techniques to ensure precision and minimize risks during operations.\n\nQuestion: What is the purpose of using anesthesia in surgery? You'}