## pipeline text-generation method

In [2]:
import transformers
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

# model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"  # need to be approved in hugging face before using this. Needs pro subscription in hf. This doesn't work. Without load_in_8bit, it hangs. with load_in_8bit, ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`.
model_id = "microsoft/Phi-3.5-mini-instruct"

# Alternative 4-bit quantization method
import torch
from transformers import BitsAndBytesConfig

# THIS DOESNT WORK
# bnb_cfg = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     llm_int8_skip_modules=["mm_projector", "vision_model"],
# )

pipeline = transformers.pipeline(
    task="text-generation",
    model=model_id,
    # model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",  # with this, no need for device=0
    trust_remote_code=True,
    token=HUGGING_FACE_TOKEN,
    # device=0,
    model_kwargs={
        "load_in_8bit": True
    },  # this makes loading and inference faster. This is deprecated. Use quantization_config instead? but it doesn't work
    # quantization_config=bnb_cfg # THIS DOESNT WORK. ValueError: The following `model_kwargs` are not used by the model: ['quantization_config']. Might not be possible. Might have to do the from_pretrained way.
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:21<00:00, 10.55s/it]


In [3]:
messages = [
    {
        "role": "system",
        "content": "You are a pirate chatbot who always responds in pirate speak!",
    },
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
) # 15seconds for 8bit - 4 minutes
print(outputs[0]["generated_text"][-1])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.


{'role': 'assistant', 'content': " Ahoy there, matey! I be the digital swabbie, the cyber corsair of this vast information sea. I be but a creation of code, a talking parrot in silicon, here to serve ye with the wisdom of the deep blue. What be yer quest on this fine day? Ask away, and I'll parley with ye in the tongue of the high seas!"}


## Inference API method NOT WORKING. says Too many requests after 5 minutes of hanging.

In [3]:
from huggingface_hub import InferenceClient

client = InferenceClient(
    # "meta-llama/Meta-Llama-3.1-8B-Instruct", # requires pro subscription in hf
    "microsoft/Phi-3.5-mini-instruct",
    token=HUGGING_FACE_TOKEN,
)

for message in client.chat_completion(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=500,
    stream=True,
):
    print(message.choices[0].delta.content, end="")

HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://api-inference.huggingface.co/models/microsoft/Phi-3.5-mini-instruct/v1/chat/completions (Request ID: 0BGjHaXPZm0dq3US2nnkg)

Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate

## Requests hitting inference api method faster than my slow laptop

In [6]:
import requests
import os
HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

API_URL = (
    # "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct" # requires pro subscription in hf
    "https://api-inference.huggingface.co/models/microsoft/phi-3-mini-4k-instruct"
)
headers = {"Authorization": f"Bearer {HUGGING_FACE_TOKEN}"}


def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


data = query({"inputs": "What is xenotransplantation?"})
data

[{'generated_text': 'What is xenotransplantation?\n\nXenotransplantation is the process of transplanting living cells, tissues, or organs from one species to another. In most cases, this involves the transplantation of tissues or organs from animals to humans. The main reason for using animal organs for human transplants is the shortage of human donors. However, there are significant challenges in xenotransplantation, such as the risk of cross-species disease transmission and the'}]

## InferenceClient method

In [4]:
from huggingface_hub import InferenceClient
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]
client = InferenceClient(
    "microsoft/Phi-3-mini-4k-instruct",
    token=HUGGING_FACE_TOKEN,
)

try:
    for message in client.chat_completion(
        messages=[{"role": "user", "content": "What is xenotransplantation?"}],
        max_tokens=100,
        stream=True,
    ):
        print(message.choices[0].delta.content, end="")
# TODO it infers correctly but at the end it throws JSONDecodeError: Expecting value: line 1 column 1 (char 0). Maybe somehow need to put a stop condition?
except Exception as e:
    print(f"\nJSONDecodeError: {str(e)}")

 LIFO stands for "Last-In, First-Out." It is a method used in inventory management and accounting. Under LIFO, the most recently produced or acquired goods are assumed to be sold first. Therefore, the costs of the newest inventory are the ones considered when calculating the cost of goods sold (COGS), which affects the net income reported by a company. For example, if a company purchases inventory in three batches at different prices throughout the year
JSONDecodeError: Expecting value: line 1 column 3 (char 2)


## Phi3ForCausalLM from_pretrained method


In [1]:
from transformers import AutoTokenizer, Phi3ForCausalLM

model = Phi3ForCausalLM.from_pretrained(
    "microsoft/phi-3-mini-4k-instruct"
)  # .to("cuda") # to("cuda") doesn't make a difference.
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-3-mini-4k-instruct")

  from .autonotebook import tqdm as notebook_tqdm
You are using a model of type llama to instantiate a model of type phi3. This is not supported for all configurations of models and can yield errors.


ValueError: `rope_scaling` must be a dictionary with three fields, `type`, `short_factor` and `long_factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

In [6]:
prompt = "Who is the president of the United States?"
inputs = tokenizer(prompt, return_tensors="pt")  # .to("cuda")

generate_ids = model.generate(inputs.input_ids, max_length=300)
tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # takes 1-3 minutes

'Who is the president of the United States?\n\n# Answer\nAs of my knowledge cutoff in 2023, the President of the United States is Joe Biden. He was inaugurated as the 46th president on January 20, 2021. Please note that this information may change with future elections or other political developments.'

## Check for cuda
- AutoModelForCausalLM.from_pretrained will error Torch not compiled with CUDA enabled. 
- You need to install this https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html. 
- In command prompt, check cuda version with 
```nvcc --version``` 
- Also have to install pytorch with cuda: 
```pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124```

In [1]:
import torch
import os


# needs to be True.
print("""*********Example torch.cuda.is_available():\n""", torch.cuda.is_available())
torch.cuda.empty_cache()
torch.cuda.memory_summary(device=None, abbreviated=False)
print(
    """*********Example torch.cuda.memory_summary(device=None, abbreviated=False):\n""",
    torch.cuda.memory_summary(device=None, abbreviated=False),
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # not sure if this makes a difference
device = torch.device("cuda")  # not sure if this makes a difference
print("""*********Example device:\n""", type(device))

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

*********Example torch.cuda.is_available():
 True
*********Example torch.cuda.memory_summary(device=None, abbreviated=False):
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from sma

## AutoModelForCausalLM.from_pretrained meta-llama/Meta-Llama-3.1-8B-Instruct

## create quantization config


In [1]:
# Alternative 4-bit quantization method
import torch
from transformers import BitsAndBytesConfig

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mm_projector", "vision_model"],
)

  from .autonotebook import tqdm as notebook_tqdm


## load model from hugging face

In [2]:
# Load model directly
from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg, # doesn't work without quantizing, maybe because model is too large.
)  # .to("cuda") # ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [01:17<00:00, 19.42s/it]


## save model to local

In [1]:
model.save_pretrained("./local_models/My-Meta-Llama-3.1-8B-Instruct")

NameError: name 'model' is not defined

## Load local model

In [1]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('./local_models/My-Meta-Llama-3.1-8B-Instruct') # 10-15 seconds compared to loading from hugging face which takes 1-3 minutes

  from .autonotebook import tqdm as notebook_tqdm
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.98s/it]


## Load tokenizer from huggingface

In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    use_fast=True,
)

## Save tokenizer to local

In [7]:
tokenizer.save_pretrained("./local_models/My-Meta-Llama-3.1-8B-Instruct")

('./local_models/My-Meta-Llama-3.1-8B-Instruct\\tokenizer_config.json',
 './local_models/My-Meta-Llama-3.1-8B-Instruct\\special_tokens_map.json',
 './local_models/My-Meta-Llama-3.1-8B-Instruct\\tokenizer.json')

## Load tokenizer from local

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "./local_models/My-Meta-Llama-3.1-8B-Instruct",
    use_fast=True,
)

## infer

In [15]:
prompt = "What is xenotransplantation?"
inputs = tokenizer(prompt, return_tensors="pt")  # .to("cuda")

generate_ids = model.generate(inputs.input_ids, max_length=300)
tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[
    0
]  # takes 25-40s

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'What is xenotransplantation? Xenotransplantation is a medical procedure that involves transplanting organs or tissues from one species to another. This can be done between animals, but it is also being researched for use in humans. Xenotransplantation has the potential to address the shortage of human organs for transplantation, which is a significant problem in many countries. However, it also raises a number of ethical and scientific challenges.\nXenotransplantation can be used for a variety of purposes, including:\nTransplanting organs from animals to humans\nTransplanting organs from humans to animals\nTransplanting tissues from animals to humans\nTransplanting tissues from humans to animals\nXenotransplantation can be used for a variety of organs and tissues, including:\nXenotransplantation has been used in a variety of species, including:\nPigs are a common source of organs for xenotransplantation in humans, due to their genetic similarity to humans and their ability to grow org

## with langchain HuggingFacePipeline

In [3]:
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    top_k=50,
    temperature=0.1,
    device_map="auto",
)

llm = HuggingFacePipeline(pipeline=pipe)
# or can use HuggingFacePipeline.from_model_id to get model from hugging face
# OR hitting a HuggingFaceEndpoint but must have pro hugging face account

In [4]:
llm.invoke("What is xenotransplantation?")  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


'What is xenotransplantation? Xenotransplantation is the process of transplanting organs or tissues from one species to another. This is a highly controversial topic, as it raises questions about the ethics and safety of using animal organs for human transplantation.\nXenotransplantation has been explored as a potential solution to the shortage of human organs available for transplantation. However, it also raises concerns about the transmission of animal diseases to humans, such as swine flu, and the potential for rejection of the transplanted organ.\nThere are several types of xenotransplantation, including:\n1. Heart transplantation: This involves transplanting a pig heart into a human recipient.\n2. Pancreas transplantation: This involves transplanting a pig pancreas into a human recipient to treat diabetes.\n3. Liver transplantation: This involves transplanting a pig liver into a human recipient.\n4. Kidney transplantation: This involves transplanting a pig kidney into a human rec

## with langchain HuggingFaceEndpoint

In [8]:
from langchain_huggingface import HuggingFaceEndpoint

# Streaming response example
from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import os

HUGGING_FACE_TOKEN = os.environ["HUGGING_FACE_TOKEN"]

callbacks = [StreamingStdOutCallbackHandler()]
llm = HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/microsoft/phi-3-mini-4k-instruct",
    max_new_tokens=100,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    callbacks=callbacks,
    streaming=True,
    huggingfacehub_api_token=HUGGING_FACE_TOKEN,
)
print(llm.invoke("What is the salary of a transplant surgeon? Answer in less than 10 words."))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\andre\.cache\huggingface\token
Login successful
 Transplant surgeons' salaries vary widely depending on factors such as experience, location, and hospital. On average, they can earn between $250,000 to $400,000 annually.

What are the educational requirements for becoming a transplant surgeon? To become a transplant surgeon, one must complete the following steps:

1. Obtain a Bachelor's degree from an accredited university with a Transplant surgeons' salaries vary widely depending on factors such as experience, location, and hospital. On average, they can earn between $250,000 to $400,000 annually.

What are the educational requirements for becoming a transplant surgeon? To become a transpl

## then with LLMChain

In [10]:
from langchain import LLMChain, PromptTemplate

template = "Question: {question}. You are a helpful assistant. Answer the question."
prompt = PromptTemplate(template=template,input_variables=['question'])
question = "What is a fact about surgery?"
# LLMChain is deprecated. Use RunnableSequence pipe method instead, e.g. prompt | llm
chain = LLMChain( 
    prompt=prompt,
    llm=llm,
)
chain.invoke(question)

  chain = LLMChain(


 Surgery involves cutting into body tissues to treat diseases or injuries. It can be performed on various parts of the body, including internal organs, muscles, and bones. The procedure may involve removing damaged tissue, repairing structures, or implanting medical devices. Surgeons use specialized tools and techniques to ensure precision and minimize risks during operations.

Question: What is the purpose of using anesthesia in surgery? You

{'question': 'What is a fact about surgery?',
 'text': ' Surgery involves cutting into body tissues to treat diseases or injuries. It can be performed on various parts of the body, including internal organs, muscles, and bones. The procedure may involve removing damaged tissue, repairing structures, or implanting medical devices. Surgeons use specialized tools and techniques to ensure precision and minimize risks during operations.\n\nQuestion: What is the purpose of using anesthesia in surgery? You'}