Here is the script to run the causal language model for inference in a Google Colab notebook. The code can load any of the saved types created in the Unsloth training script. Make sure to set the directory to the folder, not to the safetensors directly or any other file. Requires a CUDA-enabled GPU.

Install the dependencies necessary for Unsloth. If an error occurs despite no changes in this code block, check the Xformers version first. You may need to install an older version if you have it set to install the most recent one, as they update frequently.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes #keep an eye out on the xformers version. Usually you want one version before the latest; causes errors often

First, load the LoRA adapters or model, using the same data type (dtype) as you used during training. Additionally, re-establish the alpaca_prompt variable in the same way you did during training.

In [None]:

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/drive/MyDrive/Dog-LoRA", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) #Set the model for inference

alpaca_prompt = """

### label:
{}

### text:
{}"""

#You can also use the Causal LM model from HF, though it is significantly slower so it is NOT recommended
"""
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    load_in_4bit = load_in_4bit,
)
tokenizer = AutoTokenizer.from_pretrained("lora_model")
"""

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.4.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


'\nfrom peft import AutoPeftModelForCausalLM\nfrom transformers import AutoTokenizer\nmodel = AutoPeftModelForCausalLM.from_pretrained(\n    "lora_model", # YOUR MODEL YOU USED FOR TRAINING\n    load_in_4bit = load_in_4bit,\n)\ntokenizer = AutoTokenizer.from_pretrained("lora_model")\n'

Now that the adapters and model have been loaded in the same format, you can proceed to run the model for inference using the block below. Insert your prompt under the #label section. Note that it can handle up to 128 tokens well.

In [None]:


labels = tokenizer(
[
    alpaca_prompt.format(
        "Please tell me something interesting about the Rottweiler Dog", # label
        "", # text - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

texts = model.generate(**labels, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(texts)

#The code below are some modifications you can do to play with the randomness of the responses
#You can also add a streamer so that the tokens are loaded as they are calculated, instead of loading all of the output at once.
"""
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
temperature = 10.0 # Must be a positive float, lower reduces randomness. This one gets wild if too high :)
top_k = 2  # Positive integer only, no upper limit. Lower reduces randomness.
top_p = 5  # bottom is 0, no upper limit. Lower reduces randomness.

# Modify the model.generate line as follows (Add a 'streamer' so that it loads the text as it processes, instead of all at once)
_ = model.generate(**labels, streamer=text_streamer, max_new_tokens=128, temperature=temperature, top_k=top_k, top_p=top_p, do_sample=True)
"""


['<|begin_of_text|>\n\n### label:\nPlease tell me something interesting about the Rottweiler Dog\n\n### text:\nRottweiler: Rottweilers are a medium to large sized dog. They are a powerful dog, but are also a very muscular dog. They have a square appearance, with a broad chest and wide shoulders. Their legs are strong, and they have a powerful tail that they carry low.<|end_of_text|>']

If you didn't get a chance to push to Hugging Face while training, you can do so now.

In [None]:
model.push_to_hub("HF_username/LoRA_adapter_name", token = "HF_Token") # Online saving; you can create a token under your HF account settings. LoRA_adapter_name is the name you want to use to upload to HF.
tokenizer.push_to_hub("HF_username/LoRA_adapter_name", token = "HF_Token") # Same as model.push_to_hub

README.md:   0%|          | 0.00/578 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/chrismontes/Dog-LoRA


In [None]:
"""#If you want to use python to open Google Drive
from google.colab import drive
drive.mount('/content/drive')"""

In [None]:
"""#If you zipped your adapters or model , here is a way to unzip it. -d to establish unzipped directory
!unzip '/content/drive/MyDrive/lora_model.zip' -d '/content/drive/MyDrive/'"""


Archive:  /content/drive/MyDrive/lora_model.zip
   creating: /content/drive/MyDrive/content/lora_model/
  inflating: /content/drive/MyDrive/content/lora_model/special_tokens_map.json  
  inflating: /content/drive/MyDrive/content/lora_model/tokenizer_config.json  
  inflating: /content/drive/MyDrive/content/lora_model/adapter_model.safetensors  
  inflating: /content/drive/MyDrive/content/lora_model/tokenizer.json  
  inflating: /content/drive/MyDrive/content/lora_model/adapter_config.json  
  inflating: /content/drive/MyDrive/content/lora_model/README.md  
