In this notebook, we use 4-bit quantization to run Llama-7B Chat model. This code uses only 10 GB of VRAM. It can run on a free instance of Google Colab or on a local GPU (e.g., RTX 3060 12GB).
[More details here.](https://open.substack.com/pub/kaitchup/p/run-llama-2-chat-models-on-your-computer?r=2kp66c&utm_campaign=post&utm_medium=web)


We only need the following libraries:


*   transformers
*   accelerate (for device_map)
*   bitsandbytes (for 4-bit quantization)




In [1]:
!pip install transformers accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.27.2 bitsandbytes-0.42.0


In [6]:
del model

Note that to run the following code, you must have got access to Llama 2's weights and have an access token from Hugging Face. You can find instructions on the model cards on the hugging face hub: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lzw1008/Emollama-chat-7b"
access_token = "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"

# Quantized version of the model
# 4 bit
#model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True,  use_auth_token=access_token)
# 8 bit
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True,  use_auth_token=access_token)
# 8 bit + fp16 weights (doesn't work)
# model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", llm_int8_has_fp16_weight=True, load_in_8bit=True,  use_auth_token=access_token)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, use_auth_token=access_token)

In [None]:
model.config

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, device_map='auto',  use_auth_token=access_token)

prompts = ["""
Human:
Task: Assign a numerical value between 0 (least E) and 1 (most E) to represent the intensity of emotion E expressed in the text.
Text: @CScheiwiller can't stop smiling 😆😆😆
Emotion: joy
Intensity Score:
""",
"""
Human:
Task: Evaluate the valence intensity of the writer's mental state based on the text, assigning it a real-valued score from 0 (most negative) to 1 (most positive).
Text: Happy Birthday shorty. Stay fine stay breezy stay wavy @daviistuart 😘
Intensity Score:
""",
"""
Human:
Task: Categorize the text into an ordinal class that best characterizes the writer's mental state, considering various degrees of positive and negative sentiment intensity. 3: very positive mental state can be inferred. 2: moderately positive mental state can be inferred. 1: slightly positive mental state can be inferred. 0: neutral or mixed mental state can be inferred. -1: slightly negative mental state can be inferred. -2: moderately negative mental state can be inferred. -3: very negative mental state can be inferred
Text: Beyoncé resentment gets me in my feelings every time. 😩
Intensity Class:
""",

"""
Human:
Task: Categorize the text's emotional tone as either 'neutral or no emotion' or identify the presence of one or more of the given emotions (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, trust).
Text: Whatever you decide to do make sure it makes you #happy.
This text contains emotions:
"""]

for prompt in prompts:

  model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

  output = model.generate(**model_inputs, max_new_tokens=60)

  print(tokenizer.decode(output[0], skip_special_tokens=True))

# Erasing memory (don't do it)

In [None]:
from numba import cuda
import gc
import torch
device = cuda.get_current_device()
device.reset()
gc.collect()

26931

In [None]:
del model

In [None]:
torch.cuda.empty_cache()

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [None]:
import torch
print(torch.cuda.is_available())

True
