# Lighter Models for Inference

## Installing Dependencies

In [1]:
# !pip install --quiet bitsandbyte
!pip install --quiet git+https://github.com/huggingface/transformers.git # Install latest version of transformers
# !pip install --quiet accelerate

!pip install bitsandbytes
!pip install accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 7.0 MB 4.6 MB/s 
[K     |████████████████████████████████| 120 kB 73.0 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.34.0-py3-none-any.whl (55.9 MB)
[K     |████████████████████████████████| 55.9 MB 305 kB/s 
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.34.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.12.0-py3-none-any.whl (143 kB)
[K     |████████████████████████████████| 143 kB 4.8 MB/s 
Installing collected packages: accelerate
Successfully installed acce

## Hardware Requirements

To run properly this feature you need to have GPU that supports 8-bit operation modules. Currently, Turing and Ampere GPUs (RTX20s, RTX30s, A40-A100, T4+) are supported, which means on colab we need to use a T4 GPU for this feature. You can check that using this code snippet and make sure you are using a supported GPU

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Sep 27 05:23:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Here we are using a `Tesla` T4 GPU that should support 8-bit tensor cores! We are good to go 

## Utility variables & functions

In [3]:
name = "bigscience/bloom-3b"
text = "It was a Monday morning"
max_new_tokens = 150

def generate_from_model(model, tokenizer):
  encoded_input = tokenizer(text, return_tensors='pt')
  output_sequences = model.generate(input_ids=encoded_input['input_ids'].cuda())
  return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

## Use 8bit models and `pipeline`

In [4]:
from transformers import pipeline

pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens)

Downloading:   0%|          | 0.00/710 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/6.01G [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...


  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '


Downloading:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

In [5]:
# Checking Output

pipe(text)

[{'generated_text': 'It was a Monday morning, and I was in the kitchen, cooking breakfast for my family. I was in the middle of making my favorite breakfast, eggs Benedict, when I heard a knock on the door. I opened it and saw a man in his late 60s, with a beard and a mustache, and a big smile on his face. He was wearing a blue shirt and a black tie. He was wearing a pair of jeans and a pair of white sneakers. He was wearing a black hat and a black coat. He was wearing a black hat and a black coat. He was wearing a black hat and a black coat. He was wearing a black hat and a black coat. He was wearing a black hat and a black coat. He was'}]

## Use 8bit models and `.generate`

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(name)

In [7]:
generate_from_model(model_8bit, tokenizer)



'It was a Monday morning, and I was in the kitchen, cooking breakfast for my family. I'

In [8]:
model_native = AutoModelForCausalLM.from_pretrained(name, device_map="auto", torch_dtype="auto")
generate_from_model(model_native, tokenizer)

'It was a Monday morning, and I was in the kitchen, making breakfast. I was in the'

## Memory footprint comparison

In [9]:
mem_fp16 = model_native.get_memory_footprint()
mem_int8 = model_8bit.get_memory_footprint()
print("Memory footprint int8 model: {} | Memory footprint fp16 model: {} | Relative difference: {}".format(mem_int8, mem_fp16, mem_fp16/mem_int8))

Memory footprint int8 model: 3645818880 | Memory footprint fp16 model: 6005114880 | Relative difference: 1.6471237539918604


We saved 1.65x memory for a 3-billion parameters models! Note that internally we replace all the linear layers by the ones implemented in `bitsandbytes`. By scaling up the model the number of linear layers will increase therefore the impact of saving memory on those layers will be huge for very large models. For example quantizing BLOOM-176 (176 Billion parameter model) gives a gain of 1.96x memory footprint which can save a lot of compute power in practice.