<a href="https://colab.research.google.com/github/adrian-oprea/Databricks-tech-talks/blob/master/Copie_de_Accelerate_OPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running OPT up to 30B using `accelerate`

This notebook shows how to leverage the dispatching utility in colab, to load even very large checkpoints.

This should handle up to 11B in Colab Free, and 30B in colab Pro.

In [None]:
! pip install transformers accelerate

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-hzm_emww
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-hzm_emww
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 29.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 4.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 85.8 MB/s 
Building whe

This downloads the checkpoint. Several checkpoints are available:

- [facebook/opt-125m](https://huggingface.co/facebook/opt-125m)
- [facebook/opt-350m](https://huggingface.co/facebook/opt-350m)
- [facebook/opt-1.3b](https://huggingface.co/facebook/opt-1.3b)
- [facebook/opt-2.7b](https://huggingface.co/facebook/opt-2.7b)
- [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b)
- [facebook/opt-13b](https://huggingface.co/facebook/opt-13b)
- [facebook/opt-30b](https://huggingface.co/facebook/opt-30b)

It downloads it to cache and we save the link to be re-used afterwards,

In [None]:
from huggingface_hub import snapshot_download

checkpoint = 'facebook/opt-30b'
weights_path = snapshot_download(checkpoint)

# If the folder contains a checkpoint that isn't sharded, it needs to point to the state dict directly
# otherwise point to the directory containing the shard
import os
files = os.listdir(weights_path)
weights_path = os.path.join(weights_path, 'pytorch_model.bin') if 'pytorch_model.bin' in files else weights_path

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/588 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.79G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/822M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/220 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

We then instantiate a configuration, and we load the model from the config inside the `init_empty_weights` decorator. 

This decorate instantiates an empty shell with the model. This does not actually load or instantiate any weight, only the shapes.

This unties the weights, so we manually retie the weights afterwards.

In [None]:
from accelerate import init_empty_weights, dispatch_model, infer_auto_device_map, load_checkpoint_and_dispatch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

config = AutoConfig.from_pretrained(checkpoint)

# Initializes an empty shell with the model. This is instant and does not take any RAM.
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
# Initialize the model under the previous context manager breaks the tied weights.
model.tie_weights()

Downloading:   0%|          | 0.00/588 [00:00<?, ?B/s]

Finally, we infer an a device map automatically from the model. It will place all the layers to disk, CPU RAM and GPU ram according to the available memory in each device.

In [None]:
# Infer device map automatically
device_map = infer_auto_device_map(model.model, no_split_module_classes=["OPTDecoderLayer"], dtype='float16')

if any([k == 'disk' for k in device_map.values()]):
    offload_folder = 'offload_folder'
else:
    offload_folder = None

if '30b' in checkpoint:
    # Set a few layers to use the disk manually to ensure enough RAM for the 30B checkpoint.
    device_map['decoder.layers.23'] = 'disk'
    device_map['decoder.layers.24'] = 'disk'
    device_map['decoder.layers.25'] = 'disk'
    device_map['decoder.layers.26'] = 'disk'
    device_map['decoder.layers.27'] = 'disk'

device_map

We load the checkpoint that is saved on disk and we dispatch it to the devices. At no point is the checkpoint fully loaded in RAM; only parts of it to be dispatched to each device.

We load it as float16 so that we may load more layers at a time on each device for a faster execution time.






In [None]:
load_checkpoint_and_dispatch(
    model.model, 
    weights_path, 
    device_map=device_map, 
    offload_folder=offload_folder, 
    dtype='float16', 
    offload_state_dict=True
)
model.tie_weights()

Finally, we create a prompt to generate from and we generate a text from it.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-30b')
inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")

output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)


In [None]:
print(tokenizer.decode(output[0].tolist()))

Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.

The Unicorn is a symbol of purity and innocence. It is also a symbol of magic and mystery. The Unicorn is often associated with the Virgin Mary,
