# Getting Started with Bloom

![bloom](../figs/entelecheia_BLOOM.png)


## Transparency, openness, and inclusivity

While most major LLMs have been trained exclusively on English text, BLOOM’s training corpus includes 46 natural languages and 13 programming languages. This makes it useful for the many regions where English is not the main language.

BLOOM is also a break from the de facto reliance on big tech to train models. One of the main problems of LLMs is the prohibitive costs of training and tuning them. This hurdle has made 100-billion-parameter LLMs the exclusive domain of big tech companies with deep pockets. Recent years have seen AI labs gravitate toward big tech to gain access to subsidized cloud compute resources and fund their research.

The BLOOM research team has been completely transparent about the entire process of training the model. They have published the dataset, the meeting notes, discussions, and code, as well as the logs and technical details of training the model.



## Pre-trained BLOOM checkpoints

From BigScience repository (https://huggingface.co/bigscience), you can find various version of models.

![](../figs/deep_nlp/bloom/bloom-models.png)

### Download checkpoints

cf) The original bloom model is very big with a size of about 350GB.

In [None]:
!pip install transformers 

In [None]:
from transformers import AutoModel, AutoTokenizer

model_path = "/workspace/data/tbts/archive/models/bloom/bloom" # replace with your local folder path
model_uri = "bigscience/bloom"

model = AutoModel.from_pretrained(model_uri)
model.save_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_uri)
tokenizer.save_pretrained(model_path)

!ls $model_path

### Inference

We create a method (get_state_dict) that takes as input a shard number (1 to 72), reads the shard from disk, and returns a dictionary with the model state. This method allows to remove prefixes from the dictionary keys to facilitate loading the weights into the model objects using torch.load_state_dict. We also create the tokenizer and configuration objects by loading them from the downloaded folder.

In [16]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BloomConfig
from transformers.models.bloom.modeling_bloom import BloomBlock, build_alibi_tensor

config = BloomConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
device = 'cpu'

In [None]:
def load_embeddings():
    state_dict = get_state_dict(shard_num=1, prefix="word_embeddings_layernorm.")
    embeddings = nn.Embedding.from_pretrained(state_dict.pop('word_embeddings.weight'))
    lnorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon, dtype=torch.bfloat16)
    lnorm.load_state_dict(state_dict)
    return embeddings.to(device), lnorm.to(device)

def load_causal_lm_head():
    linear = nn.utils.skip_init(
        nn.Linear, config.hidden_size, config.vocab_size, bias=False, dtype=torch.bfloat16)
    linear.load_state_dict(get_state_dict(shard_num=1, prefix="word_embeddings."), strict=False)
    return linear.bfloat16().to(device)

def load_block(block_obj, block_num):
    block_obj.load_state_dict(get_state_dict(shard_num=block_num + 2, prefix=f"h.{block_num}."))
    block_obj.to(device)

final_lnorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon, dtype=torch.bfloat16)
final_lnorm.load_state_dict(get_state_dict(shard_num=72, prefix="ln_f."))
final_lnorm.to(device)
block = BloomBlock(config, layer_number=1).bfloat16()


In [2]:
import os
import torch
import torch.nn as nn
from collections import OrderedDict
from transformers import AutoTokenizer, AutoModelForCausalLM, BloomConfig
from transformers.models.bloom.modeling_bloom import BloomBlock, build_alibi_tensor

def get_state_dict(shard_num, prefix=None):
    d = torch.load(os.path.join(model_path, f"pytorch_model_{shard_num:05d}-of-00072.bin"))
    return d if prefix is None else OrderedDict((k.replace(prefix, ''), v) for k, v in d.items())



          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into 'bloom-760m'...
Username for 'https://huggingface.co': ^C

Exiting because of "interrupt" signal.


## Simple example for local inference

In [1]:
import torch
from transformers import BloomForCausalLM
from transformers import BloomTokenizerFast
from transformers import pipeline

In [None]:
model_uri = "bigscience/bloom-1b3"

model = BloomForCausalLM.from_pretrained(model_uri)
tokenizer = BloomTokenizerFast.from_pretrained(model_uri)
pipe = pipeline(model=model_uri, torch_dtype=torch.bfloat16)

In [None]:
prompt = "Bloom is a large language model"
result_length = 100
inputs = tokenizer(prompt, return_tensors="pt")

- `result_length`: the size of the response (in tokens) we get for the prompt from the model.
- `inputs`:  the embedding representation of prompt, encoded for use specifically by PyTorch. 

In [None]:
# Greedy Search
print(
    tokenizer.decode(model.generate(inputs["input_ids"], max_length=result_length)[0])
)

In [None]:
# Beam Search
print(
    tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_length=result_length,
            num_beams=2,
            no_repeat_ngram_size=2,
            early_stopping=True,
        )[0]
    )
)

In [None]:
# Sampling Top-k + Top-p
print(
    tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_length=result_length,
            do_sample=True,
            top_k=50,
            top_p=0.9,
        )[0]
    )
)

In [None]:
def infer(
    prompt,
    temperature=0.7,
    top_p=None,
    max_new_tokens=50,
    repetition_penalty=None,
    do_sample=False,
    num_return_sequences=1,
):
    response = pipe(
        f"{prompt}",
        temperature=temperature,  # 0 to 1
        top_p=top_p,  # None, 0-1
        max_new_tokens=max_new_tokens,  # up to 2047 theoretically
        return_full_text=False,  # include prompt or not.
        repetition_penalty=repetition_penalty,  # None, 0-100 (penalty for repeat tokens.
        do_sample=do_sample,  # True: use sampling, False: Greedy decoding.
        num_return_sequences=num_return_sequences,
    )
    return (prompt, response[0]["generated_text"])

## Bloom Examples

In [1]:
from ekorpkit import eKonf
from ekorpkit.models.bloom.demo import BloomDemo



In [2]:
import os

demo = BloomDemo()
print(demo.TRANSFORMERS_CACHE)
demo.init_model()

INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env


/workspace/data/tbts/.cache/huggingface/transformers


Downloading:   0%|          | 0.00/90.4k [00:00<?, ?B/s]

In [11]:
cfg = eKonf.compose("model/bloom=demo")
demo = eKonf.instantiate(cfg)
demo.init_model()

In [None]:
prompt = "One of the hottest areas of investing in recent years has been ESG: "
prompt += "the use of environmental, social, and governance criteria to evaluate possible investments."

result_length = 100
inputs = tokenizer(prompt, return_tensors="pt")