# Getting Started with Bloom


## Transparency, openness, and inclusivity

While most major LLMs have been trained exclusively on English text, BLOOM’s training corpus includes 46 natural languages and 13 programming languages. This makes it useful for the many regions where English is not the main language.

BLOOM is also a break from the de facto reliance on big tech to train models. One of the main problems of LLMs is the prohibitive costs of training and tuning them. This hurdle has made 100-billion-parameter LLMs the exclusive domain of big tech companies with deep pockets. Recent years have seen AI labs gravitate toward big tech to gain access to subsidized cloud compute resources and fund their research.

The BLOOM research team has been completely transparent about the entire process of training the model. They have published the dataset, the meeting notes, discussions, and code, as well as the logs and technical details of training the model.



## Zero Shot and Few Shot Learners

Large Language Models can show good enough performance for some tasks based on just a few examples.
These examples are called `prompts` to a language model.

For clarity, we will define a prompting task as one that requires no fine-tuning to the base language model. 
This is done by inputting some prompts into the language model and asking it to return a response. 
The model does not see any training data for this task and is expected to generalize from these few examples. 

Formatting the examples as input is referred to as `prompt engineering` and is a process that comes with some trial and error.
The goal of prompt engineering is to take your prompts and format them in a way, so they are easy to input into the model.

Language generation based on prompts is a brilliant concept, and it can be done in two ways, mainly - Zero Shot predictions and Few Shot predictions. 
Zero-Shot prediction is where the model is not trained on any data for that specific task, and Few Shot predictions are where the model is trained on a very few amount of data for that specific task. 
In both cases, we need some sort of prompt or seed text to get started with so that the model can generate new text from there. 



## Prompting on LLMs

 All you need to do is input several examples of your prompts into the `generate` function on either `gpt2` or `gpt3`. You don't even have to specify what kind of task it is, just give it some example inputs and let it figure out how to generalize from there! The only parameters required are:

 - `text`, which contains the text you want to be generated for (this should be your prompt set)  
 - `length`, which specifies how long you want each generated sequence returned by the model   
 - `num_return_sequences`, which specifies how many sequences you want to be returned by the model

In Zero-shot predictions, you mainly pass prompts which give a task description to the LLM to generate text. For example, for zero-shot summarization, you can present a body of text to the LLM along with an instruction for it to follow, like 'In summary', or 'tldr:', or even 'To explain to a 5-year-old'.

In Few-Shot summarization, you can preset a few examples of text & their summary to an LLM. You can then present a text to the model and could expect the summary generated by the model. In other words, you give it a few examples vs. none.
Language generation based on prompts is a brilliant concept and it is so much simpler than fine-tuning. 


### Zero-shot 

The model predicts the answer when provided only a description of the task. 
No gradient updates are performed on the model. 

- prompt => Translate English to French: (This is the task description)
- Cheese => (this is you prompting the LLM to complete the sentence)

### One-shot

In addition to task description, you provide the model with one example of what you are expecting it to produce. 

- prompt => Translate English to French: (Task description for the model)
- Sea Otter => loutre de mer (One example for the model to learn from)
- Cheese => (providing a prompt to LLM to follow the lead)

### Few-shot

On addition to task description, the model is provided with a few examples of the task. 

- prompt => Translate English to French: (Task description for the model)
- Sea Otter => loutre de mer (a few examples for the model to learn from)
- Plush girafe => girafe poivree
- Cheese => (providing a prompt to LLM to follow the lead)


In [18]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="distilgpt2")
classifier(
    "One of the hottest areas of investing in recent years has been ESG",
    candidate_labels=["education", "politics", "business"],
)


Some weights of the model checkpoint at distilgpt2 were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a

{'sequence': 'One of the hottest areas of investing in recent years has been ESG',
 'labels': ['business', 'politics', 'education'],
 'scores': [0.43735742568969727, 0.2975316643714905, 0.26511093974113464]}

## Zero Shot Reasoners and Chain-of-Thought

The [paper](https://arxiv.org/abs/2205.11916) from University of Tokyo and Google Brain team suggests that LLMs have fundamental zero-shot capabilities in high-level broad cognitive tasks and that these capabilities can be extracted by simple Chain-of-Thought (or CoT) prompting.

Another [paper](https://arxiv.org/abs/2201.11903) by Google Brain team has further investigated the CoT prompting. They noted that by generating a chain-of-thought (or a series of intermediate reasoning steps) LLMs significantly improve their ability to perform complex reasoning. Their experiments on three large language models have shown that chain-of-thought prompting improves performance on a range of arithmetic, common sense, and symbolic reasoning tasks.

One exmaple:

- Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis  balls. How many tennis balls does he have now? 

- A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Additionally: 

- Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have

- A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.

> Chain of thought reasoning allows models to decompose complex problems into intermediate steps that are solved individually. Moreover, the language-based nature of chain of thought makes it applicable to any task that a person could solve via language. We find through empirical experiments that chain of thought prompting can improve performance on various reasoning tasks, and that successful chain of thought reasoning is an emergent property of model scale.



## Bloom Examples

## Downloading a Pre-Trained Tokenizer & Model

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModel.from_pretrained("bigscience/bloom")

Downloading pytorch_model_00024-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00025-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00026-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00027-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00028-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00029-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00030-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00031-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00032-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00033-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

Downloading pytorch_model_00034-of-00072.bin:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

In [None]:
from transformers import pipeline
import torch
import time

s = time.time()
pipe = pipeline(model="bigscience/bloom", torch_dtype=torch.bfloat16)
print(f"Time to load model: {time.time()-s}")


In [None]:
rom IPython.display import HTML as html_print

def cstr(s, color='black'):
    #return "<text style=color:{}>{}</text>".format(color, s)
    return "<text style=color:{}>{}</text>".format(color, s.replace('\n', '<br>'))

def cstr_with_newlines(s, color='black'):
    return "<text style=color:{}>{}</text>".format(color, s.replace('\n', '<br>'))

Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A: Let’s think step by step.
Let $B \overset{\text{ref}}{=}$ blue golf balls.
Since half of the balls are a few foreign yones distinctive of Chinese name cosmogonies but are China a unique case have variations been overlooked do exist Thesis method to cover all ages according which are almost but in the former case but again with

In [7]:
prompt = "One of the hottest areas of investing in recent years has been ESG: "
promtt += "the use of environmental, social, and governance criteria to evaluate possible investments."

result_length = 100
inputs = tokenizer(prompt, return_tensors="pt")

- `result_length` calibrates the size of the response (in tokens) we get for the prompt from the model.
- `inputs` contains the embedding representation of prompt, encoded for use specifically by PyTorch. If we were using TensorFlow we’d pass return_tensors="tf".


## Running Inference: Strategies for Better Responses

## Decoding / search strategies

With autoregressive transformers (trained for next token prediction) we have a number of options to search the answer space for the most “reasonable” output. 

In [8]:
# Greedy Search
print(
    tokenizer.decode(
        model.generate(inputs["input_ids"], max_length=result_length)[0]))

One of the hottest areas of investing in recent years has been ESG: the use of environmental, social, and governance criteria to evaluate possible investments. The ESG movement has been growing steadily since the financial crisis, and the number of companies that have adopted ESG criteria has increased by more than 50% in the last five years. The ESG movement is also growing in the United States, where the number of companies that have adopted ESG criteria has increased by more than 50% in the


In [None]:
def local_inf(prompt, temperature=0.7, top_p=None, max_new_tokens=32, repetition_penalty=None, do_sample=False, num_return_sequences=1):  
    response = pipe(f"{prompt}", 
                    temperature = temperature, # 0 to 1
                    top_p = top_p, # None, 0-1
                    max_new_tokens = max_new_tokens, # up to 2047 theoretically
                    return_full_text = False, # include prompt or not.
                    repetition_penalty = repetition_penalty, # None, 0-100 (penalty for repeat tokens.
                    do_sample = do_sample, # True: use sampling, False: Greedy decoding.
                    num_return_sequences = num_return_sequences
                    )
    return html_print(cstr(prompt, color='#f1f1c7') + cstr(response[0]['generated_text'], color='#a1d8eb')), response[0]['generated_text']


In [None]:
inp = """# Use OpenCV in Python"""
color_resp, resp = local_inf(inp, max_new_tokens=64)
color_resp


- Greedy Search 
  - simply chooses the next word at each timestep t+1 that has the highest predicted probability of following the word at t. 
  - One of the main issues here is that greedy search will miss words with a high probability at t+1 if it is preceded by a word with a low probability at t.
- Beam Search 
  - keeps track of the n-th (num_beams) most likely word sequences and outputs the most likely sequence. 
  - Sounds great, but this method breaks down when the output length can be highly variable — as in the case of open-ended text generation. 
  - Both greedy and beam search also produce outputs whose distribution does not align very well with the way humans might perform the same task (i.e. both are liable to produce fairly repetitive, boring text).
- Sampling With Top-k + Top-p
  - a combination of three methods. 
  - By sampling, we mean that the next word is chosen randomly based on its conditional probability distribution (von Platen, 2020). 
  - In Top-k, we choose the k most likely words, and then redistribute the probability mass amongst them before the next draw. 
  - Top-p adds an additional constraint to top-k, in that we’re choosing from the smallest set of words whose cumulative probability exceed p.

## [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)

by Patrick von Platen (Huggingface)