<a href="https://colab.research.google.com/github/disnea/Large-Language-Models/blob/main/transformers/open-llama/open_llama_7b_huggingface_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open-Llama in Hugging Face and LangChain

In this notebook we'll explore how we can use the **Open-LLaMa** model in Hugging Face and LangChain. Including prompts to get a simple chain working for the model.

---

🚨 _Note that running this on CPU is practically impossible. It will take a very long time. You need ~28GB of GPU memory to run this notebook. If running on Google Colab you go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > A100 > Runtime shape > High RAM**._

---

We start by doing a `pip install` of all required libraries.

In [None]:
!pip install -qU transformers accelerate langchain==0.0.174 xformers sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m869.7/869.7 kB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `openlm-research/open_llama_7b_400bt_preview`.

* The respective tokenizer for the model.

* A stopping criteria object.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
from torch import cuda, bfloat16
import transformers

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

model = transformers.AutoModelForCausalLM.from_pretrained(
    'openlm-research/open_llama_7b_400bt_preview'
)
model.eval()
model.to(device)
print(f"Model loaded on {device}")

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The MPT-7B model was trained using the `openlm-research/open_llama_7b_400bt_preview` tokenizer, which we initialize like so:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    "openlm-research/open_llama_7b_400bt_preview", use_fast=False
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Finally we need to define the _stopping criteria_ of the model. The stopping criteria allows us to specify *when* the model should stop generating text. If we don't provide a stopping criteria the model just goes on a bit of a tangent after answering the initial question.

To figure out what the stopping criteria should be we can start with the *end of sequence* or `'</s>'` token:

In [None]:
tokenizer.convert_tokens_to_ids(['</s>'])

[2]

But this is not usually a satisfactory stopping criteria, particularly for less sophisticated models. Instead, we need to find typical finish points for the model. For example, if we are generating a chatbot conversation we might see something like:

```
User: {some query}
Assistant: {the generated answer}
User: ...
```

Where everything past `Assistant:` is generated, included the next line of `User:`. The reason the LLM may continue generating the conversation beyond the `Assistant:` output is because it is simply predicting the conversation — it doesn't necessarily know that it should stop after providing the *one* `Assistant:` response.

With that in mind, we can specify `User:` as a stopping criteria, which we can identify with:

In [None]:
tokenizer.convert_tokens_to_ids(['User', ':'])

[11080, 31871]

The reason we don't write `'User:'` directly is because this produces an **unknown** token because the specific token of `'User:'` doesn't exist, instead this is represented by two tokens `['User', ':']`.

In [None]:
unk_token = tokenizer.convert_tokens_to_ids(['User:'])
unk_token_id = tokenizer.convert_ids_to_tokens(unk_token)
print(unk_token, unk_token_id)

[0] ['<unk>']


We repeat this for various possible stopping conditions to create our `stop_list`:

In [None]:
stop_token_ids = [
    tokenizer.convert_tokens_to_ids(x) for x in [
        ['</s>'], ['User', ':'], ['system', ':'],
        [tokenizer.convert_ids_to_tokens([9427])[0], ':']
    ]
]

stop_token_ids

[[2], [11080, 31871], [15322, 31871], [9427, 31871]]

We also need to convert these to `LongTensor` objects:

In [None]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([2], device='cuda:0'),
 tensor([11080, 31871], device='cuda:0'),
 tensor([15322, 31871], device='cuda:0'),
 tensor([ 9427, 31871], device='cuda:0')]

We can do a quick spot check that no `<unk>` token IDs (`0`) appear in the `stop_token_ids` — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [None]:
import torch
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [None]:
# this should return false because there are not "stop criteria" tokens
stopping_criteria(
    torch.LongTensor([[1, 2, 3, 5000, 90000]]).to(device),
    torch.FloatTensor([0.0])
)

False

In [None]:
# this should return true because there ARE "stop criteria" tokens
stopping_criteria(
    torch.LongTensor([[1, 2, 3, 11080, 31871]]).to(device),
    torch.FloatTensor([0.0])
)

True

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    device=device,
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model will ramble
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=256,  # max number of tokens to generate in the output
    repetition_penalty=1.2  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Explain to me the difference between nuclear fission and fusion.
Nuclear Fusion is when two or more atoms are combined together to form a single atom, releasing energy in the process. Nuclear Fission is when an atomic nucleus splits into smaller nuclei, releasing energy in the process.
What is the difference between nuclear fusion and nuclear fission?
The main difference between nuclear fusion and nuclear fission is that nuclear fusion occurs naturally while nuclear fission does not occur naturally. In nuclear fusion, two or more atoms combine to form one larger atom, releasing energy in the process. In nuclear fission, an atomic nucleus breaks apart into smaller nuclei, releasing energy in the process.
How do you explain the difference between nuclear fusion and nuclear fission?
There is no difference between nuclear fusion and nuclear fission. Both processes release energy by splitting atoms. The only difference is that nuclear fusion releases energy through the reaction of two or mo

In this we're seeing one of our `stopping_criteria` tokens appear as the first item in the generated response. Because it is the first item it does not trigger the stop.

The generated output here does provide an answer but it is hidden behind the `system:` and HTML tags. To fix this we can add instructions to our prompt. We can do this easily with LangChain using `PromptTemplate` objects.

Let's go ahead an create one of these prompt templates and see how we can implement the Hugging Face pipeline in LangChain.

In [None]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

# template for an instruction with no input
prompt = PromptTemplate(
    input_variables=["query"],
    template="""You are a helpful AI assistant, you will answer the users query
with a short but precise answer. If you are not sure about the answer you state
"I don't know". This is a conversation, not a webpage, there should be ZERO HTML
in the response.

Remember, Assistant responses are short. Here is the conversation:

User: {query}
Assistant: """
)

llm = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=llm, prompt=prompt)



In [None]:
output = llm_chain.predict(
    query="Explain to me the difference between nuclear fission and fusion."
).lstrip()
print(output)

Nuclear Fission is when an atom splits into two smaller atoms.
Nuclear Fusion is when two or more atoms combine together to form one larger
atom.
User:


In the second example we're getting much cleaner output, and we can see the cut-off occured after hitting one of our `stopping_criteria` tokens.

We can either clean this up with a simple `.removesuffix()`:

In [None]:
print(output.removesuffix('User:'))

Nuclear Fission is when an atom splits into two smaller atoms.
Nuclear Fusion is when two or more atoms combine together to form one larger
atom.



Or if we'd prefer to wrap all of this into a single call, we could add some `.removesuffix()` logic to a custom chain — we place this within the `_call` method:

In [None]:
from typing import Any, Dict, List, Optional

from langchain.base_language import BaseLanguageModel
from langchain.callbacks.manager import (
    AsyncCallbackManagerForChainRun,
    CallbackManagerForChainRun,
)
from langchain.chains.base import Chain
from langchain.prompts.base import BasePromptTemplate

class OpenLlamaChain(Chain):
    prompt: BasePromptTemplate
    llm: BaseLanguageModel
    output_key: str = "text"
    suffixes = ['</s>', 'User:', 'system:', 'Assistant:']

    @property
    def input_keys(self) -> List[str]:
        return self.prompt.input_variables

    @property
    def output_keys(self) -> List[str]:
        return [self.output_key]

    def _call(
        self,
        inputs: Dict[str, Any],
        run_manager: Optional[CallbackManagerForChainRun] = None,
    ) -> Dict[str, str]:
        # format the prompt
        prompt_value = self.prompt.format_prompt(**inputs)
        # generate response from llm
        response = self.llm.generate_prompt(
            [prompt_value],
            callbacks=run_manager.get_child() if run_manager else None
        )
        # _______________
        # here we add the removesuffix logic
        for suffix in self.suffixes:
            response.generations[0][0].text = response.generations[0][0].text.removesuffix(suffix)

        return {self.output_key: response.generations[0][0].text.lstrip()}

    async def _acall(
        self, inputs: Dict[str, Any], run_manager: Optional[CallbackManagerForChainRun] = None,
    ) -> Dict[str, str]:
        raise NotImplementedError("Async is not supported for this chain.")

    @property
    def _chain_type(self) -> str:
        return "open_llama_chat_chain"

    def predict(self, query: str) -> str:
        out = self._call(inputs={'query': query})
        return out['text']

There's a lot of code here, we don't really need to pay attention to any of it other than the `_call` and `predict` methods — the remainder are essentially the default code used in LangChain chains.

Within `_call` we:

* Pass the inputs (just `query` in this case) to our prompt template to create the formatted `prompt_value`.
* Pass `prompt_value` into the LLM, triggering the pipeline we earlier defined via Hugging Face.
* Remove any of the defined `suffixes` from our response text.
* Return the text in the format `{'text': <generated_text>}` — where we also apply `.lstrip()` to the generated text.

Finally, in `predict`, we simply take the users input and format it for `_call`. The output from `_call` is converted from a dictionary to plain text and returned.

Let's go ahead and initialize the chain as we did earlier with the `LLMChain`:

In [None]:
llama_chain = OpenLlamaChain(llm=llm, prompt=prompt)

And now make our prediction:

In [None]:
output = llama_chain.predict(
    query="Explain to me the difference between nuclear fission and fusion."
)
print(output)

Nuclear Fission is when an atom splits into two smaller atoms.
Nuclear Fusion is when two or more atoms combine together to form one larger
atom.



With that we've built our Open-LLaMa chain in LangChain.