![Banner](img/AI_Special_Program_Banner.jpg)

## Introduction to LLMs - Material 3: Langchain Basics using the HuggingFace InferenceAPI
---

By end of this notebook, you should understand some basic concepts, such as *Large Language Model (LLM)*, *prompt*, *memory/chat history* and *context*. These are the bascis for designing applications which can serve as assistants based on *natural language processing (NLP)*. For this, we will use the [HuggingFace Hub Inference API](https://huggingface.co/docs/huggingface_hub/main/en/guides/inference) to run the model we work with remotely. We do so using a [Token](https://huggingface.co/docs/huggingface_hub/v0.20.3/guides/inference#authentication), which will give us slightly higher priority, but is not strictly necessary.

Due to the rapid developments in the area of LLMs as a whole and in LangChain in particular, we will however first record the versions of the libraries we are using as well as the point in time when we run this notebook. This will become relevant for various reasons, as we will see ...

---

## Overview
- [HuggingFace InferenceAPI for LLMs](#HuggingFace-InferenceAPI-for-LLMs)
- [Using InferenceApi (about to be removed)](#Using-InferenceApi-(about-to-be-removed))
- [Using InferenceClient](#Using-InferenceClient)
- [Prompt Template](#Prompt-Template)
  - [Knowledge cutoff](#Knowledge-cutoff)
- [Chain](#Chain)
- [Using a Community LLM](#Using-a-Community-LLM)
- [Memory](#Memory)
- [Context](#Context)

[next notebook](3.5.a_4_LC_RAG_HFHub.ipynb)

---

In [1]:
import torch
import transformers
import langchain
import langchain_community
import datetime as dt

In [2]:
now = dt.datetime.now()
formatted_day = now.strftime("%B %d, %Y")
formatted_time = now.strftime("%H:%M")

outstr = f'As of {formatted_day} at {formatted_time}, we are using'
outstr += f'\n * PyTorch version {torch.__version__}'
outstr += f'\n * Transformers version {transformers.__version__}'
outstr += f'\n * LangChain version {langchain.__version__}'
outstr += f'\n * LangChain Community version {langchain_community.__version__}'
print(outstr)

As of February 01, 2024 at 14:45, we are using
 * PyTorch version 2.1.0
 * Transformers version 4.36.2
 * LangChain version 0.1.1
 * LangChain Community version 0.0.13


## HuggingFace InferenceAPI for LLMs

The quick way to try LangChain is to use models from [HuggingFaceHub](https://huggingface.co/models?other=LLM) via the [HuggingFace Hub Inference API](https://huggingface.co/docs/huggingface_hub/main/en/guides/inference). If we first want to get some idea about the model's performance, we can check the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

In this notebook, we will again use the open source language model `zephyr-7b-beta`, but his time without downloading it, i.e. we carry out inference directly on the hub. Be aware, however, that this also means that all the data you provide will be sent to the HuggingFace servers and will be treated there according to HuggingFace's policy (which you will have to search for ...). Therefore and because of the limited resources on the hub, this should only be done for certain experiments and never for production purposes. However, even this would be possible using [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index).

For academic or (pre-)development purposes, however, using HuggingFace's inference API is a great way of experimenting with LLMs without the need to provide, e.g., GPU resources.

In [3]:
import torch
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

#import warnings
#warnings.filterwarnings('ignore')

In [4]:
tokenfile = open("hftoken", "r")
hf_token = tokenfile.read().strip()
tokenfile.close()

In [5]:
os_model="HuggingFaceH4/zephyr-7b-beta"

## Using InferenceApi (about to be removed)

The legacy way of invoking models directly on the HuggingFace Hub used to be via [`InferenceApi`](https://huggingface.co/docs/huggingface_hub/v0.20.3/en/package_reference/inference_client#huggingface_hub.InferenceApi) (you will need to scroll all the way down to the bottom of the page to see usage examples). If you do not ignore warnings in this notebook, you will be notified accordingly once you generate an instance of the `InferenceApi`.

In [6]:
from huggingface_hub import InferenceApi

# the following also works without specifying a token, but supplying one will give you some priority
tg_api = InferenceApi(os_model, task="text-generation", token=hf_token)



As of this writing on the day we run this notebook (see above), however, it still works:

In [7]:
tg_api(inputs="Who is Patrick Mahomes?")

[{'generated_text': "Who is Patrick Mahomes?\n\nPatrick Mahomes is an American football quarterback for the Kansas City Chiefs of the National Football League (NFL). He played college football at Texas Tech, where he earned All-American honors. Mahomes was selected by the Chiefs in the first round of the 2017 NFL Draft. He has also played for the Chiefs' reserve team, the Kansas City Chiefs Kings.\n\nWhat is Patrick Mahomes' background and upbringing"}]

## Using InferenceClient
This is the current way of using the inference API on the day this notebook was run. It gives good control for performing specific tasks. The available tasks obviously depend on the model we are using, but an overview may be found in the [guide on HuggingFace](https://huggingface.co/docs/huggingface_hub/guides/inference).

In [8]:
from huggingface_hub import InferenceClient

tg_client = InferenceClient(
    model=os_model, token=hf_token)
tg_client.text_generation(
    prompt="Who is Travis Kelce?", 
    do_sample=True,
    max_new_tokens=256,  
    temperature=0.7, 
    top_k=50, 
    top_p=0.95)

'\n\nTravis Kelce is an American football tight end for the National Football League’s Kansas City Chiefs. He played college football at the University of Cincinnati and was selected by the Chiefs in the third round of the 2013 NFL Draft. Kelce has been selected to the Pro Bowl four times and was named First-Team All-Pro in 2018. He holds several NFL records for tight ends, including most career games with at least 10 receptions and 150 receiving yards.\n\nWhat is Travis Kelce’s background and upbringing?\n\nTravis Kelce was born on October 5, 1989, in Westlake, Ohio. He is the son of former NFL tight end Jason Kelce, who currently plays for the Philadelphia Eagles. Growing up, Travis was a standout athlete in football, basketball, and baseball, and he credits his father’s coaching for much of his success. After graduating from Cleveland Heights High School, Kelce played college football at the University of Cincinnati, where he was a three-time All-American and set several school reco

## Prompt Template
Unfortunately, when using the API to run models directly from the HuggingFace hub, they cannot be put into a transformers pipeline and so we cannot use `pipe.tokenizer.apply_chat_template()` to find the structure of the prompt template. This means we will have to find some other way of doing so (one trick was given in the [Quantization section](3.5.a_1_LC_local.ipynb#Quantization) of the notebook on local models). Fortunately, in our case we already know the structure, so we can use the prompt generation function from the notebook on local models.

In [9]:
def get_prompt(system_input,user_input):
    prompt = '<|system|>\n'
    prompt += f'{system_input}</s>\n'
    prompt += '<|user|>\n'
    prompt += f'{user_input}</s>\n'
    prompt += '<|assistant|>\n'
    return prompt

Furthermore, let's create a wrapper for the way we want to generate our text (so we can just generate text easily without having to specify the parameters every time).

In [10]:
def generate_text(client, prompt):
    generated = client.text_generation(
    prompt=prompt, 
    do_sample=True,
    max_new_tokens=1024,  
    temperature=0.7, 
    top_k=50, 
    top_p=0.95)
    return generated

We can employ the function in a standard way

In [11]:
print(generate_text(tg_client, "Who is Travis Kelce's girlfriend?"))



Travis Kelce's girlfriend is Kayla Nicole. The couple met at a charity event in 2017 and have been together ever since. Kayla Nicole is a fashion model and social media influencer with over 1 million followers on Instagram. She has also appeared in music videos for artists such as Chris Brown and Travis Scott. The couple announced their engagement in August 2021 and are expected to get married in 2022. Travis Kelce proposed to Kayla Nicole with a custom-made diamond ring estimated to be worth over $200,000.


### Knowledge cutoff
The above answer would have been correct in 2022 (until early 2023 (see, e.g., [this article](https://hollywoodlife.com/feature/why-did-travis-kelce-and-ex-kayla-nicole-break-up-inside-their-split-5176701/) for anyone interested in star gossip ...). However, as of early 2024, it is not any longer ... This is a good example where *knowledge cutoff* strikes, i.e. the LLMs training data only goes up to a certain point in time and more recent events are beyond the model's horizon. This means that the model's training included data available up to that point in time. It wouldn't have information or awareness of events, developments, or publications that occurred after the cutoff date. This limitation is crucial to keep in mind when seeking current or very recent information from such models (and we saw it for GPT 3.5 also in the [previous notebook](3.5.a_2_LC_OpenAI.ipynb)).

Back to the prompt template: when we combine it with the prompt generation, we can easily query our model:

In [12]:
system_prompt = "You are a friendly assistant who always responds in the style of a helpful teacher. "
system_prompt += "You answer questions by explaining the answer step by step."
user_prompt = "What does the F1-score measure?"

print(generate_text(tg_client, get_prompt(system_prompt,user_prompt)))

The F1-score is a performance metric that is commonly used in machine learning and information retrieval to evaluate the performance of binary classification models. It measures the harmonic mean of precision and recall, which are two important evaluation metrics in classification tasks. 

Precision is the fraction of true positives among all the positive predictions made by the model, while recall is the fraction of true positives among all the actual positive instances in the dataset. In other words, precision measures the ability of the model to avoid false positives, while recall measures the ability of the model to avoid false negatives. 

The F1-score is calculated as the harmonic mean of precision and recall, and it provides a single score that can be used to compare the performance of different classification models. A higher F1-score indicates better overall performance, as it represents a good balance between precision and recall.


## Chain

One of the nice things about LangChain is the ability to actually chain operations together, so we can easily invoke this chain of operations. One easy and useful chain is the `LLMChain` with the help of which we can, e.g. invoke an LLM together with an appropriate prompt template, i.e. `LLMChain` takes in a prompt template, formats it with the user input and returns the response from an LLM. 

Let us start with the template. Notice that we do not need an f-string in case we want to use the template in connection with `LangChain`s `PromptTemplate` class:

In [13]:
text_prompt_template = '<|system|>\n'
text_prompt_template += '{system_prompt}</s>\n'
text_prompt_template += '<|user|>\n'
text_prompt_template += '{user_prompt}</s>\n'
text_prompt_template += '<|assistant|>\n'

In [14]:
from langchain import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["system_prompt","user_prompt"],
    template=text_prompt_template,
)

prompt_template

PromptTemplate(input_variables=['system_prompt', 'user_prompt'], template='<|system|>\n{system_prompt}</s>\n<|user|>\n{user_prompt}</s>\n<|assistant|>\n')

Unfortunately, we cannot just chain the inference client the way we would like (and can do with a model running locally). The following throws an error:

```python
from langchain import LLMChain
llm_chain = LLMChain(llm=tg_client, prompt=prompt_template)
```

results in

![Error](img/Error.jpg)

## Using a Community LLM
There is a way to overcome this problem, namely using the `HuggingFaceHub` module from the `langchain_community` library (which is automatically installed together with `langchain`).

In [16]:
from langchain_community.llms import HuggingFaceHub

In [17]:
llm = HuggingFaceHub(
    huggingfacehub_api_token=hf_token,
    repo_id=os_model, 
    model_kwargs={"temperature": 0.7, 
                  "max_new_tokens": 1024,  
#                  "max_length": 512, 
                  "top_k":50, 
                  "top_p":0.95,
                  "do_sample": True}
)



**Remarks**: 
* As we can see, the `HuggingFaceHub` community module still uses the about to be removed `InferenceApi` (see above). However, it may be expected that the developers of the `HuggingFaceHub` module will re-implement their module in such a way that the `InferenceClient` is used.
* This situation is very common, especially in the early stages of the development of a library like LangChain. Many changes must be expected and these are quite often *breaking changes*, i.e., code which works one day might very soon afterwards not work anymore!

However, in order to be able to use this simple chain based on the HuggingFace Inference API, we will have to take this route for the time being.

In [18]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [19]:
instruction = "Translate this sentence into German: I love large language models."
llm_chain.invoke(input={"system_prompt" : system_prompt, "user_prompt": instruction} )

{'system_prompt': 'You are a friendly assistant who always responds in the style of a helpful teacher. You answer questions by explaining the answer step by step.',
 'user_prompt': 'Translate this sentence into German: I love large language models.',
 'text': 'Ich liebe große Sprachmodelle.'}

You can add `verbose=True` to show more details (can be used for debug)

In [20]:
llm_chain_verb = LLMChain(llm=llm, prompt=prompt_template, verbose=True)

In [21]:
llm_chain_verb.invoke(input={"system_prompt" : system_prompt, "user_prompt": instruction} )



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<|system|>
You are a friendly assistant who always responds in the style of a helpful teacher. You answer questions by explaining the answer step by step.</s>
<|user|>
Translate this sentence into German: I love large language models.</s>
<|assistant|>
[0m

[1m> Finished chain.[0m


{'system_prompt': 'You are a friendly assistant who always responds in the style of a helpful teacher. You answer questions by explaining the answer step by step.',
 'user_prompt': 'Translate this sentence into German: I love large language models.',
 'text': 'Ich liebe große Sprachmodelle.'}

## Memory

Now let's try to translate that into a different language, only we don't provide the sentence again.

In [22]:
instruction = "Translate the sentence into French."
llm_chain.invoke(input={"system_prompt" : system_prompt, "user_prompt": instruction} )

{'system_prompt': 'You are a friendly assistant who always responds in the style of a helpful teacher. You answer questions by explaining the answer step by step.',
 'user_prompt': 'Translate the sentence into French.',
 'text': '"Vous êtes un assistant amical qui toujours répond dans le style d\'un enseignant aimable. Vous expliquez les réponses en passant les étapes."\n\nTranslation: "Vous êtes un assistant amical qui toujours répond dans le style d\'un enseignant aimable. Vous expliquez les réponses étape par étape."'}

Clearly it doesn't know which `sentence` we are referring to, as it doesn't know the previous conversations (chat history). To resolve this issue, we will add memory to the chain (or llm)

In [23]:
tpprompt_hist = '<|system|>\n'
tpprompt_hist += '{system_prompt}</s>\n'
tpprompt_hist += '<|user|>\n'
tpprompt_hist += '\n previous conversation:\n{chat_history}\n\n'
tpprompt_hist += '{user_prompt}</s>\n'

tpprompt_hist += '<|assistant|>\n'

text_prompt_template_with_history = tpprompt_hist

In [24]:
from langchain.memory import ConversationBufferMemory

prompt = PromptTemplate.from_template(text_prompt_template_with_history)

memory = ConversationBufferMemory(memory_key="chat_history", input_key="user_prompt")
llm_chain_with_memory = LLMChain(
    llm=llm,
    prompt=prompt,
    memory=memory
)

In [25]:
prompt

PromptTemplate(input_variables=['chat_history', 'system_prompt', 'user_prompt'], template='<|system|>\n{system_prompt}</s>\n<|user|>\n\n previous conversation:\n{chat_history}\n\n{user_prompt}</s>\n<|assistant|>\n')

In [26]:
instruction = "Translate this sentence into German: I love large language models."
#output = llm_chain_with_memory.invoke(
#    input={"system_prompt" : system_prompt, "user_prompt": instruction, "human_input": ''} )
output = llm_chain_with_memory.invoke(
    input={"system_prompt" : system_prompt, "user_prompt": instruction} )

In [27]:
print(output["text"])

Ich liebe große Sprachmodelle sehr. (I really like large language models in German)

Explanation:

1. I: Ich is the German word for I, and it's pronounced as "iheh."

2. love: Liebe is the German word for love, and it's pronounced as "leeh-bah."

3. large: Große is the German word for large, and it's pronounced as "grah-seh."

4. language models: Sprachmodelle is the German word for language models, and it's pronounced as "shprahkh-modeh-leh."

5. In German, the verb "to love" is conjugated as "lieben," and it's followed by the object, which is "große Sprachmodelle" in this case.

Overall, the sentence "I love large language models" translates to "Ich liebe große Sprachmodelle sehr" in German.


In [28]:
print(output["chat_history"])




In [29]:
instruction = "Translate the sentence into French."
output_fr = llm_chain_with_memory.invoke(
    input={"system_prompt" : system_prompt, "user_prompt": instruction} )

In [30]:
print(output_fr["text"])


Human: Translate this sentence into French: I love large language models.

AI: Je aim les grands modèles de langage. (Pronounced as zhuh ah-EHM leh grahnd moe-dehlz de lahng-ahzh)

Explanation:

1. I: Je is the French word for I, and it's pronounced as "zhuh ah-EHM."

2. love: Aimer is the French verb for love, and it's conjugated as "aimer" in the third person singular.

3. large: Grands is the French word for large, and it's pronounced as "grahnd."

4. language models: Modèles de langage is the French phrase for language models, and it's pronounced as "moh-dehlz de lahng-ahzh."

Overall, the sentence "I love large language models" translates to "Je aim les grands modèles de langage" in French.

Note: In French, the verb "to love" is conjugated differently depending on the subject. In this case, we use the third person singular as the subject is not explicitly stated in the English sentence. If we want to say "I love," we would use "Je aime" in French.


In [31]:
print(output_fr["chat_history"])

Human: Translate this sentence into German: I love large language models.
AI: Ich liebe große Sprachmodelle sehr. (I really like large language models in German)

Explanation:

1. I: Ich is the German word for I, and it's pronounced as "iheh."

2. love: Liebe is the German word for love, and it's pronounced as "leeh-bah."

3. large: Große is the German word for large, and it's pronounced as "grah-seh."

4. language models: Sprachmodelle is the German word for language models, and it's pronounced as "shprahkh-modeh-leh."

5. In German, the verb "to love" is conjugated as "lieben," and it's followed by the object, which is "große Sprachmodelle" in this case.

Overall, the sentence "I love large language models" translates to "Ich liebe große Sprachmodelle sehr" in German.


We can see the chat history is added into the prompt for new instructions

**Note**: `ConversationBufferMemory` basically puts all histories into the prompt, and we know there are always limitions on context length for large language models, e.g. for Llama2, the context length is only 4096. Langchain provides many other types of memories. The details may be found [here](https://python.langchain.com/docs/modules/memory/).

## Context

`Zephyr-7B-beta` is pretrained with a lot of data, but that does not mean it knows everything. We already saw an instance of *knowledge cutoff* above. Let's try to ask some question about the model itself.

In [32]:
system_prompt = "You are a helpful, respectful and honest assistant. "
system_prompt += "If you don't know the answer to a question, please don't share false information."
question = "What is Zephyr-7b-beta?"

In [33]:
llm_chain.invoke(input={"system_prompt" : system_prompt, "user_prompt": question} )

{'system_prompt': "You are a helpful, respectful and honest assistant.If you don't know the answer to a question, please don't share false information.",
 'user_prompt': 'What is Zephyr-7b-beta?',
 'text': "Zephyr-7b-beta is a software update for the Zephyr Project's open-source operating system, specifically for the Zephyr 7.0b release. It is a beta version, which means it is a pre-release version that is not yet considered stable or final. The purpose of beta releases is to allow developers and users to test the software and provide feedback before the final release. In this case, the Zephyr-7b-beta update includes bug fixes, performance improvements, and new features for the Zephyr operating system, which is designed for use in resource-constrained embedded devices."}

Oh, wow, here is a [*hallucination*](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)) for you! One way to resolve this is that we can add more context into the prompt.

In [34]:
tpprompt_context = '<|system|>\n'
tpprompt_context += '{system_prompt}</s>\n'
tpprompt_context += '<|user|>\n'
tpprompt_context += '\n Here is the context:\n{context}\n\n'
tpprompt_context += '{user_prompt}</s>\n'
tpprompt_context += '<|assistant|>\n'

text_prompt_template_with_context = tpprompt_context

prompt_template_with_context = PromptTemplate(
    template=text_prompt_template_with_context, input_variables=["system_prompt", "context", "question"]
)

llm_chain_with_context = LLMChain(llm=llm, prompt=prompt_template_with_context)

We copy the information from the [model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and provide this as context to the LLM.

In [35]:
context = "Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-beta is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so. You can find more details in the technical report."

cont_out = llm_chain_with_context.invoke(
    input={"system_prompt" : system_prompt, "user_prompt": question, "context": context})

In [36]:
print(cont_out["text"])

Zephyr-7b-beta is a specific language model in the Zephyr series that has been fine-tuned from the mistralai/Mistral-7B-v0.1 model using Direct Preference Optimization (DPO) on a mix of publicly available and synthetic datasets. This model is designed to act as a helpful assistant, but unlike its predecessor, it has been trained without the in-built alignment of the datasets, which has resulted in improved performance on the MT Bench test and a more helpful overall behavior. However, because of the removal of this alignment, the model may generate problematic text when prompted to do so. Additional information about this model can be found in a technical report.


Sounds about right, because all the information is added into prompt. But let's ask a different question.

In [37]:
question2 = "What type of model is Zephyr-7B-beta?"
cont_out2 = llm_chain_with_context.invoke(
    input={"system_prompt" : system_prompt, "user_prompt": question2, "context": context})
print(cont_out2["text"])

Zephyr-7B-beta is a language model that has been fine-tuned from the mistralai/Mistral-7B-v0.1 model. It falls under the category of helpful assistant models, and is specifically designed to provide useful and accurate responses to user prompts. The model has been trained using a mix of publicly available and synthetic datasets through Direct Preference Optimization (DPO), which has led to improved performance on the MT Bench. However, as the model was trained without alignment of the datasets, it may generate problematic text when prompted to do so.


As you can see, we don't provide enough context to answer the second question in the way we would expect (namely, as shown on the model card). However, we can not predict what the questions are and we also cannot put all information into the prompt as context.

To resolve this issue, there are other ways, especially Retrieval Augmented Generation (RAG), which we will look at in the [next notebook](3.5.a_4_LC_RAG_HFHub.ipynb).