![Banner](img/AI_Special_Program_Banner.jpg)

## Introduction to LLMs - Material 1: Using an Open Source LLM with LangChain
---

This notebook is the first in a series to show how to make use of various options to integrate LLMs into one's own solutions using the python library [LangChain](https://python.langchain.com/docs/get_started). This library is under development and, according to the webpage, "is a framework for developing applications powered by language models."

The source of the LLM can be
* open - in which case one would usually turn to [HuggingFace](https://huggingface.co/), in particular to the page showing all [available models](https://huggingface.co/models?other=LLM) or one would start at the [Oopen LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) in oder to make an informed decision as to which one to use. The model can then be run *locally* (provided you have enough resources) or directly on [HuggingFace Hub](https://huggingface.co/docs/hub/index) using an appropriate API;
* proprietary - here, the most well known ones are [OpenAI's GPT models](https://platform.openai.com/docs/models) which may be accessed via an [API](https://openai.com/blog/openai-api) (for which one must create an account and generate an access token)

We will briefly show all three cases (a brief tutorial introduction may also be found [here](https://www.pinecone.io/learn/series/langchain/langchain-intro/)) and then use the Inference API of HuggingFace to introduce certain concepts like *prompt (engineering)*, *memory/chat history* and *context*. Finally, we will explore how we could possibly employ an LLM for sentiment analysis.

The series was inpired by [Aiden Dai's GitHub Repo](https://github.com/daixba/langchain-tutorials) - but note that the notebooks provided there no longer work, although, as of January 2024, they are not even five months old ... (more on this later)

---
<div style="color:red"><b>Attention:</b></div> 
Do <b>not</b> try to run this notebook on the university server! Or, if you try to do it, make sure no one else is using the same GPU as you are!

---

## Overview
- [An Open Source LLM](#An-Open-Source-LLM)
  - [Prompt Template](#Prompt-Template)
  - [Quantization](#Quantization)

[next notebook](3.5.a_2_LC_OpenAI.ipynb)

---

## An Open Source LLM

In this notebook, we will use an open source language model (`zephyr-7b-beta`). When downloading models from the HuggingFace hub, you might sometimes have to accept a license before you can actually download the model by providing your huggingface hub token. Check more details on the respective model's *model card* on HuggingFace, in our case [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta). Moreover, we want to use it directly on our local machine, which makes sense especially in situations where we do not want our interactions with the LLM to use the internet due to privacy concerns.

Before we actually load the model, let's check our available GPU resources:

In [1]:
# this will only work on Linux ...
!nvidia-smi | grep MiB

|  0%   44C    P8              16W / 450W |     28MiB / 24564MiB |      0%      Default |
|    0   N/A  N/A      1228      G   /usr/lib/xorg/Xorg                           18MiB |


And then we import the necessary libraries ...

In [2]:
import torch
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

import warnings
warnings.filterwarnings('ignore')

... and instantiate the model in the form of a [transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial):

In [3]:
pipe = pipeline(
    task="text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    torch_dtype=torch.bfloat16, 
    device=0,
    do_sample=True,
    max_new_tokens=1024,  
    temperature=0.7, 
    top_k=50, 
    top_p=0.95)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Now let's check again our resource usage:

In [4]:
!nvidia-smi | grep MiB

|  0%   45C    P2              72W / 450W |  14998MiB / 24564MiB |     57%      Default |
|    0   N/A  N/A      1228      G   /usr/lib/xorg/Xorg                           18MiB |
|    0   N/A  N/A   1170628      C   ...vs/miniconda3/envs/pt_21/bin/python    14966MiB |


As we can see, even without interacting with the LLM, it already takes up 15GB of VRAM, so the model is definitely resource intensive. It also means that two instances of the model cannot live next to each other on a single GPU (which has 24 GB VRAM overall). Both will then run into an out-of-memory error!

### Prompt Template
It is helpful to get an idea, in what way we should *prompt* the LLM we are using, i.e. how to best supply *input* like questions or instructions to the LLM. If the model is of high quality, the *model card* on HuggingFace will provide some sample code which might be used to that extent. In our case, unfortunately, it does not do so. However, there is a way to find this out programmatically using the *tokenizer* which comes with the pipeline. We will use the tokenizer's [chat template](https://huggingface.co/docs/transformers/chat_templating).

In [5]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds truthfully",
    },
    {
        "role": "user", 
        "content": "Who is Travis Kelce?"},
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipe(prompt)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot who always responds truthfully</s>
<|user|>
Who is Travis Kelce?</s>
<|assistant|>
Travis Kelce is a professional American football player who currently plays as a tight end for the Kansas City Chiefs in the National Football League (NFL). He was drafted by the Chiefs in the third round of the 2013 NFL Draft and has been with the team ever since. Kelce is known for his outstanding athleticism, reliable hands, and ability to make clutch plays in big moments. He has been selected to the Pro Bowl four times and was named a First-Team All-Pro in 2018.


An LLM is trained to generate text. While it may be fine-tuned for quite specific use cases, text generation is its main ability. However, as we can see in the example above already, the LLM can be given some general instructions which are often called the *system prompt*, while the actual instruction given (or question asked) by the user is the *user prompt*.

In any case, let's distinguish the two components of the overall prompt and try again.

In [6]:
system_prompt = "You are a friendly chatbot who always responds truthfully and explains step by step"
user_prompt = "What does the F1-score measure?"
messages = [
    {
        "role": "system",
        "content": f'{system_prompt}',
    },
    {
        "role": "user", 
        "content": f'{user_prompt}'},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot who always responds truthfully and explains step by step</s>
<|user|>
What does the F1-score measure?</s>
<|assistant|>
The F1-score is a performance metric used in machine learning and data science to evaluate the accuracy of a binary classification model. It is a harmonic mean of the precision and recall, which gives a single value between 0 and 1 to represent the overall performance of the model. The F1-score considers both the true positives (TP) and false positives (FP) of the model and is calculated using the following formula:

F1-score = 2 * (precision * recall) / (precision + recall)

A high F1-score indicates that the model has a good balance between precision and recall, meaning it accurately identifies relevant data and does not falsely classify irrelevant data.


Next, let's specifically look at the way the prompt is constructed and try to build on that in a simple fashion.

In [7]:
print(prompt)

<|system|>
You are a friendly chatbot who always responds truthfully and explains step by step</s>
<|user|>
What does the F1-score measure?</s>
<|assistant|>



So, we can simply build the prompt directly ourselves.

In [8]:
user_prompt = "What is a CNN in the deep learning context?"
lcprompt = '<|system|>\n'
lcprompt += f'{system_prompt}</s>\n'
lcprompt += '<|user|>\n'
lcprompt += f'{user_prompt}</s>\n'
lcprompt += '<|assistant|>\n'
print(lcprompt)

<|system|>
You are a friendly chatbot who always responds truthfully and explains step by step</s>
<|user|>
What is a CNN in the deep learning context?</s>
<|assistant|>



Now, let us mkae use of the class we imported from `LangChain`'s `llms` module earlier and see how we can apply it in our context.

In [9]:
llm = HuggingFacePipeline(pipeline=pipe)

In [10]:
print(llm.invoke(lcprompt))

A CNN (Convolutional Neural Network) is a type of artificial neural network that is commonly used in the field of deep learning for image and video recognition. CNNs are designed to process grid-like data, such as images, by applying a series of convolutions, pooling, and activation functions to extract meaningful features. Convolutions involve sliding a kernel (a small matrix) over the input image to detect specific patterns, while pooling reduces the spatial dimensions of the feature maps to decrease computational complexity and increase translation invariance. Activation functions introduce non-linearity to the network, allowing it to learn more complex features. CNNs have shown exceptional performance in various image and video recognition tasks, such as object classification, detection, and segmentation.


So, it probably makes sense to wrap up our own way of constructing a prompt for easier use later on in the form of a function and check whether it does what we expect.

In [11]:
def get_prompt(system_input,user_input):
    prompt = '<|system|>\n'
    prompt += f'{system_input}</s>\n'
    prompt += '<|user|>\n'
    prompt += f'{user_input}</s>\n'
    prompt += '<|assistant|>\n'
    return prompt

In [12]:
user_prompt = "What is an RNN in the deep learning context?"
tstprompt = get_prompt(system_prompt,user_prompt)
print(tstprompt)

<|system|>
You are a friendly chatbot who always responds truthfully and explains step by step</s>
<|user|>
What is an RNN in the deep learning context?</s>
<|assistant|>



So, getting the answer we want from our LLM is now as easy as writing:

In [13]:
print(llm.invoke(get_prompt(system_prompt,user_prompt)))

In the context of deep learning, an RNN (Recurrent Neural Network) is a type of neural network that can process sequences of input data, such as text, speech, or time series. Unlike traditional feedforward neural networks, which process data in a single pass, RNNs have feedback connections that allow them to remember and process information over an extended period. This memory capability makes RNNs particularly useful for tasks such as language modeling, speech recognition, and time series prediction. The basic building block of an RNN is a recurrent unit, which takes an input, updates its internal memory or state, and outputs a result that is passed to the next unit in the sequence. The training of an RNN involves backpropagation through time, a variant of the standard backpropagation algorithm that handles the recurrent connections.


### Quantization
Thus, the only drawback to running a model on a local machine is its hunger for resources. This might actually be overcome by *quantization*. This means the process of reducing the precision of the numbers used in the model's computations. This is typically done to reduce the model's size and increase its computational efficiency, often at the cost of a slight decrease in accuracy. On HuggingFace, the user Tom Jobbins going by the handle [TheBloke](https://huggingface.co/TheBloke) provides numerous such models.

Incidentally, he also provides different quantization versions of Zephyr-7b-beta, namely [AWQ](https://huggingface.co/TheBloke/zephyr-7B-beta-AWQ), [GPTQ](https://huggingface.co/TheBloke/zephyr-7B-beta-GPTQ), and [GGUF](https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF), which you are welcome to try.

**Remark**: The model cards for those quantized versions also give us the *prompt template* for Zephyr, so we could have written our `get_prompt()`-function directly and would not have needed to find it via the tokenizer. It might be worth keeping this in mind in case you want to try a completely different open source model.

So, while we can have an open source model running locally as long as it does not take up more resources than we have to offer, in the [next notebook](3.5.a_2_LC_OpenAI.ipynb) we will look at the other extreme, namely a proprietary LLM invoked via an API.