# Technical course for large language models

Before we start the development...

Create and activate a virtual environment inside the repository, and then install all packages contained in ``requirements.py`` if this has not already been done. Check the README.md for more detailed instructions on how to do this.

In [None]:
import os
import numpy as np
import chromadb
from dotenv import load_dotenv
from chromadb import Collection
from openai import AzureOpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

Update the ``.env`` file with parameter values from the course instructors. This enables you to load the following. 

In [None]:
load_dotenv()
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_key = os.getenv("AZURE_OPENAI_KEY")
api_version = os.getenv("AZURE_OPENAI_VERSION")
embedding_deployment_name = os.getenv("AZURE_OPENAI_EMBEDDING_NAME")
completions_deployment_name = os.getenv("AZURE_OPENAI_COMPLETIONS_NAME")

## How to use this notebook ⚙️

This notebook contains boilerplate code to guide you through the programming tasks. Two rules apply to this notebook:
- ``"""START HERE"""`` and ``"""END HERE"""`` is used to mark where you are supposed to edit the code. You might make modifications outside of these if you like or download additional packages, but be aware that this might affect your learning experience and might force you to make changes in subsequent tasks.
- Have fun! 🕺💃

<b>Disclaimer</b>: we are well aware that it exists methods, especially in LangChain, which can accomplish all the tasks in this notebook in just a few lines of code. But where is the learning in that?

## Make your own chat! 🤖

You will now make your own chat with the use of a language model! You will also explore how prompt engineering and the choice of parameters impact the behaviour of the model.

We start by making a client for communication with Azure OpenAI.

In [None]:
client = AzureOpenAI(
    azure_endpoint=azure_endpoint, 
    api_key=api_key,  
    api_version=api_version,
)

The OpenAI APIs for LLMs expect the prompt to be list of "chat" messages. Including previous messages enables the model to use the context from earlier interactions, which is a key feature of the chat interface often used with LLMs, such as the ChatGPT app. For OpenAI models, a message is comprised of a ``role`` field and  ``content`` field, like so: 
```
message = {
    "role": "user",
    "content": "Tell me a fun fact about dinosaurs"
}
```

User messages are labelled with the role `"user"`, while LLM responses are labelled with the role `"assistent"`. We also have a special role for the so-called _system prompt, `"system"`, which is often included as the very first message. The system prompt is used to give a high-level instruction of how you want the LLM to behave, e.g.:
```
system_prompt = {
    "role": "system",
    "content": "You are a helpful chatbot. Answer the following questions concisely and to the best of your ability."
}
```

Below we have provided a sample outline for a class used in a basic LLM chat application. The class contains a property `messages` holding a list of messages of the form described above. The class contains a utility method for adding new messages to the message list.

In [None]:
class Chat:
    def __init__(self, system_message: str | None = None) -> None:
        self.messages = [] 
        if system_message: 
            self.add_message("system", system_message)
            

    def generate(self, prompt: str, **hyperparams) -> str:
        """
        :param prompt: the input prompt
        :param hyperparams: hyperaparameters for the large language model
        :return content: the content in the response message from the model 
        """
        content = None
        
        """START HERE"""
       
        """END HERE"""
        
        return content
    
    
    def add_message(self, role: str, content: str) -> list[dict]:
        """
        Creates and adds a new message to the list of all messages.

        :param role: the role associated with the message, either system, user or assistant
        :param content: the content associated with the message
        """ 
        assert role in ["system", "user", "assistant"], f"{role} is not a valid role for a message"
        message = {"role": role, "content": content}
        self.messages.append(message)

The cell below creates a chat history with one user message and prints the list of messages.

In [None]:
quote_chat_history = Chat()
quote_chat_history.add_message("user", "Give me a short inspiring quote about the need to learn about large language models.")
print(quote_chat_history.messages)

Now you have to do the following tasks in the ``generate`` function to complete the ``Chat`` class:
1. Update ``self.messages`` by adding a user message with ``role="user"`` and ``content=prompt``. 
2. Call on ``client.chat.completions.create`` to create a chat completion based on the messages in ``self.messages``. The parameter value for ``model`` must be equal to ``completions_deployment_name``, given in the introduction of this notebook. 
3. Extract the content from the response. Tips: the content can be found in ``response.choices[0].message.content``. 
4. Update ``self.messages`` by adding an assistant message with ``role="assistant"`` and ``content=content``.

Ignore ``system_message`` for now. 

You have now defined everything you need to start chatting. The chat below generates a quote that inspires you to learn about large language models! 🥰

In [None]:
quote_chat = Chat()
quote_chat.generate("Give me a short inspiring quote about the need to learn about large language models.")

Take a look at the chat history. 

In [None]:
quote_chat.messages

Let's ask a follow up question about the quote that was generated. 

In [None]:
quote_chat.generate("Explain the quote with one short sentence")

It's time to configure the behaviour of the model. 
1. Create a new instance of the ``Chat`` class and use the ``system_message`` parameter to start the chat with a system message (also referred to as metaprompt or system prompt). Notice that the system message has ``role="system"`` and it is the very first message that is added. The content in the system message guides the model's behaviour and is crucial to achieve high performance. You can decide this content. For instance, you can make an assistant for your current project or your personal training coach.
2. Generate an answer on a prompt you decide. Now, try to play around with different system messages and see how it affects the reponse on the same prompt. You can use [this guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message) to get tips on how formulate a system message. 

In [None]:
"""START HERE"""


"""END HERE"""

Hopefully, you have now formulated a system message that successfully guides the model's behaviour according to your intention. However, through the API we also have more granular control over the types of outputs we get from the LLM. The chat.completions API exposes many different parameters that can be tweaked to generate different responses to the same prompt. Use the same ``system_message`` and prompt as you used in the cell above for initating the ``Chat``. Then, pass different parameter values as kwargs to the ``generate`` method and see what happens. 

The parameter values you can try to change are:
- ``max_tokens`` (integer between 0 and the model's context length, default is 16) controls the maximum number of tokens to generate in a response. 
- ``temperature`` (number between 0 and 2, default is 1): higher value introduce more randomness to the model, meaning that more unlikely tokens can be chosen. 
- ``top_p`` (number between 0 and 1, default is 1): controls the tokens from the model's vocabulary that are considered. For instance, 1 means that the whole vocabulary is considered by the model, whereas 0.1 means that only the 10% most likely tokens are considered. 
- ``presence_penalty`` (number between -2 and 2, default is 0): Positive values penalize new tokens based on whether they exist in the text so far, increasing the model's likelihood to talk about new topics.
- ``frequency_penalty`` (number between -2 and 2, default is 0): Positive values penalize new tokens based on their frequency in the text so far, decreasing the model's likelihood to repeat the same text word by word.

Example on how to use kwargs in the ``generate`` method: ``quote_chat.generate("Make a quote about LLMs.", max_tokens=50, temperature=1.5)``

Remember that tokens are the model's representation of words (1 token ~ 3/4 english word). Also, it is considered good practice to only change either ``temperature`` or ``top_p``, not both. 

 




In [None]:
"""START HERE"""


"""END HERE"""

Congratulations on making and configuring your own chat! 👊

## Prepare your data for search with a vector database 🔎📦

A chat can be valuable on its own, but the real value is when large language models are integrated with domain specific data to provide more relevant answers. A popular way to do this is to <i>retrieve</i> a set of supporting documents to the input prompt. The context from these documents can then be concatenated with the input prompt and fed to the large language model for generation. This process is called Retrieval Augmented Generation (RAG). The first step towards RAG is to make the data searchable, which we will achieve by using a vector database.

#### Embedding

To make a vector database we must transform the text into a format that enables us to compare them mathematically, in other words into numbers. For this, we rely on vector embeddings. You wil now use ``client.embeddings.create`` with the ``model`` parameter equal to ``embedding_deployment_name``, given in the introduction of this notebook, to create vector embeddings.

In [None]:
def get_embeddings(chunks: list[str]) -> list[list[float]]:
    """
    :param chunks: list of text strings
    :return embeddings: vector embeddings of the chunks
    """

    """START HERE"""

    """END HERE"""
    
    embeddings = [item.embedding for item in response.data]
    return embeddings

Let's see how vector embeddings work... 

In [None]:
cosine_similarity = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
embeddings = get_embeddings(["ocean", "fish", "lion"])
ocean, fish, lion = embeddings[0], embeddings[1], embeddings[2]
print("The vector embedding of ocean is:", ocean[0:5], "...")
print("Similarity between ocean and fish:", cosine_similarity(ocean, fish))
print("Similarity between ocean and lion:", cosine_similarity(ocean, lion))

We can observe that the vector embedding is simply a list of numbers, or a vector. By using the cosine similarity measure we can see that the similarity between ocean and fish is higher than ocean and lion. This is probably because fish live in the ocean while lions live on the land, and hence, fish are closer related to ocean than lions are. The embedding model has learned this relationship.   

#### Text splitting

A document can be hundreds of pages long, which is far too much data for a large language model to handle in a prompt. Therefore, we want to extract only the most relevant <i>chunks</i> of text from the document. You will now implement a function to split a document into chunks. You can either use a [text splitter from LangChain](https://python.langchain.com/docs/modules/data_connection/document_transformers/) or you can make your own function from scratch. 

<b>Hint:</b> a good starting point is Langchain's ``RecursiveCharacterTextSplitter`` with ``chunk_size=250`` and ``chunk_overlap=50``. You have to first make an instance of the splitter, then use the method called ``split_documents``, which returns a list of Document objects based on the input documents. The text chunk from each Document object can be retrived by using the attribute called ``page_content``.

In [None]:
def split_text(documents: list[Document]) -> list[str]:
    """
    :param documents: documents to split
    :return chunks: list of text strings from documents
    """

    """START HERE"""

    """END HERE"""

    return chunks


We have uploaded a PDF file about how NASA astronauts live their daily lives in space 👨‍🚀🪐

In [None]:
nasa_docs = PyPDFDirectoryLoader("../data").load()
nasa_chunks = split_text(nasa_docs)
for i, nasa_chunk in enumerate(nasa_chunks[0:3]):
    print(f"--- CHUNK {i+1} ---")
    print(nasa_chunk)
    

Notice how the text in the PDF file is split into different chunks. If you have used overlapping, you will also notice that some of the text in each chunk is overlapping. The text might have typos, but we ignore these for now.

#### Vector database

Let's now make a pipeline which takes as input a path to a folder of PDF files for loading, splits these into chunks by using ``split_text``, embeds the text in the chunks with ``get_embeddings``, and finally stores these in a Chroma collection. The collection serves the role as our vector database. To add data to the collection along with embeddings you must use the ``collection.add`` function with the parameters:
- ``ids``: a list of unique string ids, one for each document in documents.
- ``documents``: the documents to add (the chunks of text).
- ``embeddings``: the embeddings of the documents. 

In [None]:
chroma_client = chromadb.Client()

In [None]:
def create_vector_db(data_path="../data", name="my_db") -> Collection:
    """
    Creates a Chroma collection and adds data.
    
    :param data_path: path to a folder with PDF files
    :param name: the name on the collection, overwrites existing collections
    """
    existing_collections = [collection.name for collection in chroma_client.list_collections()]
    if name in existing_collections:
        chroma_client.delete_collection(name)
    
    collection = chroma_client.create_collection(
        name=name,
        metadata={"hnsw:space": "cosine"},  # L2 is default
        )

    """START HERE"""

    """END HERE"""

    return collection

I really wonder how astronauts sleep 👀

In [None]:
nasa_db = create_vector_db()
embedding = get_embeddings(["how do astronauts sleep?"])[0]
results = nasa_db.query(
    query_embeddings=embedding,
    n_results=3
)
for i, result in enumerate(results["documents"][0]):
    print(f"--- RETRIEVAL {i+1} ---")
    print(result)
    

Your data is now ready for action! 🙌 You have now learned how to implement and use vector search. Other popular retrieval methods that we have not touched here are keyword search, where text is searched by keywords, and hybrid search, which combines vector search and keyword search. However, vector search is a very strong method as it manages to capture the semantics of the text. 

## Now to the fun part: Retrieval Augmented Generation 🦄

By now you have everything you need to make a RAG, i.e. a way to search for context given an input prompt and the ``generate`` method to generate an answer by using large language models. To sucessfully implement RAG, you will now extend the ``Chat`` class with two methods: 
1. ``retrieve``: queries the vector database and returns the context
2. ``get_answer``: uses the ``retrieve`` method to get context and then the ``generate`` method, which is inherited from the ``Chat`` class, for the contextualized prompt. The contextualized prompt is the original input prompt concatenated with the context. A ``template`` is provided to help you with this, where you can easily insert strings by using ``template.format``.  

You can add your own PDF files to the data folder if you wish to, but we recommend that you start by using the NASA file. Remember not to use any confidential data. 

<b>Hint:</b> all methods and properties are inherited from the ``Chat`` class because ``RAG`` is created as a child class of ``Chat``. Therefore, you can for example call ``self.generate(prompt)`` inside the ``RAG`` class.

In [None]:
class RAG(Chat):
    def __init__(self, vectordb: Collection, system_message: str | None = None) -> None:
        super().__init__(system_message)
        self.vectordb = vectordb


    def retrieve(self, prompt: str, n_results=3) -> str:
        """
        Querying the vector database with the given prompt and defines the context. 

        :param prompt: the input prompt
        :n_results: the number of results to return
        :return context: the concatenation of the text in the retrieved results seperated by '---'
            e.g. 
            text1
            ---
            text2
            ---
        """

        """START HERE"""

        """END HERE"""

        return context
    

    def generate(self, prompt: str) -> str:
        """
        Performs the RAG pipeline, i.e.:
            1. retrieve context
            2. make contextualized prompt
            3. generates answer based on contextualized prompt
            4. returns the answer
        
        :param prompt: input prompt by the user
        :return: the retrieval augmented generated answer
        """

        template = """
            Answer the question based only on the following context:

            {context}
            
            ---

            Answer the question based on the above context: {prompt}
            """        
        
        """START HERE"""

        """END HERE"""

        return answer
        

In [None]:
my_chat = RAG(nasa_db, system_message="You're a friendly assistant.")
print(my_chat.generate("how do astronauts sleep?"), end="\n\n")         # should give an answer
print(my_chat.generate("how do I get rich?"), end="\n\n")               # should not give an answer

## Build a sample app
We've also provided the framework for a basic Streamlit application where you can interact with the model in a more typical app interface. Open the `demo-app/rag.py` file, and compare its implementation with the methods you've written here. (You are free to rewrite the demo app, and reuse the corresponding methods you've defined in the RAG chat above). The app lets you upload your own PDFs, which are chunked and stored in a vector database.

## Already done? Let's dive deeper 🤿

We have provided you with some different tasks so you can learn even more about RAG. Pick the tasks you want, they are independent of each other. 

#### Take a closer look at <b>embeddings</b>

#### Make your data even more searchable with <b>hybrid search</b>

#### Extend your chat with <b> function calling </b>