## Converting data to documents
This notebook is for using the data we obtained in the dataframe and representing it in a langchain Document object. This is a data structure very applicable to Data Science and NLP tasks, since it allows us to separate the actual text content we want to use for our embeddings from the metadata tags, such as source and url. 

### Part 1: Importing essential packages

In [1]:
import json
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
import os
import sys
import openai
from openai import OpenAI
from langchain.chat_models import ChatOpenAI

### Part 1 - Loading the df from memory

Since we already extracted the data in the scraping notebook, we can just load it from memory.

In [2]:
df = pd.read_json('data.json', orient='records', lines=True)

### Part 2 - The langchain Document

Now, the idea behind this data structure is that it is very suitable for NLP tasks. It has one attribute, page_content, that is the plain text we intend to embed and store in a vectorized form. Additionally, it has a metadata field which is a customizable dictionary where we can store whatever we want. This content is not touched by the embedding model, so we can use this downstream for filtering etc.

This function simply loops through the dataframe and builds a Document instance. This is where you would typically introduce as much metadata as you can find, since it is cheap in terms of storage and gives you much more dynamic possibilities in future alterations.

In [3]:

def df_to_langchain_documents(df):
    """
    Convert a pandas DataFrame into a list of LangChain Document objects.

    Parameters:
    df (pd.DataFrame): The DataFrame to convert.

    Returns:
    list: A list of LangChain Document objects.
    """
    documents = []
    for _, row in df.iterrows():
        doc = Document(
            page_content=row['text'],
            metadata={
                'key': row['key'],
                'url': row['url'],
                'category': row['category']
            }
        )
        documents.append(doc)
    return documents

In [4]:
documents = df_to_langchain_documents(df)

### Part 2 - A glance at the Documents
Just have a quick look at the document we print out. You can see that it has the `page_content` being the text we want to vectorize, and then in the end we have our customized `metadata` dictionary. Perfect setup!

In [5]:
print(documents[0])

page_content='Nyfiken På We Know IT? | Om Oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital marknadsföring Hosting & förvaltning Om oss Karriär Kontakta oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital Marknadsföring Hosting & Förvaltning Om oss Karriar Kontakta oss Varför We Know IT? Vi är IT-konsultbolaget som satsar på studerande och nyexade talanger med höga ambitioner och drivkrafter.\xa0Experter på utveckling, design och digital strategi, We Know IT helt enkelt. \u200bMål, vision & sådant gött Vårt mål är att vara det självklara valet för studenter att starta sin karriär på, och det självklara valet för företag som behöver motiverad och innovativ kompetens. Det är inte alla som får chansen att jobba med morgondagens skarpaste konsulter. Vi får det varje dag, och du som kund eller samarbetspartner får stora möjligheter genom att välja oss. Vi vill ha kul på vägen och leverera över fö

### Part 3 - Chunking the Text Content

Now we move on to one of the most essential aspects of building a useful vector space: **chunking the documents**. There are several key considerations when chunking your data:

- **Not too long**: Currently, the documents we have may vary significantly in length. We want a uniform representation of text in terms of size, as this increases the likelihood that information is evenly distributed across the dataset.
- **Not too short**: If the chunks of text are too short, a specific chunk might not contain any relevant context. We need to ensure each chunk can hold valuable information relevant to a given query.
- **Embedding models**: Many embedding models are trained with a specific chunk size in mind, optimizing them to build meaningful embeddings for chunks of that size. Simply put, we want the embedding model to excel at taking a query, building a vector representation, comparing it to vectors in our vector database, and returning relevant context. Some models even have a hard limit—if you try to embed text exceeding the preferred size, the model will simply truncate the text.

Information about the configuration of embedding models can be found on the Hugging Face Massive Text Embedding Benchmark Leaderboard, or the MTEB Leaderboard: [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)


#### Langchain Recursive Character Text Splitter

While some applications may require more sophisticated approaches to chunking, such as hierarchical chunking, a simple character text splitter often suffices. In the declaration below, we specify that we want our chunks to be of size 512 tokens, with a chunk overlap of 30 tokens.

**Why the chunk overlap?** Because we don't want to lose information. When we strictly cut the text at 512 tokens, it's possible that a sentence answering a specific query starts at the end of one chunk and finishes at the beginning of another. To preserve this semantic meaning, a good practice is to use an overlap of 30 tokens. This means that the last 30 tokens of chunk *n* are the same as the first 30 tokens of chunk *n + 1*.


In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
chunked_docs = text_splitter.split_documents(documents)

### Part 4 - Unique identifiers
To create a more meaningful ID than the index in the dataframe is good practice. Here, we previously introduced a `key`attribute based on the url. Now, we use that key along with a counter to idnetify the chunks.

This means that the chunks will have IDs like "karriar_0", "karriar_1" and so on for each key.

In [8]:
current_key = None
counter = 0

for chunk in chunked_docs:
    key = chunk.metadata.get("key")
    if key != current_key:
        current_key = key
        counter = 0
    chunk.metadata["chunk_id"] = f"{key}_{counter}"
    counter += 1


### Part 5 - The embedding functions
These are supporting functions, based on the already existing functions and methods on HuggingFace. It can be nice to write these manually in order to have more control of what we do with the text chunks. Referring to the HF documentation for full reference.

In [6]:
from typing import Optional


def doc_embedding(embedding_model: str, 
                  model_kwargs: dict={'device':'cpu'}, 
                  encode_kwargs: dict={'normalize_embeddings':True},
                  cache_folder: Optional[str]=None,
                  multi_process: bool=False
                  ) -> HuggingFaceEmbeddings:
    embedder = HuggingFaceEmbeddings(
        model_name = embedding_model,
        model_kwargs = model_kwargs,
        encode_kwargs = encode_kwargs,
        cache_folder = cache_folder,
        multi_process = multi_process
    )
    return embedder

def get_API_embedding(text, model):
    embedder = doc_embedding(model)
    embedding = embedder.embed_query(text)
    return embedding

### Part 6 - Create the vector databases

In order to do this, we need to download an embedding model. In this example, we download the model locally, since it is rather small. The task is then to use our created langchain Documents and the embedding model to locally persist a directory, which is the Chroma collection and the Chroma DB we will then use for context retrieval. More about Chroma in the README.md file.

**NOTE**: The part commented out is what you would use when the embedding database is not yet built. In that version, we pass our chunked documents along with out embedding model and directory name to the `Chroma.from_documents`function. When this is already done, we simply load it from memory. I recommend you delete the collection and build it yourself!

In [7]:
model = "mixedbread-ai/mxbai-embed-large-v1"
embedding_model = doc_embedding(model)

persist_directory = "e5_ml_db"

# vectordb = Chroma.from_documents(documents=chunked_docs, 
#                                  embedding=embedding_model, 
#                                  persist_directory=persist_directory)

vectordb = Chroma(embedding_function=embedding_model, persist_directory=persist_directory)

  from .autonotebook import tqdm as notebook_tqdm


### Part 7 - Context retrieval in action
Now it is time to look at the retrieval process. We simply construct a query, and use the vectordb instance to perform a similarity search over our entire vector database (all the chunks we have built, each 512 tokens long), and we also state how many chunks we want to retrieve.

In [12]:
query = "Hur bygger We Know IT sina webbsidor?"
context = vectordb.similarity_search_with_relevance_scores(query, 5)

The following cell prints the documents found most similar towards the given query.

In [13]:
display(context)

[(Document(page_content='Nyfiken På We Know IT? | Om Oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital marknadsföring Hosting & förvaltning Om oss Karriär Kontakta oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital Marknadsföring Hosting & Förvaltning Om oss Karriar Kontakta oss Varför We Know IT? Vi är IT-konsultbolaget som satsar på studerande och nyexade talanger med höga ambitioner och drivkrafter.\xa0Experter på utveckling, design och', metadata={'category': 'general', 'chunk_id': 'om-oss_0', 'key': 'om-oss', 'url': 'https://www.weknowit.se/om-oss/'}),
  0.6072708608347592),
 (Document(page_content='Karriar | We Know IT Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital marknadsföring Hosting & förvaltning Om oss Karriär Kontakta oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital Marknadsföring 

### Part 8 - Chatbot in action
This part requires that you create a `.env` file in the root directory, and that you insert the following line:

OPENAI_KEY=<your_generated_key>

You can generate your API key here: https://platform.openai.com/api-keys

In [16]:
load_dotenv()
# Retrieve the value of the environment variable
openai_key = os.environ.get("OPENAI_KEY")

#### Loading a GPT-3.5 instance
This is a simple way to load a 3.5 instance using your API key. We set temperature to 0.1, which can be seen as the freedom the model has to generate creative responses. If temperature = 0, it is very strict and will likely yield the same response to a given query at any time. Temperature = 1 will yield a model prone to hallucinations, often generating strange and incorrect responses.

In [17]:
chat_model = ChatOpenAI(
    openai_api_key=os.environ.get("OPENAI_KEY"),
    model='gpt-3.5-turbo-1106',
    temperature=0.1
)

#### Our first RAG chain!
Now, we set up a very simple structure in order to answer questions. The following can be said about this code cell:

- **System prompt**: A system prompt is a OpenAI specific terminology, but corresponding functionality exists for all LLMs. The idea is that the  models are trained to accept a system prompt that contextualizes what we try to achieve and acts as general guidelines for the model. 
- **Function for get_prompt**: Here, we perform the same context retrieval we did earlier, but we do some preprocessing to construct a nice readable prompt. We simply extract the page content of each chunk and separate them by a new line.
- **Final simple RAG prompt**: The last part of the cell just states that the model shall amswer a question, the `query` passed to the function, and it shall use the `context`, the merged page_content from our retrieved chunks, to answer it. This is in all its simplicity what Retreival Augmented Generation is.

**NOTE** that this is just the simplest of examples. This is where you can get creative. You can prompt the model to generate responses in any way or format you like, you can include metadata filtering to find information easier, or maybe use metadata in the response to for example link to the source of information using the URL. 

In [18]:
from langchain.schema import (SystemMessage, HumanMessage, AIMessage)

system_prompt = """Du är en hjälpsam AI assistent, specialiserad på att svara på frågor om ett IT-konsultbolag som
                heter We Know IT. Du kommer att få frågor samt utvald information, vilken du kan använda
                för att svara på frågan. Svara på svenska."""

def get_prompt(query: str, vectordb):
    # Retrieve 10 chunks with relevance scores
    context_results = vectordb.similarity_search_with_relevance_scores(query, 10)
    
    # Extract the page_content from each context document
    context = "\n".join([doc.page_content for doc, score in context_results])
    
    # Construct the final prompt
    user_prompt = f"""Svara på följande fråga: {query}.

Du kan använda följande information för att generera ditt svar:
{context}"""

    return user_prompt


Here we 

In [21]:
test = get_prompt("Hur bygger We Know It sina webbapplikationer?", vectordb)
print(test)


Svara på följande fråga: Hur bygger We Know It sina webbapplikationer?.

Du kan använda följande information för att generera ditt svar:
av en webbsida eller ett element för att se vilket som presterar bättre. Genom att använda verktyg för webbanalys kan marknadsförare få insikter om användarbeteende, vilka sidor som presterar bra och vilka som har hög avvisningsfrekvens, vilket möjliggör datadrivna beslut för att förbättra konverteringsfrekvensen. Innovativa projektexempel. Folkes Biluthyrning Digitalisering av biluthyrning med webbsida och integrerat bokningssystem. Boujt Webbsida med chatfunktion med text eller video, inloggning och
Webbutveckling För Företag & Organisationer | We Know IT Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital marknadsföring Hosting & förvaltning Om oss Karriär Kontakta oss Våra kunder Våra tjänster Konsultuthyrning Webbutveckling Apputveckling UX/UI - Design Digital Marknadsföring Hosting & Förvaltning Om oss 

#### Now, we finally put everything together and ask our RAG system a question about We Know IT!

As you can see by the response, we have utilized the GPT models ability to generate fluent and nice tetual content but we provide it with our data in an effective way. This response is certainly nothing a good ol' GPT knows by heart!

In [23]:
import textwrap

query = "Hur bygger We Know It sina webbapplikationer?"
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=get_prompt(query, vectordb))
]
response = chat_model.invoke(messages)
wrapped_response = textwrap.fill(response.content, width=80)
print(wrapped_response)

We Know IT bygger sina webbapplikationer genom en effektiv utvecklingsfas av
webbsidan. De använder sitt unika ramverk för att säkerställa högsta kvalitet
och kontinuerlig rapportering. Vidare lägger de stor vikt vid att SEO-anpassa
allt innehåll i termer av sökordsdensitet, ordmängd och kvalitet. När
webbutvecklingen är färdig och det nya innehållet har integrerats, genomförs
kvalitetssäkring och lansering av webbplatsen. We Know IT fokuserar också på
konverteringsoptimering för att förbättra användarupplevelsen och öka
effektiviteten online.


## Moving on to API Calls...
Now, we look at the RAG.py script that spins up a FastAPI endpoint for our RAG system. 