# Retrieval Augmented Generation using LLMs on Proprietary Data 

In this tutorial, we'll create a proof of concept prototype that uses MLOps community data from our Slack channel and answer questions based on combined community knowledge.

### IMPORTANT:
Please use the virtualenv set up for you following the README in the parent project.
It will install the dependencies needed for the entire repo.

## Setup Environment

In [6]:
# LOAD ENVIRONMENT VARIABLES
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv(dotenv_path="./.env")  # Change this to your own .env file, if not using the venv
assert os.getenv("OPENAI_API_KEY") is not None, "OPENAI_API_KEY not found in .env file"

In [7]:
# Since we have some long lines, let's make sure output is line wrapped
from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

## Download the data

In [8]:
from download_chats import download_chats_from_gdrive

download_chats_from_gdrive()

Directory already exists. Skipping download.


### Raw Messages & Embeddings

In most real-world applications, data that you're building on top of is never clean. While we've done some preprocessing on our Slack raw data from the community, we're keeping the data as original as possible so that you can explore some of the challeneges in preparing the data for retrieval augmnented generation.

Let's load up the messages.

In [9]:
import pandas as pd

# Raw messages
messages_df = pd.read_csv("./.content/chats/messages.csv")
messages_df.head()

Unnamed: 0,__Source,User_ID,Channel_Name,Message_Timestamp,Thread_Timstamp,Channel_ID,__Text
0,threads,U01T78HPG3H,computer-vision,2023-05-18 11:36:44.001949 UTC,2023-05-18 11:30:13.159979 UTC,C026ED0PZEZ,Both <https://roboflow.github.io/supervision/q...
1,threads,U01T78HPG3H,computer-vision,2023-05-18 11:30:13.159979 UTC,2023-05-18 11:30:13.159979 UTC,C026ED0PZEZ,Would love a package for suggesting and then i...
2,threads,U04QRD69H8Q,computer-vision,2023-05-18 10:56:33.313369 UTC,2023-05-17 12:35:14.522419 UTC,C026ED0PZEZ,<@U056Q4V4FFC> how about this dataset (for the...
3,threads,U01J0NVNE1G,mlops-questions-answered,2023-05-18 05:21:04.346819 UTC,2023-05-18 01:08:57.948139 UTC,C015J2Y9RLM,"I always oppose the counterargument, why do yo..."
4,threads,U01J0NVNE1G,mlops-questions-answered,2023-05-18 05:19:51.718379 UTC,2023-05-18 01:08:57.948139 UTC,C015J2Y9RLM,I find MLflow more convenient to use. Here are...


You'll see above that each row represents a chat message. Each message has the user id of the poster, the channel name in which it was posted, the message and thread timestamps, and the text of the message.

Let's filter messages only on the mlops-questions-answered channel to create our Milo bot.

In [10]:
# Get all data for specific channel
df_mlops_questions_answered = messages_df[messages_df["Channel_Name"] == "mlops-questions-answered"]
df_mlops_questions_answered.head()

Unnamed: 0,__Source,User_ID,Channel_Name,Message_Timestamp,Thread_Timstamp,Channel_ID,__Text
3,threads,U01J0NVNE1G,mlops-questions-answered,2023-05-18 05:21:04.346819 UTC,2023-05-18 01:08:57.948139 UTC,C015J2Y9RLM,"I always oppose the counterargument, why do yo..."
4,threads,U01J0NVNE1G,mlops-questions-answered,2023-05-18 05:19:51.718379 UTC,2023-05-18 01:08:57.948139 UTC,C015J2Y9RLM,I find MLflow more convenient to use. Here are...
5,threads,U01CRVDS4NA,mlops-questions-answered,2023-05-18 05:14:52.514189 UTC,2023-05-16 23:22:04.332479 UTC,C015J2Y9RLM,I just built some demos with it. The developer...
7,messages,U01VCA57PD0,mlops-questions-answered,2023-05-18 01:08:57.948139 UTC,2023-05-18 01:08:57.948139 UTC,C015J2Y9RLM,These days I'm feeling very tempted to roll my...
21,messages,U015BH45ZK6,mlops-questions-answered,2023-05-17 14:56:59.775629 UTC,2023-05-17 14:56:59.775629 UTC,C015J2Y9RLM,I ran into a problem downloading files from Az...


In [11]:
# Get the count of messages in that channel
print(f"Number of messages: {len(df_mlops_questions_answered)}")

Number of messages: 20450


Wow! That's a lot of messages. 

As you know from our Slack, that conversations are when you combine messages from one thread. Let's see how many threads or conversations we have. 

In [12]:
# Get the count of conversations
print(f"Number of conversations: {len(df_mlops_questions_answered.groupby(['Thread_Timstamp']))}")

Number of conversations: 2086


If we were to create an embedding for each message and put it in our database, it might not have all the context of the question. So, instead we will only add each conversation into our RAG database. 

Let's group our messages by channel name and thread timestamp to generate complete conversations. 

In [13]:
# Iterate through conversations
from IPython.display import HTML, display

NUM_CONVERSATIONS_TO_PRINT = 5
current_conversation = 0
for (channel_name, thread_id), conv in df_mlops_questions_answered.groupby(["Channel_Name", "Thread_Timstamp"]):
    html = f"<h2>Channel: {channel_name} / conversation {thread_id} </h2><hr/>\n<table>\n"
    for index, row in conv.iterrows():
        html += f'<tr><td>{row["User_ID"]}</td><td>{row["__Text"]}</td></tr>'
    html += "</table>\n<hr/><br/>"
    display(HTML(html))
    current_conversation += 1
    if current_conversation >= NUM_CONVERSATIONS_TO_PRINT:
        break

0,1
U011NTHUKEF,"We use a lot of TPOT library. The main advantage is that it does hyperparameter tunning and also deals with preprocessing steps as well... downside is that it works mostly with scikit-learn algorithms. But there is a small hack you can do to make it work with many algorithms... make a new algo class that inherits both from sklearn BaseEstimators and your algo (Catboost, for example, or Pygams)"
U015CHWG25B,"Aha, I’ve heard that it’s really good!"
U0150LZ578X,"For anyone already using other Kubeflow components, Katib is relatively easy to work with It parallelizes trials across k8s pods and provides a web UI to visualize the hyperparameter space for each training history!"
U015CHWG25B,<@U013CL3GTB3> Thank you! Looks like this is an entire ML platform rather than just a hyperparameter tuning tool :thinking_face:
U013CL3GTB3,Polyaxon is another good one!
U015CHWG25B,"At some point we reviewed about 15 hyperparameter tuning tools in order to choose one that answers our needs. We stopped at NNI from Microsoft (). This tool is designed to run hyperparameter tuning in several parallel jobs. Unlike many other tools, it supports a lot of different algorithms of hyperparameter tuning. It has a decent UI. Plus, it's OSS from Microsoft :) Do you know other tools which answer our criteria?"
U015CHWG25B,I want to run hyperparameter tuning on several machines in parallel and track the process in Web UI. What is the best tool for that?


0,1
U011NTHUKEF,"AirFlow is a great tool. We use it a lot. When we work on a AWS environment, we use AWS stepfunctions which has its own Data Science SDK. Works like a charm!"
U015CHWG25B,"This is a clear usecase for pipeline tools. There are plenty of them out there, providing various features and UI capabilities. Currently, for our projects we use AirFlow. Which pipeline tools do you prefer and why?"
U015CHWG25B,"OK, I have a model of decent quality. Now I want to automate daily collecting data, retraining the model, and redeployment. How do I do that?"


0,1
U015CHWG25B,"Sure, thank you! :pray:"
U016A3RAL5N,<@U015CHWG25B> thanks for including us in your list. If you need any support lest us know


0,1
UV92GMLF4,"<@U014Z58NT25> Sorry for the late reply: 1. It's determined by marketing. The thing that we do it's just to highlight who is in the border. 2. We used only numericals in that time, but OHE work fine also. 3. At the that time, we didn't used any specific tool. I think today it's doable to do in matplotlib if you transform in 2D array those records from the cluster."
U012YQULW4X,"you seem to treat it as a technical problem (automate the encoding). This issue points you to a potentially significant change in the input data so you want to notice and manually investigate. For detection, you can check tools for data monitoring like or tensorflow data monitoring. They should be able to check feature cardinality (number of segments). Segments: Do I understand you correctly that the marketing team changed the logic for the segment variable e.g. an prior ""A"" customer might now be a ""B"" customer? Case A: Feature is not important. Fix encoding bug and move on. Case B: The feature is important but not a lot of customers migrated their segment. Fix encoding bug and move on. Case C: Feature is important and the logic is very different or many people migrated. This is possibly a breaking change in the data and you need a migration plan (discard training data with old logic, not use the feature for the migration duration etc): you mention: ""everytime this happens"". That should not happen regularly. If it does, you have to stop using the feature/understand better what they do and what it means. Constantly changing data definitions make the feature dangerous to use."
UP3T8K9M5,maybe <@U012YQULW4X> can also help as she has some experience with monitoring and will be talking to us about it during this week’s meetup
U014Z58NT25,And thanks for the answers guys! It really helps even just to talk about the problem
U014Z58NT25,"<@UV92GMLF4> the clusters were determined by the marketing in the sense of ""good cluster/bad cluster"" or they even gave you a centroid for that? Could you use anything but numericals for KMeans? Essentially I know you can't, for you have to deal with distances. But someone has once told me you can sometimes work with binary/ordinal using KMeans as well. Also, did you have anything (besides your own code and CLI) for monitoring those jumpers? Any specific tools or dashboards"
U014Z58NT25,"<@U013CL3GTB3> What happens is that as of now I have, say, 5 customer segments. On the data that I extract tomorrow, I might have 6, cause this customer segmentation is something still being created by the business, and I won't know it in advance. I can extract some meaning from these classes and put them into an ascending order (of integers, for example). And yes, I wanna use them for my recsys at the end of the day. EDIT: typo"
UP3T8K9M5,<@U0156CADGJG> might have something to say about this too
UV92GMLF4,"Those guys I removed from the clusters and put them in another cluster (determined by marketing). Was totally primitive and I was using the KMeans, but the general idea it's that."
UV92GMLF4,"In pink I got the ""jumpers""."
UV92GMLF4,Those are the clusters (just for simplification)


0,1
U016FCTSDGS,"Few answers from our experience: 1. We keep Airflow stack separate from DAG Projects. We also have a single repo per DAG project that gets auto-named and deployed in such a way that Airflow picks it up to sync in the DAG folder. We also have a DAG that runs frequently to scan and ingest the new DAG(s) from the DAG Projects. 2. If the question is long running tasks we spin up AWS Batch or AWS EMR depending on the need. If the question is how to interject DAG updates if a long running task is executing then no solution here other than on the next run the task(s)/DAG will get updated. 3. Did not know about the *`PythonVirtualenvOperator`* or would have used it. We just installed custom python packages to a temp directory and added to Python path, then deleted the folder in the last task for cleanup. 4. Haven't used Docker Operator. 5. Let me know if you find one!"
U015CHWG25B,"I can’t answer all your questions one by one, but I can share the basics of our approach. • Regarding deployment: We use Git to deploy DAGs. In our case, we have a dedicated repo for DAGs, and AirFlow installation polls this repo every minute and redeploys DAGs if necessary. Thus your DAGs are separated from your Airflow service. • Regarding tasks: We chose to make our own operator (as we have an ML platform with some specificity) based on DockerOperator. Containerisation of each step gives you way more freedom, especially in complex pipelines when you may need even different versions of CUDA for different steps."
U013K7876BF,"thanks for the responses <@U013CL3GTB3> and <@U0158N59C8H> Flux sounds great, but I’m going need to work with a push model (we have some constraints in our system - we can’t access the repo from the cloud) If i’m pushing the scheduler just to kick off a `KubernetesPodOperator`, Airflow feels very similar to Argo."
UP3T8K9M5,hmmm yeah i also would like to know this.
U0158N59C8H,Nope. I mean I use argo quite a bit via kubeflow pipelines. I used airflow too. Just wanting to get more opinions on the diff solutions. Esp Argo vs Tekton. As they are very closely related.
UP3T8K9M5,I also remember <@U0158N59C8H> was asking about Argo a while back. Were you able to get any info that could help us?
U013CL3GTB3,"1. I would launch the task in a pod using the kubernetesPodOperator. And for pushing your dags, use gitops! You can point your airflow deployment to a git branch. This requires flux and some other stuff to be in place. 2. Not sure what you mean by long running airflow service? Are you saying how to update a task independent of airflow itself like the web server or scheduler? - if you're launching your tasks in pods you leave the execution logic to the docker container running in the pod and the orchestration logic to airflow. I would also recommend using the KubernetesExecutor alongside the KubernetesPodOperator. 3. The kubernetesPodOperator also solves this problem as you have complete control over the runtime environment, resources, etc - the container has everything it needs to run the code 4. Haven’t used the docker operator but It sounds like it’s more general than just python code. And will allow you to containerize your code which deals with the dependency issue. 5. My company BenevolentAI Is about to publish an article about some lessons we learned using airflow so I’ll post it to the community when it’s out. Also me and <@UP3T8K9M5> are planning on doing a coffee session on pipelines - and airflow will be one of them. You mentioned Argo, that’s also another good option as it’s native to kubernetes which is nice and also has other benefits."
U013K7876BF,"Hey everyone We’re looking for productionize some data engineering pipelines but we really want this run on Airflow and have a proper CI/CD pipeline. I have a couple of questions, and I’m hoping some of you may have experience and knowledge about these topics. 1. Would you deploy the Airflow task along with the Airflow service itself? How else would you push your DAGs from source control into the dags folder of Airflow? 2. Is there an elegant way to update the tasks separately from a long running Airflow service? 3. Given that not all tasks have the same python dependencies, how does dependency management work? Do you install all of the dependencies with Airflow itself, or do you use `PythonVirtualenvOperator` to create a virtual env for each step in the dag? 4. How difficult is monitoring and debugging when using `airflow.operators.docker_operator`? we can package each step in our data engineering pipeline in an .py file with a `main` and create a container. Does it have clear benefits over the `PythonVirtualenvOperator?` 5. If there are good resource on airflow deployment strategies, I’d love to read more. I was looking at Argo and if help have experience with Argo vs Airflow I’d love to hear that too. Thanks!"


### Conversations & Embeddings

We've pre-built these conversations to make it easier.

In [16]:
# Each conversation grouped into a single thread_id
chats_df = pd.read_csv("./.content/chats/chats.csv")
chats_df.head()

Unnamed: 0.1,Unnamed: 0,channel_name,thread_id,chat_text,summary
0,0,africa,2022-03-22 19:42:06.219769 UTC,U024WRAA0D9: Hello fellow MLOpsers in Africa :...,A user named U024WRAA0D9 welcomes fellow MLOps...
1,1,africa,2022-03-24 08:14:33.140029 UTC,U024WRAA0D9: What should our next steps be (fo...,The conversation is discussing the next steps ...
2,2,africa,2022-03-28 11:57:42.840049 UTC,U024WRAA0D9: What’s everyone’s timezone?U024WR...,The conversation is discussing everyone's time...
3,3,africa,2022-04-12 14:36:00.144498 UTC,U03142DQP6Z: Please can we make it later in th...,One person asks if it's possible to schedule a...
4,4,africa,2022-04-19 10:24:57.455849 UTC,"U024WRAA0D9: Hello <#C037GTG932B|africa>, I on...",Someone named U024WRAA0D9 mentions in the Slac...


At this point, as we build our vector embeddings, we would use an embedding model. You could use an online model (e.g. form OpenAI, Cohere, etc.) or using a local model (e.g. using T5 or BERT). In this case, since we have about 9.7k conversations, we've pre-embedded these for you. 

In [17]:
# The embedding for each conversation with its thread_id (Note: not all embeddings were generated for the chat text)
embeddings_df = pd.read_csv("./.content/chats/chats-embeddings-ada-002.csv")
embeddings_df.head()

Unnamed: 0.1,Unnamed: 0,thread_id,embedding
0,0,2022-03-22 19:42:06.219769 UTC,"[0.0036350293084979057, -0.01264416053891182, ..."
1,1,2022-03-24 08:14:33.140029 UTC,"[-0.002360287122428417, -0.04199115186929703, ..."
2,2,2022-03-28 11:57:42.840049 UTC,"[0.017543835565447807, 0.0032007887493819, 0.0..."
3,3,2022-04-12 14:36:00.144498 UTC,"[-0.001173610333353281, -0.014446504414081573,..."
4,4,2022-04-19 10:24:57.455849 UTC,"[-0.0025763490702956915, -0.02925489842891693,..."


Let's deserialize the string embeddings into vectors. (The API you use for your embeddings should automatically return you a list of floats. this step is only because we prebuilt the embeddings)

In [18]:
print("Number of all conversations: ", len(chats_df))
print("Number of conversations with embeddings: ", len(embeddings_df))
# NOTE: We have some missing.

Number of all conversations:  9719
Number of conversations with embeddings:  9713


AssertionError: Number of conversations and embeddings do not match

In [None]:
# Get your embeddings data together.
import json

from tqdm import tqdm

# Create a temp index of the chats
chats_index = {}
for _, row in tqdm(chats_df.iterrows(), desc="Creating temporary chats index"):
    chats_index[row["thread_id"]] = row["chat_text"]

# Link the chats and embeddings together
embeddings = []
VECTOR_SIZE = None
for index, row in tqdm(embeddings_df.iterrows(), desc="Collecting chats and embeddings"):
    chat_row = chats_df.iloc[index]
    embedding = json.loads(row["embedding"])
    embeddings.append({"thread_id": row["thread_id"], "text": chat_row["chat_text"], "embedding": embedding})
    if not VECTOR_SIZE:
        VECTOR_SIZE = len(embedding)
    else:
        assert VECTOR_SIZE == len(embedding)

# Create your vector database
Feel free to choose the Vector DB you want to use, but for the tutorial we will be using an open-source DB - ChromaDB

In [None]:
import chromadb

client = chromadb.Client()

collection = client.create_collection("chats")

In [None]:
embeddings_list = []
metadata_list = []
id_list = []

for row in tqdm(embeddings):
    metadata = {
        "thread_id": row["thread_id"],
        "text": chats_index[row["thread_id"]],
    }
    metadata_list.append(metadata)
    embeddings_list.append(row["embedding"])
    id_list.append(row["thread_id"])

collection.add(embeddings=embeddings_list, metadatas=metadata_list, ids=id_list)

We can  check the data in the vector db and confirm the step above was successful.

In [None]:
print("Number of conversations in vector DB: ", collection.count())  # returns the number of items in the collection
collection.peek(1)  # returns a list of the first n items in the collection

Next, let's create a function for searching for the closest chats given an embedding.

In [None]:
def search_index(embedding):
    results = collection.query(query_embeddings=[embedding], n_results=3)
    return [{"thread_id": m["thread_id"], "text": m["text"]} for m in results["metadatas"][0]]

For sanity check, test with an existing document, that that document is returned.


In [None]:
row = embeddings[2539]
docs = search_index(row["embedding"])
assert row["thread_id"] == docs[0]["thread_id"], "Document does not match"

# Example Q&A with OpenAI

Now that we have our vector database, let's create the Q&A mechanism. It'll work like this:

For a question that the user asks:
- Let's generate the embedding for that using the same embedding model as our documents (i.e. OpenAI's Ada-002)
- Query the database to get the 3 nearest documents. 
- Since these documents would be too large for the context window of openai, we need to summarize them.
- Finally, we generate the answer for the user.

In [None]:
!pip install openai

In [None]:
import os
import openai

# Set up the OpenAI API key
assert os.getenv("OPENAI_API_KEY"), "Please save your OPENAI_API_KEY in a .env file."
openai.api_key = os.getenv("OPENAI_API_KEY")


def get_embedding(text):
    # Use the same embedding generator as what was used on the data!!!
    response = openai.Embedding.create(model="text-embedding-ada-002", input=text)
    return response.data[0].embedding


def summarize(chat_text):
    # Summarize conversations since individually they are long and go over 8k limit
    prompt = (
        "Summarize the following conversation on the MLOps.community slack channel. Do not use the usernames in the summary. ```"
        + chat_text
        + "```"
    )
    completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}])
    return completion.choices[0].message.content


def extract_answer(chat_texts, question):
    # Combine the summaries into a prompt and use SotA GPT-4 to answer.
    prompt = "Use the following summaries of conversations on the MLOps.community slack channel backtics to generate an answer for the user question."
    for i, chat_text in enumerate(chat_texts):
        print(f"Getting summary for conversation {i+1}")
        prompt += f"\nConversation {i+1} Summary:\n```\n{summarize(chat_text)}```"

    if not question.endswith("?"):
        question = question + "?"
    prompt += f"\nQuestion: {question}"
    print("Getting answer for the question.")
    completion = openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": prompt}])
    content = completion.choices[0].message.content
    return content

In [None]:
import time


def get_answer(question):
    tic = time.perf_counter()
    # Get answer to the question by finding the three conversations that are nearest
    # to the question and then using them to generate the answer.
    print("Searching documents nearest to the question.")
    search_vector = get_embedding(question)
    docs = search_index(search_vector)
    # Take the top three answers, and use ChatGPT to form the answer to give the user.
    chat_texts = []
    for doc in docs:
        chat_text = chats_index[doc["thread_id"]]
        chat_texts.append(chat_text)
    if len(chat_texts) > 3:
        chat_texts[:3]
    answer = extract_answer(chat_texts, question)
    toc = time.perf_counter()
    print(f"Answer took {toc - tic:0.4f} seconds")
    return answer

We can now try these out.

In [None]:
question = "What are some good ways to deploy models on Kubernetes?"
answer = get_answer(question)
print(f"\n\nQuestion: {question}\nAnswer: {answer}")

In [None]:
question = "How can I structure a good Data Science team?"
answer = get_answer(question)
print(f"\n\nQuestion: {question}\nAnswer: {answer}")

In [None]:
question = "What is the best way to train models for tabular data?"
answer = get_answer(question)
print(f"\n\nQuestion: {question}\nAnswer: {answer}")

That's it. Feel free to play around with this proof of concept. 

We also have companion code with some optimizations for you when you're ready

## Observtions

- We note that even though we need the UI to look like a Q&A system.
- Also, another observation is that while answering, the code makes 4 calls to OpenAI - three to summarize and one to generate the response. 

So we optimize our code for the above and create the PoC.