# Build local RAG

In this notebook, I am going to walkthrough what I have done to build a simple local RAG by leveraging open source LLM and embedding model.

## Install ollama

Follow this [link](https://ollama.com/) to install ollama on you local machine.

## (Optional) Install Conda

Follow this [link](https://docs.conda.io/projects/conda/en/latest/index.html) to install Conda which is for package and python environment management. Otherwise, you have to manually `pip install` the missing packages.

In [None]:
# (Only needed if you use Conda) Create a local python environment and automatically install all required packages

!conda env create -f environment.yaml
!conda activte local-rag

In [None]:
# Using ollama to pull the model you plan to use

!ollama pull llama3.2:3b

## Try LLM without RAG

In [12]:
from langchain_ollama.llms import OllamaLLM

MODEL = "llama3.2:3b"
model = OllamaLLM(model=MODEL)
model.invoke("What is autopilot?")

'Autopilot is a system or technology that enables an aircraft, vehicle, or other device to navigate and control itself automatically, without human intervention. The term "autopilot" was first coined in the early 20th century for use in aviation.\n\nIn modern times, the concept of autopilot has expanded beyond just aircraft and vehicles to include various applications across multiple industries. Here are some examples:\n\n1. **Aviation Autopilot**: In aircraft, autopilot systems use sensors, computers, and software to control flight trajectory, altitude, speed, and navigation.\n2. **Automotive Autopilot**: In cars and trucks, advanced driver-assistance systems (ADAS) like lane-keeping assist, adaptive cruise control, and automatic emergency braking can be considered types of autopilot.\n3. **Marine Autopilot**: Similar to aviation autopilot, marine autopilot systems help navigate boats and yachts by automatically controlling speed, direction, and depth.\n4. **GPS Autopilot**: Global Po

## Try LLM by providing a simple context

In [13]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you cannot answer the questinon, reply "I do not know"

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

chain = prompt | model
chain.invoke({
    "context": "The name I was given when I was born was Daniel",
    "question": "What is my name?"
})

'Your name is Daniel.'

## Use loadl PDF as the context

In [14]:
from langchain_community.document_loaders import PyPDFLoader

# Split PDF into pages
loader = PyPDFLoader("./resources/gke_autopilot_paper.pdf")
pages = loader.load_and_split()
pages

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20240909001400', 'source': './resources/gke_autopilot_paper.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='Autopilot: workload autoscaling at Google\nKrzysztof Rzadca\nGoogle + Univ. of Warsaw\nkmrz@google.com\nPawel Findeisen\nGoogle\npafinde@google.com\nJacek Swiderski\nGoogle\njswiderski@google.com\nPrzemyslaw Zych\nGoogle\npzych@google.com\nPrzemyslaw Broniek\nGoogle\nbroniek@google.com\nJarek Kusmierek\nGoogle\njdk@google.com\nPawel Nowak\nGoogle\npawelnow@google.com\nBeata Strack\nGoogle\nbstrack@google.com\nPiotr Witusowski\nGoogle\nwitus@google.com\nSteven Hand\nGoogle\nsthand@google.com\nJohn Wilkes\nGoogle\njohnwilkes@google.com\nAbstract\nIn many public and private Cloud systems, users need to\nspecify a limit for the amount of resources (CPU cores and\nRAM) to provision for their workloads. A job that exceeds\nits limits might be throttled or killed, resulting in delayin

In [None]:
# Pull embeding model from https://ollama.com/blog/embedding-models
!ollama pull mxbai-embed-large

In [15]:
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_ollama import OllamaEmbeddings

EMBEDDING_MODEL = "mxbai-embed-large"
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL)
# Store the embeddings into memory for experiment purposes
vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)

In [16]:
# Define the retriever which will return the top-k results in terms of the similarity
retriever = vectorstore.as_retriever()
retriever.invoke("What is autopilot?")

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20240909001400', 'source': './resources/gke_autopilot_paper.pdf', 'total_pages': 16, 'page': 11, 'page_label': '12'}, page_content='are not Autopiloted (for these jobs Autopilot runs in simu-\nlation mode). The dashboard helps the user to understand\nAutopilot’s actions, and build trust in what Autopilot would\ndo if it was enabled on non-Autopiloted jobs: the user can\nsee how Autopilot would respond to daily and weekly cycles,\nnew versions of binaries or suddenly changing loads.\n5.3 Automatic migration\nOnce we had sufficient trust in Autopilot’s actions from\nlarge-scale off-line simulation studies and smaller-scale A/B\nexperiments, we enabled it as a default for all existing small\njobs (with an aggregate limit of up to roughly 10 machines)\nand all new jobs. Users were notified well in advance and\nthey were able to opt-out. This automatic migration trivially\nincreased adoption with practically n

In [18]:
from operator import itemgetter

# Leverage retriever to fetch the context and pipe into the prompt template
chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question")
    }
    | prompt
    | model
)

# This should answer the question based on the context
# chain.invoke({"question": "What is autopilot?"})
for chunk in chain.stream({"question": "What is autopilot?"}):
    print(chunk, end="", flush=True)

Autopilot refers to a set of software tools and algorithms developed by Google that aim to optimize resource allocation in large-scale systems, such as cloud computing platforms. The goal of Autopilot is to automatically adjust the amount of resources (e.g., CPU, memory) allocated to running tasks or jobs, based on their historical performance and other factors.

In the context of the provided text, Autopilot appears to be a component of Google's Cloud Platform that uses machine learning algorithms to optimize resource allocation and reduce waste. The Autopilot system is designed to automatically adjust the limits for histogram buckets (i.e., the bounds within which data points are stored) in order to minimize overruns and underruns, while also taking into account the cost of changing these limits.

Autopilot can be seen as a form of "autopilot" in the sense that it automates the process of resource allocation and optimization, allowing users to focus on other tasks without having to m