
# Local Retrieval-Augmented Generation (RAG) Pipeline in Python via Ollama


# Ollama Setup

**Ollama** is an open-source Large Language Model (LLM) backend server that streamlines the deployment of LLMs on local environments, utilizing both CPU and GPU resources

1. **[Download Ollama](https://ollama.com/download)**

2. **Follow the installation instructions**

3. **Select models from the [Ollama library](https://ollama.com/library)**

For the remaining steps, open a terminal

4. **Pull the models**

> ```bash
> ollama pull llama2      # Language model
> ollama pull all-minilm  # Embedding model
> ```

5. **Install the [Ollama python library](https://github.com/ollama/ollama-python/blob/main/README.md)**:

> ```bash
> pip install ollama==0.1.8 # Install Ollama Python library (version 0.1.8)
> ```

> > ```bash
> > pip show ollama # Check the version
> > ```

6. **Verify model**  

> ```bash
> ollama run llama2
> ```

7. **Start the Ollama service for Jupyter notebook connection (VS Code TBD)**  

> ```bash
> ollama serve &
> ```


#### Key Features of Ollama 

- **Optimized Performance:** Efficiently leverages both CPU and GPU hardware to maximize the speed and performance of supported LLMs
- **Flexible Deployment:** Supports easy setup and deployment on local machines, enabling developers full control over model training and inference
- **Scalable Architecture:** Designed to handle varying workloads, making it suitable for both small-scale projects and large enterprise applications


# Chat Query with Llama2

In [172]:


# Import the Ollama library to interact with the language models
import ollama

# Send a chat request to the 'llama2' model with a user message
response = ollama.chat(model='llama2', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',  # Message content that the user sends to the model
  },
])

# Print the response from the model, displaying the answer to the user's question
print(response['message']['content'])




The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere. The sunlight encounters tiny molecules of gases in the air, such as nitrogen and oxygen, which scatter the light in all directions.

Rayleigh scattering is a process where shorter wavelengths of light (such as blue and violet) are scattered more than longer wavelengths (such as red and orange). This is because the smaller wavelengths have shorter paths in the atmosphere, which allows them to be scattered more easily.

As a result of this scattering, the blue light is dispersed throughout the atmosphere, giving the sky its blue appearance. The blue color can appear more intense near the horizon due to the extra scattering that occurs when light passes through more air molecules in the atmosphere.

It's worth noting that the color of the sky can vary depending on a number of factors, including the time of day, weather conditions, and atmospheric phenomena such a

# Generate Vector Embeddings

Ollama supports embedding models, making it possible to build RAG applications that combine text prompts with existing documents or other data.

**What are embedding models?** 
Embedding models are models that are trained specifically to generate vector embeddings: long arrays of numbers that represent semantic meaning for a given sequence of text:

<figure>
  <img src="https://ollama.com/public/blog/what-are-embeddings.svg" alt="vector_embeddings_ollama" width="500" />
  <figcaption></figcaption>
</figure>

In [173]:


from pprint import pprint

# Generate vector embeddings for the given text prompt
resp = ollama.embeddings(model="all-minilm", 
                      prompt="Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels");

# The resulting vector embedding arrays can then be stored in a database, 
# which will compare them as a way to search for data that is similar in meaning

# Limit the print output to the first 10 values
pprint(f"Embedding length: {len(resp['embedding'])}")
pprint({'embedding': resp['embedding'][:10]})



'Embedding length: 384'
{'embedding': [0.0479680672287941,
               0.11637094616889954,
               -0.24570561945438385,
               -0.04406300559639931,
               -0.24932530522346497,
               0.12218563258647919,
               -0.48447176814079285,
               -0.1940533071756363,
               0.27372273802757263,
               0.1956769824028015]}


# Weaviate Setup

**Weaviate** is an open-source, AI-native vector database designed to store and manage large-scale data. It provides advanced capabilities for AI-driven applications, making it easier to handle data ingestion, querying, and search operations.

1. **Install [Python Weaviate library](https://weaviate.io/developers/weaviate/client-libraries/python)**:

> ```bash
> pip install -U weaviate-client  # Install Weaviate client library (version 4.5.5)
> ```

> > ```bash
> > pip show weaviate-client # Check the version
> > ```

#### Key Features of the Weaviate Client Library 

- **Data Ingestion:** Easily add and manage data within your Weaviate instance 
- **Querying:** Execute complex queries to retrieve relevant information 
- **Search Operations:** Perform semantic and vector-based searches for accurate data retrieval


In [174]:


# Import the Weaviate client library to interact with the Weaviate database
import weaviate

# Check if the client object already exists to prevent multiple connections
if 'client' not in globals():
    # Connect to an embedded Weaviate instance
    # Python client interacts with the local server at http://127.0.0.1:8079
    # using HTTP for queries and gRPC (Google Remote Procedure Call) for faster data communication
    # Start process ID path: /Users/briankaewell/.cache/weaviate-embedded/*
    client = weaviate.connect_to_embedded() # For quick and basic evaluations
    print("Client connected successfully")
    
else:
    print("Client is already connected")

# Check if the Weaviate instance is ready and print the connection status
print(client.is_ready())



{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-10-24T17:27:41-04:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-10-24T17:27:41-04:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-10-24T17:27:41-04:00"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-10-24T17:27:41-04:00"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":52101},"time":"2024-10-24T17:27:41-04:00"}
{"address":"192.168.0.249:52102","level":"info","msg":"starting cloud rpc server ...","time":"2024-10-24T17:27:41-04:00"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-10-24T17:2

Client connected successfully
True


{"docker_image_tag":"unknown","level":"info","msg":"configured versions","server_version":"1.26.1","time":"2024-10-24T17:27:43-04:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-10-24T17:27:43-04:00"}
{"address":"192.168.0.249:52101","level":"info","msg":"current Leader","time":"2024-10-24T17:27:43-04:00"}
{"action":"restapi_management","docker_image_tag":"unknown","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-10-24T17:27:43-04:00"}
{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:4b652bfb-ddcb-471d-8182-2707242e567f Type:INIT Version:1.26.1 NumObjects:0 OS:darwin Arch:arm64 UsedModules:[]}","time":"2024-10-24T17:27:44-04:00"}
{"action":"bootstrap","level":"info","msg":"node reporting ready, node has probably recovered cluster from raft config. Exiting bootstrap process","time":"2024-10-24T17:27:44-04:00"}
{"action":"hnsw_prefill_cache_async","level":"inf

# Implement a Local RAG Pipeline

<figure>
  <img src="https://weaviate.io/assets/images/rag-ollama-diagram-c71ba5c4e60629e70a2cf334a7716860.png" alt="rag_ollama" width="700" />
  <figcaption>Local Retrieval Augmented Generation (RAG) system with language models via Ollama</figcaption>
</figure>



# Input Data

In [175]:


# List contains individual pieces of information (documents) related to llamas, 
# which may be used for processing or data storage tasks
documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]
#print(documents)


# This code retrieves a json file from a remote URL using the 'requests' library, 
# checks if the request was successful, and then loads and prints the data in JSON format.
#import requests
#import json
#url = 'https://raw.githubusercontent.com/bkaewell/rag-workflow/refs/heads/main/notebooks/generative-search/data/2024-10-23-llamas-json-example.json'
#response = requests.get(url)
#
# Check if the request was successful
#if response.status_code == 200:
#    print(response.text)
#else:
#    print(f"Error: {response.status_code}")
#
#data = json.loads(resp.text)  # Load data
#print(data)



# Create Data Structure

In [176]:


# Import specific classes from Weaviate to work with data schema and configs for vector database
import weaviate.classes as wvc
from weaviate.classes.config import Property, DataType

# Define the name of the structure (collection)
collection_name = "docs"

# Check if the collection already exists
if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

# Create a new collection with the specified name and define its structure properties
collection = client.collections.create(
    collection_name,
    properties=[
        Property(name="text", 
                 data_type=DataType.TEXT), # Name and data type of a single property for simple list of strings
    ],
)



{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-10-24T17:28:07-04:00","wait_for_cache_prefill":false}
{"level":"info","msg":"Created shard docs_Lr9jb2BShrtX in 7.910709ms","time":"2024-10-24T17:28:07-04:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-10-24T17:28:07-04:00","took":84208}


# Ingest Data

In [177]:


# Store each document in a vector embedding database
with collection.batch.dynamic() as batch:
  for i, d in enumerate(documents):
    # For each document, generate its vector embeddings
    response = ollama.embeddings(model="all-minilm", 
                                 prompt=d)
    embedding = response["embedding"]
    # Print text and its embedding
    # display({f'Document {i}': d, "Embedding": embedding}) 
    
    # Store data object with combined text and embedding in the vector embedding database
    batch.add_object(
        properties = {"text" : d},
        vector = embedding,
    )



In [178]:


# Query the collection to fetch objects, retrieving only the closet 
# result (limit=1), and include their vector representations in the response
collection.query.fetch_objects(limit=1, include_vector=True)



QueryReturn(objects=[Object(uuid=_WeaviateUUIDInt('03a34f8b-f227-4183-9b26-b6e27942e5b0'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'text': 'Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands'}, references=None, vector={'default': [0.17213347554206848, 0.16122983396053314, -0.1948625147342682, 0.317971795797348, -0.19478818774223328, -0.2074703574180603, -0.6278446316719055, -0.26288244128227234, -0.052119795233011246, 0.6996364593505859, 0.44047579169273376, -0.2558085322380066, 0.05621032789349556, 0.18558967113494873, -0.01409709919244051, 0.23181354999542236, -0.05545922368764877, -0.264326274394989, 0.20167632400989532, 0.4523124694824219, 0.19254468381404877, -0.10235374420881271, -0.02420978620648384, 0.17560499906539917, 0.08126483857631683, 0.10912062972784042, -0.18565529584884644, 

# Retrieve Context

In [179]:


# Define the prompt for which you want to find the most relevant document
prompt = "What animals are llamas related to?"

# Generate an embedding for the prompt using the specified model 'all-minilm'
response = ollama.embeddings(
  prompt=prompt,
  model="all-minilm"
)

# Query the collection to retrieve the MOST relevant document (limit=1) based on the prompt's embedding
results = collection.query.near_vector(near_vector=response["embedding"],
                             limit=1)

# Extract and display the text of the most relevant document
data = results.objects[0].properties['text']
print(data)



Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels


# Augment the Prompt

In [180]:


# Create a prompt template that combines the retrieved context (data) 
# with the original prompt to generate a comprehensive response
prompt_template = f"Using this data: {data}. Respond to this prompt: {prompt}"



# Generate a Response

In [181]:


# Generate a response from the augmented prompt template
output = ollama.generate(
  model="llama2",
  prompt=prompt_template,
)

# Print the generated response to the prompt
print(output['response'])

# Llama2 will answer the prompt "What animals are llamas related to?" using the data: 




Llamas are related to several other animals within the camelid family, including:

1. Vicuñas: Vicuñas are small, wild relatives of llamas and alpacas. They are native to South America and are known for their soft, velvety fur.
2. Camels: As you mentioned, llamas are closely related to camels. Both llamas and camels belong to the Camelidae family and share many similarities in terms of their physical characteristics and behavior.
3. Alpacas: Alpacas are also members of the camelid family and are closely related to llamas. They are native to South America and are known for their soft, fleecy fur.

So, in summary, llamas are related to vicuñas, camels, and alpacas within the camelid family.


In [182]:
# Close the embedded Weaviate instance
# client.close()