This solution accelerator notebook is available at [Databricks Industry Solutions](https://github.com/databricks-industry-solutions/semantic-caching).

#Cache eviction

This notebook walks you through some of the eviction strategies you can employ to your semantic cache. 

In [0]:
%pip install -r requirements.txt --quiet
dbutils.library.restartPython()

In [0]:
from config import Config
config = Config()

In [0]:
import os

HOST = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

os.environ['DATABRICKS_HOST'] = HOST
os.environ['DATABRICKS_TOKEN'] = TOKEN

## Cleaning up the cache

We instantiate a Vector Search client to interact with a Vector Search endpoint.

In [0]:
from databricks.vector_search.client import VectorSearchClient
from cache import Cache

vsc = VectorSearchClient(
    workspace_url=HOST,
    personal_access_token=TOKEN,
    disable_notice=True,
    )

semantic_cache = Cache(vsc, config)

## FIFO (First-In-First-Out) Strategy

**FIFO** (First-In-First-Out) removes the oldest cached items first. In a **semantic caching** context for **LLM responses**, it is useful when:
**Static or frequently changing queries**: If queries or questions tend to change frequently over time, older answers might become irrelevant quickly.
- **Use Case**: Effective in scenarios where users query frequently changing topics (e.g., breaking news or real-time.)

#### Pros:
- Simple to implement.
- Removes outdated or stale responses automatically.

#### Cons:
- Does not account for query popularity. Frequently asked questions might be evicted even if they are still relevant.
- Not ideal for handling frequently recurring queries, as important cached answers could be removed.


In [0]:
semantic_cache.evict(strategy='FIFO', max_documents=4, batch_size=4)

## LRU (Least Recently Used) Strategy

**LRU** (Least Recently Used) evicts items that haven't been accessed recently. This strategy works well in **semantic caching** for **LLM responses** when:
- **Popular or recurring questions**: Frequently asked questions (FAQs) remain in the cache while infrequent or one-off queries are evicted.
- **Use Case**: Best suited for systems handling recurring queries, such as customer support, FAQ systems, or educational queries where the same questions are asked repeatedly.

#### Pros:
- Ensures that frequently accessed answers stay in the cache.
- Minimizes re-computation for common queries.

#### Cons:
- Higher overhead compared to FIFO, as it tracks access patterns.
- May retain less relevant but frequently accessed responses, while important but less commonly asked answers could be evicted.


In [0]:
semantic_cache.evict(strategy='LRU', max_documents=49)

### **Limitations:**

- **Sequential Batch Eviction:** Both FIFO and LRU rely on batch eviction that involves querying and removing documents iteratively. This sequential process could slow down as the number of documents increases.
- **Full Cache Query:** The current implementation of __evict_fifo_ and __evict_lru_ fetches a batch of documents for each iteration, which requires a similarity search query each time. This may introduce latency for larger caches.
- **Single-threaded Eviction:** The eviction process operates in a single thread, and as the number of documents grows, the time taken to query and delete entries will increase.

**Potential Improvements:**

- **Bulk Deletion:**
   - Instead of deleting documents in small batches (based on batch_size), consider implementing bulk deletion by gathering all the documents to be evicted in a single query and deleting them all at once.
- **Parallelism/Concurrency:**
   - Use parallel or multi-threaded processing to speed up both the similarity search and deletion processes using Spark.
   - Implementing asynchronous operations can allow multiple batches to be processed concurrently, reducing overall eviction time.
- **Optimize Batch Size:**
   - Fine-tune the batch_size dynamically based on the current system load or cache size. Larger batches may reduce the number of queries but may also consume more memory, so optimization here is key.
- **Index Partitioning:**
   - If possible, partition the index based on time (for FIFO) or access time (for LRU). This would allow the search and eviction process to be more efficient, as it would target a specific partition instead of querying the entire cache.
- **Cache Usage Statistics:**
   - Integrate a system to track the real-time size of the cache and update indexed_row_count without querying the entire cache each time. This would reduce the number of times you need to check the total cache size during eviction.


© 2024 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License.