# Simple Caching System with Pickle Files

This notebook demonstrates how to implement a simple caching system using pickle (.pkl) files for storing both document data and vector indices. This approach helps avoid the slowness of the notion API and reduces the cost of repeated API calls to embbed documents.


## Prerequisites
- A Notion integration token (create one at https://www.notion.so/my-integrations)
- An OpenAI API key
- The Notion page must grant access to your integration

## Setup and Installation

First, let's install the required packages. Run the cell below to install all dependencies needed for this notebook.

In [None]:
# Install required packages
!pip install llama-index llama-index-readers-notion llama-index-llms-openai python-dotenv

# Verify installations
import importlib

def check_package(package_name):
    try:
        importlib.import_module(package_name)
        return True
    except ImportError:
        return False

packages = {
    "llama_index": "llama-index core",
    "llama_index.readers.notion": "Notion reader",
    "llama_index.llms.openai": "OpenAI integration",
    "openai": "OpenAI API"
}

all_installed = True
for package, display_name in packages.items():
    installed = check_package(package)
    print(f"{display_name}: {'✅ Installed' if installed else '❌ Not installed'}")
    all_installed = all_installed and installed

if all_installed:
    print("\n✅ All required packages are installed!")
else:
    print("\n⚠️ Some packages are missing. Run the installation command again.")

## Environment Setup

Load environment variables from the `.env` file. <br>
N.b. it will look through the entire project for a valid `.env` file.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get API keys from environment variables
NOTION_INTEGRATION_TOKEN = os.getenv("NOTION_INTEGRATION_TOKEN")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Cache Expiration setting  
CACHE_DOCUMENTS_EXPIRATION = 60 * 60 * 24 * 7  # 1 week
CACHE_INDEX_EXPIRATION = 60 * 60 * 24 * 7  # 1 week
CACHE_QUERIES_EXPIRATION = 60 * 60 # 1 hour

# Set Notion page IDs (comma-separated string if multiple one)
page_ids_str = "9917363395904835a604ca7a6a358579" # replace with your Notion page ID(s)
# Convert comma-separated string to list
NOTION_PAGE_IDS = page_ids_str.split(",")

# Set environment variables for compatibility with libraries that expect them
os.environ["NOTION_INTEGRATION_TOKEN"] = NOTION_INTEGRATION_TOKEN or ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY or ""

# Verify API keys are set
if not NOTION_INTEGRATION_TOKEN:
    print("⚠️ Warning: NOTION_INTEGRATION_TOKEN is not set in .env file")
if not OPENAI_API_KEY:
    print("⚠️ Warning: OPENAI_API_KEY is not set in .env file")
else:
    print("✅ API keys are set")
    print(f"✅ Using Notion page IDs: {NOTION_PAGE_IDS}")

## Import Required Libraries

First, we'll import the necessary libraries and configure logging.

In [None]:
import logging
import sys
import openai

from IPython.display import Markdown, display

# Configure basic logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Verify API Keys

Before proceeding, let's verify that our API keys are properly set.

In [None]:
# Set OpenAI API key
openai.api_key = OPENAI_API_KEY

if not NOTION_INTEGRATION_TOKEN:
    raise ValueError("No Notion integration token found. Please set NOTION_INTEGRATION_TOKEN above.")
    
if not OPENAI_API_KEY:
    raise ValueError("No OpenAI API key found. Please set OPENAI_API_KEY above.")

print("✅ API keys verified")

## Configure LLM

Set up the OpenAI language model.

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

# Initialize the language model
llm = OpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY, temperature=0.0, max_tokens=1000)

# Set the LLM in the settings of llama_index
Settings.llm = llm

# Test the LLM
response = llm.complete("Hello, I am a language model. ")
print("LLM Test Response:", response.text)

## Basic Caching Implementation

Let's create a simple caching system for storing and retrieving data. We'll use pickle to serialize and deserialize Python objects to/from files.

In [None]:
import pickle
import time
import os

class SimpleCache:
    def __init__(self, cache_dir="cache", expiration_seconds=CACHE_DOCUMENTS_EXPIRATION):
        """Initialize a simple cache with specified directory and expiration time."""
        self.cache_dir = cache_dir
        self.expiration_seconds = expiration_seconds
        os.makedirs(cache_dir, exist_ok=True)
        
    def get(self, key, data_loader=None):
        """Get data from cache or load it using data_loader if expired or missing."""
        cache_file = os.path.join(self.cache_dir, f"{key}.pkl")
        
        # Check if cache file exists and is not expired
        if os.path.exists(cache_file):
            file_age = time.time() - os.path.getmtime(cache_file)
            if file_age < self.expiration_seconds:
                print(f"Loading {key} from cache...")
                with open(cache_file, "rb") as f:
                    return pickle.load(f)
            else:
                print(f"Cache for {key} expired ({file_age:.0f} seconds old)")
        else:
            print(f"No cache found for {key}")
            
        # If we reached here, we need to load fresh data
        if data_loader is None:
            return None
            
        print(f"Fetching fresh data for {key}...")
        data = data_loader()
        self.set(key, data)
        return data
    
    def set(self, key, data):
        """Save data to cache."""
        cache_file = os.path.join(self.cache_dir, f"{key}.pkl")
        print(f"Saving data to {cache_file}")
        with open(cache_file, "wb") as f:
            pickle.dump(data, f)
        return True
    
    def invalidate(self, key):
        """Remove an item from the cache."""
        cache_file = os.path.join(self.cache_dir, f"{key}.pkl")
        if os.path.exists(cache_file):
            os.remove(cache_file)
            print(f"Removed {key} from cache")
            return True
        return False
    
    def clear(self):
        """Clear all items from the cache."""
        for filename in os.listdir(self.cache_dir):
            if filename.endswith(".pkl"):
                os.remove(os.path.join(self.cache_dir, filename))
        print("Cache cleared")

# Create a cache instance
cache = SimpleCache(cache_dir="simple_cache")

## Test Basic Cache Functions

Let's test our caching system with some basic data.

In [None]:
import random
import time

# Define a function that simulates expensive computation
def expensive_computation():
    print("Performing expensive computation...")
    time.sleep(2)  # Simulate work
    return {
        "result": random.randint(1, 100),
        "timestamp": time.time()
    }

# First call - should compute
result1 = cache.get("test_data", expensive_computation)
print(f"Result 1: {result1}\n")

# Second call - should use cache
result2 = cache.get("test_data", expensive_computation)
print(f"Result 2: {result2}\n")

# Invalidate and try again - should compute
cache.invalidate("test_data")
result3 = cache.get("test_data", expensive_computation)
print(f"Result 3: {result3}\n")

# Check that results match expectations
print(f"Result 1 and 2 are identical: {result1 == result2}")
print(f"Result 2 and 3 are different: {result2 != result3}")

## Caching Notion Data

Now let's apply our caching system to Notion data retrieval. This is especially useful since Notion API calls can be slow and rate-limited.

In [None]:
from llama_index.readers.notion import NotionPageReader

# Function to load data from Notion
def load_notion_data():
    # Get page IDs from environment or use default
    page_ids_str = "9917363395904835a604ca7a6a358579"
    page_ids = page_ids_str.split(",")
    
    print(f"Fetching data from Notion API for pages: {page_ids}")
    documents = NotionPageReader(integration_token=NOTION_INTEGRATION_TOKEN).load_data(
        page_ids=page_ids
    )
    print(f"Fetched {len(documents)} documents from Notion")
    return documents

# Try to get Notion data from cache or load it
try:
    # Use configurable cache expiration from .env
    notion_cache = SimpleCache(cache_dir="notion_cache", expiration_seconds=CACHE_DOCUMENTS_EXPIRATION)
    documents = notion_cache.get("notion_docs", load_notion_data)
    
    if documents:
        print(f"\nSuccessfully loaded {len(documents)} documents")
        # Display brief information about the documents
        for i, doc in enumerate(documents):
            print(f"Document {i+1} - Title: {doc.metadata.get('title', 'Untitled')}")
            print(f"  - First 100 chars: {doc.text[:100]}...")
except Exception as e:
    print(f"Error loading Notion data: {e}")

## Caching Vector Indices

Vector indices can be computationally expensive to create. Let's enhance our approach to cache not just the documents but also the vector index.

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Create text splitter
text_splitter = SentenceSplitter(chunk_size=1000, chunk_overlap=200)

# Function to create vector index from documents
def create_vector_index(documents):
    print("Creating vector index...")
    return VectorStoreIndex.from_documents(
        documents,
        transformations=[text_splitter]
    )

# Get documents (from cache if available)
try:
    documents = notion_cache.get("notion_docs", load_notion_data)
    
    # Create a vector index cache with configurable expiration
    index_cache = SimpleCache(cache_dir="vector_cache", expiration_seconds=CACHE_INDEX_EXPIRATION)
    
    # Get or create index
    index = index_cache.get("notion_index", lambda: create_vector_index(documents))
    
    print("Vector index ready for querying")
except Exception as e:
    print(f"Error setting up vector index: {e}")

## Query the Cached Index

Now let's use our cached vector index to query the documents.

In [None]:
# Set up a query engine with our cached index
query_engine = index.as_query_engine(
    similarity_top_k=2  # Return top 2 matching chunks
)

# Define a query function with caching
def cached_query(query_text):
    # Use a cache for queries with 1-hour expiration
    query_cache = SimpleCache(cache_dir="query_cache", expiration_seconds=CACHE_QUERIES_EXPIRATION)
    cache_key = f"query_{hash(query_text)}"  # Create a unique key for this query
    
    # Define a function to execute the query if not cached
    def execute_query():
        print(f"Executing query: '{query_text}'")
        return query_engine.query(query_text)
    
    # Get from cache
    result = query_cache.get(cache_key, execute_query)
    
    if result:
        print(f"Query successful!")
        return result
    else:
        print("No result found for this query.")
        return None
    
# Example query
query_text = "What is the purpose of this document?"

# Call the cached query function
result = cached_query(query_text)
if result:
    print(f"Query result: {result}")
    
    
query_text = "Summarize the main points of this document."

# Call the cached query function
result = cached_query(query_text)
if result:
    print(f"Query result: {result}")
    

# Clean up caches if needed
# cache.clear()


## Conclusion

In this notebook, we demonstrated how to :
1. Create a simple caching system using pickle files.
2. Cache Notion data to avoid repeated API calls.
3. Cache vector indices to speed up document retrieval and reduce computational costs.
4. Query the cached index to retrieve relevant documents.

This approach can significantly improve the performance of applications that rely on external APIs and large datasets. By caching data locally, we can reduce latency and API costs while maintaining the flexibility of our application.