In [1]:
!pip install openai numpy



# Setting up the dependencies

In [2]:
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Enter OpenAI API Key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [3]:
from openai import OpenAI
client = OpenAI()

# Running Repeated Queries Without Caching
In this section, we run the same query 10 times directly through the GPT-4.1 model to observe how long it takes when no caching mechanism is applied. Each call triggers a full LLM computation and response generation, leading to repetitive processing for identical inputs.

This helps establish a baseline for total time and cost before we implement semantic caching in the next part.

In [18]:
import time
def ask_gpt(query):
    start = time.time()
    response = client.responses.create(
      model="gpt-4.1",
      input=query
    )
    end = time.time()
    return response.output[0].content[0].text, end - start

In [19]:
query = "Explain the concept of semantic caching in just 2 lines."
total_time = 0

for i in range(10):
    _, duration = ask_gpt(query)
    total_time += duration
    print(f"Run {i+1} took {duration:.2f} seconds")

print(f"\nTotal time for 10 runs: {total_time:.2f} seconds")

Run 1 took 3.10 seconds
Run 2 took 1.36 seconds
Run 3 took 1.36 seconds
Run 4 took 3.03 seconds
Run 5 took 1.27 seconds
Run 6 took 3.45 seconds
Run 7 took 1.60 seconds
Run 8 took 1.60 seconds
Run 9 took 1.88 seconds
Run 10 took 3.21 seconds

Total time for 10 runs: 21.87 seconds


Even though the query remains the same, every call still takes between 1‚Äì3 seconds, resulting in a total of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching can be so valuable ‚Äî it allows us to reuse previous responses for semantically identical queries and save both time and API cost.

# Implementing Semantic Caching for Faster Responses
In this section, we enhance the previous setup by introducing semantic caching, which allows our application to reuse responses for semantically similar queries instead of repeatedly calling the GPT-4.1 API.

Here‚Äôs how it works: each incoming query is converted into a vector embedding using the text-embedding-3-small model. This embedding captures the semantic meaning of the text. When a new query arrives, we calculate its cosine similarity with embeddings already stored in our cache. If a match is found with a similarity score above the defined threshold (e.g., 0.85), the system instantly returns the cached response ‚Äî avoiding another API call.

If no sufficiently similar query exists in the cache, the model generates a fresh response, which is then stored along with its embedding for future use. Over time, this approach dramatically reduces both response time and API costs, especially for frequently asked or rephrased queries.

In [20]:
import numpy as np
from numpy.linalg import norm
semantic_cache = []

def get_embedding(text):
    emb = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(emb.data[0].embedding)

In [23]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

def ask_gpt_with_cache(query, threshold=0.85):
    query_embedding = get_embedding(query)

    # Check similarity with existing cache
    for cached_query, cached_emb, cached_resp in semantic_cache:
        sim = cosine_similarity(query_embedding, cached_emb)
        if sim > threshold:
            print(f"üîÅ Using cached response (similarity: {sim:.2f})")
            return cached_resp, 0.0  # no API time

    # Otherwise, call GPT
    start = time.time()
    response = client.responses.create(
        model="gpt-4.1",
        input=query
    )
    end = time.time()
    text = response.output[0].content[0].text

    # Store in cache
    semantic_cache.append((query, query_embedding, text))
    return text, end - start

In [25]:
queries = [
    "Explain semantic caching in simple terms.",
    "What is semantic caching and how does it work?",
    "How does caching work in LLMs?",
    "Tell me about semantic caching for LLMs.",
    "Explain semantic caching simply.",
]

In [26]:
total_time = 0
for q in queries:
    resp, t = ask_gpt_with_cache(q)
    total_time += t
    print(f"‚è±Ô∏è Query took {t:.2f} seconds\n")

print(f"\nTotal time with caching: {total_time:.2f} seconds")

‚è±Ô∏è Query took 8.28 seconds

üîÅ Using cached response (similarity: 0.86)
‚è±Ô∏è Query took 0.00 seconds

‚è±Ô∏è Query took 15.84 seconds

‚è±Ô∏è Query took 11.77 seconds

üîÅ Using cached response (similarity: 0.97)
‚è±Ô∏è Query took 0.00 seconds


Total time with caching: 35.89 seconds


In the output, the first query took around 8 seconds as there was no cache and the model had to generate a fresh response. When a similar question was asked next, the system identified a high semantic similarity (0.86) and instantly reused the cached answer, saving time. Some queries, like ‚ÄúHow does caching work in LLMs?‚Äù and ‚ÄúTell me about semantic caching for LLMs,‚Äù were sufficiently different, so the model generated new responses, each taking over 10 seconds. The final query was nearly identical to the first one (similarity 0.97) and was served from cache instantly.