## Introduction {#introduction}
Have you ever searched for something online and been frustrated when the search engine couldn't find what you were looking for, even though you knew it existed? Traditional search relies on exact word matching, so searching for "Denim for men" won't return products labeled "Men's Jeans 32W" or "Men's Trendy Jeans."
Semantic search solves this by understanding meaning and context. It recognizes that "denim" relates to "jeans" and "men" connects to "32W" in clothing contexts, returning relevant results despite different wording.
This comprehensive Pinecone vector database tutorial will teach you to build a complete semantic search pipeline using Ollama embeddings for local processing. You'll learn to implement a Python vector database solution that combines the power of Pinecone vector database with Ollama embeddings and LangChain orchestration.
By the end of this semantic search tutorial, you'll have a fully functional system that processes documents locally and stores them in a scalable vector database for lightning-fast similarity searches.
<div class="klaviyo-form-VWXSdu" style="margin: 20px;"></div>
## What Is Pinecone? {#what-is-pinecone}
[**Pinecone**](https://github.com/pinecone-io) is a vector database designed specifically for storing and searching high-dimensional data. But what does that mean for data scientists?
Think of traditional databases like spreadsheets: they store text, numbers, and dates in rows and columns. Vector databases like Pinecone store mathematical representations of data called vectors (arrays of numbers that capture meaning).
Here's a simple comparison
**Traditional Database:**
| ID | Product Name    | Category |
|----|-----------------|----------|
| 1  | Men's Jeans 32W | Clothing |
| 2  | Denim Pants     | Apparel  |
**Vector Database:**
| ID | Product Vector                    | 
|----|-----------------------------------|
| 1  | [0.2, 0.8, 0.1, 0.9, ...]       |
| 2  | [0.3, 0.7, 0.2, 0.8, ...]       |
By storing data as vectors, Pinecone enables semantic understanding beyond what traditional databases can achieve. Similar products have similar vector representations, allowing Pinecone to find relevant results based on meaning rather than exact keyword matches.

In [None]:
#| echo: false
import matplotlib.pyplot as plt
from utils import apply_codecut_style

# Create side-by-side comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left plot: Similarity scores (easier to understand)
products = ["Men's Jeans", "Denim Pants", "Leather Jacket"]
query = "denim for men"
similarity_scores = [0.92, 0.89, 0.45]
colors = ["#E583B6", "#72BEFA", "#72FCDB"]

bars = ax1.barh(products, similarity_scores, color=colors, alpha=0.8)
ax1.set_xlabel("Similarity Score", fontsize=12, color="white")
ax1.set_title(f'Query: "{query}"', fontsize=13, color="white")
ax1.set_xlim(0, 1.0)

# Add score labels on bars
for bar, score in zip(bars, similarity_scores, strict=False):
    ax1.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2, 
             f"{score:.2f}", va="center", fontsize=11, color="white")

# Right plot: Simple 2D vector space (same products as left plot)
vectors = {
    "Men's Jeans": [0.8, 0.7],
    "Denim Pants": [0.75, 0.72], 
    "Leather Jacket": [0.6, 0.4]
}

for i, (product, pos) in enumerate(vectors.items()):
    ax2.scatter(pos[0], pos[1], c=colors[i], s=200, 
               edgecolors="white", linewidth=2, alpha=0.8)

    # Smart label positioning to avoid overlaps
    if product == "Denim Pants":
        offset_x, offset_y, ha = -0.08, -0.03, "right"  # Closer to blue dot
    elif product == "Leather Jacket":
        offset_x, offset_y, ha = 0.05, -0.05, "left"
    else:
        offset_x, offset_y, ha = 0.05, 0.05, "left"

    ax2.text(pos[0] + offset_x, pos[1] + offset_y, product, 
             fontsize=11, color="white", ha=ha)

# Draw similarity grouping for denim products
circle = plt.Circle((0.775, 0.71), 0.08, fill=False, 
                   linestyle="--", color="white", alpha=0.7, linewidth=2)
ax2.add_patch(circle)
ax2.text(0.85, 0.78, "Similar\nProducts", fontsize=10, color="white", 
         ha="center", style="italic")

ax2.set_xlim(0.2, 0.95)
ax2.set_ylim(0.15, 0.85)
ax2.set_xlabel("Semantic Dimension 1", fontsize=12, color="white")
ax2.set_ylabel("Semantic Dimension 2", fontsize=12, color="white")
ax2.set_title("Vector Space: Similar Items Cluster Together", fontsize=13, color="white")

# Apply styling to both plots
for ax in [ax1, ax2]:
    apply_codecut_style(ax)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

![](https://codecut.ai/wp-content/uploads/2025/06/output-1.png)

The visualization demonstrates semantic search in action where distance between dots represents similarity. The dashed circle shows how "denim" products (pink and blue dots) cluster close together, while the leather jacket (green dot) sits farther away, indicating lower semantic similarity.

Pinecone stands out among vector databases with its cloud-native design that eliminates infrastructure management:

- **Serverless architecture**: No need to manage clusters or configure hardware unlike self-hosted options like Weaviate or Qdrant
- **Automatic scaling**: Handles millions of vectors without manual intervention, unlike pgvector which requires PostgreSQL tuning
- **Built-in monitoring**: Provides performance metrics and alerts out-of-the-box, reducing operational overhead
- **Optimized indexing**: Uses proprietary algorithms for sub-millisecond search across billion-scale datasets

## What Are Ollama and LangChain? {#what-are-ollama-and-langchain}

[**Ollama**](https://github.com/ollama/ollama) provides free, local LLM hosting with complete data privacy, unlike paid cloud APIs like OpenAI or Claude.

[**LangChain**](https://github.com/langchain-ai/langchain) serves as the orchestration framework for building LLM applications efficiently. For comprehensive LangChain fundamentals, see our [LangChain and Ollama guide](https://codecut.ai/private-ai-workflows-langchain-ollama/).

## Overview of the Architecture {#overview-of-the-architecture}

The architecture follows a straightforward design:

- **Ollama**: Generates embeddings locally
- **Pinecone**: Stores vectors and performs similarity search
- **LangChain**: Orchestrates the entire pipeline

![](https://codecut.ai/wp-content/uploads/2025/06/diagram-export-6-22-2025-12_38_51-PM-1.png)

## Step-by-Step Implementation {#step-by-step-implementation}

### Ollama Setup {#ollama-setup}

We can setup Ollama locally by downloading and installing it first. For Linux users, run the installation script:

```bash
# For Linux users
curl -fsSL https://ollama.com/install.sh | sh
```

Once installation completes, start the Ollama server locally.

```bash
ollama serve
```

Next, download a model specifically designed for generating embeddings. We'll use `mxbai-embed-large`, which is optimized for semantic search tasks.

```bash
ollama pull mxbai-embed-large
```

With Ollama installed and the embedding model downloaded, we're ready to start coding.

### Loading Text Data {#loading-text-data}

Next, we'll prepare our text data for embedding generation. For this tutorial, we'll work with PDF documents stored in a `data` folder using LangChain's document loader.

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader


# read/load the pdf document from the directory
def read_doc(directory):
    file_loader= PyPDFDirectoryLoader(directory)
    document = file_loader.load()
    if not document:
            raise ValueError("No documents found in the specified directory.")
    return document

docs = read_doc("data/")

The documents we are considering in this tutorial are [research papers](https://papers.nips.cc/paper_files/paper/2023) due to their symmetrical format. Before generating embeddings, we need to split large documents into smaller chunks through a process called text chunking.


### Text Chunking {#text-chunking}

Text chunking is essential for processing large documents before embedding generation. The process breaks down documents into smaller, manageable pieces that embedding models can process effectively within their token limits.

Key concepts for chunking are:

- **Chunk size (1000 chars)**: Maximum characters per chunk, ensuring each piece fits within embedding model token limits.
- **Chunk overlap (50 chars)**: Characters that overlap between adjacent chunks to prevent important information from being split. (Example: Chunk 1 ends with "...model performance" and Chunk 2 starts with "model performance improves...")
- **Hierarchical separators**: Natural text boundaries like paragraphs (`\n\n`), sentences (`.`), and words (` `) that preserve meaning when splitting
- **Minimum length filtering (50 chars)**: Removes chunks too short to contain meaningful semantic information. (Example: Filters out headers like "Introduction" or "Figure 1")

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def chunk_data(docs, chunk_size=1000, chunk_overlap=50, min_length=50):
    """Split documents into smaller chunks for embedding generation."""

    # Configure text splitter with hierarchical separators
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
        add_start_index=True
    )

    # Split documents into chunks
    chunks = text_splitter.split_documents(docs)

    # Filter out headers and short chunks
    filtered_chunks = [
        chunk for chunk in chunks
        if not chunk.page_content.startswith("NeurIPS 2023") and
           len(chunk.page_content.strip()) >= min_length
    ]

    return filtered_chunks

# Process documents into chunks
chunks = chunk_data(docs)

The hierarchical separators `["\n\n", "\n", ".", "!", "?", ",", " ", ""]` ensure text breaks at natural boundaries. The splitter prefers paragraph breaks, then line breaks, then sentence endings, maintaining content integrity while creating optimal chunk sizes for semantic search.

### Embeddings {#embeddings}

The embeddings will be generated in two steps:

- Initializing the `OllamaEmbeddings`.
- Saving the embeddings in Pinecone

#### Python Vector Database with Ollama Embeddings

We can simply generate Ollama embeddings using a text model (`mxbai-embed-large` in this case). This approach provides local embedding generation for our Python vector database implementation, ensuring data privacy and reducing API costs.

In [None]:
from langchain_ollama.embeddings import OllamaEmbeddings

# Initialize Ollama embeddings for Python vector database
# Using mxbai-embed-large model for high-quality embeddings
embeddings = OllamaEmbeddings(model="mxbai-embed-large")

#### LangChain Pinecone Integration Setup

Now, we will configure our Pinecone vector database for the LangChain Pinecone integration. First things first, go to the [Pinecone website](https://app.pinecone.io/?sessionType=signup) and make a (free) user account. After making your account, you will be able to get Pinecone's API key for your Python vector database setup. Before setting up the Pinecone vector database, please set up the Pinecone API key in the .env file.

The `.env` file will look something like this:

```
PINECONE_API_KEY="your-api-key"
```

Pinecone organizes vector data in indexes, with records partitioned into namespaces within each index. We'll create an index named `semantic-search-local` with 1024 dimensions using cosine similarity.

The free tier restricts us to the `us-east-1` region, and we'll enable deletion protection to prevent accidental data loss. Here's how to set up the index:

In [None]:
import os
import time

from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone, ServerlessSpec

load_dotenv()

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index_name = "semantic-search-local"
namespace = "langchain-ollama"

# Check if index already exists
existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

# Create index if it doesn't exist
if index_name not in existing_indexes:
    print(f"Creating index '{index_name}'...")
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        deletion_protection="enabled"
    )

    # Wait for index to be ready
    print("Waiting for index to be ready...")
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)
    print(f"Index '{index_name}' created and ready.")
else:
    print(f"Index '{index_name}' already exists. Connecting...")

Additionally, if you want to have a quick look at the index, you can do so using the `describe_index_stats()` .

In [None]:
# Connect to index
index = pc.Index(index_name)
index.describe_index_stats()

```
Index 'semantic-search-local' already exists. Connecting...
{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'semantic-search-local-namespace': {'vector_count': 139}},
 'total_vector_count': 139,
 'vector_type': 'dense'}

```

And now, we have populated the vector store/db with the embeddings.

In [None]:
vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    index_name=index_name,
    embedding=embeddings,
    namespace=index_name + "-namespace",
)

### Implementing Semantic Search with Python Vector Database {#implementing-semantic-search-with-python-vector-database}

With our Ollama embeddings stored in the Pinecone vector database, we can now perform semantic search using our Python vector database implementation. The process follows three simple steps common to all vector databases.

The search methodology involves converting queries to Ollama embeddings, comparing them with stored vectors in the Pinecone vector database, and retrieving the most similar results. We will:

- Use `k=2` to get the top 2 matches
- Set a similarity threshold of 0.5 to filter out low-quality results.

In [None]:
# cosine similarity search
# get the data from the database itself (VectorDB)
def retrieve_query(query, k=2, score_threshold=0.5):
    matching_result = vector_store.similarity_search_with_score(query, k=k)
    filtered = [r for r in matching_result if r[1] >= score_threshold]
    return filtered

Now, we will check some queries and also see the similarity score.

In [None]:
query = "What were the key findings of the NeurIPS 2023 LLM Efficiency Fine-tuning Competition?"
x = retrieve_query(query=query, k=1, score_threshold=0.5)
for match, score in x:
    print(f"Score: {score:.3f}")
    print(match.page_content[:300])  # Show preview
    print("-" * 50)

The above code will generate an output similar to this:

```
Score: 0.736
for generative models and demonstrate the need for more robust evaluation meth-
ods. Notably, the winning submissions utilized standard open-source libraries and
focused primarily on data curation. To facilitate further research and promote
reproducibility, we release all competition entries, Docker
```

The similarity score of 0.736 indicates a strong semantic match between our query and the retrieved text. This score exceeds our 0.5 threshold, confirming the result's relevance.

Since we set `k=1`, we're retrieving only the single most similar document. This result represents the best semantic match in our entire document collection for the given query.


### Measuring Basic Performance Measures {#measuring-basic-performance-measures}

Performance monitoring is crucial for production systems, so let's measure our search latency. This gives us baseline metrics for optimization decisions.

We'll test with the same research-focused queries from our earlier example. These represent typical academic questions users might ask about NeurIPS papers:

In [None]:
test_queries = [
    "What were the key findings of the NeurIPS 2023 LLM Efficiency Fine-tuning Competition?",
    "How does self-preference bias manifest in LLM evaluators, and what evidence supports this?",
]

Search times will vary based on dataset size, network conditions, and query complexity. The measurements below provide a starting point for performance expectations. 

In [None]:
def measure_latency(queries, k=3):
    latencies = []
    for query in queries:
        start = time.time()
        results = retrieve_query(query, k)
        duration = time.time() - start
        latencies.append(duration)
        print(f"Query: {query[:50]}... | Search Time: {duration:.2f}s")
    avg_latency = sum(latencies) / len(latencies) if latencies else 0
    print(f"Average Search Time: {avg_latency:.2f}s")
    return results

measure_latency(test_queries, k=3)