# Semantic search indexes with Ollama (Python)

***...using RAG***

Anton Antonov  
January 2026


----

## Introduction


This notebook has the following workflow:

1. Load packages:
    - RAG packages (LangChain)
    - Ollama embeddings
    - Data display packages
2. Ingest a set of text files
    - Produce a corresponding data frame
3. Summarize the data frame
4. Prepare the documents for RAG
5. Create a semantic search index for the texts
    - Use batching to avoid large embedding calls
    - Merge the partial indexes into a single [FAISS index](https://docs.langchain.com/oss/python/integrations/vectorstores/faiss)
6. Export the vector database
7. Verification:
    - Import the vector database
    - Run a sample retrieval
    - Generate a LLM answer using the RAG result as context 


---

## Setup


Load general, LLM, RAG, and display packages:

In [56]:
import os
import glob
import time
import xdg
from typing import List
import re

import pandas as pd
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

#from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama import OllamaEmbeddings, ChatOllama

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Display packages
from IPython.display import display, Markdown

Create an embedding client:

In [None]:
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434",
)


Ollama LLM model for queries processing:

In [69]:
llm = ChatOllama(
    model="gpt-oss:20b",
    base_url="http://localhost:11434",
)

### Helpers

Preferred vector databases storage:

In [57]:
def ensure_xdg_data_home_dir():
    xdg_data_home = os.environ.get('XDG_DATA_HOME', os.path.expanduser('~/.local/share'))
    target_dir = os.path.join(xdg_data_home, 'Python', 'LLM', 'SemanticSearch')
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
    return target_dir

----

## Data ingestion


In [None]:
print("Ingesting text data...")

data_dir = os.path.expanduser("../../texts/DialogueWorks/Wolff-Hudson")
file_names = sorted(glob.glob(os.path.join(data_dir, "*.txt")))

print(f"{len(file_names)} files")
file_names[:5]


Ingesting text data...
76 files


['../../texts/DialogueWorks/Wolff-Hudson/-fCOefH-Fnw.txt',
 '../../texts/DialogueWorks/Wolff-Hudson/0lD_UrtPVpA.txt',
 '../../texts/DialogueWorks/Wolff-Hudson/2FKtZtjN_cA.txt',
 '../../texts/DialogueWorks/Wolff-Hudson/4G98DeCZcT8.txt',
 '../../texts/DialogueWorks/Wolff-Hudson/594yN8rxIJo.txt']

In [41]:
# Read all text into a single list
texts = []
for path in file_names:
    with open(path, 'r', encoding='utf-8') as file:
        texts.append(file.read())

basenames = [ re.sub(r'\.txt$', '', os.path.basename(x)) for x in file_names]

texts = dict(zip(basenames, texts))

len(texts)

76

In [43]:
dfTexts = pd.DataFrame([{"ID": k, "Text" : v} for (k, v) in texts.items()])
dfTexts

Unnamed: 0,ID,Text
0,-fCOefH-Fnw,"Hi everybody. Today is Thursday, June 26, 2025..."
1,0lD_UrtPVpA,hi everybody today's th February 13 2025 and o...
2,2FKtZtjN_cA,"Hi everybody. Today is Thursday, June 12, 2025..."
3,4G98DeCZcT8,"Hi everybody. Today is Thursday, April 24th, 2..."
4,594yN8rxIJo,hi everybody today is Thursday April 3r 25 and...
...,...,...
71,w--fsqQQQa0,hi everybody today is Thursday February 6th 20...
72,wl8sBSx5lpI,hi everybody today is Wednesday November 27th ...
73,xJMbCi3cmQI,let's start with the SEO Summit in kakistan As...
74,zHmzUl0HmPE,what are the impacts of sanctions on a country...


---

## Summaries


Basic character-length summary:

In [44]:
# Basic character-length summary for the text columns
summary = (
    dfTexts["Text"]
    .astype(str)
    .map(len)
    .describe()
)
summary

count       76.000000
mean     40516.842105
std      14227.287461
min       6498.000000
25%      31202.500000
50%      41394.500000
75%      49382.500000
max      70363.000000
Name: Text, dtype: float64

----

## Documents preparation


In [51]:
# Map Record_ID to the main text column and filter short records
min_chars = 100
rows = dfTexts[["ID", "Text"]].copy()
rows["Text"] = rows["Text"].astype(str)
rows = rows[rows["Text"].str.len() >= min_chars]

records = [
    Document(
        page_content=row.Text,
        metadata={"record_id": row.ID},
    )
    for row in rows.itertuples(index=False)
]

print(f"Prepared {len(records)} documents")


Prepared 76 documents


---

## Semantic indexes

- Embedding very large batches can be slow or fail.
- We create smaller FAISS indexes, then merge them into a single index.


In [53]:
# Split long texts into chunks for better recall
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
)

chunked_docs: List[Document] = []
for doc in records:
    chunked_docs.extend(text_splitter.split_documents([doc]))

print(f"Chunked documents: {len(chunked_docs)}")


Chunked documents: 3646


In [54]:
batch_size = 1000
faiss_index = None

start = time.time()
for i in range(0, len(chunked_docs), batch_size):
    batch = chunked_docs[i : i + batch_size]
    batch_index = FAISS.from_documents(batch, embeddings)
    if faiss_index is None:
        faiss_index = batch_index
    else:
        faiss_index.merge_from(batch_index)
    print(f"Embedded batch {i // batch_size + 1}")

print(f"Total time: {time.time() - start:.1f}s")


Embedded batch 1
Embedded batch 2
Embedded batch 3
Embedded batch 4
Total time: 47.2s


----

## Export semantic index


Persist the merged FAISS index:

In [None]:
export_dir = ensure_xdg_data_home_dir() + "/EconomicsAI"
os.makedirs(export_dir, exist_ok=True)
faiss_index.save_local(export_dir)

#export_dir

---

## Retrieval experiments


Import vector database:

In [None]:
export_dir = os.path.expanduser("~/.local/share/Python/LLM/SemanticSearch") + "/EconomicsAI"
vdb = FAISS.load_local(export_dir, embeddings, allow_dangerous_deserialization=True)

Search:

In [81]:
query = "how capitalism undermines itself"
rag_docs = vdb.similarity_search(query, k=10)

#res = [(r.metadata.get("record_id"), r.page_content[:200]) for r in results]
res = [r.metadata.get("record_id") for r in rag_docs]
res


['JQ2P5yitPyE',
 'v5CGEQrv0Lg',
 'uS4ewq1R9ps',
 'e5Ed7b1-LnY',
 'WqwK3xdZe1A',
 'loxwfNQw17o',
 'pHOJ2ASlu80',
 'v5CGEQrv0Lg',
 'vU1uxkZUFb0',
 '4G98DeCZcT8']

In [85]:
# This works but there is a built-in way of doing it.
#df2 = dfTexts[dfTexts["ID"].isin(res)]
#text_chunks = list(df2["Text"])

text_chunks = [doc.page_content.strip() for doc in rag_docs if doc.page_content]


In [None]:
prompt = PromptTemplate.from_template(
    "Summarize in a list with at most {number_of_items} points the content of this document chunks:\n {doc_chunks}"
)

chain = prompt | llm | StrOutputParser()

result = chain.invoke({
    "number_of_items": 12, 
    "doc_chunks": "\n\n".join(text_chunks)
})

In [87]:
display(Markdown(result))

**Key points from the document:**

1. **Capitalism produces both wealth and poverty** – it amplifies inequality while also generating large segments of the population in extreme poverty.  
2. **Uneven development** – even within the same city, five miles apart can be the richest and the poorest places, illustrating the uneven spread of capitalistic gains.  
3. **International agreements perpetuate the divide** – treaties, trade, and investment deals formalise the separation between rich and poor regions, reinforcing capital’s uneven expansion.  
4. **Economic inequality breeds political manipulation** – as the wealthy minority outgrows the majority, it often uses its resources to corrupt or limit democratic institutions to protect its interests.  
5. **Capitalism’s inherent instability** – the system is cyclical: sectors boom on credit, become over‑expanded, then crash or are forced back into equilibrium through government intervention.  
6. **Critics highlight the lack of coordination** – decision‑making is fragmented among enterprises, leading to systemic crises unless corrected by state action.  
7. **Marx’s duality of capitalism** – “capitalism is a constant producer and reproducer of great wealth … and great poverty”; the same mechanism that creates riches also creates misery.  
8. **Unfulfilled promises** – capitalism has historically failed to deliver on the ideals of liberty, equality, fraternity, and democracy it initially promised.  
9. **Capitalism itself blocks those ideals** – Marx argued that the very structure of capitalism prevents the realization of freedom, equality, and democracy.  
10. **China’s hybrid model as a counterexample** – a powerful state coupled with a sizable private sector has achieved unprecedented growth, outperforming Western economies.  
11. **Capitalism is not democratic** – its organizational logic (hierarchies, ownership concentration, profit maximisation) is fundamentally at odds with democratic decision‑making.  
12. **Environmental contradictions** – capitalist pursuit of profit inevitably degrades the natural environment; these contradictions are built into the system and will recur when conditions change.