# Setting Up an Embedding Database with LLaMA and Zotero Data

This notebook will guide you through the steps to:
1. Load and preprocess research papers metadata from Zotero.
2. Generate embeddings for each document chunk using a local LLaMA model.
3. Store these embeddings in a vector database (Chroma) for retrieval.
4. Perform similarity-based retrieval on the stored embeddings.

### Prerequisites
- You need to have a JSON export of your Zotero library.
- Ensure that `langchain`, `chromadb`, `faiss-cpu`, and `langchain_ollama` are installed.
  
Install any missing packages with the following:
```python
pip install langchain chromadb faiss-cpu 


In [None]:
# Import necessary libraries
import json
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document


## Step 1: Load and Preprocess Zotero Data

Export your Zotero library to a JSON file and provide the file path below. This code will load the JSON file, extract relevant information, and prepare documents for embedding generation.


In [None]:
# Path to the Zotero JSON data file
JSON_FILE_PATH = "zotero_export.json"

# Load JSON metadata from Zotero
def load_zotero_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

# Preprocess documents to extract and structure text
def preprocess_documents(data):
    documents = []
    for item in data:
        title = item.get("title", "Untitled")
        text = item.get("abstract", "")  # Use abstract or other text available
        documents.append({"title": title, "text": text})
    return documents

# Load and preprocess Zotero data
zotero_data = load_zotero_json(JSON_FILE_PATH)
documents = preprocess_documents(zotero_data)

# Display a sample document
documents[:1]  # Show the first document as an example


## Step 2: Split Text into Chunks for Better Retrieval

To improve retrieval precision, split each document into smaller chunks. This will make it easier to search and retrieve relevant sections.


In [None]:
# Split text into chunks (for better retrieval)
def split_text(text, chunk_size=512):
    words = text.split()
    return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Process documents into chunks
document_chunks = []
for doc in documents:
    chunks = split_text(doc['text'])
    document_chunks.append({"title": doc['title'], "chunks": chunks})

# Display the first document with chunks
document_chunks[:1]


## Step 3: Generate Embeddings with LLaMA

We will use `OllamaEmbeddings` to generate embeddings for each chunk. Make sure you have a local instance of LLaMA configured.


In [None]:
# Initialize LLaMA embeddings
embedding_model = OllamaEmbeddings(model="llama3")

# Generate embeddings for each chunk
all_embeddings = []
metadata = []

for doc in document_chunks:
    for chunk in doc['chunks']:
        embedding = embedding_model.embed(chunk)
        all_embeddings.append(embedding)
        metadata.append({"title": doc['title'], "text": chunk})

# Check the number of embeddings created
len(all_embeddings), len(metadata)


## Step 4: Store Embeddings in Chroma Vector Database

We’ll now store the embeddings in a vector database (Chroma), which allows us to perform similarity-based searches.


In [None]:
# Initialize Chroma vector store to store embeddings
vector_store = Chroma.from_embeddings(all_embeddings, metadata=metadata)

# Confirm that embeddings are stored in Chroma
print("Number of documents in vector store:", len(metadata))


## Step 5: Perform Similarity Search on Stored Embeddings

Now, we’ll set up a search function to find the most relevant documents based on a query.


In [None]:
# Retrieval function using similarity search
def search(query, embedding_model, vector_store, k=5):
    query_embedding = embedding_model.embed(query)
    results = vector_store.similarity_search(query_embedding, k=k)
    return [{"title": res.metadata["title"], "text": res.metadata["text"]} for res in results]

# Test retrieval with a sample query
sample_query = "What are effective recovery protocols after training?"
results = search(sample_query, embedding_model, vector_store)

# Display the search results
for result in results:
    print(f"Title: {result['title']}\nText: {result['text']}\n")
