# Multilingual Embedding Pipeline
This notebook:
- Loads multilingual PDF files ***Rich Dad Poor Dad*** (English + Spanish)
- Applies two chunking strategies
- Uses three embedding models (LaBSE, E5-Multilingual, BGE-M3)
- Stores embeddings into Pinecone indexes

## Install Dependencies

In [None]:
!pip install langchain sentence-transformers pinecone-client python-dotenv langchain-openai langchain-pinecone langchain_community pypdf hf_xet

Collecting pinecone-client
  Downloading pinecone_client-6.0.0-py3-none-any.whl.metadata (3.4 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.17-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-pinecone
  Downloading langchain_pinecone-0.2.6-py3-none-any.whl.metadata (5.3 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting pinecone<7.0.0,>=6.0.0 (from pinecone[async]<7.0.0,>=6.0.0->langchain-pinecone)
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Collecting aiohttp<3.11,>=3.10 (from langchain-pinecone)
  Downloading aiohttp-3.10.11-cp311-cp311-manylinux_

## 1. Collect a representative set of documents
### Languages
*   Spanish
*   English

### Knowledge
*   [Padre-rico-padre-pobre-nueva-es](https://drive.google.com/file/d/1Mt8cEOIcXMJykRkknw4zApuNpNRBAo-v/view?usp=drive_link)
*   [Rich-Dad-Poor-Dad-en](https://drive.google.com/file/d/1TEnvsdJvgWFbhmy-UxbCP-5AmdB_hi4D/view?usp=drive_link)





In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Load `.env` File & library

In [None]:
import os
from dotenv import load_dotenv
import glob
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
import pinecone
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
load_dotenv(".env")
pinecone_key = os.getenv("PINECONE_API_KEY")

## 2. Load and Preprocess Documents

In [None]:
# Load all PDFs manually
def load_all_pdfs(data_path):
    pdf_files = glob.glob(os.path.join(data_path, "*.pdf"))
    if not pdf_files:
        print("No PDF files found.")
        return []

    print(f"Found {len(pdf_files)} PDF(s):")
    for pdf in pdf_files:
        print(f"- {pdf}")

    all_documents = []
    for file in pdf_files:
        loader = PyPDFLoader(file)
        documents = loader.load()
        print(f"Loaded {len(documents)} page(s) from: {file}")
        all_documents.extend(documents)

    print(f"Total pages loaded from all PDFs: {len(all_documents)}")
    return all_documents

In [None]:
# # Step 2: Split into text chunks
# def split_documents(documents, chunk_size=500, chunk_overlap=20):
#     splitter = RecursiveCharacterTextSplitter(
#         chunk_size=chunk_size,
#         chunk_overlap=chunk_overlap
#     )
#     chunks = splitter.split_documents(documents)
#     print(f"Total chunks created: {len(chunks)}")
#     return chunks



## 3. Apply Two Chunking Strategies

In [None]:
# Clean & Preprocess with Two Chunking Strategies
def split_documents(documents):
    print("Applying RecursiveCharacterTextSplitter...")
    recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    recursive_chunks = recursive_splitter.split_documents(documents)

    print("Applying CharacterTextSplitter (fixed-length)...")
    fixed_splitter = CharacterTextSplitter(chunk_size=600, chunk_overlap=0)
    fixed_chunks = fixed_splitter.split_documents(documents)

    return recursive_chunks + fixed_chunks

In [None]:
# Set your path and run everything
data_path = r"/content/drive/MyDrive/knowledge"

if not os.path.isdir(data_path):
    print(f"Directory not found: {data_path}")
else:
    docs = load_all_pdfs(data_path)
    text_chunks = split_documents(docs)

    # Preview first 1st chunk
    for i, chunk in enumerate(text_chunks[:1]):
        print(f"\n--- Chunk {i+1} ---\n{chunk.page_content[:300]}...\n")
        print(f"Total chunks: {len(text_chunks)}")

Found 2 PDF(s):
- /content/drive/MyDrive/knowledge/Padre-rico-padre-pobre-nueva-es.pdf
- /content/drive/MyDrive/knowledge/Rich-Dad-Poor-Dad-en.pdf
Loaded 248 page(s) from: /content/drive/MyDrive/knowledge/Padre-rico-padre-pobre-nueva-es.pdf
Loaded 241 page(s) from: /content/drive/MyDrive/knowledge/Rich-Dad-Poor-Dad-en.pdf
Total pages loaded from all PDFs: 489
Applying RecursiveCharacterTextSplitter...
Applying CharacterTextSplitter (fixed-length)...

--- Chunk 1 ---
Basado en el principio de que los bienes que generan
ingreso siempre dan mejores resultados que los trabajos
tradicionales, Robert Kiyosaki explica cómo pueden
adquirirse dichos bienes para, eventualmente, olvidarse de
trabajar.
Con un estilo claro y ameno, este libro te pondrá en el
camino directo ...

Total chunks: 2282


## 4. Download and Wrap Embedding Models (LaBSE, E5, BGE-M3)

In [None]:
# Download the model
models = [
    "sentence-transformers/LaBSE",
    "intfloat/multilingual-e5-small",
    "BAAI/bge-m3"
]

embeddings = {model: HuggingFaceEmbeddings(model_name=model) for model in models}

# Basic understanding how the below short-handed nested loops work

# embedded_chunks = {}
# for model in models:
#     model_embeddings = []  # Store embeddings for the current model
#     for chunk in text_chunks:
#         # Get the embedding for the current chunk
#         embedding = embeddings[model].embed_query(chunk)
#         model_embeddings.append(embedding)
#     # Store the embeddings list in the dictionary with the model name as the key
#     embedded_chunks[model] = model_embeddings

In [None]:

# Embed each chunk using all models
# embedded_chunks = {model: [embeddings[model].embed_query(chunk) for chunk in text_chunks] for model in models}
embedded_chunks = {model: [embeddings[model].embed_query(chunk.page_content) for chunk in text_chunks] for model in models}

# print(embedded_chunks)


## 5. Create Pinecone Indexes

In [None]:
# Initialize Pinecone
pc = Pinecone(api_key=pinecone_key)

dimensions = {
    "sentence-transformers/LaBSE": 768,
    "intfloat/multilingual-e5-small": 384,
    "BAAI/bge-m3": 1024
}

for model in models:
    index_name = model.replace("/", "-").lower()
    dimension = dimensions[model]

    if index_name not in pc.list_indexes().names():
        pc.create_index(
            name=index_name,
            dimension=dimension,
            metric="cosine",
            spec=ServerlessSpec(
                cloud="aws",
                region="us-east-1"
            )
        )
        print(f"Index '{index_name}' created successfully.")
    else:
        print(f"Index '{index_name}' already exists.")



Index 'sentence-transformers-labse' created successfully.
Index 'intfloat-multilingual-e5-small' created successfully.
Index 'baai-bge-m3' created successfully.


## 5.1 Populate Pinecone Indexes

In [None]:
# Store embeddings to the respective Pinecone index
for model in models:
    index_name = model.replace("/", "-").lower()
    embedding_model = embeddings[model]

    # If the .env not works use below approach
    # os.environ["PINECONE_API_KEY"] = "PASTE_DIRECTLY_KEY_HERE"

    vector_store = PineconeVectorStore(
        index_name=index_name,
        embedding=embedding_model
    )

    print(f"Adding embeddings for model '{model}' to index '{index_name}' in batches...")

    # Define a batch size
    batch_size = 100

    for i in range(0, len(text_chunks), batch_size):
        batch = text_chunks[i:i + batch_size]
        try:
            vector_store.add_documents(documents=batch)
            print(f"  Uploaded batch {i // batch_size + 1}/{(len(text_chunks) + batch_size - 1) // batch_size}")
        except Exception as e:
            print(f"  Error uploading batch {i // batch_size + 1}: {e}")
    print(f"Finished adding embeddings for model '{model}'.")

Adding embeddings for model 'sentence-transformers/LaBSE' to index 'sentence-transformers-labse' in batches...
  Uploaded batch 1/23
  Uploaded batch 2/23
  Uploaded batch 3/23
  Uploaded batch 4/23
  Uploaded batch 5/23
  Uploaded batch 6/23
  Uploaded batch 7/23
  Uploaded batch 8/23
  Uploaded batch 9/23
  Uploaded batch 10/23
  Uploaded batch 11/23
  Uploaded batch 12/23
  Uploaded batch 13/23
  Uploaded batch 14/23
  Uploaded batch 15/23
  Uploaded batch 16/23
  Uploaded batch 17/23
  Uploaded batch 18/23
  Uploaded batch 19/23
  Uploaded batch 20/23
  Uploaded batch 21/23
  Uploaded batch 22/23
  Uploaded batch 23/23
Finished adding embeddings for model 'sentence-transformers/LaBSE'.
Adding embeddings for model 'intfloat/multilingual-e5-small' to index 'intfloat-multilingual-e5-small' in batches...
  Uploaded batch 1/23
  Uploaded batch 2/23
  Uploaded batch 3/23
  Uploaded batch 4/23
  Uploaded batch 5/23
  Uploaded batch 6/23
  Uploaded batch 7/23
  Uploaded batch 8/23
  Upload

#### Test Retrieval

In [None]:
# Load the existing index for each model
for model in models:
    index_name = model.replace("/", "-").lower()
    embedding_model = embeddings[model]

    docsearch = PineconeVectorStore.from_existing_index(
        index_name=index_name,
        embedding=embedding_model
    )

    print(f"Successfully loaded index '{index_name}' for model '{model}'.")


Successfully loaded index 'sentence-transformers-labse' for model 'sentence-transformers/LaBSE'.
Successfully loaded index 'intfloat-multilingual-e5-small' for model 'intfloat/multilingual-e5-small'.
Successfully loaded index 'baai-bge-m3' for model 'BAAI/bge-m3'.


#### Retrieve Vector

In [None]:
# Loop through each model
for model in models:
    index_name = model.replace("/", "-").lower()
    embedding_model = embeddings[model]

    docsearch = PineconeVectorStore.from_existing_index(
        index_name=index_name,
        embedding=embedding_model
    )

    retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k": 3})

    result = retriever.invoke("What is the wealth?")
    print(f"Results from model '{model}':")
    print(result)

Results from model 'sentence-transformers/LaBSE':
[Document(id='6eebe1cd-cbf5-4165-b78d-0f1270f847ae', metadata={'author': 'Robert Toru Kiyosaki', 'creationdate': '2023-12-19T16:11:01+00:00', 'creator': 'calibre (5.32.0) [http://calibre-ebook.com]', 'keywords': 'Divulgación, Ciencias sociales', 'moddate': '2023-12-19T16:11:01+00:00', 'page': 42.0, 'page_label': '43', 'producer': 'calibre (5.32.0) [http://calibre-ebook.com]', 'source': '/content/drive/MyDrive/knowledge/Padre-rico-padre-pobre-nueva-es.pdf', 'title': 'Padre rico, padre pobre (nueva edición actualizada)', 'total_pages': 248.0}, page_content='—explicó padre rico.\n—¿La verdad acerca de qué?, —pregunté.'), Document(id='85733b30-0c9e-47b1-82d2-da97f2a6371b', metadata={'author': 'Robert Toru Kiyosaki', 'creationdate': '2023-12-19T16:11:01+00:00', 'creator': 'calibre (5.32.0) [http://calibre-ebook.com]', 'keywords': 'Divulgación, Ciencias sociales', 'moddate': '2023-12-19T16:11:01+00:00', 'page': 83.0, 'page_label': '84', 'prod