
# ðŸ§¾ Legal Document Similarity Checker (EURI + FAISS/Qdrant)
Pipeline: Generate dataset â†’ Load â†’ Clean â†’ Chunk â†’ Embed (EURI) â†’ Store (FAISS/Qdrant) â†’ Retrieve



## 1) Setup
Uncomment to install deps.


In [3]:

# %%bash
!pip install -U euriai faiss-cpu qdrant-client python-dotenv numpy pandas tqdm


Collecting euriai
  Downloading euriai-1.0.32-py3-none-any.whl.metadata (7.0 kB)
Collecting python-dotenv
  Using cached python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting httpx>=0.20.0 (from httpx[http2]>=0.20.0->qdrant-client)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client)
  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Collecting h11>=0.16 (from httpcore==1.*->httpx>=0.20.0->httpx[http2]>=0.20.0->qdrant-client)
  Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)
Downloading euriai-1.0.32-py3-none-any.whl (53 kB)
Using cached python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Using cached httpx-0.28.1-py3-none-any.whl (73 kB)
Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)
Using cached h11-0.16.0-py3-none-any.whl (37 kB)
Installing collected packages: python-dotenv, h11, httpcore, euriai, httpx

  Attempting uninstall: h11

    Found existing instal

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
googletrans 4.0.0rc1 requires httpx==0.13.3, but you have httpx 0.28.1 which is incompatible.



## 2) Configuration
Set your `EURI_API_KEY` via env or `.env`.


In [20]:

import os
try:
    from dotenv import load_dotenv
    load_dotenv()
except Exception:
    pass

EURI_API_KEY = os.getenv("EURI_API_KEY", "{EURI_API_KEY}")
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
QDRANT_COLLECTION = os.getenv("QDRANT_COLLECTION", "legal_clauses")

USE_FAISS = True
USE_QDRANT = False

assert EURI_API_KEY and EURI_API_KEY != "EURI_API_KEY", "Please set EURI_API_KEY"
print("Config OK. FAISS:", USE_FAISS, "Qdrant:", USE_QDRANT)


Config OK. FAISS: True Qdrant: False



## 3) Generate mock dataset


In [13]:

from pathlib import Path
import textwrap

DATA_DIR = Path("data/legal")
DATA_DIR.mkdir(parents=True, exist_ok=True)

docs = {
    "SaaS_Master_Subscription_Agreement.txt": '''
    1. Definitions
    "Service" means the subscription-based software service provided by Vendor.
    "Customer Data" refers to information submitted by Customer to the Service.

    2. Term and Termination
    This Agreement commences on the Effective Date and continues for the Initial Term of twelve (12) months.
    Either party may terminate this Agreement for material breach if such breach remains uncured for thirty (30) days after written notice.
    Upon termination, Customer's access to the Service shall cease, except solely for limited export of Customer Data for thirty (30) days.

    3. Confidentiality
    Each party agrees to maintain the confidentiality of the other party's Confidential Information and use it only as permitted under this Agreement.
    Confidential Information does not include information that becomes public without breach, or is independently developed without use of Confidential Information.

    4. Payment Terms
    Fees are due within thirty (30) days of invoice date. Late payments may accrue interest at 1.5% per month or the maximum allowed by law.

    5. Limitation of Liability
    Neither party shall be liable for indirect, incidental, special, consequential, or punitive damages.
    ''',
    "Employment_Agreement.txt": '''
    1. Employment and Duties
    Employee shall perform the duties described in the attached Job Description to the best of Employee's ability.

    2. Compensation and Benefits
    Employer shall pay Employee the salary stated in Exhibit A in accordance with Employer's standard payroll practices.

    3. Confidentiality and IP
    Employee agrees to keep Employer's trade secrets confidential and to assign any inventions developed in the scope of employment to Employer.

    4. Termination
    Employer may terminate employment for Cause immediately upon notice.
    Employee may resign with two weeks' notice.
    In either case, all Employer property must be returned upon termination.

    5. Governing Law
    This Agreement is governed by the laws of the State of New York, without regard to conflict of law principles.
    ''',
    "Privacy_Policy.txt": '''
    1. Introduction
    We value your privacy and explain here how we collect, use, and disclose personal data.

    2. Data Collection
    We collect information you provide directly and information collected automatically via cookies and similar technologies.

    3. Data Use
    We use your data to provide and improve services, personalize content, and comply with legal obligations.

    4. Data Retention & Deletion
    We retain personal data for as long as necessary for the purposes described. You may request deletion subject to legal requirements.

    5. Contact
    You can contact our Data Protection Officer at dpo@example.com.
    ''',
    "Vendor_Service_Agreement.txt": '''
    1. Scope of Services
    Vendor shall provide the services outlined in Statement of Work(s).

    2. Termination for Convenience
    Client may terminate this Agreement for convenience upon thirty (30) days' prior written notice.
    Upon such termination, Vendor shall be paid for services performed up to the date of termination.

    3. Confidentiality
    Vendor shall not disclose Client's Confidential Information except as necessary to perform services.

    4. Indemnification
    Vendor shall indemnify and hold Client harmless from third-party claims arising out of Vendor's negligence or willful misconduct.
    ''',
    "Partner_Reseller_Agreement.txt": '''
    1. Appointment
    Company appoints Reseller as a non-exclusive reseller of the Products in the Territory.

    2. Orders and Payment
    All orders are subject to acceptance by Company. Payment is due net fifteen (15) days from invoice.
    Late payments may incur finance charges.

    3. Term and Termination
    This Agreement shall remain in effect for one (1) year and automatically renew unless either party gives notice of non-renewal.
    Either party may terminate for material breach not cured within fifteen (15) days after written notice.

    4. Governing Law
    This Agreement will be governed by the laws of England and Wales.
    '''
}

for filename, content in docs.items():
    (DATA_DIR / filename).write_text(textwrap.dedent(content).strip(), encoding="utf-8")

print("Wrote", len(docs), "docs to", DATA_DIR.resolve())


Wrote 5 docs to D:\Dou\EURON\Gen AI Certification Bootcamp\assign_vectordb\data\legal



## 4) Load, clean, chunk


In [14]:

import re, glob
from typing import List, Dict

def load_documents(path="data/legal/*.txt"):
    recs = []
    for fp in glob.glob(path):
        with open(fp, "r", encoding="utf-8") as f:
            recs.append({"path": fp, "text": f.read()})
    return recs

def clean_text(t: str) -> str:
    t = t.replace("\r", "\n")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{3,}", "\n\n", t)
    t = re.sub(r"(?m)^\s*\d+\.\s*", "", t)
    return t.strip()

def split_on_blocks(text: str) -> List[str]:
    return [p.strip() for p in re.split(r"\n\s*\n", text) if p.strip()]

def window_chunks(blocks: List[str], target_chars=900, overlap_chars=150) -> List[str]:
    chunks, buf = [], ""
    for b in blocks:
        if not buf:
            buf = b
        elif len(buf) + 2 + len(b) <= target_chars:
            buf = buf + "\\n\\n" + b
        else:
            chunks.append(buf.strip())
            buf = (buf[-overlap_chars:] + "\\n\\n" + b) if overlap_chars and len(buf) > overlap_chars else b
    if buf.strip():
        chunks.append(buf.strip())
    return chunks

def chunk_document(doc: Dict) -> List[Dict]:
    blocks = split_on_blocks(clean_text(doc["text"]))
    chunks = window_chunks(blocks, target_chars=900, overlap_chars=150)
    return [{
        "doc_path": doc["path"],
        "chunk_id": f"{os.path.basename(doc['path'])}::chunk_{i}",
        "text": ch
    } for i, ch in enumerate(chunks)]

docs_raw = load_documents()
all_chunks = []
for d in docs_raw:
    all_chunks.extend(chunk_document(d))

print("Total chunks:", len(all_chunks))


Total chunks: 5



## 5) Embed with **EURI**


In [15]:

import numpy as np
from tqdm import tqdm
from euriai.embedding import EuriaiEmbeddingClient

embed_client = EuriaiEmbeddingClient(api_key=EURI_API_KEY)

def embed_texts(texts):
    vecs = []
    for t in tqdm(texts, desc="EURI embedding"):
        v = np.array(embed_client.embed(t), dtype=np.float32)
        v = v / (np.linalg.norm(v) + 1e-12)  # L2 normalize
        vecs.append(v)
    return np.vstack(vecs)

texts = [c["text"] for c in all_chunks]
embeddings = embed_texts(texts)
dim = embeddings.shape[1]
print("Embeddings:", embeddings.shape)


EURI embedding: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5/5 [00:17<00:00,  3.41s/it]

Embeddings: (5, 1536)






## 6) Store in FAISS (local) or Qdrant (optional)


In [16]:

faiss = None
if USE_FAISS:
    try:
        import faiss
    except Exception as e:
        print("FAISS not available, will use NumPy fallback.", e)

if USE_FAISS and 'faiss' in globals() and faiss is not None:
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)
    print("FAISS index size:", index.ntotal)
else:
    index = None
    print("Using NumPy fallback.")

id_to_meta = {i: all_chunks[i] for i in range(len(all_chunks))}

if USE_QDRANT:
    from qdrant_client import QdrantClient
    from qdrant_client.http import models as qmodels
    qclient = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
    qclient.recreate_collection(
        collection_name=QDRANT_COLLECTION,
        vectors_config=qmodels.VectorParams(size=dim, distance=qmodels.Distance.COSINE),
    )
    qclient.upsert(
        collection_name=QDRANT_COLLECTION,
        points=qmodels.Batch(
            ids=list(range(len(embeddings))),
            vectors=embeddings.tolist(),
            payloads=[{
                "doc_path": m["doc_path"],
                "chunk_id": m["chunk_id"],
                "text": m["text"],
            } for m in all_chunks],
        ),
        wait=True,
    )
    print("Pushed to Qdrant:", QDRANT_COLLECTION)


FAISS index size: 5



## 7) Retrieval helpers


In [17]:

def search_local(query: str, top_k=5):
    qv = np.array(embed_client.embed(query), dtype=np.float32)
    qv = qv / (np.linalg.norm(qv) + 1e-12)

    if index is not None:
        D, I = index.search(qv.reshape(1, -1), top_k)
        scores, ids = D[0].tolist(), I[0].tolist()
    else:
        sims = embeddings @ qv
        ids = sims.argsort()[-top_k:][::-1].tolist()
        scores = sims[ids].tolist()

    results = []
    for i, s in zip(ids, scores):
        m = id_to_meta[i]
        results.append({
            "score": float(s),
            "doc": os.path.basename(m["doc_path"]),
            "chunk_id": m["chunk_id"],
            "text": m["text"]
        })
    return results

def pretty(results):
    for r in results:
        print(f"[{r['score']:.3f}] {r['doc']} :: {r['chunk_id']}")
        print(r['text'])
        print("-"*80)

print("Search ready.")


Search ready.



### Try a clause query


In [18]:

results = search_local("termination conditions", top_k=5)
pretty(results)


[0.377] Employment_Agreement.txt :: Employment_Agreement.txt::chunk_0
Employment and Duties
Employee shall perform the duties described in the attached Job Description to the best of Employee's ability.
Compensation and Benefits
Employer shall pay Employee the salary stated in Exhibit A in accordance with Employer's standard payroll practices.
Confidentiality and IP
Employee agrees to keep Employer's trade secrets confidential and to assign any inventions developed in the scope of employment to Employer.
Termination
Employer may terminate employment for Cause immediately upon notice.
Employee may resign with two weeks' notice.
In either case, all Employer property must be returned upon termination.
Governing Law
This Agreement is governed by the laws of the State of New York, without regard to conflict of law principles.
--------------------------------------------------------------------------------
[0.358] Vendor_Service_Agreement.txt :: Vendor_Service_Agreement.txt::chunk_0
Scope of


## 8) Debug


In [19]:

import pandas as pd
df = pd.DataFrame([
    {
        "id": i,
        "doc": os.path.basename(m["doc_path"]),
        "chunk_id": m["chunk_id"],
        "len": len(m["text"]),
        "preview": m["text"][:120].replace("\\n"," ")
    } for i, m in id_to_meta.items()
]).sort_values(["doc","id"]).reset_index(drop=True)
df.head(10)


Unnamed: 0,id,doc,chunk_id,len,preview
0,0,Employment_Agreement.txt,Employment_Agreement.txt::chunk_0,762,Employment and Duties\nEmployee shall perform ...
1,1,Partner_Reseller_Agreement.txt,Partner_Reseller_Agreement.txt::chunk_0,592,Appointment\nCompany appoints Reseller as a no...
2,2,Privacy_Policy.txt,Privacy_Policy.txt::chunk_0,584,Introduction\nWe value your privacy and explai...
3,3,SaaS_Master_Subscription_Agreement.txt,SaaS_Master_Subscription_Agreement.txt::chunk_0,1161,"Definitions\n""Service"" means the subscription-..."
4,4,Vendor_Service_Agreement.txt,Vendor_Service_Agreement.txt::chunk_0,571,Scope of Services\nVendor shall provide the se...
