## FAISS Vector Database
FAISS (Facebook AI Similarity Search) is a high-performance library for storing and searching vector embeddings to quickly find the most similar documents, chunks, or items to a query. It’s commonly used in retrieval-augmented generation (RAG) pipelines to enable fast semantic search over large text corpora.

```bash
pip install faiss-cpu
```


## Text Loaders

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

sample_data = TextLoader('../data/sample_data.txt', encoding='utf-8')
docs = sample_data.load()
docs

  from .autonotebook import tqdm as notebook_tqdm


[Document(metadata={'source': '../data/sample_data.txt'}, page_content='Document Title: Northwind Health — Customer Ops Knowledge Base (v1.7)\nLast updated: 2026-01-05\nAudience: Support agents, customer success, and operations\n1) Company & Products\nNorthwind Health provides a care coordination platform used by outpatient clinics. The platform has three modules:\n\nIntake: online forms, eligibility checks, insurance capture, and appointment requests.\nCareHub: tasks, care plans, referrals, and secure messaging.\nInsights: dashboards for no-show rates, referral leakage, and care gap tracking.\n\nClinics typically use Intake and CareHub daily. Insights is used weekly by managers.\n2) Common Customer Questions (FAQ)\nQ: How long does it take to onboard a new clinic?\nA: Typical onboarding is 2–4 weeks depending on EHR integration, SSO (single sign-on) setup, and data migration.\nQ: What is the fastest onboarding path?\nA: Use CSV patient import and password-based login first, then add S

In [2]:
text_splitter =CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
split_docs = text_splitter.split_documents(docs)
split_docs

[Document(metadata={'source': '../data/sample_data.txt'}, page_content='Document Title: Northwind Health — Customer Ops Knowledge Base (v1.7)\nLast updated: 2026-01-05\nAudience: Support agents, customer success, and operations\n1) Company & Products\nNorthwind Health provides a care coordination platform used by outpatient clinics. The platform has three modules:\n\nIntake: online forms, eligibility checks, insurance capture, and appointment requests.\nCareHub: tasks, care plans, referrals, and secure messaging.\nInsights: dashboards for no-show rates, referral leakage, and care gap tracking.'),
 Document(metadata={'source': '../data/sample_data.txt'}, page_content='Clinics typically use Intake and CareHub daily. Insights is used weekly by managers.\n2) Common Customer Questions (FAQ)\nQ: How long does it take to onboard a new clinic?\nA: Typical onboarding is 2–4 weeks depending on EHR integration, SSO (single sign-on) setup, and data migration.\nQ: What is the fastest onboarding pat

## Embeddings

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "mixedbread-ai/mxbai-embed-large-v1"
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
)

## Store into FAISS

In [4]:
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(split_docs, embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x174562b4ec0>

## quering the FAISS db

In [6]:
query = "How long does it take to onboard a new clinic?"
results = db.similarity_search(query)

for i, r in enumerate(results, 1):
    print(f"\n--- Hit {i} ---")
    print("source:", r.metadata.get("source"))
    print(r.page_content[:600])


--- Hit 1 ---
source: ../data/sample_data.txt
Clinics typically use Intake and CareHub daily. Insights is used weekly by managers.
2) Common Customer Questions (FAQ)
Q: How long does it take to onboard a new clinic?
A: Typical onboarding is 2–4 weeks depending on EHR integration, SSO (single sign-on) setup, and data migration.
Q: What is the fastest onboarding path?
A: Use CSV patient import and password-based login first, then add SSO and EHR integration later.
Q: Does the system support multiple locations?
A: Yes. A single tenant can support multiple locations with location-based roles and reporting filters.
Q: Can patients message prov

--- Hit 2 ---
source: ../data/sample_data.txt
Discounts:

Annual prepay may qualify for a discount.
Multi-clinic contracts may qualify for tiered pricing.
Important: Sales owns final pricing approval. Support should not quote numbers; refer to the account owner.

4) Support & Escalation
Support hours are 8am–8pm ET, Monday–Friday (excluding US holid

In [7]:
query = "What are the most common SSO misconfiguration issues?"
results = db.similarity_search(query, k=2)

for i, r in enumerate(results, 1):
    print(f"\n--- Hit {i} ---")
    print("source:", r.metadata.get("source"))
    print(r.page_content[:600])


--- Hit 1 ---
source: ../data/sample_data.txt
Password login: verify email, check lockout status, confirm password reset email delivery.
SSO login: confirm they used the correct SSO URL, verify IdP status, validate SAML attributes.
Multi-location user: ensure they have at least one active role assignment.

If multiple users at the same clinic are impacted, treat as SEV-2 or SEV-1 depending on scope.
9) Glossary

Tenant: One customer environment (may include multiple clinic locations)
PHI: Protected Health Information
HL7 ADT: Healthcare messaging feed for admissions/discharges/transfers
IdP: Identity Provider used for SSO

--- Hit 2 ---
source: ../data/sample_data.txt
incorrect ACS URL configured in the IdP
missing email attribute mapping
clock drift causing assertion expiry

7) Recent Incident Summary (for retrieval testing)
Incident INC-20491 (Resolved):
On 2025-12-11, some clinics experienced delayed task notifications in CareHub. Root cause was a queue backlog after a deployment. 

##  check scores

In [13]:
import numpy as np
query = "What are the most common SSO misconfiguration issues?"
results_with_scores = db.similarity_search_with_score(query, k=2)
results_with_scores

[(Document(id='beb463ad-a840-4ee2-9676-63e71e20dddb', metadata={'source': '../data/sample_data.txt'}, page_content='Password login: verify email, check lockout status, confirm password reset email delivery.\nSSO login: confirm they used the correct SSO URL, verify IdP status, validate SAML attributes.\nMulti-location user: ensure they have at least one active role assignment.\n\nIf multiple users at the same clinic are impacted, treat as SEV-2 or SEV-1 depending on scope.\n9) Glossary\n\nTenant: One customer environment (may include multiple clinic locations)\nPHI: Protected Health Information\nHL7 ADT: Healthcare messaging feed for admissions/discharges/transfers\nIdP: Identity Provider used for SSO'),
  np.float32(222.73239)),
 (Document(id='8883c3b1-bafe-4463-bc2b-cb29d03391e0', metadata={'source': '../data/sample_data.txt'}, page_content='incorrect ACS URL configured in the IdP\nmissing email attribute mapping\nclock drift causing assertion expiry\n\n7) Recent Incident Summary (for

In [None]:
query = "How long does it take to onboard a new clinic?"
results_with_scores = db.similarity_search_with_score(query, k=2)
results_with_scores

[(Document(id='03fd6cfc-7d3d-4358-a004-e4ce341a63c1', metadata={'source': '../data/sample_data.txt'}, page_content='Clinics typically use Intake and CareHub daily. Insights is used weekly by managers.\n2) Common Customer Questions (FAQ)\nQ: How long does it take to onboard a new clinic?\nA: Typical onboarding is 2–4 weeks depending on EHR integration, SSO (single sign-on) setup, and data migration.\nQ: What is the fastest onboarding path?\nA: Use CSV patient import and password-based login first, then add SSO and EHR integration later.\nQ: Does the system support multiple locations?\nA: Yes. A single tenant can support multiple locations with location-based roles and reporting filters.\nQ: Can patients message providers?\nA: Yes in CareHub secure messaging. Patients can message a care team inbox; clinics can configure auto-responses outside business hours.\n3) Pricing & Packaging (Customer-safe summary)\nNorthwind Health pricing has two dimensions:\n\nBase subscription by clinic size (

In [19]:
print(type(db.index))                 # e.g., <class 'faiss.swigfaiss_avx2.IndexFlatL2'>
print("metric_type:", db.index.metric_type)

<class 'faiss.swigfaiss_avx2.IndexFlatL2'>
metric_type: 1


## Embedding Vectors

In [20]:
query = "How long does it take to onboard a new clinic?"
embedding_vectors = embeddings.embed_query(query)
embedding_vectors

[0.14561450481414795,
 0.405779093503952,
 -0.2536205053329468,
 -0.16193179786205292,
 -0.28705596923828125,
 -0.8669158220291138,
 0.7798689007759094,
 0.2769632637500763,
 0.3799178898334503,
 0.28103864192962646,
 -0.08775310218334198,
 0.060962945222854614,
 0.18076907098293304,
 -0.02082127332687378,
 0.07776160538196564,
 0.2913074493408203,
 0.04524673521518707,
 -0.07812894135713577,
 -0.29008540511131287,
 0.560136079788208,
 -0.3429751396179199,
 0.6954990029335022,
 -0.9472960233688354,
 -0.214829683303833,
 -0.3737434446811676,
 0.6026571393013,
 0.6627659797668457,
 0.04324892908334732,
 1.4791929721832275,
 0.8352136015892029,
 -0.8940722942352295,
 -0.1153833344578743,
 0.5985071063041687,
 -0.19152067601680756,
 0.10974203050136566,
 -0.21487894654273987,
 -0.4382695257663727,
 -0.7615188360214233,
 -0.8425277471542358,
 -0.2977340817451477,
 0.048428937792778015,
 -0.04688023403286934,
 0.0729060098528862,
 0.22459286451339722,
 -0.7486161589622498,
 0.372250407934188

### Save to local

In [24]:
db.save_local("faiss_index")

## Load from local

In [26]:
db.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

<langchain_community.vectorstores.faiss.FAISS at 0x1745e012990>

In [27]:
query = "How long does it take to onboard a new clinic?"
results = db.similarity_search(query)
for i, r in enumerate(results, 1):
	print(f"\n--- Hit {i} ---")
	print("source:", r.metadata.get("source"))
	print(r.page_content[:600])


--- Hit 1 ---
source: ../data/sample_data.txt
Clinics typically use Intake and CareHub daily. Insights is used weekly by managers.
2) Common Customer Questions (FAQ)
Q: How long does it take to onboard a new clinic?
A: Typical onboarding is 2–4 weeks depending on EHR integration, SSO (single sign-on) setup, and data migration.
Q: What is the fastest onboarding path?
A: Use CSV patient import and password-based login first, then add SSO and EHR integration later.
Q: Does the system support multiple locations?
A: Yes. A single tenant can support multiple locations with location-based roles and reporting filters.
Q: Can patients message prov

--- Hit 2 ---
source: ../data/sample_data.txt
Discounts:

Annual prepay may qualify for a discount.
Multi-clinic contracts may qualify for tiered pricing.
Important: Sales owns final pricing approval. Support should not quote numbers; refer to the account owner.

4) Support & Escalation
Support hours are 8am–8pm ET, Monday–Friday (excluding US holid