# Text Embedding with LangChain and HuggingFace

* Add your HuggingFace token to your Colab Secrets (call it HF_Token)
* If needed, add:
  ```
  from google.colab import userdata
  userdata.get('secretName')
  ```
* Add a sample.pdf to Colab's session storage (Files)
* Embedding takes about 4+ minutes

In [None]:
!pip install langchain-community langchain langchain-text-splitters pypdf sentence-transformers chromadb

## Text Embedding

Swapping out the embeddings would be **relatively easy** from a code perspective, but there are some important considerations:

### Different HuggingFace Model
```python
# Just change the model name - super easy!
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-mpnet-base-v2'  # Better quality
    # or 'BAAI/bge-small-en-v1.5'  # Popular alternative
)
```

### OpenAI Embeddings
```python
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"  # or text-embedding-3-large
)
```

### Other Providers
```python
# Cohere
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-english-v3.0")

# Voyage AI
from langchain_voyageai import VoyageAIEmbeddings
embeddings = VoyageAIEmbeddings(model="voyage-2")
```

## Important Gotchas

1. **Embedding Dimensions Must Match**: Different models have different dimensions:
   - `all-MiniLM-L6-v2`: 384 dimensions
   - `all-mpnet-base-v2`: 768 dimensions
   - `text-embedding-3-small`: 1536 dimensions
   
   **You MUST re-create your ChromaDB from scratch** if you change models, as you can't mix embeddings of different dimensions.

2. **Cost**: OpenAI/Cohere charge per token, while HuggingFace models are free (but slower/local)

3. **Performance**: You'll need to re-run the entire embedding process for your existing PDF

## Bottom Line

**Code change**: 2 lines  
**Effort**: Re-embedding all documents (depends on your data size)  
**Overall difficulty**: 3/10 - Easy but requires re-processing your data

In [None]:
# from langchain_community.document_loaders import PyPDFLoader
# from langchain_text_splitters import CharacterTextSplitter
# from langchain_community.embeddings import HuggingFaceEmbeddings
# # from langchain_openai import OpenAIEmbeddings
# from langchain_community.vectorstores import Chroma
# import pandas as pd
# # from google.colab import userdata

# # Initialize HuggingFace embeddings
# embeddings = HuggingFaceEmbeddings(
#     model_name='sentence-transformers/all-MiniLM-L6-v2'
# )

# # embeddings = HuggingFaceEmbeddings(
# #     model_name='sentence-transformers/all-MiniLM-L6-v2',
# #     model_kwargs={'use_auth_token': userdata.get("HF_TOKEN")}
# # )

# # embeddings = OpenAIEmbeddings()

# text_splitter = CharacterTextSplitter(
#     separator="\n",
#     chunk_size=200,
#     chunk_overlap=100
# )

# pdf_path = "sample.pdf"
# loader = PyPDFLoader(pdf_path)
# docs = loader.load_and_split(text_splitter=text_splitter)

# # Add PDF filename as metadata
# for doc in docs:
#     doc.metadata["source"] = pdf_path

# # Save to Chroma
# db = Chroma.from_documents(
#     docs,
#     embedding=embeddings,
#     persist_directory="emb"
# )

# # Build DataFrame
# chunk_data = []
# for doc in docs:
#     chunk_text = doc.page_content
#     chunk_embedding = embeddings.embed_query(chunk_text)

#     chunk_data.append({
#         'chunk': chunk_text,
#         'embedding': chunk_embedding,
#         'metadata': doc.metadata
#     })

# df = pd.DataFrame(chunk_data)


In [None]:
# # ask some questions
# import numpy as np

# def cosine_similarity(vec1, vec2):
#     """Calculate cosine similarity between two vectors"""
#     return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# def search_query(query, k=4):
#     """Search for a query and display results"""
#     query_embedding = embeddings.embed_query(query)

#     # ChromaDB results
#     results_with_scores = db.similarity_search_with_score(query, k=k)

#     print("\n" + "="*50)
#     print(f"QUERY: {query}")
#     print("="*50)

#     for doc, score in results_with_scores:
#         print(f"\nScore: {score:.4f}")
#         print(f"Content: {doc.page_content[:200]}...")

#     # DataFrame results
#     df['similarity_score'] = df['embedding'].apply(
#         lambda emb: cosine_similarity(query_embedding, emb)
#     )

#     df_sorted = df.sort_values('similarity_score', ascending=False)

#     print("\n" + "="*50)
#     print("TOP MATCHES IN DATAFRAME:")
#     print("="*50)

#     for idx, row in df_sorted.head(k).iterrows():
#         print(f"\nSimilarity Score: {row['similarity_score']:.4f}")
#         print(f"Chunk: {row['chunk'][:200]}...")

#     # return df_sorted
#     return None

# # ask away
# results = search_query("What is maximum likelihood estimate (MLE)?")
# # results = search_query("What is expected value?")
# # results = search_query("What is the binomial distribution?")
# print(results)
# df_sorted = df.sort_values('similarity_score', ascending=False)
# df_sorted.head()


QUERY: What is maximum likelihood estimate (MLE)?

Score: 0.5162
Content: fundamentally introducing the logic of MLE. Maximum Likelihood Estimation (MLE) is a
general method for estimating the parameters of a statistical model. MLE works by...

Score: 0.5855
Content: equivalent to finding the Maximum Likelihood Estimate (MLE) for the model's
parameters:
Maximum Likelihood Principle: The MLE criterion demands that you choose the model...

Score: 0.6208
Content: Maximum Likelihood Estimate (MLE): The most likely value for  is the sample
mean (the observed proportion of successes).
Important Distinction: Bernoulli vs. Binomial...

Score: 0.6273
Content: true parameters—a concept known as Maximum Likelihood Estimation (MLE). This
connection to the Normal Distribution is what cemented the method's theoretical power....

TOP MATCHES IN DATAFRAME:

Similarity Score: 0.7419
Chunk: fundamentally introducing the logic of MLE. Maximum Likelihood Estimation (MLE) is a
general method for estimatin

Unnamed: 0,chunk,embedding,metadata,similarity_score
815,fundamentally introducing the logic of MLE. Ma...,"[0.012163488194346428, -0.10281084477901459, 0...","{'producer': 'Skia/PDF m142', 'creator': 'Mozi...",0.741892
797,equivalent to finding the Maximum Likelihood E...,"[-0.022549554705619812, -0.06762761622667313, ...","{'producer': 'Skia/PDF m142', 'creator': 'Mozi...",0.707246
899,Maximum Likelihood Estimate (MLE): The most li...,"[0.05918924883008003, -0.08582346141338348, 0....","{'producer': 'Skia/PDF m142', 'creator': 'Mozi...",0.689578
733,true parameters—a concept known as Maximum Lik...,"[-0.05203985050320625, -0.06702414155006409, 0...","{'producer': 'Skia/PDF m142', 'creator': 'Mozi...",0.686336
814,linking the least squares method to the Normal...,"[-0.00782845076173544, -0.07579118013381958, 0...","{'producer': 'Skia/PDF m142', 'creator': 'Mozi...",0.661107


In [None]:
# df_sorted.at[815, 'metadata']

{'producer': 'Skia/PDF m142',
 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36',
 'creationdate': '2025-11-12T21:58:34+00:00',
 'title': 'Historical_Setup.ipynb - Colab',
 'moddate': '2025-11-12T21:58:34+00:00',
 'source': 'sample.pdf',
 'total_pages': 112,
 'page': 46,
 'page_label': '47'}