- install sentence transformers
- vectorize the html content from db entries
- store the embeddings (like 3) in the db

*** add sentence-transformers to the requirements ***

## Imports

In [8]:
!pip install sentence_transformers
!pip install pymongo==4.11.2
!pip install numpy

Collecting pymongo==4.11.2
  Downloading pymongo-4.11.2-cp39-cp39-macosx_11_0_arm64.whl (731 kB)
[K     |████████████████████████████████| 731 kB 10.9 MB/s eta 0:00:01
Installing collected packages: pymongo
  Attempting uninstall: pymongo
    Found existing installation: pymongo 4.11.3
    Uninstalling pymongo-4.11.3:
      Successfully uninstalled pymongo-4.11.3
Successfully installed pymongo-4.11.2


In [7]:
from sentence_transformers import SentenceTransformer
from pymongo import MongoClient
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Functions to create embeddings

In [9]:
def create_all-MiniLM-L6-v2_embedding(text):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embedding = model.encode(text)
    
    if len(embedding) != 384:
        raise ValueError(f"Unexpected embedding length: {len(embedding)}. Expected length: 384.")
    
    return embedding

In [10]:
def create_paraphrase-MiniLM-L6-v2_embedding(text):
    model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
    embedding = model.encode(text)
    
    if len(embedding) != 384:
        raise ValueError(f"Unexpected embedding length: {len(embedding)}. Expected length: 384.")
    
    return embedding

In [11]:
def create_all-distilroberta-v1_embedding(text):
    model = SentenceTransformer("all-distilroberta-v1")
    embedding = model.encode(text)
    
    if len(embedding) != 768:
        raise ValueError(f"Unexpected embedding length: {len(embedding)}. Expected length: 768.")
    
    return embedding

## Sample Embeddings

In [4]:
text = "App Router: Getting Started | Next.js Menu Using App Router Features available in /app Using Latest Version 15.2.1 Introduction App Router Getting Started Getting Started Installation Create a new Next.js application with the `create-next-app` CLI, and set up TypeScript, ESLint, and Module Path Aliases. Project Structure An overview of the folder and file conventions in Next.js, and how to organize your project. Layouts and Pages Create your first pages and layouts, and link between them. Images and Fonts Learn how to optimize images and fonts. CSS Learn about the different ways to add CSS to your application, including CSS Modules, Global CSS, Tailwind CSS, and more. Fetching Data Start fetching data and streaming content in your application. Updating Data Learn how to update data in your Next.js application. Error Handling Learn how to display expected errors and handle uncaught exceptions."
all-MiniLM-L6-v2_embedding = create_all-MiniLM-L6-v2_embedding(text)
paraphrase-MiniLM-L6-v2_embedding = create_paraphrase-MiniLM-L6-v2_embedding(text)
all-distilroberta-v1_embedding = create_all-distilroberta-v1_embedding(text)

print(all-MiniLM-L6-v2_embedding)
print(paraphrase-MiniLM-L6-v2_embedding)
print(all-distilroberta-v1_embedding)

NameError: name 'SentenceTransformer' is not defined

## Append Embeddings to Docs in DB

In [None]:
client = MongoClient('mongodb+srv://bxrodgers1:CS4675@cluster0.6u3n5.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0')
db = client['web_crawler']
collection = db['crawl_data']

documents = collection.find()

for document in documents:
    html_text = document.get("html", "")

    if all(key in document for key in ["all-MiniLM-L6-v2", "paraphrase-MiniLM-L6-v2", "all-distilroberta-v1"]):
        continue
    
    all-MiniLM-L6-v2_embedding = create_all-MiniLM-L6-v2_embedding(html_text).tolist()
    paraphrase-MiniLM-L6-v2_embedding = create_paraphrase-MiniLM-L6-v2_embedding(html_text).tolist()
    all-distilroberta-v1_embedding = create_all-distilroberta-v1_embedding(html_text).tolist()
    
    collection.update_one(
        {"_id": document["_id"]},
        {"$set": {
            "all-MiniLM-L6-v2": all-MiniLM-L6-v2_embedding,
            "paraphrase-MiniLM-L6-v2": paraphrase-MiniLM-L6-v2_embedding,
            "all-distilroberta-v1": all-distilroberta-v1_embedding
        }}
    )

    print(f"Updated document with _id: {document['_id']}")