# Blog search

Trying to set up a script which will crawl the content of my website. And then I will store embeddings inside FAISS DB and use that for search. 

This is just an experiment. No intention to add this to my website as a search option.

In [5]:
import requests
from bs4 import BeautifulSoup
from transformers import BertTokenizer, BertModel
import faiss
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
def get_article_content(url: str):
    if not url:
        raise ValueError("Invalid URL provided.")
    
    try:
        response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
        response.raise_for_status()  # Raise an exception for non-200 status codes

        soup = BeautifulSoup(response.content, 'html.parser')
        article_content = soup.find("main", class_="article-full")

        if article_content:
            return article_content.text.strip()  # Extract and clean the text content
        else:
            print(f"Content not found in <main> tag with class 'article-full' on {url}.")
            return None

    except (requests.exceptions.RequestException, ValueError) as e:
        print(f"Error crawling {url}: {e}")
        return None

In [3]:
def chunk_text(text, chunk_size=1000):
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

In [6]:
def encode_text(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model(**inputs)
    encoded_vector = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
    return encoded_vector

In [11]:
article_url = "https://www.amitavroy.com/articles/beyond-boundaries-how-frankenphp-redefines-php-application-runtimes-2024-01-01"
article_content = get_article_content(article_url)
text_chunks = chunk_text(article_content)

In [12]:
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 56.4kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 551kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 802kB/s]
config.json: 100%|██████████| 570/570 [00:00<00:00, 3.16MB/s]
model.safetensors: 100%|██████████| 440M/440M [00:34<00:00, 12.9MB/s] 


In [13]:
index = faiss.IndexFlatIP(768)  # Assuming 768 is the size of BERT encoding

In [15]:
encoded_vectors = []
for chunk in text_chunks:
    encoded_vector = encode_text(chunk, tokenizer, model)
    index.add(np.array([encoded_vector]))
    encoded_vectors.append(encoded_vector)

In [18]:
# Search for similar content
search_query = "build binary"
query_vector = encode_text(search_query, tokenizer, model)

In [19]:
D, I = index.search(np.array([query_vector]), k=5)  # Adjust k based on the number of similar documents you want

In [20]:
print("Similar documents:")
for i, idx in enumerate(I[0]):
    similarity_score = D[0][i]
    similar_content = text_chunks[idx]
    print(f"Similarity Score: {similarity_score}, Content: {similar_content}")

Similar documents:
Similarity Score: 34.817176818847656, Content: What is FrankenPHP?
FrankenPHP is a contemporary PHP app server crafted in Go, boasting numerous advantages. Among its standout features, one that truly captures the spotlight is its remarkable ability to construct a standalone and self-contained binary. This particular feature holds immense significance, particularly in the realm of shipping PHP applications to production.
FrankenPHP is a modern PHP app server written in Go. Here are some points which really sets it apart from others in the race

It’s about 3.5x faster than PHP FPM.
Written in Go and C, it relies on Go’s iconic goroutines features
It uses Caddy server under the hood. So, HTTPS support for local development is also available out of the box
It has native support for HTTP/1.1, HTTP/2 and HTTP/3. Having a reduced round-trip time for HTTP/3, it’s a very performant option.

The game-changing perks
Now, let's talk about the good things that make FrankenPHP sta