<a href="https://colab.research.google.com/github/balnarendrasapa/search-engine/blob/main/Search_Engine_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Dependencies

In [None]:
!pip install -q langchain_community faiss-cpu rank_bm25

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.3/423.3 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Get Index from the repository

- This is a pre-built index from my repository
- This step is optional.
- If you want to build fresh index, Just don't run this step.

In [None]:
!wget https://github.com/balnarendrasapa/search-engine/raw/refs/heads/main/built_index.zip

--2025-04-05 17:13:44--  https://github.com/balnarendrasapa/search-engine/raw/refs/heads/main/built_index.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/balnarendrasapa/search-engine/refs/heads/main/built_index.zip [following]
--2025-04-05 17:13:45--  https://raw.githubusercontent.com/balnarendrasapa/search-engine/refs/heads/main/built_index.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3443323 (3.3M) [application/zip]
Saving to: ‘built_index.zip’


2025-04-05 17:13:45 (40.4 MB/s) - ‘built_index.zip’ saved [3443323/3443323]



In [None]:
!unzip built_index.zip

Archive:  built_index.zip
   creating: built_index/
   creating: built_index/vector_store/
  inflating: built_index/vector_store/index.pkl  
  inflating: built_index/vector_store/index.faiss  
  inflating: built_index/search_index.json  
  inflating: built_index/bm25_index.pkl  


# Crawler Code

- If you want the crawler to crawl more pages, set PAGES_TO_CRAWL value to your liking
- TOP_K_RESULTS will show you top k searches.
- You can also change the BASE_URL to your liking

In [None]:
PAGES_TO_CRAWL = 500
TOP_K_RESULTS = 5
BASE_URL = "https://python.langchain.com/"
BASE_DIR = "built_index"
EMBEDDINGS_MODEL = "all-MiniLM-L6-v2"

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urldefrag
import time
import json
import os
import pickle
from collections import defaultdict
from tqdm import tqdm

# Import Hugging Face embeddings from LangChain and FAISS vector store
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Import BM25 from rank_bm25 for sparse searching
from rank_bm25 import BM25Okapi

if not os.path.exists(BASE_DIR):
    os.makedirs(BASE_DIR)

class SearchEngine:
    def __init__(self,
                 base_url="https://python.langchain.com/",
                 index_file=BASE_DIR + "/"+ "search_index.json",
                 vector_store_dir=BASE_DIR + "/" + "vector_store",
                 bm25_index_file=BASE_DIR + "/" + "bm25_index.pkl"):
        self.base_url = base_url
        self.index_file = index_file
        self.vector_store_dir = vector_store_dir
        self.bm25_index_file = bm25_index_file
        self.index = defaultdict(dict)
        self.visited_urls = set()
        self.urls_to_visit = [base_url]
        self.load_index()
        self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.vector_store = None
        self.bm25 = None      # BM25 object for sparse search
        self.bm25_texts = []  # List of tokenized texts
        self.url_order = []   # To maintain order corresponding to BM25 texts

        # Load semantic vector store if available
        if os.path.exists(self.vector_store_dir):
            try:
                self.vector_store = FAISS.load_local(
                    self.vector_store_dir,
                    self.embeddings,
                    allow_dangerous_deserialization=True
                )
                print("Loaded vector store from disk.")
            except Exception as e:
                print(f"Failed to load vector store: {e}")

        # Load BM25 index if available
        self.load_bm25_index()

    def load_index(self):
        if os.path.exists(self.index_file):
            with open(self.index_file, 'r') as f:
                self.index = defaultdict(dict, json.load(f))

    def save_index(self):
        with open(self.index_file, 'w') as f:
            json.dump(dict(self.index), f)

    def load_bm25_index(self):
        if os.path.exists(self.bm25_index_file):
            try:
                with open(self.bm25_index_file, 'rb') as f:
                    self.bm25, self.bm25_texts, self.url_order = pickle.load(f)
                print("Loaded BM25 index from disk.")
            except Exception as e:
                print(f"Failed to load BM25 index: {e}")

    def save_bm25_index(self):
        try:
            with open(self.bm25_index_file, 'wb') as f:
                pickle.dump((self.bm25, self.bm25_texts, self.url_order), f)
            print("BM25 index saved to disk.")
        except Exception as e:
            print(f"Failed to save BM25 index: {e}")

    def fetch_page(self, url):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            return None

    def extract_content(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        # Remove unwanted tags
        for elem in soup(['script', 'style', 'nav', 'footer', 'header']):
            elem.decompose()
        main_content = soup.find('main') or soup.find('article') or soup
        return main_content.get_text(separator=' ', strip=True)

    def process_page(self, url, html):
        content = self.extract_content(html)
        self.index[url] = {
            'content': content,
            'timestamp': time.time()
        }

    def find_links(self, html, base_url):
        soup = BeautifulSoup(html, 'html.parser')
        for link in soup.find_all('a', href=True):
            href = link['href']
            absolute_url = urljoin(base_url, href)
            absolute_url, _ = urldefrag(absolute_url)
            if absolute_url.startswith(self.base_url) and absolute_url not in self.visited_urls:
                self.urls_to_visit.append(absolute_url)

    def crawl(self, max_pages=PAGES_TO_CRAWL):
        print("Starting indexing process...")
        pbar = tqdm(total=max_pages, desc="Pages Crawled")
        while self.urls_to_visit and len(self.visited_urls) < max_pages:
            current_url = self.urls_to_visit.pop(0)
            if current_url in self.visited_urls:
                continue

            html = self.fetch_page(current_url)
            if html:
                self.process_page(current_url, html)
                self.find_links(html, current_url)
                self.visited_urls.add(current_url)
                pbar.update(1)
        pbar.close()
        self.save_index()
        print(f"Index updated. Total pages: {len(self.index)}")
        self.build_indexes()

    def build_indexes(self):
        # Build semantic vector store and BM25 index together.
        print("Building semantic vector store and BM25 sparse index...")
        texts, metadatas = [], []
        self.url_order = []  # reset BM25 order list
        for url, data in self.index.items():
            texts.append(data['content'])
            metadatas.append({"url": url})
            self.url_order.append(url)
        # Build semantic index
        self.vector_store = FAISS.from_texts(texts, self.embeddings, metadatas=metadatas)
        self.vector_store.save_local(self.vector_store_dir)
        print("Semantic vector store built and saved successfully.")
        # Build BM25 index
        self.build_bm25_index(texts)

    def build_bm25_index(self, texts):
        print("Building BM25 sparse index...")
        # Tokenize each document (using a simple whitespace split)
        self.bm25_texts = [text.lower().split() for text in texts]
        self.bm25 = BM25Okapi(self.bm25_texts)
        print("BM25 index built.")
        self.save_bm25_index()

    def bm25_search(self, query, top_k=TOP_K_RESULTS):
        if not self.bm25:
            print("BM25 index is not built. Building BM25 index now...")
            texts = [data['content'] for data in self.index.values()]
            self.build_bm25_index(texts)
        query_tokens = query.lower().split()
        scores = self.bm25.get_scores(query_tokens)
        # Get indices of the top k scores
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        results = []
        for idx in top_indices:
            url = self.url_order[idx] if idx < len(self.url_order) else "N/A"
            results.append({
                "url": url,
                "score": scores[idx],
                "snippet": self.index[url]['content'][:200] if url in self.index else ""
            })
        return results

    def semantic_search(self, query, top_k=TOP_K_RESULTS):
        if not self.vector_store:
            print("Semantic vector store is not built. Building now...")
            self.build_indexes()
        results = self.vector_store.similarity_search(query, k=top_k)
        sem_results = []
        for result in results:
            url = result.metadata.get("url", "N/A")
            sem_results.append({
                "url": url,
                "score": None,  # semantic search does not return a raw BM25-like score
                "snippet": result.page_content[:200]
            })
        return sem_results

    def search(self, query, top_k=5):
        """
        method can be 'semantic', 'bm25', or 'combined'
        """
        sem_results = self.semantic_search(query, top_k=top_k)
        bm25_results = self.bm25_search(query, top_k=top_k)
        return {"semantic": sem_results, "bm25": bm25_results}

    def interactive_search(self):
        query = input("\nEnter search query: ").strip()
        start_time = time.time()
        results = self.search(query)
        search_time = time.time() - start_time
        print(f"\nSemantic Search Results (found {len(results['semantic'])} results in {search_time:.2f}s):")
        for i, res in enumerate(results['semantic'], 1):
            print(f"{i}. {res['url']}")
            print(f"Snippet: {res['snippet']}...\n")
        print(f"BM25 Sparse Search Results:")
        for i, res in enumerate(results['bm25'], 1):
            print(f"{i}. {res['url']} (Score: {res['score']:.2f})")
            print(f"Snippet: {res['snippet']}...\n")

if __name__ == "__main__":
    engine = SearchEngine(BASE_URL)

    # If the index file exists, ask whether to update/re-crawl the index.
    if os.path.exists(engine.index_file) and engine.index:
        choice = input("Index found. Do you want to update (re-crawl) the index? (y/N): ").lower()
        if choice == 'y':
            engine.crawl()
        else:
            # Build semantic and BM25 indexes if not loaded.
            if not engine.vector_store or not engine.bm25:
                engine.build_indexes()
    else:
        engine.crawl()

  self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Starting indexing process...


Pages Crawled: 100%|██████████| 500/500 [02:48<00:00,  2.96it/s]


Index updated. Total pages: 500
Building semantic vector store and BM25 sparse index...
Semantic vector store built and saved successfully.
Building BM25 sparse index...
BM25 index built.
BM25 index saved to disk.


# Search the Index.

- There are two kinds of searches
- One is semantic searching using vector embeddings
- Another is sparse searching using bm25.

In [None]:
engine.interactive_search()


Enter search query: what is embeddings?

Semantic Search Results (found 5 results in 0.02s):
1. https://python.langchain.com/docs/how_to/embed_text/
Snippet: On this page info Head to Integrations for documentation on built-in integrations with text embedding model providers. The Embeddings class is a class designed for interfacing with text embedding mode...

2. https://python.langchain.com/docs/how_to/custom_embeddings/
Snippet: On this page LangChain is integrated with many 3rd party embedding models . In this guide we'll show you how to create a custom Embedding class, in case a built-in one does not already exist. Embeddin...

3. https://python.langchain.com/docs/concepts/embedding_models/
Snippet: On this page Prerequisites Documents Note This conceptual overview focuses on text-based embedding models. Embedding models can also be multimodal though such models are not currently supported by Lan...

4. https://python.langchain.com/docs/how_to/caching_embeddings/
Snippet: On this 