# Semantic Search Engine Using NLP and BERT
## By [Your Name Here]

## 1. Motivation
We are surrounded by massive amounts of textual data, and retrieving relevant information quickly is a common need across industries — from customer support to academic research. Traditional keyword-based search engines often fail to capture the semantic meaning behind a query. This project addresses that by building a semantic search engine using modern NLP techniques.

## 2. Connection to Multimodal Learning
While this project focuses on text, the progression from Bag-of-Words to BERT embeddings reflects a deepening multimodal understanding — going from sparse symbolic features to dense contextual embeddings. It also mirrors the evolution in NLP pipelines: 
- Classic representations (BoW, TF-IDF)
- Learned word embeddings (Word2Vec)
- Deep contextual representations (BERT)
K-Means clustering is applied only in the final stage (after BERT), showing how we can organize meaningfully similar documents.

## 3. Learnings from This Work
- Different vectorization methods capture different levels of semantic information
- Cosine similarity is a powerful tool to compare vectors
- Pretrained BERT embeddings outperform previous methods in understanding query context
- K-Means clustering helps organize semantically similar documents

In [None]:
# BoW
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [
    "What is the capital of France?",
    "How to cook pasta?",
    "Where is Paris located?",
    "Recipe for Italian pasta",
    "Best tourist spots in France"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
sim_matrix_bow = cosine_similarity(bow_matrix)

In [None]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
sim_matrix_tfidf = cosine_similarity(tfidf_matrix)

In [None]:
# Word2Vec
from gensim.models import Word2Vec
import numpy as np
tokenized_docs = [doc.lower().split() for doc in documents]
model = Word2Vec(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=2)
def document_vector(doc):
    doc = doc.lower().split()
    return np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)
w2v_matrix = np.vstack([document_vector(doc) for doc in documents])
sim_matrix_w2v = cosine_similarity(w2v_matrix)

In [None]:
# BERT + KMeans
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
model = SentenceTransformer('all-MiniLM-L6-v2')
bert_embeddings = model.encode(documents)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(bert_embeddings)
labels = kmeans.labels_

In [None]:
# Querying
query = "Places to visit in Europe"
query_embedding = model.encode([query])
sim_scores = cosine_similarity(query_embedding, bert_embeddings)[0]
top_doc_index = np.argmax(sim_scores)
print("Top Match:", documents[top_doc_index])

## 5. Reflections
### (a) What surprised me:
- Word2Vec sometimes performs well even with simple models
- BERT models require minimal tuning to perform powerfully out of the box
### (b) Improvements
- Add visualizations (e.g. t-SNE or PCA)
- Use a larger dataset with evaluation metrics (e.g. MAP, NDCG)
- Incorporate cross-modal search (image + text)

## 6. References
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- https://scikit-learn.org/
- https://radimrehurek.com/gensim/
- https://nlp.stanford.edu/projects/glove/
- https://bair.berkeley.edu/blog/2024/07/20/visual-haystacks/