<a href="https://colab.research.google.com/github/atharvnaidu/BerkeleyTime/blob/main/SemanticSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
!pip install sentence_transformers
!pip install faiss-cpu
!pip install nltk



Text Normalization

In [31]:
import re
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Text normalization function
def normalize_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # Remove stopwords using spaCy
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop]  # Lemmatize and remove stopwords

    return " ".join(tokens)

sample course description include number like 101 special character


Synonym Expansion

In [32]:
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')

# Synonym expansion using WordNet
def expand_synonyms(tokens):
    expanded_tokens = set(tokens)  # To store original and synonym words

    for token in tokens:
        synonyms = wordnet.synsets(token)
        for syn in synonyms:
            for lemma in syn.lemmas():
                expanded_tokens.add(lemma.name())  # Add synonym lemmas

    return expanded_tokens

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Entity Recognition + Contextual Understanding + Indexing

In [33]:
import requests
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import re

# Load BERT model for embedding
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define the GraphQL endpoint and query for fetching course data
url = "https://stanfurdtime.com/api/graphql"
coursesQuery = """
query CoursesQuery {
  courseList {
    number
    title
    description
  }
}
"""

# Fetch courses data from the API
response = requests.post(url, json={"query": coursesQuery})
if response.status_code == 200:
    data = response.json()
    courses = data["data"]["courseList"]
else:
    print(f"Failed to fetch courses. Status code: {response.status_code}")
    courses = []  # Fallback if API call fails

# Text normalization function
def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Remove special characters
    return text

# Generate course embeddings using BERT
course_texts = [f"{course['number']} {course['title']} {course['description']}" for course in courses]
course_embeddings = model.encode(course_texts)

# Convert embeddings to numpy array
course_embeddings = np.array(course_embeddings)

# Build FAISS index for efficient similarity search
index = faiss.IndexFlatL2(course_embeddings.shape[1])  # L2 distance
index.add(course_embeddings)

# Function to encode user query
def encode_query(query):
    normalized_query = normalize_text(query)
    query_embedding = model.encode(normalized_query)
    return query_embedding

Ranking

In [35]:
# Example user query
user_query = "hash tables"
query_embedding = encode_query(user_query).reshape(1, -1)

# Search for nearest courses
k = 5  # Number of results to return
distances, indices = index.search(query_embedding, k)

# Output results
print("Search Results:")
for idx in indices[0]:
    print(f"Course: {courses[idx]['number']} - {courses[idx]['title']}")

Search Results:
Course: 61B - Data Structures
Course: 47B - Completion of Work in Computer Science 61B
Course: 215 - Analysis and Design of Databases
Course: 174 - Combinatorics and Discrete Probability
Course: 249 - Algebraic Combinatorics
