# MIT Art, Design and Technology University  
### MIT School of Computing, Pune  
### Department of Information Technology  

---

## Participative Learning Activity  
### Subject – Natural Language Processing  
### Topic – Information Retrieval and Extraction: Building a Simple Search Engine  
### Academic Year 2025 – 2026 (SEM I)  
### Course Coordinator – Prof. Kalyani Lokhande  


###  Phase 1 – Creating the Document Dataset

Before building a search engine, we need a **collection of text documents** that the engine can search through.  
In real life, these could be:
- webpages,  
- articles,  
- research papers, or  
- news stories.

For our project, we’ll create a **small custom dataset** — just a few short text paragraphs on different topics.  

This dataset will act as our **“mini knowledge base”** for the search engine.  
Later, we’ll preprocess it, index it (convert to vectors), and search it.

###  Objective of this phase:
- Create and store a few sample text documents  
- Save them in a DataFrame (and optionally in a CSV)


In [3]:
import pandas as pd

# Create a small document collection
documents = [
    "Natural Language Processing enables computers to understand human language.",
    "Information retrieval is about searching and ranking relevant documents.",
    "Machine learning algorithms improve automatically through experience.",
    "Neural networks are used for image recognition and NLP tasks.",
    "Search engines use indexing and ranking techniques to find information quickly."
]

# Create dataframe
df = pd.DataFrame({"Document_ID": range(1, len(documents)+1), "Text": documents})

# Display dataset
df


Unnamed: 0,Document_ID,Text
0,1,Natural Language Processing enables computers ...
1,2,Information retrieval is about searching and r...
2,3,Machine learning algorithms improve automatica...
3,4,Neural networks are used for image recognition...
4,5,Search engines use indexing and ranking techni...


###  Phase 2 – Text Preprocessing

Before we can build a search engine, we must clean and normalize the text.  
Raw text often contains:
- Punctuation marks  
- Uppercase/lowercase inconsistencies  
- Stopwords (like *the, is, and, of*), which add no real meaning

In this phase, we’ll:
1. Convert all text to lowercase  
2. Remove punctuation  
3. Tokenize (split text into words)  
4. Remove stopwords  
5. Lemmatize words (convert them to their base form, e.g., *computers → computer*)  

This makes all documents **uniform and comparable**, allowing the system to index and match them effectively.

###  Objective of this phase:
To clean and prepare all text documents so they are ready for vectorization and search.


In [6]:
!pip install nltk


Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Using cached click-8.3.0-py3-none-any.whl.metadata (2.6 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Using cached click-8.3.0-py3-none-any.whl (107 kB)
Installing collected packages: click, nltk
Successfully installed click-8.3.0 nltk-3.9.2



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (only first time)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Remove punctuation and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)
    # 3. Tokenize
    tokens = nltk.word_tokenize(text)
    # 4. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # 5. Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # 6. Rejoin into cleaned string
    cleaned = ' '.join(tokens)
    return cleaned

# Apply preprocessing
df["Cleaned_Text"] = df["Text"].apply(preprocess_text)
df


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\astam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\astam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\astam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Document_ID,Text,Cleaned_Text
0,1,Natural Language Processing enables computers ...,natural language processing enables computer u...
1,2,Information retrieval is about searching and r...,information retrieval searching ranking releva...
2,3,Machine learning algorithms improve automatica...,machine learning algorithm improve automatical...
3,4,Neural networks are used for image recognition...,neural network used image recognition nlp task
4,5,Search engines use indexing and ranking techni...,search engine use indexing ranking technique f...


### Phase 3 – TF-IDF Indexing

After cleaning and preprocessing the text, the next step is to **convert text into numerical form** so that it can be compared mathematically.  
Search engines use such numeric representations to measure how similar a user’s query is to each document.

We will use **TF-IDF (Term Frequency – Inverse Document Frequency)**, which represents how important a word is to a document in the collection.

- **Term Frequency (TF):** measures how often a word appears in a document.  
- **Inverse Document Frequency (IDF):** reduces the weight of common words that appear in many documents.

The resulting TF-IDF value gives higher importance to words that are **unique and meaningful** within each document.

### Objective of this phase:
1. Convert all preprocessed documents into TF-IDF vectors.  
2. Prepare them for similarity measurement and ranking.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the cleaned text
tfidf_matrix = vectorizer.fit_transform(df["Cleaned_Text"])

# Display feature names (vocabulary)
print("Vocabulary (sample):", vectorizer.get_feature_names_out()[:15])

# Show the TF-IDF matrix shape
print("\nTF-IDF Matrix Shape:", tfidf_matrix.shape)


Vocabulary (sample): ['algorithm' 'automatically' 'computer' 'document' 'enables' 'engine'
 'experience' 'find' 'human' 'image' 'improve' 'indexing' 'information'
 'language' 'learning']

TF-IDF Matrix Shape: (5, 33)



### Phase 4 – Implementing the Search Function

After converting documents into TF-IDF vectors, we can now build the core of our search engine.  
The goal of this phase is to take a **user query**, compare it with all documents, and return the most relevant ones.

To do this, we will use **Cosine Similarity** — a mathematical measure of similarity between two vectors.

- **Cosine Similarity** = (A · B) / (||A|| × ||B||)  
  It gives a value between 0 and 1, where 1 means identical and 0 means completely different.

Steps:
1. Take a user query and preprocess it the same way as documents.  
2. Convert the query into a TF-IDF vector using the same vocabulary.  
3. Compute cosine similarity between the query vector and all document vectors.  
4. Sort documents by similarity score and display the top results.

### Objective of this phase:
To enable the system to accept a search query and return ranked documents based on their relevance.


In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_engine(query, df, vectorizer, tfidf_matrix, top_n=3):
    # Preprocess the query
    cleaned_query = preprocess_text(query)

    # Transform the query using the same TF-IDF vectorizer
    query_vec = vectorizer.transform([cleaned_query])

    # Compute cosine similarity between query and all documents
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()

    # Get top N most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_n]

    # Prepare results
    results = df.iloc[top_indices][["Document_ID", "Text"]].copy()
    results["Similarity_Score"] = similarities[top_indices]
    return results

# Example search
query = "ranking documents in search engines"
search_results = search_engine(query, df, vectorizer, tfidf_matrix)
search_results


Unnamed: 0,Document_ID,Text,Similarity_Score
4,5,Search engines use indexing and ranking techni...,0.481513
1,2,Information retrieval is about searching and r...,0.375242
3,4,Neural networks are used for image recognition...,0.0


### Phase 5 – Ranking and Result Interpretation

Once the similarity scores are calculated, the search engine ranks the results in descending order of relevance.  
This ranking helps the user quickly access the most useful documents.

**How ranking works:**
1. Each document receives a similarity score (0–1) that indicates how close it is to the user’s query.
2. The documents are sorted by this score — the highest-scoring ones appear first.
3. The top results are presented to the user as the most relevant answers.

In this simplified implementation, ranking is based purely on **cosine similarity** using TF-IDF vectors.  
More advanced search engines (like Google) also consider factors such as link authority, freshness, and user engagement.

### Objective of this phase:
- Display search results ranked by relevance.  
- Interpret how similarity scores affect the ranking order.


In [13]:
# Function to interactively test queries
while True:
    query = input("Enter your search query (or type 'exit' to stop): ")
    if query.lower() == 'exit':
        break
    results = search_engine(query, df, vectorizer, tfidf_matrix, top_n=3)
    print("\nTop Results:\n")
    print(results.to_string(index=False))
    print("\n" + "-"*60 + "\n")



Top Results:

 Document_ID                                                                            Text  Similarity_Score
           5 Search engines use indexing and ranking techniques to find information quickly.               0.0
           4                   Neural networks are used for image recognition and NLP tasks.               0.0
           3           Machine learning algorithms improve automatically through experience.               0.0

------------------------------------------------------------


Top Results:

 Document_ID                                                                            Text  Similarity_Score
           4                   Neural networks are used for image recognition and NLP tasks.          0.377964
           5 Search engines use indexing and ranking techniques to find information quickly.          0.000000
           3           Machine learning algorithms improve automatically through experience.          0.000000

------------------