<a href="https://colab.research.google.com/github/anonymousboy67/Document-Similarity-Analysis/blob/main/Aashish_Adhikari_Week_6_CODE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Zip files**

In [None]:
from google.colab import files
uploaded=files.upload()

Saving archive.zip to archive.zip


**Extracting the zip file**

In [None]:
import zipfile

with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
    zip_ref.extractall('cisi_data')


**Lisiting files in zip**

In [None]:
import os
print("Files in dataset:")
for file in os.listdir('cisi_data'):
    print(f"  - {file}")


Files in dataset:
  - CISI.REL
  - CISI.QRY
  - CISI.ALL


**FIRST 30 LINES OF CISI.ALL (documents)**

In [None]:
with open('cisi_data/CISI.ALL', 'r', encoding='utf-8', errors='ignore') as f:
    for i, line in enumerate(f):
        if i < 30:  # First 30 lines
            print(line.rstrip())
        else:
            break

.I 1
.T
18 Editions of the Dewey Decimal Classifications
.A
Comaromi, J.P.
.W
   The present study is a history of the DEWEY Decimal
Classification.  The first edition of the DDC was published
in 1876, the eighteenth edition in 1971, and future editions
will continue to appear as needed.  In spite of the DDC's
long and healthy life, however, its full story has never
been told.  There have been biographies of Dewey
that briefly describe his system, but this is the first
attempt to provide a detailed history of the work that
more than any other has spurred the growth of
librarianship in this country and abroad.
.X
1	5	1
92	1	1
262	1	1
556	1	1
1004	1	1
1024	1	1
1024	1	1
.I 2
.T
Use Made of Technical Libraries
.A
Slater, M.
.W


**FIRST 20 LINES OF CISI.QRY (Queries)**

In [None]:
with open('cisi_data/CISI.QRY', 'r', encoding='utf-8', errors='ignore') as f:
    for i, line in enumerate(f):
        if i < 20:
            print(line.rstrip())
        else:
            break

.I 1
.W
What problems and concerns are there in making up descriptive titles?
What difficulties are involved in automatically retrieving articles from
approximate titles?
What is the usual relevance of the content of articles to their titles?
.I 2
.W
How can actually pertinent data, as opposed to references or entire articles
themselves, be retrieved automatically in response to information requests?
.I 3
.W
What is information science?  Give definitions where possible.
.I 4
.W
Image recognition and any other methods of automatically
transforming printed text into computer-ready form.
.I 5
.W
What special training will ordinary researchers and businessmen need for proper


**FIRST 20 LINES OF CISI.REL (Relevance Judgments)**

In [None]:
with open('cisi_data/CISI.REL', 'r', encoding='utf-8', errors='ignore') as f:
    for i, line in enumerate(f):
        if i < 20:
            print(line.rstrip())
        else:
            break

     1     28	0	0.000000
     1     35	0	0.000000
     1     38	0	0.000000
     1     42	0	0.000000
     1     43	0	0.000000
     1     52	0	0.000000
     1     65	0	0.000000
     1     76	0	0.000000
     1     86	0	0.000000
     1    150	0	0.000000
     1    189	0	0.000000
     1    192	0	0.000000
     1    193	0	0.000000
     1    195	0	0.000000
     1    215	0	0.000000
     1    269	0	0.000000
     1    291	0	0.000000
     1    320	0	0.000000
     1    429	0	0.000000
     1    465	0	0.000000


**Parsing CISI Dataset**

In [None]:
import re
from collections import defaultdict


def parse_documents(file_path):

    documents = {}

    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()


        doc_parts = re.split(r'\.I\s+(\d+)', content)[1:]


        for i in range(0, len(doc_parts), 2):
            doc_id = int(doc_parts[i])
            doc_content = doc_parts[i + 1]


            title_match = re.search(r'\.T\s+(.*?)(?=\.[AWBX]|\Z)', doc_content, re.DOTALL)
            title = title_match.group(1).strip() if title_match else ""


            author_match = re.search(r'\.A\s+(.*?)(?=\.[TWBX]|\Z)', doc_content, re.DOTALL)
            author = author_match.group(1).strip() if author_match else ""


            content_match = re.search(r'\.W\s+(.*?)(?=\.I|\Z)', doc_content, re.DOTALL)
            main_content = content_match.group(1).strip() if content_match else ""


            full_content = title + " " + main_content


            documents[doc_id] = {
                'id': doc_id,
                'title': title,
                'author': author,
                'content': full_content.strip()
            }

        print(f"Successfully parsed {len(documents)} documents from CISI.ALL")
        return documents

    except FileNotFoundError:
        print(f" Error: File '{file_path}' not found!")
        return {}
    except Exception as e:
        print(f" Error parsing documents: {e}")
        return {}

**Parse Queries (CISI.QRY)**

In [None]:
def parse_queries(file_path):

    queries = {}

    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()


        query_parts = re.split(r'\.I\s+(\d+)', content)[1:]


        for i in range(0, len(query_parts), 2):
            query_id = int(query_parts[i])
            query_content = query_parts[i + 1]


            text_match = re.search(r'\.W\s+(.*?)(?=\.I|\Z)', query_content, re.DOTALL)
            query_text = text_match.group(1).strip() if text_match else ""


            queries[query_id] = {
                'id': query_id,
                'text': query_text
            }

        print(f" Successfully parsed {len(queries)} queries from CISI.QRY")
        return queries

    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found!")
        return {}
    except Exception as e:
        print(f"Error parsing queries: {e}")
        return {}



**Parse Relevance Judgments (CISI.REL)**

In [None]:
def parse_relevance(file_path):

    relevance = defaultdict(list)

    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue


                parts = line.split()
                if len(parts) >= 2:
                    query_id = int(parts[0])
                    doc_id = int(parts[1])


                    relevance[query_id].append(doc_id)

        relevance = dict(relevance)

        print(f"Successfully parsed relevance judgments for {len(relevance)} queries from CISI.REL")
        return relevance

    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found!")
        return {}
    except Exception as e:
        print(f"Error parsing relevance judgments: {e}")
        return {}



In [None]:
def load_cisi_dataset(base_path='cisi_data'):

    import os
    print("LOADING CISI DATASET")



    docs_path = os.path.join(base_path, 'CISI.ALL')
    queries_path = os.path.join(base_path, 'CISI.QRY')
    relevance_path = os.path.join(base_path, 'CISI.REL')


    documents = parse_documents(docs_path)
    queries = parse_queries(queries_path)
    relevance = parse_relevance(relevance_path)


    print("DATASET SUMMARY")

    print(f"Total Documents: {len(documents)}")
    print(f"Total Queries: {len(queries)}")
    print(f"Queries with Relevance Judgments: {len(relevance)}")


    return documents, queries, relevance


In [None]:
def display_samples(documents, queries, relevance, num_samples=3):


    print("SAMPLE DOCUMENTS")

    for i, (doc_id, doc) in enumerate(list(documents.items())[:num_samples]):
        print(f"\nDocument ID: {doc['id']}")
        print(f"Title: {doc['title'][:100]}...")
        print(f"Author: {doc['author']}")
        print(f"Content Preview: {doc['content'][:150]}...")



    print("SAMPLE QUERIES")

    for i, (query_id, query) in enumerate(list(queries.items())[:num_samples]):
        print(f"\nQuery ID: {query['id']}")
        print(f"Text: {query['text']}")



    print("SAMPLE RELEVANCE JUDGMENTS")

    for i, (query_id, doc_ids) in enumerate(list(relevance.items())[:num_samples]):
        print(f"\nQuery {query_id} has {len(doc_ids)} relevant documents:")
        print(f"Relevant Doc IDs: {doc_ids[:10]}...")  # Show first 10
        print("-" * 60)



if __name__ == "__main__":

    documents, queries, relevance = load_cisi_dataset('cisi_data')


    if documents and queries and relevance:
        display_samples(documents, queries, relevance, num_samples=3)

        print("\nPhase 1 Complete: Data successfully parsed!")
        print(" You can now access:")
        print(" documents[1] â†’ Get document 1")
        print(" queries[5] â†’ Get query 5")
        print(" relevance[1] â†’ Get relevant docs for query 1")

LOADING CISI DATASET
Successfully parsed 1460 documents from CISI.ALL
 Successfully parsed 112 queries from CISI.QRY
Successfully parsed relevance judgments for 76 queries from CISI.REL
DATASET SUMMARY
Total Documents: 1460
Total Queries: 112
Queries with Relevance Judgments: 76
SAMPLE DOCUMENTS

Document ID: 1
Title: 18 Editions of the Dewey Decimal Classifications...
Author: Comaromi, J.P.
Content Preview: 18 Editions of the Dewey Decimal Classifications The present study is a history of the DEWEY Decimal
Classification.  The first edition of the DDC was...

Document ID: 2
Title: Use Made of Technical Libraries...
Author: Slater, M.
Content Preview: Use Made of Technical Libraries This report is an analysis of 6300 acts of use
in 104 technical libraries in the United Kingdom.
Library use is only o...

Document ID: 3
Title: Two Kinds of Power
An Essay on Bibliographic Control...
Author: Wilson, P.
Content Preview: Two Kinds of Power
An Essay on Bibliographic Control The relationships 

**PHASE 2: TEXT PREPROCESSING FOR INFORMATION RETRIEVAL**

In [None]:

!pip install nltk


import re
import string
from collections import Counter
import nltk


nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

print("Libraries installed and imported successfully!")

Libraries installed and imported successfully!


In [None]:
class TextPreprocessor:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.stemmer = PorterStemmer()

    def preprocess(self, text):
        """Clean and process text into tokens."""
        if not text:
            return []


        text = text.lower()
        text = text.translate(str.maketrans('', '', string.punctuation))


        tokens = re.findall(r'\b\w+\b', text)


        tokens = [t for t in tokens if t not in self.stop_words and len(t) >= 2]


        tokens = [self.stemmer.stem(t) for t in tokens]

        return tokens

In [None]:
preprocessor = TextPreprocessor()
test = "The RUNNING dogs are running QUICKLY!"
result = preprocessor.preprocess(test)

print("Preprocessor created!")
print(f"\nTest: '{test}'")
print(f"Result: {result}")

Preprocessor created!

Test: 'The RUNNING dogs are running QUICKLY!'
Result: ['run', 'dog', 'run', 'quickli']


**PREPROCESSING ALL DOCUMENTS AND QUERIES**

In [None]:
preprocessed_docs = {}


print("PREPROCESSING DOCUMENTS")


for doc_id, doc in documents.items():
    tokens = preprocessor.preprocess(doc['content'])
    preprocessed_docs[doc_id] = {
        'id': doc_id,
        'tokens': tokens,
        'original_title': doc['title']
    }


    if doc_id % 200 == 0:
        print(f"  Processed {doc_id}/{len(documents)} documents...")

print(f"\nPreprocessed {len(preprocessed_docs)} documents!")


print("\n EXAMPLE - Document 1:")
print(f"Original: {documents[1]['content'][:150]}...")
print(f"Tokens: {preprocessed_docs[1]['tokens'][:15]}")
print(f"Total tokens: {len(preprocessed_docs[1]['tokens'])}")

PREPROCESSING DOCUMENTS
  Processed 200/1460 documents...
  Processed 400/1460 documents...
  Processed 600/1460 documents...
  Processed 800/1460 documents...
  Processed 1000/1460 documents...
  Processed 1200/1460 documents...
  Processed 1400/1460 documents...

Preprocessed 1460 documents!

 EXAMPLE - Document 1:
Original: 18 Editions of the Dewey Decimal Classifications The present study is a history of the DEWEY Decimal
Classification.  The first edition of the DDC was...
Tokens: ['18', 'edit', 'dewey', 'decim', 'classif', 'present', 'studi', 'histori', 'dewey', 'decim', 'classif', 'first', 'edit', 'ddc', 'publish']
Total tokens: 56


In [None]:
preprocessed_queries = {}

print("PREPROCESSING QUERIES")


for query_id, query in queries.items():
    tokens = preprocessor.preprocess(query['text'])
    preprocessed_queries[query_id] = {
        'id': query_id,
        'tokens': tokens,
        'original_text': query['text']
    }

print(f"Preprocessed {len(preprocessed_queries)} queries!")

PREPROCESSING QUERIES
Preprocessed 112 queries!


**Statistics**

In [None]:
all_tokens = []
for doc in preprocessed_docs.values():
    all_tokens.extend(doc['tokens'])

vocabulary = set(all_tokens)
token_freq = Counter(all_tokens)


print("PREPROCESSING STATISTICS")


print(f"\n DOCUMENTS:")
print(f"  Total documents: {len(preprocessed_docs)}")
print(f"  Total tokens: {len(all_tokens):,}")
print(f"  Unique words (vocabulary): {len(vocabulary):,}")
print(f"  Average tokens per doc: {len(all_tokens)/len(preprocessed_docs):.1f}")

print(f"\n QUERIES:")
total_query_tokens = sum(len(q['tokens']) for q in preprocessed_queries.values())
print(f"  Total queries: {len(preprocessed_queries)}")
print(f"  Average tokens per query: {total_query_tokens/len(preprocessed_queries):.1f}")

print(f"\n TOP 10 MOST COMMON WORDS:")
for word, count in token_freq.most_common(10):
    print(f"  {word}: {count:,} times")


PREPROCESSING STATISTICS

 DOCUMENTS:
  Total documents: 1460
  Total tokens: 262,301
  Unique words (vocabulary): 8,160
  Average tokens per doc: 179.7

 QUERIES:
  Total queries: 112
  Average tokens per query: 47.1

 TOP 10 MOST COMMON WORDS:
  librari: 1,866 times
  inform: 1,644 times
  system: 1,250 times
  use: 1,132 times
  index: 695 times
  research: 617 times
  retriev: 603 times
  data: 585 times
  studi: 572 times
  175: 551 times


In [None]:
print("\n VERIFICATION:")
print(f" preprocessed_docs created: {len(preprocessed_docs)} documents")
print(f"preprocessed_queries created: {len(preprocessed_queries)} queries")
print(f"preprocessor created: {type(preprocessor)}")


print("\n Quick Access Test:")
print(f"Document 5 has {len(preprocessed_docs[5]['tokens'])} tokens")
print(f"Query 10 has {len(preprocessed_queries[10]['tokens'])} tokens")



 VERIFICATION:
 preprocessed_docs created: 1460 documents
preprocessed_queries created: 112 queries
preprocessor created: <class '__main__.TextPreprocessor'>

 Quick Access Test:
Document 5 has 136 tokens
Query 10 has 8 tokens


**Phase 3: Building the Inverted Index**

In [None]:
from collections import defaultdict

print("BUILDING INVERTED INDEX")


inverted_index = defaultdict(list)

for doc_id, doc in preprocessed_docs.items():
    for token in doc['tokens']:
        if doc_id not in inverted_index[token]:
            inverted_index[token].append(doc_id)


inverted_index = dict(inverted_index)

print(f" Inverted index created!")
print(f"   Total unique words: {len(inverted_index):,}")

BUILDING INVERTED INDEX
 Inverted index created!
   Total unique words: 8,160


**Calculate Term Frequency (TF)**

In [None]:

print("CALCULATING TERM FREQUENCIES")



document_tf = {}

for doc_id, doc in preprocessed_docs.items():
    tf = Counter(doc['tokens'])
    document_tf[doc_id] = dict(tf)

print(f"Term frequencies calculated for {len(document_tf)} documents!")


CALCULATING TERM FREQUENCIES
Term frequencies calculated for 1460 documents!


Calculate Inverse Document Frequency (IDF)

In [None]:
import math


print("CALCULATING IDF SCORES")


total_docs = len(preprocessed_docs)
idf = {}

for term, doc_list in inverted_index.items():
    df = len(doc_list)
    idf[term] = math.log(total_docs / df)

print(f" IDF calculated for {len(idf):,} terms!")


print("\n IDF EXAMPLES:")
print("   (Higher IDF = More rare/important word)")
print()


CALCULATING IDF SCORES
 IDF calculated for 8,160 terms!

 IDF EXAMPLES:
   (Higher IDF = More rare/important word)



**Calculate TF-IDF Scores**

In [None]:
print("CALCULATING TF-IDF SCORES")

document_tfidf = {}

for doc_id in preprocessed_docs.keys():
    tfidf = {}
    tf_dict = document_tf[doc_id]

    for term, tf_value in tf_dict.items():
        if term in idf:
            tfidf[term] = tf_value * idf[term]

    document_tfidf[doc_id] = tfidf

print(f" TF-IDF calculated for {len(document_tfidf)} documents!")


CALCULATING TF-IDF SCORES
 TF-IDF calculated for 1460 documents!


**Create Search Function (TF-IDF)**

In [None]:
def search_tfidf(query_text, top_k=10):

    query_tokens = preprocessor.preprocess(query_text)


    scores = {}
    for doc_id in preprocessed_docs.keys():
        score = 0
        for term in query_tokens:
            if term in document_tfidf[doc_id]:
                score += document_tfidf[doc_id][term]

        if score > 0:
            scores[doc_id] = score

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

    return ranked

print(" Search function created!")


 Search function created!


In [None]:
test_query = "information retrieval system"
print(f"\nSearching for: '{test_query}'")


results = search_tfidf(test_query, top_k=5)

print(f"\n TOP 5 RESULTS:\n")
for rank, (doc_id, score) in enumerate(results, 1):
    title = preprocessed_docs[doc_id]['original_title']
    print(f"{rank}. Document {doc_id} (Score: {score:.2f})")
    print(f"   Title: {title[:80]}...")
    print()


Searching for: 'information retrieval system'

 TOP 5 RESULTS:

1. Document 636 (Score: 26.14)
   Title: Text Searching Retrieval of Answer-Sentences and Other Answer-Passages...

2. Document 523 (Score: 20.73)
   Title: The Cost_Performance of an On-Line, Free-Text Bibliographic Retrieval System...

3. Document 630 (Score: 20.70)
   Title: A Novel Philosophy for the Design of Information Storage
and Retrieval Systems A...

4. Document 1136 (Score: 20.49)
   Title: Data Retrieval Systems:  Specifics and Problems...

5. Document 615 (Score: 18.54)
   Title: A Cost Model for Evaluating Information Retrieval Systems...



**RETRIEVAL MODELS & QUERY PROCESSING**

In [None]:
import math

print("CALCULATING IDF VALUES")


idf = {}
N = len(preprocessed_docs)

for term, postings in inverted_index.items():
    df = len(postings)
    idf[term] = math.log(N / df)

print(f" IDF calculated for {len(idf)} unique terms")
print(f"\nSample IDF values:")
sample_terms = list(idf.items())[:5]
for term, idf_val in sample_terms:
    print(f"  {term}: {idf_val:.4f}")

CALCULATING IDF VALUES
 IDF calculated for 8160 unique terms

Sample IDF values:
  18: 2.3233
  edit: 3.5485
  dewey: 4.8013
  decim: 4.5136
  classif: 2.6322


**Calculate Document TF-IDF Vectors**

In [None]:
from collections import Counter

print("\nCALCULATING DOCUMENT TF-IDF VECTORS")

doc_vectors = {}

for doc_id, doc in preprocessed_docs.items():
    tokens = doc['tokens']
    doc_length = len(tokens)
    term_freq = Counter(tokens)


    vector = {}
    for term, freq in term_freq.items():
        tf = freq / doc_length
        vector[term] = tf * idf.get(term, 0)

    doc_vectors[doc_id] = vector

print(f"TF-IDF vectors created for {len(doc_vectors)} documents")
print(f"\nDocument 1 vector (first 5 terms):")
for i, (term, score) in enumerate(list(doc_vectors[1].items())[:5]):
    print(f"  {term}: {score:.4f}")


CALCULATING DOCUMENT TF-IDF VECTORS
TF-IDF vectors created for 1460 documents

Document 1 vector (first 5 terms):
  18: 0.0415
  edit: 0.2535
  dewey: 0.2572
  decim: 0.1612
  classif: 0.0940


In [None]:
print("\nBUILDING SEARCH FUNCTION")

def simple_search(query_text, top_k=10):
    """Simple TF-IDF search"""


    query_tokens = preprocessor.preprocess(query_text)
    print(f"Query tokens: {query_tokens}")


    query_freq = Counter(query_tokens)
    query_length = len(query_tokens)

    query_vector = {}
    for term, freq in query_freq.items():
        if term in idf:
            tf = freq / query_length
            query_vector[term] = tf * idf[term]

    scores = {}
    for doc_id, doc_vector in doc_vectors.items():

        score = 0
        for term in query_vector:
            if term in doc_vector:
                score += query_vector[term] * doc_vector[term]

        if score > 0:
            scores[doc_id] = score


    results = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return results[:top_k]

print(" Search function ready!")


BUILDING SEARCH FUNCTION
 Search function ready!


**TEST THE SEARCH!**

In [None]:
print("\n" + "="*60)
print("TESTING SEARCH")
print("="*60)


query = "information retrieval"
results = simple_search(query, top_k=10)

print(f"\nQuery: '{query}'")
print(f"Found {len(results)} documents\n")


for rank, (doc_id, score) in enumerate(results[:5], 1):
    title = preprocessed_docs[doc_id]['original_title']
    print(f"{rank}. Doc {doc_id} (Score: {score:.4f})")
    print(f"   {title[:70]}...")
    print()


TESTING SEARCH
Query tokens: ['inform', 'retriev']

Query: 'information retrieval'
Found 10 documents

1. Doc 539 (Score: 0.1512)
   Information Retrieval Languages...

2. Doc 1136 (Score: 0.1178)
   Data Retrieval Systems:  Specifics and Problems...

3. Doc 1134 (Score: 0.0966)
   Information Retrieval Learning...

4. Doc 1120 (Score: 0.0712)
   A Grammatical Elements in a Descriptor Language for an 
Information Re...

5. Doc 1171 (Score: 0.0659)
   Problems of Compatibility of Information on Retrieval Systems and 
Req...



In [None]:

query_id = 1
query_text = queries[query_id]['text']

print("\n" + "="*60)
print(f"TESTING WITH CISI QUERY {query_id}")
print("="*60)
print(f"Query: {query_text}\n")

results = simple_search(query_text, top_k=10)


print("Top 10 Retrieved Documents:")
retrieved_ids = []
for rank, (doc_id, score) in enumerate(results, 1):
    retrieved_ids.append(doc_id)
    print(f"{rank}. Doc {doc_id} (Score: {score:.4f})")


print(f"\nðŸ“Š EVALUATION:")
print(f"Retrieved: {retrieved_ids}")
print(f"Relevant (ground truth): {relevance[query_id][:10]}")

# Calculate overlap
overlap = set(retrieved_ids) & set(relevance[query_id])
print(f" Found {len(overlap)} relevant documents out of 10!")


TESTING WITH CISI QUERY 1
Query: What problems and concerns are there in making up descriptive titles?
What difficulties are involved in automatically retrieving articles from
approximate titles?
What is the usual relevance of the content of articles to their titles?

Query tokens: ['problem', 'concern', 'make', 'descript', 'titl', 'difficulti', 'involv', 'automat', 'retriev', 'articl', 'approxim', 'titl', 'usual', 'relev', 'content', 'articl', 'titl']
Top 10 Retrieved Documents:
1. Doc 429 (Score: 0.0571)
2. Doc 589 (Score: 0.0553)
3. Doc 276 (Score: 0.0478)
4. Doc 1064 (Score: 0.0430)
5. Doc 322 (Score: 0.0419)
6. Doc 722 (Score: 0.0397)
7. Doc 956 (Score: 0.0383)
8. Doc 869 (Score: 0.0376)
9. Doc 805 (Score: 0.0369)
10. Doc 1323 (Score: 0.0352)

ðŸ“Š EVALUATION:
Retrieved: [429, 589, 276, 1064, 322, 722, 956, 869, 805, 1323]
Relevant (ground truth): [28, 35, 38, 42, 43, 52, 65, 76, 86, 150]
 Found 4 relevant documents out of 10!


**Week 7 Assignment**

In [None]:
print("STEP 1: CALCULATING PRECISION")

def precision_at_k(retrieved, relevant, k=10):

    retrieved_k = retrieved[:k]
    relevant_set = set(relevant)


    relevant_retrieved = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)

    precision = relevant_retrieved / k if k > 0 else 0
    return precision


query_id = 1
results = simple_search(queries[query_id]['text'], top_k=10)
retrieved = [doc_id for doc_id, score in results]
relevant = relevance[query_id]

precision = precision_at_k(retrieved, relevant, k=10)

print(f"\nQuery {query_id}: {queries[query_id]['text'][:50]}...")
print(f"Retrieved: {retrieved}")
print(f"Relevant:  {relevant[:10]}...")
print(f"Precision@10: {precision:.4f} ({precision*100:.1f}%)")

STEP 1: CALCULATING PRECISION
Query tokens: ['problem', 'concern', 'make', 'descript', 'titl', 'difficulti', 'involv', 'automat', 'retriev', 'articl', 'approxim', 'titl', 'usual', 'relev', 'content', 'articl', 'titl']

Query 1: What problems and concerns are there in making up ...
Retrieved: [429, 589, 276, 1064, 322, 722, 956, 869, 805, 1323]
Relevant:  [28, 35, 38, 42, 43, 52, 65, 76, 86, 150]...
Precision@10: 0.4000 (40.0%)


**Recall@K Function**

In [None]:
print("\nSTEP 2: CALCULATING RECALL")

def recall_at_k(retrieved, relevant, k=10):

    retrieved_k = retrieved[:k]
    relevant_set = set(relevant)


    relevant_retrieved = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)

    recall = relevant_retrieved / len(relevant) if len(relevant) > 0 else 0
    return recall


recall = recall_at_k(retrieved, relevant, k=10)

print(f"\nQuery {query_id}: {queries[query_id]['text'][:50]}...")
print(f"Retrieved {len(retrieved[:10])} documents")
print(f"Total relevant documents: {len(relevant)}")
print(f"Found: {sum(1 for doc in retrieved[:10] if doc in relevant)}")
print(f" Recall@10: {recall:.4f} ({recall*100:.1f}%)")


STEP 2: CALCULATING RECALL

Query 1: What problems and concerns are there in making up ...
Retrieved 10 documents
Total relevant documents: 46
Found: 4
 Recall@10: 0.0870 (8.7%)


Average Precision (for MAP)
**bold text**

In [None]:
print("\nSTEP 3: CALCULATING AVERAGE PRECISION")

def average_precision(retrieved, relevant):

    relevant_set = set(relevant)

    precision_sum = 0
    relevant_count = 0

    for i, doc_id in enumerate(retrieved, 1):
        if doc_id in relevant_set:
            relevant_count += 1
            precision_at_i = relevant_count / i
            precision_sum += precision_at_i

    if relevant_count == 0:
        return 0

    ap = precision_sum / len(relevant)
    return ap


ap = average_precision(retrieved, relevant)

print(f"\nQuery {query_id}")
print(f" Average Precision: {ap:.4f}")


STEP 3: CALCULATING AVERAGE PRECISION

Query 1
 Average Precision: 0.0652


MAP (Mean Average Precision)
**bold text**

In [None]:
print("\nSTEP 4: CALCULATING MAP (MEAN AVERAGE PRECISION)")

def calculate_map(queries_dict, relevance_dict, top_k=10):

    ap_scores = []

    for query_id in relevance_dict.keys():
        if query_id not in queries_dict:
            continue


        query_text = queries_dict[query_id]['text']
        results = simple_search(query_text, top_k=top_k)
        retrieved = [doc_id for doc_id, score in results]


        relevant = relevance_dict[query_id]
        ap = average_precision(retrieved, relevant)
        ap_scores.append(ap)


        if query_id % 20 == 0:
            print(f"  Evaluated {query_id} queries...")

    map_score = sum(ap_scores) / len(ap_scores) if ap_scores else 0
    return map_score, ap_scores


print("Evaluating all queries...")
map_score, ap_scores = calculate_map(queries, relevance, top_k=10)

print(f"\n MAP@10: {map_score:.4f}")
print(f"   Evaluated {len(ap_scores)} queries")
print(f"   Best AP: {max(ap_scores):.4f}")
print(f"   Worst AP: {min(ap_scores):.4f}")


STEP 4: CALCULATING MAP (MEAN AVERAGE PRECISION)
Evaluating all queries...
Query tokens: ['problem', 'concern', 'make', 'descript', 'titl', 'difficulti', 'involv', 'automat', 'retriev', 'articl', 'approxim', 'titl', 'usual', 'relev', 'content', 'articl', 'titl']
Query tokens: ['actual', 'pertin', 'data', 'oppos', 'refer', 'entir', 'articl', 'retriev', 'automat', 'respons', 'inform', 'request']
Query tokens: ['inform', 'scienc', 'give', 'definit', 'possibl']
Query tokens: ['imag', 'recognit', 'method', 'automat', 'transform', 'print', 'text', 'computerreadi', 'form']
Query tokens: ['special', 'train', 'ordinari', 'research', 'businessmen', 'need', 'proper', 'inform', 'manag', 'unobstruct', 'use', 'inform', 'retriev', 'system', 'problem', 'like', 'encount']
Query tokens: ['possibl', 'verbal', 'commun', 'comput', 'human', 'commun', 'via', 'spoken', 'word']
Query tokens: ['describ', 'present', 'work', 'plan', 'system', 'publish', 'print', 'origin', 'paper', 'comput', 'save', 'byproduct', 

**nDCG (Normalized Discounted Cumulative Gain)**

In [None]:
import math

print("\nSTEP 5: CALCULATING nDCG")

def dcg_at_k(retrieved, relevant, k=10):
    """
    Calculate DCG@K

    Args:
        retrieved: List of retrieved document IDs
        relevant: List of relevant document IDs
        k: Number of results to consider

    Returns:
        float: DCG score
    """
    relevant_set = set(relevant)
    dcg = 0

    for i, doc_id in enumerate(retrieved[:k], 1):
        if doc_id in relevant_set:

            rel = 1
            dcg += rel / math.log2(i + 1)

    return dcg

def ndcg_at_k(retrieved, relevant, k=10):
    """
    Calculate nDCG@K

    Args:
        retrieved: List of retrieved document IDs
        relevant: List of relevant document IDs
        k: Number of results to consider

    Returns:
        float: nDCG score
    """

    dcg = dcg_at_k(retrieved, relevant, k)


    ideal_retrieved = relevant[:k]
    idcg = dcg_at_k(ideal_retrieved, relevant, k)

    if idcg == 0:
        return 0

    ndcg = dcg / idcg
    return ndcg


query_id = 1
results = simple_search(queries[query_id]['text'], top_k=10)
retrieved = [doc_id for doc_id, score in results]
relevant = relevance[query_id]

ndcg = ndcg_at_k(retrieved, relevant, k=10)

print(f"\nQuery {query_id}")
print(f" nDCG@10: {ndcg:.4f}")


STEP 5: CALCULATING nDCG
Query tokens: ['problem', 'concern', 'make', 'descript', 'titl', 'difficulti', 'involv', 'automat', 'retriev', 'articl', 'approxim', 'titl', 'usual', 'relev', 'content', 'articl', 'titl']

Query 1
 nDCG@10: 0.5068


**Evaluate ALL Queries (Complete Report)**

In [None]:
print("\n" + "="*60)
print("STEP 6: COMPLETE EVALUATION REPORT")
print("="*60)

def evaluate_all_queries(queries_dict, relevance_dict, top_k=10):
    """Complete evaluation with all metrics"""

    results = {
        'precision': [],
        'recall': [],
        'ap': [],
        'ndcg': []
    }

    print("\nEvaluating all queries...")

    for query_id in relevance_dict.keys():
        if query_id not in queries_dict:
            continue

        # Search
        query_text = queries_dict[query_id]['text']
        search_results = simple_search(query_text, top_k=top_k)
        retrieved = [doc_id for doc_id, score in search_results]
        relevant = relevance_dict[query_id]

        # Calculate metrics
        p = precision_at_k(retrieved, relevant, k=top_k)
        r = recall_at_k(retrieved, relevant, k=top_k)
        ap = average_precision(retrieved, relevant)
        ndcg = ndcg_at_k(retrieved, relevant, k=top_k)

        results['precision'].append(p)
        results['recall'].append(r)
        results['ap'].append(ap)
        results['ndcg'].append(ndcg)

        if query_id % 20 == 0:
            print(f"  Processed {query_id} queries...")

    # Calculate averages
    avg_results = {
        'Precision@10': sum(results['precision']) / len(results['precision']),
        'Recall@10': sum(results['recall']) / len(results['recall']),
        'MAP@10': sum(results['ap']) / len(results['ap']),
        'nDCG@10': sum(results['ndcg']) / len(results['ndcg'])
    }

    return avg_results, results

# Run complete evaluation
avg_results, detailed_results = evaluate_all_queries(queries, relevance, top_k=10)

print("\n" + "="*60)
print("ðŸ“Š FINAL EVALUATION RESULTS")
print("="*60)
print(f"Total Queries Evaluated: {len(detailed_results['precision'])}")
print()
for metric, score in avg_results.items():
    print(f"{metric}: {score:.4f} ({score*100:.2f}%)")
print("="*60)


STEP 6: COMPLETE EVALUATION REPORT

Evaluating all queries...
Query tokens: ['problem', 'concern', 'make', 'descript', 'titl', 'difficulti', 'involv', 'automat', 'retriev', 'articl', 'approxim', 'titl', 'usual', 'relev', 'content', 'articl', 'titl']
Query tokens: ['actual', 'pertin', 'data', 'oppos', 'refer', 'entir', 'articl', 'retriev', 'automat', 'respons', 'inform', 'request']
Query tokens: ['inform', 'scienc', 'give', 'definit', 'possibl']
Query tokens: ['imag', 'recognit', 'method', 'automat', 'transform', 'print', 'text', 'computerreadi', 'form']
Query tokens: ['special', 'train', 'ordinari', 'research', 'businessmen', 'need', 'proper', 'inform', 'manag', 'unobstruct', 'use', 'inform', 'retriev', 'system', 'problem', 'like', 'encount']
Query tokens: ['possibl', 'verbal', 'commun', 'comput', 'human', 'commun', 'via', 'spoken', 'word']
Query tokens: ['describ', 'present', 'work', 'plan', 'system', 'publish', 'print', 'origin', 'paper', 'comput', 'save', 'byproduct', 'articl', 'co