#### VECTOR SPACE MODEL
Develop an IR system which would be based on the vector space model.

The system has to return the top k documents in response to the user query.

Use the publicly available datasets for the purpose and to test the system.

In [1]:
import nltk
from nltk.corpus import reuters
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### DATASET USED : Reuters-21578

In [3]:
# Reuters-21578 dataset
nltk.download('reuters')

# Access documents from the Reuters-21578 corpus
documents = reuters.fileids()


[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


### QUERY : 'crude oil prices'

In [None]:
# User query
query = "crude oil prices"

**Perform TOKENIZATION - Convert input text to tokens/words:**

In [4]:
# Tokenization and preprocessing
nltk.download('punkt')
preprocessed_documents = [' '.join(reuters.words(file_id)) for file_id in documents]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**TF-IDF VECTORIZATION -  Represent documents as numerical vectors:**

In [5]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)
query_vector = vectorizer.transform([query])

**COSINE SIMILARITY CALCULATION - Used to measures the similarity between two vectors by computing the cosine of the angle between them:**

In [6]:
# Cosine Similarity Calculation
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

**RANKING the documents based on their cosine similarity scores:**

In [7]:
# Ranking
k = 5  # Number of top documents to return
top_indices = cosine_similarities.argsort()[-k:][::-1]

# Output the top k documents
for i, index in enumerate(top_indices):
    print(f"Rank {i+1}: Document {index+1} - Similarity: {cosine_similarities[index]}")
    print(f"Title: {reuters.raw(documents[index])[:100]}...")  # Displaying first 100 characters of the document
    print()

Rank 1: Document 4718 - Similarity: 0.460174090990781
Title: DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
  Diamond Shamrock Corp said that
  effective today it had ...

Rank 2: Document 1741 - Similarity: 0.43152384097126323
Title: UNION PACIFIC &lt;UNP> RAISES CRUDE OIL PRICES
  Union Pacific Resources, formerly
  Champlin Petrol...

Rank 3: Document 5684 - Similarity: 0.4299670269268614
Title: DIAMOND SHAMROCK &lt;DIA> RAISES CRUDE OIL POSTINGS
  Diamond Shamrock said it raised its
  posted p...

Rank 4: Document 6514 - Similarity: 0.4069576391740879
Title: CONOCO RAISES CRUDE OIL PRICES UP TO ONE DLR BARREL, WTI AT 17.50 DLRS

  CONOCO RAISES CRUDE OIL PR...

Rank 5: Document 7856 - Similarity: 0.40545906027199896
Title: UNOCAL &lt;UCL> UNIT CUTS CRUDE OIL POSTED PRICES
  Unocal Corp's Union Oil Co said it
  lowered its...



### QUERY : 'medical research advancements'

In [8]:
# User query
query = "medical research advancements"

# Tokenization and preprocessing
nltk.download('punkt')
preprocessed_documents = [' '.join(reuters.words(file_id)) for file_id in documents]

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)
query_vector = vectorizer.transform([query])

# Cosine Similarity Calculation
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

# Ranking
k = 5  # Number of top documents to return
top_indices = cosine_similarities.argsort()[-k:][::-1]

# Output the top k documents
for i, index in enumerate(top_indices):
    print(f"Rank {i+1}: Document {index+1} - Similarity: {cosine_similarities[index]}")
    print(f"Title: {reuters.raw(documents[index])[:100]}...")  # Displaying first 100 characters of the document
    print()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Rank 1: Document 9166 - Similarity: 0.20728071938985732
Title: PESCH SEES SHAREHOLDER SUPPORT IN AMI &lt;AMI> BID
  Chicago physician LeRoy Pesch said he
  has had...

Rank 2: Document 10664 - Similarity: 0.206784438279818
Title: AMERICAN MEDICAL INTERNATIONAL INC 2ND QTR SHR PROFIT 32 CTS VS LOSS 95 CTS

  AMERICAN MEDICAL INTE...

Rank 3: Document 10343 - Similarity: 0.19551668067343334
Title: UNITED MEDICAL &lt;UM> TO SELL UNIT
  United Medical Corp said it
  has reached a definitive agreeme...

Rank 4: Document 189 - Similarity: 0.19517820607476374
Title: FRONTIER INSURANCE &lt;FRTR> IN ACQUISITION TALKS
  Frontier Insurance Group Inc
  said it is curren...

Rank 5: Document 2069 - Similarity: 0.19314426478046617
Title: FRONTIER &lt;FRTR.O> BUYS MALPRACTICE BUSINESS
  Frontier Insurance Group Inc
  said it acquired the...

