***Information Retrieval Project***

**1- Import Necessary Libraries**

In [2]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
# Install necessary packages
!pip install pandas scikit-learn nltk
# Download the NLTK stopwords
nltk.download('stopwords')

Defaulting to user installation because normal site-packages is not writeable


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**2- Load the dataset**

In [3]:
df = pd.read_csv('news_dataset(3).csv')

print("Dataset Preview:")
print(df.head(10))

Dataset Preview:
                                                text  category
0  THIS year has seen significant developments in...  Business
1  PETALING JAYA: Economists are shading their op...  Business
2  PETALING JAYA: Money tied up in the initial pu...  Business
3  WASHINGTON: The U.S. central bank on Wednesday...  Business
4  NEW YORK: Citigroup Inc is stepping up its dea...  Business
5  THE United States shipped a record 6.1 million...  Business
6  BUDGET 2023 laid the groundwork for the princi...  Business
7  PETALING JAYA: Software development firm Web B...  Business
8  PETALING JAYA: While narrowing yield different...  Business
9  AN income tax system is based on the willingne...  Business


**3- Create the Corpus**

In [4]:
documents = [
    "THIS year has seen significant developments in indirect tax, with key changes such as the sales tax on low-value goods (LVG), the increase in the service tax rate from 6% to 8%, an expanded scope of taxable services and a new excise duty on pre-mixed preparations.",
    "PETALING JAYA: Economists are shading their optimism with a sense of prudence over the outlook for Malaysia’s external trade although August numbers again showed the country’s international transactions continue to grow.",
    "NEW YORK: Citigroup Inc is stepping up its dealmaking in the underwriting of bonds for fossil-fuel companies, a development that coincides with intense climate protests outside its Manhattan office.",
    "BUDGET 2023 laid the groundwork for the principles of Madani, focusing on sustainability, prosperity, innovation, trust, and compassion."
]

# Display the selected documents
print("\nSelected Documents:")
for i, doc in enumerate(documents):
    print(f"Document {i + 1}: {doc}\n")

# this code preprocesses a collection of text documents by converting them to lowercase,
# removing punctuation, filtering out common stop words, and applying stemming.
# The result is a cleaner, normalized version of each document


Selected Documents:
Document 1: THIS year has seen significant developments in indirect tax, with key changes such as the sales tax on low-value goods (LVG), the increase in the service tax rate from 6% to 8%, an expanded scope of taxable services and a new excise duty on pre-mixed preparations.

Document 2: PETALING JAYA: Economists are shading their optimism with a sense of prudence over the outlook for Malaysia’s external trade although August numbers again showed the country’s international transactions continue to grow.

Document 3: NEW YORK: Citigroup Inc is stepping up its dealmaking in the underwriting of bonds for fossil-fuel companies, a development that coincides with intense climate protests outside its Manhattan office.

Document 4: BUDGET 2023 laid the groundwork for the principles of Madani, focusing on sustainability, prosperity, innovation, trust, and compassion.



**4- Data Preprocessing**

In [5]:
# Initialize the stop words and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation
    words = text.split()
    words = [stemmer.stem(word) for word in words if word not in stop_words]  # Remove stop words and stem
    return ' '.join(words)

# Preprocess the documents
cleaned_documents = [preprocess_text(doc) for doc in documents]

# Display cleaned documents
print("\nCleaned Documents:")
for i, doc in enumerate(cleaned_documents):
    print(f"Document {i + 1}: {doc}\n")
    
# This code calculates and displays term frequencies (TF) and TF-IDF values for a set of documents. 
# It begins by defining a function to compute term frequency for each document,normalizing the count of each word by the total word count in the document. 
# Next, it applies this function to all documents and prints the term frequencies per document.
# After calculating TF, it uses Tfidf Vectorizer to compute the TF-IDF matrix, transforming the documents into a numerical
# representation based on term frequency and inverse document frequency.
# The resulting matrix is then displayed as a DataFrame, with terms as columns and documents as rows, showing each term’s TF-IDF score for each document.
# This approach highlights the importance of terms within each document and across the entire document set, aiding in information retrieval and analysis.


Cleaned Documents:
Document 1: year seen signific develop indirect tax key chang sale tax lowvalu good lvg increas servic tax rate expand scope taxabl servic new excis duti premix prepar

Document 2: petal jaya economist shade optim sens prudenc outlook malaysia extern trade although august number show countri intern transact continu grow

Document 3: new york citigroup inc step dealmak underwrit bond fossilfuel compani develop coincid intens climat protest outsid manhattan offic

Document 4: budget laid groundwork principl madani focus sustain prosper innov trust compass



**5- Calculate TF-IDF madel**

In [6]:
# Calculate TF-IDF 

# Calculate Term Frequency (TF)
def calculate_tf(documents):
    tf_list = []
    for doc in documents:
        words = preprocess_text(doc)
        word_count = len(words)
        tf = defaultdict(int)
        
        for word in words:
            tf[word] += 1
        
        # Normalize TF
        for word in tf:
            tf[word] /= word_count
            
        tf_list.append(tf)
    
    return tf_list

# Calculate TF for each document
tf_results = calculate_tf(documents)

# Display the Term Frequencies
print("\nTerm Frequencies:")
for i, tf in enumerate(tf_results):
    print(f"\nDocument {i + 1}:")
    for word, frequency in tf.items():
        print(f"  {word}: {frequency:.4f}")
        
# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_documents)
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Display the TF-IDF Matrix
print("\nTF-IDF Matrix:")
print(tfidf_df)
# This code prepares a text query, “sustainability and economic development,” for search or analysis by 
# cleaning it and transforming it into a numerical representation using TF-IDF. After preprocessing the text with preprocess_text,
# it is converted into a vector with vectorizer.transform, making it ready for comparison with other text data in the TF-IDF space.



Term Frequencies:

Document 1:
  y: 0.0129
  e: 0.1161
  a: 0.0839
  r: 0.0581
   : 0.1613
  s: 0.0516
  n: 0.0452
  i: 0.0710
  g: 0.0258
  f: 0.0065
  c: 0.0516
  d: 0.0323
  v: 0.0323
  l: 0.0387
  o: 0.0323
  p: 0.0387
  t: 0.0452
  x: 0.0452
  k: 0.0065
  h: 0.0065
  w: 0.0129
  u: 0.0129
  b: 0.0065
  m: 0.0065

Document 2:
  p: 0.0214
  e: 0.0714
  t: 0.0929
  a: 0.0857
  l: 0.0286
   : 0.1357
  j: 0.0071
  y: 0.0143
  c: 0.0357
  o: 0.0786
  n: 0.0786
  m: 0.0286
  i: 0.0429
  s: 0.0571
  h: 0.0286
  d: 0.0214
  r: 0.0571
  u: 0.0571
  k: 0.0071
  x: 0.0071
  g: 0.0214
  b: 0.0071
  w: 0.0143

Document 3:
  n: 0.0769
  e: 0.0692
  w: 0.0154
   : 0.1308
  y: 0.0077
  o: 0.0769
  r: 0.0385
  k: 0.0154
  c: 0.0538
  i: 0.0923
  t: 0.0769
  g: 0.0077
  u: 0.0308
  p: 0.0385
  s: 0.0462
  d: 0.0462
  a: 0.0538
  l: 0.0385
  m: 0.0308
  b: 0.0077
  f: 0.0308
  v: 0.0077
  h: 0.0077

Document 4:
  b: 0.0125
  u: 0.0625
  d: 0.0500
  g: 0.0250
  e: 0.0250
  t: 0.0500
   : 0.1250
  l: 

**7- Propose a Query**

In [16]:
query = "Budget principles for sustainable development Malaysia" # Proposed query

# Preprocess the query
cleaned_query = preprocess_text(query)

# Transform the query into the TF-IDF space
query_vector = vectorizer.transform([cleaned_query])

# This code calculates and displays similarity scores to rank documents based on their relevance to a given query.
# It first computes the cosine similarity between the query, represented as a TF-IDF vector, and the TF-IDF matrix of all documents,
# with higher scores indicating closer matches. Then, it prints each document’s similarity score,
# allowing for a ranked list of documents by their relevance to the query.
# This approach effectively highlights the most relevant documents for the given query.

**8- Ranking Using Cosine Similarity**

In [17]:
similarity_scores = cosine_similarity(query_vector, tfidf_matrix)

# Display the similarity scores
print("\nSimilarity Scores:")
for i, score in enumerate(similarity_scores[0]):
    print(f"Document {i + 1}: {score:.4f}")
    
# We proposed a query, transformed it into the TF-IDF space, and calculated cosine similarity scores 
# to rank the documents based on their relevance to the query.


Similarity Scores:
Document 1: 0.0501
Document 2: 0.1040
Document 3: 0.0696
Document 4: 0.4208


**9- Conculsion**