# Information Retrieval (IR) Lab: Implementing Bag-of-Words and TF-IDF for Document Search

## Lab Objectives
- Understand and apply fundamental Information Retrieval (IR) concepts.
- Implemen **Bag-of-Words (BoW)** and **TF-IDF** (optional) models for document retrieval.
- Use an existing corpus (scientific abstracts) for indexing and searching.
- Evaluate retrieval performance using basic ranking methods. (optional)

---

## Dataset
- We will use **Spotify songs** from the Hugging Face dataset **[petkopetkov/spotify-million-song-dataset-descriptions](https://huggingface.co/datasets/petkopetkov/spotify-million-song-dataset-descriptions?row=0)**.
- The dataset consists of song and corresponding metadata (artist, link, etc.).
- The corpus will be preprocessed and available in a structured format.

---
## potential use
- build a search engine by calculating the cosine similarity between query and dataset
---
## Lab Setup
### Required Libraries


In [4]:
!pip install datasets
import pandas as pd
from datasets import load_dataset
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset # pip install datasets



##Dataset Preparation##

Download and load the dataset into a Pandas DataFrame. (This step may take several minutes)

In [26]:
# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_parquet("hf://datasets/petkopetkov/spotify-million-song-dataset-descriptions/data/train-00000-of-00001.parquet")

# Corpus prepocessing

In [30]:
df = df[['text', 'description', 'artist']].sample(n=5000,random_state=42)
# keep only useful columns


# Task 1: Preprocessing the Corpus

*   Convert text to lowercase
*   Remove stopwords
* Tokenization
* Optionally, apply stemming or lemmatization


In [31]:
import nltk
import numpy as np
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess(text):
    words = word_tokenize(text.lower())
    words = [ word for word in words if word.isalnum() and word not in stop_words ]
    return " ".join(words)


df['processed_text'] = df['text'].apply(preprocess)
df[:5]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,description,artist,processed_text
19382,I'm alive! And I see things mighty clear today...,"Upbeat, energetic, simple, declarative, celebr...",Tom Jones,alive see things mighty clear today alive aliv...
41041,"\r\nI'm a one boy, one boy girl \r\nYou kno...","Pop, catchy, playful, confident, assertive, ro...",Kylie Minogue,one boy one boy girl know one boy girl 1 girls...
7360,Now if you're gonna cruise be sure that you'll...,"upbeat, tropical, rhythmic, exotic, enchanting...",Hank Snow,gon na cruise sure choose habana waters green ...
842,Words and music by Jimmy Rodgers \r\nCattle p...,"upbeat, nostalgic, Western, storytelling, yode...",Arlo Guthrie,words music jimmy rodgers cattle prowl coyotes...
21490,"Some way, some day, I'll find a way \r\nTo ma...","Hopeful, introspective, melancholic, reflectiv...",Who,way day find way make see way even think like ...


# Task 2: Implementing Bag-of-Words (BoW) from Scratch

Using a manual implementation to create a document-term matrix:

In [8]:
# creating a vocab of stem using posterstemmer
def create_vocab(corpus):
    vocab = []
    vocab = [word for sentence in corpus for word in sentence.split() if word not in vocab] # corpus here is probably a series type
    return vocab

def build_bow_matrix(corpus, vocab):
    word_index = {word:i for i, word in enumerate(vocab)} # print(word_index)
    bow_matrix = np.zeros((len(corpus), len(vocab))) # init our bow with 0 value, row = total sentences in our corpus, columns are our vocabulary

    # now use loop to go through corpus doc by doc, what we need is the doc
    for doc_id, doc in enumerate(corpus):
        # word_freq = Counter(doc.split()) doc = [{word:word_freq[word]} for word in word_freq] # 将单词和它们在文本里的出现频率记录下来，这样子就只用保存单个单词就知道它们的频率了
        doc = doc.split()
        for word in doc:
            if word in vocab:
                # 如何不遍历所有的词
                column_id = word_index[word] # we need to locate our word by looking at the vocabulary which order is the order of our matrix row
                # print(f"word is {word} column_id is {column_id}") to check if words are represented in the right place in our matrix
                bow_matrix[doc_id, column_id] = doc.count(word) # column_id =  vocab.index(ps.stem(word))

    return bow_matrix



### A test sample

In [9]:
# test your functions
example_corpus = [
    "like a rainbow shinning in the sky",
    "walk down the street being awear of your sweat",
    "don't pretent to know nothing about it, I know you've done it"
]


vocab = create_vocab(example_corpus)
bow_matrix = build_bow_matrix(example_corpus, vocab)

# print(vocab)

print("Vocabulary:", vocab)
print("BoW Matrix:")
print(bow_matrix)

Vocabulary: ['like', 'a', 'rainbow', 'shinning', 'in', 'the', 'sky', 'walk', 'down', 'the', 'street', 'being', 'awear', 'of', 'your', 'sweat', "don't", 'pretent', 'to', 'know', 'nothing', 'about', 'it,', 'I', 'know', "you've", 'done', 'it']
BoW Matrix:
[[1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1.
  2. 1. 1. 1.]]


In [10]:
query = "walk down the street like a rainbow shinning"
query_vector = np.zeros(len(vocab))
for word in preprocess(query).split():
    if word in vocab:
        query_vector[vocab.index(word)] += 1 # or query.count(word) to get frequency
print(query_vector)

[1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


In [None]:
similarities = cosine_similarity([query_vector], bow_matrix) # transform our query matrix into a two demensions matrix, because cs expects two demensions inputs like vectors or matrix
top_indices = similarities.argsort()[0][-5:][::-1]  # similarities.argsort() take array and sort them by returning their index of demension with regards of input's dimension,

for i in top_indices:
    print(df.iloc[i]['text'], "*****", df.iloc[i]['description'], "*****")

### A remaind about pandas_dataFrame strucuture

In [35]:
a = df['processed_text'] # pandas.core.series.Series a one demention data of the frame, no column's name
b = df[['processed_text']]

<class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>


### Transforming our datasets

In [None]:
corpus = df[['processed_text']]
vocab = corpus.apply(create_vocab) # apply works row by row, like iloc[::]
bow_matrix = build_bow_matrix(corpus, vocab)

print("Vocabulary:", vocab)
print("BoW Matrix:")
print(bow_matrix)

In [59]:
query = "walk down the street like a rainbow shinning"
query_vector = np.zeros(len(vocab))
for word in preprocess(query).split():
    if word in vocab:
        query_vector[vocab.index(word)] += 1 # or query.count(word) to get frequency
print(query_vector)

[0. 0. 0. ... 0. 0. 0.]


In [None]:
similarities = cosine_similarity([query_vector], bow_matrix) # transform our query matrix into a two demensions matrix, because cs expects two demensions inputs like vectors or matrix
top_indices = similarities.argsort()[0][-5:][::-1]  # similarities.argsort() take array and sort them by returning their index of demension with regards of input's dimension,

for i in top_indices:
    print(df.iloc[i]['text'], "*****", df.iloc[i]['description'], "*****")

# Task 3: Implementing TF-IDF for Retrieval
Using TfidfVectorizer to compute document importance:

In [32]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['text'])

In [33]:
query_vector_tfidf = tfidf_vectorizer.transform([preprocess(query)])
similarities_tfidf = cosine_similarity(query_vector_tfidf, X_tfidf)

top_indices_tfidf = similarities_tfidf.argsort()[0][-5:][::-1]
for i in top_indices_tfidf:
    print(df.iloc[i]['text'])

When all the world is a hopeless jumble,  
And the raindrops tumble all around,  
Heaven opens a magic lane.  
When all the clouds darken up the skyway,  
There's a rainbow highway to be found,  
Leading from your window pane,  
To a place behind the sun,  
Just a step beyond the rain.  
  
Somewhere, over the rainbow, way up high,  
There's a land that I heard of once in a lullaby.  
  
Somewhere over the rainbow, skies are blue,  
And the dreams that you dare to dream,  
Really do come true.  
  
Someday, I'll wish upon a star  
And wake up where the clouds are far behind me.  
Where troubles melt like lemon drops,  
Away above the chimney tops,  
That's where you'll find me.  
  
Somewhere over the rainbow, bluebirds fly.  
Birds fly over the rainbow,  
Why then, oh, why can't I?  
If all those little bluebirds fly beyond the rainbow,  
Why, oh why, can't I?


(E. Y Harnburg)  
  
Hooh  
  
Somewhere over the rainbow  
Way up high  
There's a land t

# Task 4: Compare BoW and TF-IDF Performance

Retrieve documents using both methods.

Compare their ranking results.

Discuss why one method performs better than the other for certain queries.



Implementation of Sklearn Bag-of-Words

Using **CountVectorizer** from Sklearn for BoW implementation:

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['processed_text'])

# Transform the query
query_vector_bow = vectorizer.transform([preprocess(query)])

# Compute similarity
similarities_bow = cosine_similarity(query_vector_bow, X_bow)

top_indices_bow = similarities_bow.argsort()[0][-5:][::-1]
for i in top_indices_bow:
    print(df.iloc[i]['artist'])

Barbra Streisand
Il Divo
Kylie Minogue
Yo La Tengo
Lorde
