<a href="https://colab.research.google.com/github/aadhamashraf/ZC-SP24-DSAI-201/blob/main/IR_WEEK(3)_Indexing%26ExploringIndexingforStudents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR** - WEEK(3)


***This*** is one of a series of Colab notebooks created for the **IR** course. It demonstrates how we can **index a collection**, and how to access an index to visualize some index analysis.

The **learning outcomes** of the this notebook are:

*   PyTerrier setup.
*   Preprocessing.
*   Indexing a collection.
*   Accessing and exploring the index.




### **PyTerrier Setup**
**What is PyTerrier?**

**[PyTerrier](https://pyterrier.readthedocs.io/en/latest/)** is a Python framework, but uses the underlying [Terrier information retrieval](http://terrier.org/) ***toolkit for many indexing and retrieval operations***. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations.


In [None]:
#install the Pyterrier framework
!pip install python-terrier
# install the nltk modules
!pip install nltk

### You must always start by **importing PyTerrier** and **initialise PyTerrier**.

To initialize PyTerrier: using **PyTerrier's init()** method.

The **init()** method is needed as PyTerrier must download Terrier's jar file and start the **Java virtual machine**.

We prevent init() from being called more than once by checking started().

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.8 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8



### We will **import all the python libraries** needed for this lab

*   Pandas
*   nltk





In [None]:
#Import the necessary modules:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re # used to clean the data
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)


### **Data preparation**
We will first create five textual documents.

In [None]:
docs_df = pd.DataFrame([ ["d0","The importance of education in society and its impact on economic growth",

                                "d1","The benefits of regular exercise for maintaining physical and mental health",
                                'd2',"The role of technology in modern workplaces: Increasing productivity and efficiency",
                                'd3',"The influence of social media on interpersonal relationships and communication",
                                'd4',"The significance of environmental conservation efforts for sustainable development",
                                'd5',"The challenges and opportunities of globalization in the
                        columns=["docno", "raw_text"])

docs_df

SyntaxError: unterminated string literal (detected at line 7) (<ipython-input-1-8c6c4667a598>, line 7)

### Before indexing our data we need to do the following **processing steps**:

1.   **Tokenization**
2.   **Remove stopwords.**
3.   **Normalization.**

Then there is **Stemming**



### **Let's remove the stopwords.**


### **NLTK Library**
It contains all the common stop words in python


In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

**NLTK library** has 179 words in the stopword collection.

As you can observe, most frequent words are **was, the, and I**.

In [None]:
# Display the original DataFrames
print("Original DataFrame:")
docs_df

Original DataFrame:


Unnamed: 0,docno,raw_text
0,d0,Python programming language is widely used in data science.
1,d1,Information retrieval techniques are crucial for efficient data searching.
2,d2,I'm a Data Scientist studing information retrieval.
3,d3,Boolean model is a mathematical model for information retrieval.
4,d4,I think that data retrieval techniques play a key role in information systems.
5,d5,IR system plays a very essential role in Data science field


### **Tokenize, remove stopwords,Normalize**

In [None]:
# Function to remove stopwords
def remove_stopwords(text):

    tokens = word_tokenize(text)
    filtered_tokens = [word.lower() for word in tokens if word.lower() not in stop_words] #Lower is used to normalize al the words make them in lower case
    print('Tokens are:',tokens,'\n')
    return ' '.join(filtered_tokens)

# Apply the remove_stopwords function to the 'raw_text' column
docs_df['processed_text'] = docs_df['raw_text'].apply(remove_stopwords)



Tokens are: ['Python', 'programming', 'language', 'is', 'widely', 'used', 'in', 'data', 'science', '.'] 

Tokens are: ['Information', 'retrieval', 'techniques', 'are', 'crucial', 'for', 'efficient', 'data', 'searching', '.'] 

Tokens are: ['I', "'m", 'a', 'Data', 'Scientist', 'studing', 'information', 'retrieval', '.'] 

Tokens are: ['Boolean', 'model', 'is', 'a', 'mathematical', 'model', 'for', 'information', 'retrieval', '.'] 

Tokens are: ['I', 'think', 'that', 'data', 'retrieval', 'techniques', 'play', 'a', 'key', 'role', 'in', 'information', 'systems', '.'] 

Tokens are: ['IR', 'system', 'plays', 'a', 'very', 'essential', 'role', 'in', 'Data', 'science', 'field'] 



In [None]:
# Display the  processed DataFrames
print('dataFrame after processing:\n')
docs_df

dataFrame after processing:



Unnamed: 0,docno,raw_text,processed_text
0,d0,Python programming language is widely used in data science.,python programming language widely used data science .
1,d1,Information retrieval techniques are crucial for efficient data searching.,information retrieval techniques crucial efficient data searching .
2,d2,I'm a Data Scientist studing information retrieval.,'m data scientist studing information retrieval .
3,d3,Boolean model is a mathematical model for information retrieval.,boolean model mathematical model information retrieval .
4,d4,I think that data retrieval techniques play a key role in information systems.,think data retrieval techniques play key role information systems .
5,d5,IR system plays a very essential role in Data science field,ir system plays essential role data science field


The last processing step is to **stem** the terms in each document.

In [None]:
from nltk.stem import *
from nltk.stem.porter import *

In [None]:
# Initialize Porter stemmer
stemmer = PorterStemmer()

In [None]:
def Steem_text(text):

    tokens = word_tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    # print (tokens)
    return ' '.join(stemmed_tokens)



In [None]:
# Apply stemming

docs_df['processed_text']=docs_df['processed_text'].apply(Steem_text)
docs_df

Unnamed: 0,docno,raw_text,processed_text
0,d0,Python programming language is widely used in data science.,python program languag wide use data scienc .
1,d1,Information retrieval techniques are crucial for efficient data searching.,inform retriev techniqu crucial effici data search .
2,d2,I'm a Data Scientist studing information retrieval.,'m data scientist stude inform retriev .
3,d3,Boolean model is a mathematical model for information retrieval.,boolean model mathemat model inform retriev .
4,d4,I think that data retrieval techniques play a key role in information systems.,think data retriev techniqu play key role inform system .
5,d5,IR system plays a very essential role in Data science field,ir system play essenti role data scienc field


### **Indexing:**

Next, we will index the dataframe's documents. The index, with all its data structures, is saved into a directory called **myFirstIndex**.

**[DFIndexer](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html)**: Use this Indexer if you wish to index a pandas.Dataframe

Constructor called by all indexer subclasses. All arguments listed below are available in IterDictIndexer, DFIndexer, TRECCollectionIndexer and FilesIndsexer.

In [None]:
indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)

# index the text, record the docnos as metadata
index_ref = indexer.index(docs_df["processed_text"], docs_df["docno"])

./myFirstIndex/data.properties


#### **Explore the index**
An index has several data structures:

*    **the Lexicon** - the vocabulary of the index, including statistics of the terms, and a pointer into the inverted index.
* **the inverted index (a PostingIndex**) - contains the posting list for each term, detailing the frequency in which a term appears in that document .
* **the DocumentIndex** - contains the length of the document (and other field lengths).  
* **the direct index (also a PostingIndex)** - contains a posting list for each document, detailing which terms occuring in that document and which frequency. The presence of the direct index depends on the IndexingType that has beenapplied - single-pass and some memory indices do not provide a direct index.


Let's check the files the index files created.

In [None]:
print(index_ref.toString())
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

./myFirstIndex/data.properties
Number of documents: 6
Number of terms: 25
Number of postings: 40
Number of fields: 0
Number of tokens: 41
Field names: []
Positions:   false



We can check the lexicon which is the **vocabulary** of the collection.

* Nt is the number of unique documents that each term occurs in.
* TF is the total number of occurrences – some weighting models use this instead of Nt.
* The numbers in the @{} are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.


In [None]:
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString()))

boolean -> term16 Nt=1 TF=1 maxTF=1 @{0 0 0} 
crucial -> term7 Nt=1 TF=1 maxTF=1 @{0 0 6} 
data -> term2 Nt=5 TF=5 maxTF=1 @{0 1 2} 
effici -> term10 Nt=1 TF=1 maxTF=1 @{0 2 6} 
essenti -> term22 Nt=1 TF=1 maxTF=1 @{0 3 2} 
field -> term23 Nt=1 TF=1 maxTF=1 @{0 4 0} 
inform -> term11 Nt=4 TF=4 maxTF=1 @{0 4 6} 
ir -> term24 Nt=1 TF=1 maxTF=1 @{0 6 0} 
kei -> term17 Nt=1 TF=1 maxTF=1 @{0 6 6} 
languag -> term1 Nt=1 TF=1 maxTF=1 @{0 7 4} 
mathemat -> term14 Nt=1 TF=1 maxTF=1 @{0 7 6} 
model -> term15 Nt=1 TF=2 maxTF=2 @{0 8 4} 
plai -> term19 Nt=2 TF=2 maxTF=1 @{0 9 3} 
program -> term4 Nt=1 TF=1 maxTF=1 @{0 10 3} 
python -> term3 Nt=1 TF=1 maxTF=1 @{0 10 5} 
retriev -> term6 Nt=4 TF=4 maxTF=1 @{0 10 7} 
role -> term21 Nt=2 TF=2 maxTF=1 @{0 12 1} 
scienc -> term0 Nt=2 TF=2 maxTF=1 @{0 13 1} 
scientist -> term12 Nt=1 TF=1 maxTF=1 @{0 14 1} 
search -> term8 Nt=1 TF=1 maxTF=1 @{0 14 5} 
stude -> term13 Nt=1 TF=1 maxTF=1 @{0 15 1} 
system -> term18 Nt=2 TF=2 maxTF=1 @{0 15 5} 
techniqu -> te