# Term based Indexing

Term-based indexing involves indexing documents based on individual terms or words that appear in the documents. By associating each term with a list
of document identifiers where it occurs, we efficiently retrieve documents based on specific query terms. One of the advantages of term-based indexing 
is its fast and efficient retrieval of relevant documents containing the query terms. However, this type of indexing can be memory-intensive as it also
requires the application of text preprocessing steps like tokenization, normalization, and stemming to handle variations in term spellings or word forms.
In the example below, let’s see how to apply term-based indexing using Python. We’ll read reviews from a CSV file, tokenize them into words, remove
common English stopwords, and then create an inverted index that maps each term to the list of review IDs in which it appears.


In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict 

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [3]:
"""
apply a lambda function to tokenize each review text and then convert it to lowercase using word_tokenize. We save the result in a new tokens column.

initialize a set named stop_words with common English stopwords from the stopwords.words('english') list and further process the tokens column by 
applying another lambda function to filter out any stopwords from the tokenized text.

We initialize a nested defaultdict object called term_index. This data structure will store term-document positions, allowing us to quickly look up 
where terms appear in each document.

We iterate over rows in the df DataFrame using the itertuples() method, extracting the review_id and tokens columns. This loop allows us to process 
each document’s tokens and associate them with document IDs. Within the loop, we further iterate over the tokens and their positions within the 
document. For each term in a document, we append its position to the term_index dictionary under the corresponding term and document ID.

Finally, we print the results. For each term in term_index, we print the term itself, and for each document in which the term appears, we print the
document ID and the positions of the term within that document.

"""


df['tokens'] = df['text'].apply(lambda text: word_tokenize(text.lower())) 
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])
term_index = defaultdict(lambda: defaultdict(list))
for idx, tokens in df[['review_id', 'tokens']].itertuples(index=False):
    for position, term in enumerate(tokens):
        term_index[term][idx].append(position)
for term, doc_positions in term_index.items():
    print(f"Term: {term}")
    for doc_id, positions in doc_positions.items():
        print(f"  Document ID: {doc_id}, Positions: {positions}")



Term: software
  Document ID: txt145, Positions: [0]
  Document ID: txt327, Positions: [5]
  Document ID: txt209, Positions: [2]
  Document ID: txt825, Positions: [3]
  Document ID: txt878, Positions: [2]
  Document ID: txt718, Positions: [3]
  Document ID: txt316, Positions: [3]
  Document ID: txt247, Positions: [0]
  Document ID: txt515, Positions: [2]
  Document ID: txt913, Positions: [0]
  Document ID: txt341, Positions: [5]
  Document ID: txt688, Positions: [2]
  Document ID: txt137, Positions: [2]
Term: steep
  Document ID: txt145, Positions: [1]
Term: learning
  Document ID: txt145, Positions: [2]
Term: curve
  Document ID: txt145, Positions: [3]
Term: first
  Document ID: txt145, Positions: [4]
Term: ,
  Document ID: txt145, Positions: [5, 6]
  Document ID: txt825, Positions: [4]
  Document ID: txt878, Positions: [4]
  Document ID: txt718, Positions: [5]
  Document ID: txt316, Positions: [4]
  Document ID: txt247, Positions: [4]
  Document ID: txt341, Positions: [6]
  Document 

In [4]:
text = " ".join(df['text'])
print(text)

The software had a steep learning curve at first, but after a while, I started to appreciate its powerful features. I'm really impressed with the user interface of the software. It's intuitive and easy to navigate. The latest update to the software fixed several bugs and improved its overall performance. I encountered a few glitches while using the software, but the customer support was quick to help me resolve them. I was skeptical about trying the software initially, but it turned out to be a game-changer for our productivity. The analytics features have provided us with valuable insights that have guided our decision-making. I appreciate the regular updates that the software receives, as they often bring new and useful features. I attended a training session for the software, and it greatly improved my understanding of its advanced functionalities. The software documentation could be more comprehensive, as some features are not well explained. I've recommended the software to collea

In [5]:
df

Unnamed: 0,review_id,text,tokens
0,txt145,The software had a steep learning curve at fir...,"[software, steep, learning, curve, first, ,, ,..."
1,txt327,I'm really impressed with the user interface o...,"['m, really, impressed, user, interface, softw..."
2,txt209,The latest update to the software fixed severa...,"[latest, update, software, fixed, several, bug..."
3,txt825,I encountered a few glitches while using the s...,"[encountered, glitches, using, software, ,, cu..."
4,txt878,I was skeptical about trying the software init...,"[skeptical, trying, software, initially, ,, tu..."
5,txt933,The analytics features have provided us with v...,"[analytics, features, provided, us, valuable, ..."
6,txt718,I appreciate the regular updates that the soft...,"[appreciate, regular, updates, software, recei..."
7,txt316,I attended a training session for the software...,"[attended, training, session, software, ,, gre..."
8,txt247,The software documentation could be more compr...,"[software, documentation, could, comprehensive..."
9,txt515,I've recommended the software to colleagues du...,"['ve, recommended, software, colleagues, due, ..."


# Document Based Indexing

Document-based indexing involves indexing documents based on their overall content, metadata, structure, or other document-level features. It varies
from term-based indexing in that it indexes documents as a whole rather than preprocessing specific words or keywords. It’s a valuable technique for
text classification, summarization, and retrieval based on document metadata (e.g., author, title, date), document structure
(e.g., sections, paragraphs), or other document-level properties. Let’s see how to apply document-based indexing in the 
code example below.


In [7]:
import pandas as pd 

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [8]:
"""
We create a pandas Series named document_index using the pd.Series constructor and the DataFrame’s index as the data, and the review_id column as the
index for the Series. This essentially creates a mapping from the review_id values to their corresponding DataFrame indexes. We then convert the
document_index Series into a dictionary using the to_dict() method. This step facilitates the easy lookup of DataFrame indexes based on review_id.

We start a for loop to iterate through each key-value pair in the document_index dictionary. Inside the loop, we unpack the key-value pair into the 
variables review_id and document_id, where review_id represents the review_id value from the original DataFrame, and document_id represents the
corresponding DataFrame index. We then display each loop iteration’s review and document IDs using the f-string format.

We then create a list named requested_review_ids containing specific review_id values that we want to retrieve. We also print a separator line to 
indicate the beginning of the requested reviews display.

To check whether the review_id values exist, we create a for loop to iterate through each requested_id in the requested_review_ids list. Inside the loop,

We use a conditional if statement to check if the current requested_id is present in the document_index dictionary.

If requested_id is found in the document_index:

We retrieve the corresponding document_id using dictionary indexing.

We then use the loc accessor to extract the review text from the df DataFrame using document_id.

We similarly use the loc accessor to extract the sentiment value from the df DataFrame using document_id.

Later, we use the print() function to display the retrieved review information, including review_id, document_id, sentiment, and review_text.

If requested_id is not found in the document_index, we use the print() function to display a message indicating that the requested review was not found.

"""

document_index = pd.Series(df.index, index=df['review_id']).to_dict()
for review_id, document_id in document_index.items():
    print(f"ReviewID: {review_id} -> DocumentID: {document_id}") 
requested_review_ids = ["rv1315", "rv2087", "rv6898"] 
print("\nRequested Reviews:")
for requested_id in requested_review_ids:
    if requested_id in document_index:
        document_id = document_index[requested_id]
        review_text = df.loc[document_id, 'review']
        sentiment = df.loc[document_id, 'sentiment']
        print(f"ReviewID: {requested_id} -> DocumentID: {document_id} -> Sentiment: {sentiment} -> Review Text: {review_text}")
    else:
        print(f"ReviewID: {requested_id} -> Review not found")
        

ReviewID: txt145 -> DocumentID: 0
ReviewID: txt327 -> DocumentID: 1
ReviewID: txt209 -> DocumentID: 2
ReviewID: txt825 -> DocumentID: 3
ReviewID: txt878 -> DocumentID: 4
ReviewID: txt933 -> DocumentID: 5
ReviewID: txt718 -> DocumentID: 6
ReviewID: txt316 -> DocumentID: 7
ReviewID: txt247 -> DocumentID: 8
ReviewID: txt515 -> DocumentID: 9
ReviewID: txt913 -> DocumentID: 10
ReviewID: txt341 -> DocumentID: 11
ReviewID: txt943 -> DocumentID: 12
ReviewID: txt688 -> DocumentID: 13
ReviewID: txt136 -> DocumentID: 14
ReviewID: txt137 -> DocumentID: 15

Requested Reviews:
ReviewID: rv1315 -> Review not found
ReviewID: rv2087 -> Review not found
ReviewID: rv6898 -> Review not found
