# Indexing

Indexing helps us create and maintain unique identifiers for individual words, characters, or other linguistic units within a text corpus for efficient 
retrieval, manipulation, and storage of textual data. When dealing with a lot of data, we might want to retrieve it efficiently for later manipulation. 
Indexing becomes crucial in such an instance.

Applications of Indexing :

Feature extraction for machine learning: When performing feature extraction for machine learning, we use indexing to convert words into their 
corresponding indexes, which are then used to represent the text in a numerical format that machine-learning algorithms can work with.

Document retrieval and search: When retrieving data, indexing helps create an inverted index, which maps words to the documents that contain them. 
This speeds up searching and retrieving relevant documents based on keyword queries.

Text similarity and clustering: By representing documents as vectors of indexes (or term frequencies), we can measure the similarity between documents
using techniques like cosine similarity. This is often used in clustering, topic modeling, and recommendation systems.

Named entity recognition (NER): In NER tasks, we can use indexing to tag and identify entities like names of people, organizations, locations, etc., 
in a text. We can then assign each identified entity a unique index for further reference.

![image.png](attachment:acfbfddb-fd40-46d1-ac21-a8b4f4a05ff4.png)


# Indexing example

In [1]:
import pandas as pd

"""
We extract the review_text column from the DataFrame and convert it to a Python list named software_reviews.

We then initialize an empty dictionary named vocabulary that we’ll use to store unique words from the reviews and their corresponding indexes. We also
initialize an empty list named indexed_reviews to store the indexed representation of each review.

We start a loop to iterate through each review in the reviews list. Within the loop:

We initialize an empty list named indexed_review to store the indexes of words in the current review.

We also start another loop to iterate through each word in review by splitting it using whitespace. Inside that loop:

We check if word is not already present in the vocabulary dictionary.

If word is not in the dictionary, we add it to the vocabulary dictionary as a key and assign the current length of the dictionary as its value. 
This effectively assigns a unique index to each new word encountered in the reviews.

We then append the index of word in the indexed_review list.

After processing all the words in review, we append the indexed_review list to the indexed_reviews list, creating a list of lists where each inner
list contains the indexes of words in a review.

We start another loop using enumerate to iterate through each indexed_review in the indexed_reviews list and use the enumerate function to obtain both
the i index and the indexed_review itself. Within the loop, we print the indexed representation of each review using the indexed_review list and 
include the i review number to indicate which review is being printed

Lastly, we print a line break to separate the review prints from the vocabulary print and then later print the vocabulary dictionary, showing the 
mapping of words to their assigned indexes

"""

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
print(df)
reviews = df['text'].tolist()
vocabulary = {}
indexed_reviews = [] 
for review in reviews:
    indexed_review = []
    for word in review.split():
        if word not in vocabulary:
            vocabulary[word] = len(vocabulary)
        indexed_review.append(vocabulary[word])
    indexed_reviews.append(indexed_review) 
for i, indexed_review in enumerate(indexed_reviews, start=1):
    print(f"Review {i}: {indexed_review}") 
print("\nWord-to-Index Dictionary:")
print(vocabulary)

   review_id                                               text
0     txt145  The software had a steep learning curve at fir...
1     txt327  I'm really impressed with the user interface o...
2     txt209  The latest update to the software fixed severa...
3     txt825  I encountered a few glitches while using the s...
4     txt878  I was skeptical about trying the software init...
5     txt933  The analytics features have provided us with v...
6     txt718  I appreciate the regular updates that the soft...
7     txt316  I attended a training session for the software...
8     txt247  The software documentation could be more compr...
9     txt515  I've recommended the software to colleagues du...
10    txt913  The software integration with third-party plug...
11    txt341  I'm looking forward to the upcoming release of...
12    txt943  The user community is active and supportive, m...
13    txt688  I've been using the software for a while now, ...
14    txt136  The user interface could u

In [9]:
vocabulary['The'] , vocabulary['management'] , vocabulary['insights']

(0, 109, 73)

In [31]:
reviews[0]

'The software had a steep learning curve at first, but after a while, I started to appreciate its powerful features.'

# Indexing for feature extraction

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
"""
We create an instance of the CountVectorizer class and assign it to the vectorizer variable. We usually use this class to transform text data into 
a numerical format suitable for machine-learning algorithms.
We use the fit_transform() method of the vectorizer object to convert the reviews data into a numerical format and store the result in the X variable. 
This transformation involves the application of indexing. The fit_transform() method processes the text data by creating a vocabulary of unique words
and assigning indexes to each word. We store the result of this transformation in the X variable, which represents the bag-of-words feature matrix.

We print the Vocabulary (Word-to-Index Mapping): message to indicate that we want to display the mapping of words to their respective indexes in the 
vocabulary. We then print the vocabulary_ attribute of the vectorizer object, which contains the mapping of words to their indexes. This provides 
insight into the vocabulary used to create the bag-of-words representation.

We print the Feature Matrix (Bag-of-Words Representation): message to indicate that we want to display the bag-of-words representation of the reviews. 
Lastly, we print the bag-of-words representation matrix X in its dense array form using the toarray() method. This matrix represents the frequency of
words in each review, where each row corresponds to a review and each column corresponds to a word in the vocabulary

"""

df = pd.read_csv('C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv')
reviews = df['text']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)
print("Vocabulary (Word-to-Index Mapping):")
print(vectorizer.vocabulary_)
print("\nFeature Matrix (Bag-of-Words Representation):")
print(X.toarray())


Vocabulary (Word-to-Index Mapping):
{'the': 123, 'software': 114, 'had': 54, 'steep': 119, 'learning': 71, 'curve': 28, 'at': 10, 'first': 43, 'but': 16, 'after': 4, 'while': 148, 'started': 118, 'to': 127, 'appreciate': 7, 'its': 68, 'powerful': 95, 'features': 40, 'really': 101, 'impressed': 58, 'with': 149, 'user': 139, 'interface': 63, 'of': 86, 'it': 67, 'intuitive': 64, 'and': 6, 'easy': 35, 'navigate': 82, 'latest': 70, 'update': 134, 'fixed': 44, 'several': 111, 'bugs': 15, 'improved': 59, 'overall': 91, 'performance': 93, 'encountered': 36, 'few': 42, 'glitches': 50, 'using': 140, 'customer': 29, 'support': 120, 'was': 144, 'quick': 100, 'help': 57, 'me': 78, 'resolve': 107, 'them': 124, 'skeptical': 113, 'about': 0, 'trying': 130, 'initially': 60, 'turned': 131, 'out': 89, 'be': 12, 'game': 49, 'changer': 19, 'for': 45, 'our': 88, 'productivity': 96, 'analytics': 5, 'have': 56, 'provided': 99, 'us': 136, 'valuable': 141, 'insights': 61, 'that': 122, 'guided': 53, 'decision': 