---
# Positional Index
---

## Preprocessing 
---
The index.html files are removed and 15 files from the SRE folder are shifted to stories folder. The resulting 467 files are preprocessed before creating the Positional Index. Following steps are undertaken to clean the document text:-

- Convert the text to lower case
- Perform word tokenization
- Remove punctuation marks from tokens
- Remove stopwords from tokens
- Remove blank space tokens

The documents' names, original texts and cleaned texts obtained after preprocessing are stored in a pickle file **stories_data.pkl**.

In [1]:
# Importing relevent modules for preprocessing

import re
import codecs
import pickle

from tqdm import tqdm
from pathlib import Path
from collections import OrderedDict, Counter 

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [2]:
# Reading the 467 documents and storing the document name and document text

Data = []
Unresolved = []

for i in Path("../Other/stories").glob("*"):
    try:
        with codecs.open(i,'r', encoding = 'utf-8', errors = 'ignore') as f:
            d = str(i).split('/')[-1]
            t =  ' '.join(f.readlines())
            Data.append({'doc_name' : d, 'text' : t})
    except:
        Unresolved.append(str(i))
        
print("Total Documents :",len(Data))
print("Unresolved Documents :",Unresolved)

Total Documents : 467
Unresolved Documents : []


In [3]:
# Function to clean text

def clean(text):
    
    # Converting text to lowercase
    text = text.lower()
    
    # Word Tokenization
    words = word_tokenize(text)
    
    # Removing all punctuation and unecessary characters from text
    for i in range(len(words)):
        words[i] = re.sub(r'[^a-z\s]',' ',words[i])
    
    # Removing stopwords from text
    stop_words = set(stopwords.words("english"))
    words = [w for w in words if not w in stop_words]
    
    # Removing blank spaces
    for i in range(len(words)):
        words[i]=words[i].strip()
        
    words = [i for i in words if i!='']
    
    cleaned_text = ' '.join(words)
    
    return cleaned_text

In [4]:
# Cleaning

for doc in tqdm(Data):
    doc['cleaned_text'] = clean(doc['text'])

100%|██████████| 467/467 [00:25<00:00, 17.96it/s]


In [5]:
# Sorting documents in Alphabetical order

Data.sort(key = lambda x:x['doc_name'])

In [6]:
# Resolving documents

docs = {}

for i in range(len(Data)):
    docs[i]={}
    docs[i]['name']= Data[i]['doc_name']
    docs[i]['size']= len(Data[i]['cleaned_text'].split())
    docs[i]['max_frequency'] = Counter(Data[i]['cleaned_text'].split()).most_common(1)[0][1]

In [7]:
# Dumping in Pickle File

pickle.dump(Data,open('../Dumps/stories_data.pkl','wb'))
pickle.dump(docs,open('../Dumps/stories_docs.pkl','wb'))

---
## Creating Positional Index
---
For creating the index we iterate through the cleaned document texts and store the document name as well as the position of the word in the document in the posting list of the term. 

In [8]:
# Loading Cleaned Data

Data = pickle.load(open('../Dumps/stories_data.pkl','rb'))

In [9]:
# Creating Index

index = {}
for doc in Data:
    for i,term in enumerate(doc['cleaned_text'].split()):
        if term in index:
            index[term].append((doc['doc_name'],i))
        else:
            index[term] = [(doc['doc_name'],i)]

In [10]:
# Dumping in Pickle File

pickle.dump(index,open('../Dumps/positional_index.pkl','wb'))

---
## Query System   
---
For query system, we first load the index from the saved pickle file. Then we create pointers for all the words in the query and try to find the documents which contain all the words of the query in the correct order using the positional index created.

In [11]:
# Loading the Index

index = pickle.load(open('../Dumps/positional_index.pkl','rb'))

In [12]:
# Phrase Query Processing

def phrase_query(index,query):
    
    cleaned_query = clean(query).split()
    m = len(cleaned_query)
    pointers = [0 for i in range(m)]
    answer = []
    flag = True
    
    for i in range(m):
        if cleaned_query[i] not in index:
            flag = False
            break
            
    test = 0
    
    while flag:
        
        test += 1
        for i in range(m):
            if pointers[i] == len(index[cleaned_query[i]]):
                flag = False
                break
                
        if flag == False:
            break

        for i in range(1,m):
            if index[cleaned_query[i]][pointers[i]][0] != index[cleaned_query[0]][pointers[0]][0]:
                flag = False
                break
        
        if flag:
            for i in range(1,m):
                if index[cleaned_query[i]][pointers[i]][1] - index[cleaned_query[i-1]][pointers[i-1]][1] != 1:
                    flag = False
                    break
            if flag:
                answer.append(index[cleaned_query[0]][pointers[0]][0])
        
        j = 0
        for i in range(1,m):
            if index[cleaned_query[j]][pointers[j]] > index[cleaned_query[i]][pointers[i]]:
                j = i
        pointers[j] += 1
        flag = True
    unique = {}
    for i in answer:
        unique[i] = 1
    return list(unique.keys())

In [13]:
query = "good day"
results = phrase_query(index,query)
print('Number Of Documents Retrieved:',len(results))
print('List Of Documents Retrieved:- \n')
for i in results:
    print(i)

Number Of Documents Retrieved: 21
List Of Documents Retrieved:- 

13chil.txt
aesop11.txt
aesopa10.txt
brain.damage
breaks2.asc
bruce-p.txt
enchdup.hum
fantasy.hum
fantasy.txt
fic5
forgotte
history5.txt
horswolf.txt
hound-b.txt
mazarin.txt
melissa.txt
outcast.dos
sick-kid.txt
srex.txt
startrek.txt
superg1


---