Skip to content

The program is designed to create an index from a large collection of documents efficiently and do query processing.

Notifications You must be signed in to change notification settings

giridhar7889/queryprocessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

1. Introduction

This document provides a comprehensive explanation of the program. The program is designed

to create an index from a large collection of documents efficiently and do query processing. This

document will cover the program's functionality, how to run it, its internal workings, execution

time, index size, design decisions, limitations, and a conclusion.

2.Query processor overview:-

The query processor plays a pivotal role in information retrieval systems, acting as the gateway

between user search queries and the vast repository of data, with the ultimate goal of delivering

relevant results. At its core, the query processor takes a user-entered search query as input and

strives to produce the most pertinent information in the form of the top 10 URL results.

The input, a search query, acts as the trigger for the entire process. This query is the user's

expression of information needs, and the query processor must decipher and translate it into

actionable instructions for retrieval. The desired output, the top 10 URL results, represents a

curated selection of web addresses ranked based on their relevance to the user's query.

The query processing algorithm is designed for efficiency and speed, ensuring that users

receive results promptly. Leveraging a document-at-a-time approach, the algorithm minimizes

the amount of data loaded into main memory, loading only the Lexicon and URL Table during

startup. This optimized strategy ensures rapid response times, with conjunctive query results

being returned in less than 50 milliseconds.

The use of BM25 as the ranking function is a key component of the algorithm. This algorithm

calculates relevance scores for documents based on the frequency of query terms within the

document and the inverse document frequency. This nuanced approach to ranking enables the

system to not only identify relevant documents but also prioritize them based on their contextual

significance.

One of the notable features of the query processor is its disk optimization strategy. Rather than

loading entire inverted lists into memory, the processor selectively seeks and reads only the

relevant lists corresponding to the query terms. This disk-efficient approach minimizes

unnecessary data retrieval, contributing to the overall speed and responsiveness of the system.

In conclusion, the query processor serves as the linchpin in the information retrieval process,

transforming user queries into actionable insights through an intricate dance of Boolean filters,

Query Scores, and sophisticated algorithms like BM25. Its commitment to efficiency, manifested

through rapid response times and intelligent disk optimization, ensures that users can access

the most relevant information with minimal delay.

3.Inverted Index generation Overview:-

The process of generating an inverted index is a crucial step in information retrieval, laying the

foundation for efficient and effective search operations. This complex task involves several key

steps and utilizes various data structures and compression techniques to optimize storage and

retrieval.

The first step in the inverted index generation process is the parsing of gzip-compressed wet

files, which contain web page information. During this parsing, each document is processed

individually, extracting alphanumeric terms and assigning unique termIds based on the order of

word occurrence. Simultaneously, DocIds are assigned in the order the documents are

encountered, and essential metadata, such as the URL and the number of terms in the

document, is appended to the URL Table.

Intermediate postings are then generated in binary format, with each record consisting of 8

bytes (4 bytes for TermId and 4 bytes for DocId). This forms the basis for constructing the

inverted index. To achieve this, an I/O-efficient Merge Sort is performed on the intermediate

postings, ensuring that the final sorted postings are correct and ready for further processing.

In the subsequent stage, the sorted intermediate postings are parsed, and a smart O(n)

algorithm is applied to remove WordId from each posting. Term frequencies are added, resulting

in the creation of the final inverted index. Each record in this index contains 8 bytes, with 4 bytes

representing DocId and the next 4 bytes representing term frequency. To conserve disk space,

the inverted index undergoes compression.

The inverted index relies on several key data structures to organize and manage the data

efficiently. The lexicon serves as a critical mapping, linking each term to its start index, end

index, and document frequency. This mapping facilitates quick and effective retrieval of

information during search operations.

The URL table is another essential data structure, containing metadata about each document,

such as DocId, URL, and the number of terms in the document. This information aids in

presenting relevant details to users when displaying search results.

Compression techniques play a vital role in optimizing storage space. The use of

JavaFastPFOR library introduces blockwise compression to the inverted index. This library

employs coding techniques such as Binary Packing, NewPFD, OptPFD, Variable Byte, Simple

9, among others. The structure of an inverted list within the compressed index includes arrays

like Last[], SizeOfDocIds[], SizeOfTermFreqs[], and individual chunks. These arrays and chunks

are essential components of the compressed index, ensuring that the information is stored in a

compact yet retrievable format.

Despite the efficiency introduced by compression, the generation process faces challenges,

such as maintaining accuracy throughout the sorting and compression stages. Rigorous checks

and algorithms are implemented to ensure the correctness of the final inverted index. In

conclusion, the inverted index generation process, with its intricate steps and thoughtful

application of data structures and compression techniques, forms the backbone of a robust and

high-performance information retrieval system.

4.Internal Working and the functions involved in query processing:-

### 1. Initialization in the main function:

- **Loading Lexicon:**

- Lexicon is loaded into memory and stored in a Map<String, Long[]> called lexicon.

- The map contains mappings of the form: Term => (StartIndex, EndIndex,

DocumentFrequency).

- **Loading URL Table:**

- URL table is loaded into memory and stored in a List<PageTable> named pageTableList.

- Each PageTable object represents a row in the URL Table.

### 2. User Query Processing in executeQuery:

- User enters a search query.

- querySearch.executeQuery is called with the entered query.

### 3. executeQuery(string query) Function:

- Initializes an array with ListPointer data structure.

- For each query term, openList(queryTerm) is called, and its ListPointer is stored in the

array.

### 4. openList(queryTerm) Function:

- Checks if the query term is in the lexicon.

- If present, creates a ListPointer object and calls readLastAndSizeArrays(startIndex);

otherwise, returns NULL.

### 5. readLastAndSizeArrays(startIndex) Function:

- Seeks to the given offset on the compressed inverted index.

- Reads integer arrays (last, SizeOfDocIds, SizeOfTermFreqs) and stores them in the

respective member variables.

### 6. sortListPointerAndRemoveNull(lp, q, docFreq) Function:

- Sorts the ListPointer array based on document frequency.

- Removes any NULL pointers.

### 7. Conjunction Query Processing Algorithm:

- Starts a while loop from did=0.

- Gets the next posting from the shorter list.

- Calls nextGEQ(lp[0], did).

### 8. nextGEQ(lp[0], did) Function:

- Finds the next posting in list lp with docID >= did.

- Returns value > MAXDID if none exists.

- Skips the block, finds the block, and uncompresses the ListPointer by calling

uncompress(lp).

### 9. uncompress(listPointer lp) Function:

- Checks if the block is already uncompressed; if yes, returns the temporary uncompressed

block.

- Finds the index of the doc, uncompresses the block by performing seek and read on the

inverted index, and returns the block or list of documents in that chunk.

### 10. getFreq(lp[i]) Function:

- Checks if there are entries with the same docID in other list pointers.

- If not in the intersection, changes did to d; if in intersection, gets all frequencies and calls

getFreq(lp[i]).

### 11. computeBM25Score(did, q, f, docFreq) Function:

- Computes BM25 scores from frequencies and other data using the provided formula.

- Increases the search for the next post.

### 12. Disjunctive Query Processing Algorithm:

- Gets list pointers using openList(lp[i]).

- Starting from docid 0, computes BM25 scores on the given list pointers.

### 13. Print To p 10 Results:

- Prints the top 10 results after both searches, including rank scores.

- Uses TreeMap<Double, TopResult> to store BM25 scores as keys and metadata associated

with the web page as values.

This detailed breakdown provides a comprehensive overview of the program's flow and the role

of each function in processing user queries, handling conjunction and disjunction, and

computing BM25 scores.

5.Internal Working and the functions involved in index Generation:-

1. **Parsing Wet Files:**

- parseWetFile(String wetFilePath) processes one Wet file at a time, uncompressing and

parsing documents.

- It uses the ArchiveReader Library to check the mime type and read the uncompressed

ArchiveRecord into a byte array.

- Converts the byte array to a string, splitting it using regular expressions to extract

alphanumeric terms.

- Iterates over the terms, assigning each a unique wordId using wordToWordId HashMap.

2. **Intermediate Posting Generation:**

- generateIntermediatePosting(int wordId, int docId) takes wordId and docId as input and

writes them in binary to the IntermediateInvertedIndex file.

- Converts integers to bytes using ByteBuffer, changes byte ordering to little-endian, and

writes to the file using DataOutputStream.

3. **URL Table Update:**

- addEntryToPageTable(String url, int size) appends a new row to the URLTableFile

representing a document's URL, DocId, and the number of terms.

4. **I/O Efficient Merge Sort:**

- mergeSort(String memoryLimit) runs an I/O-efficient merge sort program written in C using

the ProcessBuilder class.

- **Sort Phase:**

- Reads in memsize bytes from the intermediate Inverted List, sorts them in chunks, and

creates temporary sorted files (temp0, temp1, ..., temp39).

- Writes the list of generated files to the "list" file.

- **Merge Phase:**

- Uses a min-heap to merge the sorted files, selecting the record with the minimum wordId at

the root.

- Writes the selected records to the output file (result0) and updates the heap with the next

record from the corresponding file.

- Writes the filename of the merged result file to "list2."

- **Check Correctness Phase:**

- Checkoutput compares consecutive records, checking if they are sorted by wordId and

docId.

5. **Reformatting Postings:**

- reformatPostings() reads the result0 file generated by the merge sort.

- Converts bytes to little-endian using ByteBuffer, updates docIdToTermFrequency TreeMap

in a single iteration.

- This linear-time algorithm updates the Lexicon using the TreeMap.

6. **Generate Lexicon File:**

- generateLexiconFile() iterates over the keyset of the lexicon TreeMap, writing out terms in

alphabetical order along with their startIndex, endIndex, and document frequency to

Lexicon.txt file.

The program combines Java and C components to efficiently parse, sort, and format the

inverted index. Data structures like ArrayLists, HashMaps, and TreeMaps are utilized to manage

information, and the I/O-efficient merge sort facilitates sorting large datasets with limited

memory. The final Lexicon and URL Table files provide metadata, while the Inverted Index offers

a structured and compressed representation of the document-term relationships for efficient

retrieval.

6.Document at a time query Processing:-

Document-at-a-time query processing is a pivotal algorithm in information retrieval systems,

focusing on efficiently finding intersections among inverted lists and computing BM25 scores for

relevant documents. This approach offers a step-by-step mechanism for processing queries and

scoring documents, aiming to provide accurate and timely results.

The algorithm begins by opening inverted lists for reading using the openList() function. These

lists represent the postings of terms in documents and are traversed simultaneously from left to

right. The primary objective is to efficiently identify document intersections and calculate BM25

scores for the relevant documents.

**Algorithm Steps:**

1. **Initialization:**

- Example: Open 3 inverted lists using openList() – obtaining pointers lp0, lp1, and lp2 to

the starts of these lists.

2. **Traversal and Intersection:**

- Call d0 = nextGEQ(lp0, 0) to get the docID of the first posting in lp0.

- Call d1 = nextGEQ(lp1, d0) to check for a matching docID in lp1.

- If d1 > d0, restart at the first list by calling d0 = nextGEQ(lp0, d1).

- If d1 = d0, call d2 = nextGEQ(lp2, d0) to see if d0 is also in lp2.

- If d2 > d0, restart at the first list by calling d0 = nextGEQ(lp0, d2).

- If d2 = d0, then d0 is in all three lists.

3. **Scoring and Result Insertion:**

- Compute the score for d0 based on the BM25 scoring algorithm.

- Continue at the first list and call d0 = nextGEQ(lp0, d2 + 1).

- Whenever a score is computed for a docID, check if it should be inserted into a TreeMap of

the current top-10 results.

4. **Result Presentation:**

- At the end of the processing, return the results in the TreeMap containing the top-10

documents based on their BM25 scores.

**Page 2: Efficiency and Optimizations**

The document-at-a-time query processing algorithm demonstrates efficiency through its careful

traversal and scoring mechanisms. It employs a strategy of checking for matching docIDs in

inverted lists, restarting if necessary, and computing scores for relevant documents. This

approach minimizes unnecessary computations and focuses on the most promising

intersections.

**Efficiency Highlights:**

1. **Simultaneous Traversal:**

- The algorithm traverses inverted lists simultaneously, reducing the need for repeated passes

over the same data.

2. **BM25 Scoring:**

- The use of the BM25 scoring algorithm allows for the computation of relevance scores,

taking into account term frequencies and document lengths.

3. **Top-10 Results TreeMap:**

- Results are efficiently managed using a TreeMap for the current top-10 documents. This

ensures constant-time insertion and easy retrieval of the most relevant results.

**Optimizations:**

1. **Early Termination:**

- The algorithm has the potential to terminate early when certain conditions are met, avoiding

unnecessary computations for documents that cannot improve the current top-10.

2. **Traversal Restart:**

- Restarting traversal when a matching docID is not found in a list helps avoid unnecessary

comparisons and ensures the algorithm focuses on promising intersections.

In conclusion, the document-at-a-time query processing algorithm excels in its efficiency and

systematic approach. By combining simultaneous traversal, BM25 scoring, and careful result

management, it provides a streamlined mechanism for retrieving the most relevant documents

in response to user queries.

7.Forward seeks:-

Forward seeks are a critical aspect of query processing, playing a key role in efficiently

accessing and retrieving information from compressed inverted indices. This process involves

seeking to specific positions in the compressed index to read necessary data for query

evaluation. The strategies employed in forward seeks, along with blockwise compression

techniques, contribute significantly to the overall efficiency of the query processing system.

**Functions involving Forward Seeks:**

1. **readLastAndSizeArrays(startIndex) in ListPointer Class:**

- This method seeks to a given offset on the compressed inverted index and reads integer

arrays (last, SizeOfDocIds, SizeOfTermFreqs), storing them in the respective member variables.

- This strategy optimizes forward seeks by efficiently retrieving necessary information directly

from the compressed index, reducing the need for unnecessary data decompression.

2. **uncompress(listpointer lp):**

- Checks if the block is already uncompressed. If so, it returns the temporarily uncompressed

block.

- If the block is not yet uncompressed, it finds the index of the document, performs seek and

read operations on the inverted index to uncompress the block.

- It checks for entries with the same docID in other list pointers. If they are not in the

intersection, the docID is changed to 'd'. If they are in the intersection, it retrieves all frequencies

by calling getfreq(lp[i]).

**Benefits of Blockwise Compression:**

1. **Efficient Seek and Retrieval:**

- Blockwise compression allows the system to skip unnecessary decompression of entire

blocks when seeking specific information. This significantly speeds up the forward seek

process.

2. **Reduced I/O Operations:**

- By working with compressed blocks, the system minimizes the amount of data that needs to

be read from disk. This reduction in I/O operations contributes to faster query processing times.

3. **Optimized Storage Utilization:**

- Storing and retrieving data in compressed blocks optimizes storage utilization, enabling the

system to handle large amounts of data efficiently without sacrificing performance.

4. **Selective Decompression:**

- The system can selectively decompress only the required blocks, avoiding the overhead of

decompressing entire indexes. This is particularly beneficial when seeking information related to

specific query terms or document IDs.

In conclusion, forward seeks, particularly when implemented in conjunction with blockwise

compression techniques, enhance the efficiency and speed of query processing. These

strategies reduce the computational load associated with seeking information in compressed

inverted indices, contributing to the overall responsiveness of the system. The balance between

seeking specific information and leveraging compression technologies is crucial for optimizing

the performance of information retrieval systems.

8.Results and Performance:-

The performance of the document indexing and query processing system is a critical metric for

evaluating its efficiency and practical usability. The provided dataset, along with the details of

the program execution, offers insights into the processing speed, size of the resulting inverted

index files, and the responsiveness of the query processor.

For the given dataset, the program exhibited a commendable performance. It parsed

approximately 3.2 million pages in just 83 minutes, demonstrating a processing speed of 371

pages per second. The execution time of the program is expected to be influenced by factors

such as the size and complexity of the input dataset. In this case, the program took around 3

hours to create the final index, reflecting the substantial computational effort required for the

given volume of data.

The size of the resulting inverted index files is a crucial aspect of system efficiency and storage

utilization. Before compression, the inverted index reached a size of 3 GB, reflecting the

extensive information captured during the indexing process. However, through effective

compression techniques, the final index size was significantly reduced to around 1.5 GB. This

reduction showcases the program's ability to optimize storage utilization without compromising

the integrity and accessibility of the indexed information.

In this scenario, the query processor demonstrated impressive speed, returning the top 10

results in just 50-100 milliseconds. In conclusion, the results obtained from processing the

dataset highlight the program's capability to handle a substantial volume of data efficiently. The

parsing speed, index sizes before and after compression, and the responsiveness of the query

processor collectively indicate a well-optimized system for information retrieval.

Design decisions and limitations:-

9.Design Decisions:

**Query Processor:**

1. **Size Arrays in Compressed Inverted List:**

- Decision: Two size arrays (SizeOfDocIds[], SizeOfTermFreqs[]) were implemented as

headers in the compressed inverted list instead of a single size array. This separation was

chosen to enhance compression efficiency and optimize list traversal.

2. **JavaFastPFOR Compression Library:**

- Decision: The JavaFastPFOR integer compression library was chosen for compression due

to its efficiency, especially in terms of decompressing integers at a high rate, exceeding 1.2

billion per second (4.5 GB/s).

**Index Generation:**

1. **Java for Parsing and Initial Processing:**

- Decision: Java was selected for the initial part of generating intermediate postings from

documents, leveraging its strong support for WET parsing libraries.

2. **C for Sorting and Merging Postings:**

- Decision: C was chosen for the sorting and merging of postings due to its superior I/O

efficiency, particularly advantageous for handling large datasets.

3. **Java for Reformatting and Lookup Structure Updates:**

- Decision: Java was used for the final part of reformatting sorted postings and updating

lookup structures. Manipulating bytes and writing out Lexicon and URL Table to disk is more

convenient in Java.

4. **Endian Notation Consideration:**

- Decision: Recognition of the difference in byte order notation (BIG_ENDIAN in Java and

LITTLE_ENDIAN in C) was addressed to ensure consistency and prevent issues when reading

bytes across languages.

5. **Selective Uncompression:**

- Decision: The decision to uncompress only the document that needs parsing at the moment

was made, prioritizing efficient use of disk space.

6. **WordIdToWord HashMap:**

- Decision: Implemented a WordIdToWord HashMap to efficiently replace WordId with the

actual word in the final Inverted Index.

10.Limitations:

1. **HashMap Memory Growth:**

- Limitation: HashMaps in Java can experience significant memory growth and may affect

performance after a certain point. Adjusting initial capacity and load factor values was explored

to optimize performance. Writing them out to disk and emptying them after reaching a threshold

could be a potential solution.

2. **Heap Space Allocation:**

- Limitation: The Query Processor program requires a considerable amount of heap space

due to loading Lexicon and URL table into HashMaps. While this may impact memory usage,

the query processing performance in terms of time remains excellent.

Designing an efficient information retrieval system involves a delicate balance between

language-specific advantages, compression techniques, and memory management. The

decisions made in this design aimed to maximize efficiency, but limitations in memory handling

highlight the ongoing challenges in optimizing large-scale data processing systems.

11. Conclusion:-

The described program is an efficient and robust search engine system that encompasses both

the index generation and query processing phases. The system is designed to handle the given

corpus of documents and execute search queries with high performance.

The use of an I/O efficient merge-sort in C ensures optimal sorting and merging of intermediate

postings, resulting in a final inverted index.The query processor is designed to efficiently

process both conjunctive and disjunctive queries using the BM25 ranking function.It loads only

the essential data structures, namely Lexicon and URL Table, into memory to conserve

resources.The document-at-a-time query processing algorithm efficiently traverses inverted lists

and computes scores, resulting in fast response times.The system demonstrates impressive

performance metrics, processing conjunctive queries in less than 50-100 milliseconds.For the

given dataset, the program exhibited a commendable performance. It parsed approximately 3.2

million pages in just 83 minutes, demonstrating a processing speed of 371 pages per second.In

conclusion, the presented program successfully combines efficient index generation with a

responsive query processor.

About

The program is designed to create an index from a large collection of documents efficiently and do query processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published