Skip to content

Information retrieval toolkit for large document collections.

Notifications You must be signed in to change notification settings

amallia/poly-ir-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

poly-ir-toolkit

Information retrieval toolkit for large document collections.

Developed at the Web Exploration and Search Technology Lab at the Polytechnic Institute of NYU, under the advisement of Professor Torsten Suel, PolyIRTK provides tools for indexing and querying large document collections.

The aim of PolyIRTK is to act as a platform for research into new algorithms for compression/decompression, querying, and indexing techniques. It implements a number of techniques from recent research literature that improves on the efficiency of indexing and querying. PolyIRTK is also focused on query optimization and early termination algorithms with rank safe properties (that is, faster querying performance, with absolutely no loss in result quality). Another goal of PolyIRTK is to be a search engine library for applications that require fast full text search.

Features

On the querying side: * BM25 document ranking * DAAT algorithms for AND/OR mode querying * WAND query optimization algorithm * MaxScore query optimization algorithm * MaxScore with block and chunk level score skipping query optimization algorithm * Early termination algorithm based on multi-layered indices with list score upperbounds * On disk index with LRU caching (cache size is configurable) or a completely in-memory index * Support for TREC experiments * ...and more

On the indexing side: * In-memory compression of intermediate postings with adjustable memory consumption * Writes out complete (fully queryable) indices (which are merged later) * Supports many fully configurable compression algorithms for docIDs, frequencies, positions, and the block header (Rice, Turbo-Rice, Variable Byte, S9, S16, PForDelta) * Allows docID remapping based on document URLs for better compression and significantly faster querying performance. * Allows indices to be layered (with several layering algorithms) by partial docID scores, used in various early termination algorithms * ...and more

Releases

No releases published

Packages

No packages published

Languages