Document Collection Foreign-data Wrapper (DC FDW) allows users to map an entire directory of documents (e.g. Reuters Corpora RCV1) as a single foreign table in PostgreSQL database. The FDW supports building inverted index and postings file in a user specific location. And then use the index and postings file to support various types of information retrieval tasks in the context of PostgreSQL database. This project is a Google Summer of Code 2012 project. Original project proposal can be found here.
- Project Proposal
- Setting up dc_fdw in PostgreSQL 9.3devel
- Setting up dc_fdw in PostgreSQL PostgreSQL 9.1.3 (Outdated:Using older version of FDW API)
- Design details with examples
1.The code uses some PostgreSQL built-in libraries, see PostgreSQL Source Code Documentation for more details.
2.More about foreign-data wrappers, see PostgreSQL SQL/MED wiki page for more information.
Document Collection Foreign-data Wrapper allows users to map an entire directory of text files into a foreign table.
In Information Retrieval (IR) context, a document collection is a set of document from which users are interested in
extracting information (searching). By mapping a document collection to a foreign table, a user can easily search through
the set of documents by issuing an SQL query to the PostgreSQL server.
Let's take a look at what a typical document collection looks like:
The Reuters-21578 benchmark corpus, ApteMod version in NLTK
$ cd ~/nltk_data/corpora/reuters/test/ $ ls 14826 15312 15751 16179 16604 17540 18095 18810 19471 20112 20651 21146 14828 15313 15753 16180 16606 17543 18096 18811 19473 20114 20652 21148 14829 15314 15757 16185 16607 17544 18099 18816 19474 20116 20653 21149 ... $ cat 16179 LME CLARIFIES NEW ALUMINIUM CONTRACT DETAILS The London Metal Exchange (LME) has issued a note clarifying details on its new high grade aluminium contract, in response to questions from members following the announcement of the contract, due to start June 1. All deliverable shapes of aluminium under the high grade primary aluminium contract (minimum 99.7 pct purity) will also be deliverable against the standard primary aluminum contract (min 99.5 pct), the LME said. Sows will not constitute good delivery against the standard contract until September 1, and 99.5 pct purity sows are not good delivery and cannot be placed on LME warrant. The dollar quotation for the high grade contract will be in multiples of one U.S. Dollar but carries may be made at 50 cents for even tonnages only. Singapore, which is the first port warehouse outside Europe to be used as an LME delivery point, will be used for high grade metal only and the rent imposed by owners Steinweg will be 1.05 U.S. Dlr a tonne per week, the LME said. The LME Board, in response to representation from the trade, agreed to annul from LME contracts the minimum weight requirements of 450 kilos for T-bars and 250 kilos for sows, effective for high grade on June 1 and for standard on July 24.
The text document can be in any language. In this version, we only look at English.
A text document in a collection consists of 2 parts:
Therefore, in foreign table options, we need to provide the following information:
data_dir [where the files in a document collection are located] index_dir [where the index files are located: postings file, dictionary file] index_method [either In-memory(IM) Indexing or Single-pass in-memory(SPIM) indexing] buffer_size [when using SPIM indexing, this is the limit of memory (in MB) available] id_col [the column name for mapping doc id] text_col [the column name for mapping doc content]
In indexing phase, the indexer will take each file in the document collection and perform the following actions:
- Load file content into memory
- Convert the content of the file into a
TSVector(essentially a table of terms)
- Add doc_id to the postings list of each term that is appearing in its
TSVectorin the dictionary (implemented as a hashtable
There are 2 indexing methods: In-memory indexing (IM) where a single global dictionary is used and postings lists are stored in memory; Single-pass-in-memory (SPIM) where the collection will be chopped into parts, each part will have its dictionaries and postings lists in memory when it's being indexed. Upon finishing, they are serialized to disk. In the end, all the dictionaries will be loaded into memory and a global postings list for each term will be produced by joining all the intermediate posting lists of the same term.
Note: More information on Single-pass-in-memory indexing can be found here.
The searcher will load the global dictionary file into memory as a hashtable
The key of the dictionary is term. Each entries in the dictionary will store the pointer to the term's postings list in the postings file. Postings lists of a term will be loaded when it's searched.
A tree structure is constructed by extracting quals from
The tree is then evaluated by recursively performing searching term (leaf node) and boolean operations (internal nodes), i.e. intersect, union, negate on postings lists. The details for boolean operation algorithms (intersect via skip pointers) can be found here
Quals that can be pushed down:
1. id = <integer> 2. content @@ <term> 3. to_tsquery( <tsquery text> ) 4. plainto_tsquery ( <free text> )
Note: If no quals can be pushed down, a sequential scan on all the documents in the collection will be conducted.