Permalink
Fetching contributors…
Cannot retrieve contributors at this time
181 lines (115 sloc) 5.22 KB

TF-IDF Ranker

This is an implementation of a document ranker based on tf-idf vectorization. The ranker implementation is based on DrQA [1] project. The default ranker implementation takes a batch of queries as input and returns 5 document titles sorted via relevance.

Quick Start

Before using the model make sure that all required packages are installed running the command:

python -m deeppavlov install en_ranker_tfidf_wiki

Training and building (if you have your own data)

from deeppavlov import configs, train_model
ranker = train_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True)

Building (if you don't have your own data)

from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, load_trained=True)

Inference

result = ranker(['Who is Ivan Pavlov?'])
print(result)

Output

>> ['Ivan Pavlov (lawyer)', 'Ivan Pavlov', 'Pavlovian session', 'Ivan Pavlov (film)', 'Vladimir Bekhterev']

Text for the output titles can be further extracted with :class:`~deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab` class.

Configuration

Default ranker config for English language is :config:`doc_retrieval/en_ranker_tfidf_wiki.json <doc_retrieval/en_ranker_tfidf_wiki.json>`

Default ranker config for Russian language is :config:`doc_retrieval/ru_ranker_tfidf_wiki.json <doc_retrieval/ru_ranker_tfidf_wiki.json>`

Running the Ranker

Note

About 16 GB of RAM required.

Training

Run the following to fit the ranker on English Wikipedia:

python -m deppavlov train en_ranker_tfidf_wiki

Run the following to fit the ranker on Russian Wikipedia:

python -m deeppavlov train ru_ranker_tfidf_wiki

Interacting

When interacting, the ranker returns document titles of the relevant documents.

Run the following to interact with the English ranker:

python -m deeppavlov interact en_ranker_tfidf_wiki -d

Run the following to interact with the Russian ranker:

python -m deeppavlov ru_ranker_tfidf_wiki -d

As a result of ranker training, a SQLite database and tf-idf matrix are created.

Available Data and Pretrained Models

Wikipedia DB and pretrained tfidf matrices are downloaded in deeppavlov/download/odqa folder by default.

enwiki.db

enwiki.db SQLite database consists of 5180368 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest enwiki dump (from 2018-02-11)
  2. Unpack and extract the articles with WikiExtractor [2] (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database during :ref:`ranker_training`.

enwiki_tfidf_matrix.npz

enwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 224 x 5180368. This matrix is built with :class:`~deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer` class.

ruwiki.db

ruwiki.db SQLite database consists of 1463888 Wikipedia articles and is built by the following steps:

  1. Download a Wikipedia dump file. We took the latest ruwiki dump (from 2018-04-01)
  2. Unpack and extract the articles with WikiExtractor (with --json, --no-templates, --filter_disambig_pages options)
  3. Build a database during :ref:`ranker_training`.

ruwiki_tfidf_matrix.npz

ruwiki_tfidf_matrix.npz is a full Wikipedia tf-idf matrix of size hash_size x number of documents which is 224 x 1463888. This matrix is built with :class:`~deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer` class. class.

Comparison

Scores for TF-IDF Ranker model:

Model Dataset Wiki dump Recall (top 5)
:config:`DeepPavlov <doc_retrieval/en_ranker_tfidf_wiki.json>` SQuAD (dev) enwiki (2018-02-11) 75.6
DrQA [1] enwiki (2016-12-21) 77.8

References

[1](1, 2) https://github.com/facebookresearch/DrQA/
[2]https://github.com/attardi/wikiextractor