Skip to content

Commit

Permalink
Updated docs about 'searching own docs' (#626)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool committed May 28, 2021
1 parent 8a1e492 commit 102ed2b
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 7 deletions.
57 changes: 50 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,9 @@ for i in range(searcher.num_docs):

## How do I index and search my own documents?

To build sparse (i.e., Lucene inverted indexes) on your own document collections, following the instructions below.
To build dense indexes (e.g., the output of transformer encoders) on your own document collections, see instructions [here](docs/usage-dense-indexes.md).

Pyserini (via Anserini) provides ingestors for document collections in many different formats.
The simplest, however, is the following JSON format:

Expand All @@ -302,12 +305,24 @@ So, the quickest way to get started is to write a script that converts your docu
Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well):

```bash
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 1 -input integrations/resources/sample_collection_jsonl \
-index indexes/sample_collection_jsonl -storePositions -storeDocvectors -storeRaw
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 1 \
-input integrations/resources/sample_collection_jsonl \
-index indexes/sample_collection_jsonl \
-storePositions -storeDocvectors -storeRaw
```

Once this is done, you can use `SimpleSearcher` to search the index:
Three options control the type of index that is built:

+ `-storePositions`: builds a standard positional index
+ `-storeDocvectors`: stores doc vectors (required for relevance feedback)
+ `-storeRaw`: stores raw documents

If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies.
This is sufficient for simple "bag of words" querying (and yields the smallest index size).

Once indexing is done, you can use `SimpleSearcher` to search the index:

```python
from pyserini.search import SimpleSearcher
Expand All @@ -316,9 +331,39 @@ searcher = SimpleSearcher('indexes/sample_collection_jsonl')
hits = searcher.search('document')

for i in range(len(hits)):
print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')
```

You should get something like the following:

```
1 doc2 0.25620
2 doc3 0.23140
```

If you want to perform a batch retrieval run (e.g., directly from the command line), organize all your queries in a tsv file, like [here](integrations/resources/sample_queries.tsv).
The format is simple: the first field is a query id, and the second field is the query itself.
Note that the file extension _must_ end in `.tsv` so that Pyserini knows what format the queries are in.

Then, you can run:

```bash
$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \
--index indexes/sample_collection_jsonl \
--output run.sample.txt \
--bm25

$ cat run.sample.txt
1 Q0 doc2 1 0.256200 Anserini
1 Q0 doc3 2 0.231400 Anserini
2 Q0 doc1 1 0.534600 Anserini
3 Q0 doc1 1 0.256200 Anserini
3 Q0 doc2 2 0.256199 Anserini
4 Q0 doc3 1 0.483000 Anserini
```

Note that output run file is in standard TREC format.

You can also add extra fields in your documents when needed, e.g. text features.
For example, the [SpaCy](https://spacy.io/usage/linguistic-features#named-entities) Named Entity Recognition (NER) result of `contents` could be stored as an additional field `NER`.

Expand All @@ -333,8 +378,6 @@ For example, the [SpaCy](https://spacy.io/usage/linguistic-features#named-entiti
}
```

Happy honking!

## Reproduction Guides

With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a number of standard IR test collections!
Expand Down
2 changes: 2 additions & 0 deletions docs/usage-dense-indexes.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Pyserini: Guide to Dense Indexes

## How do I index and search my own documents (Dense)?

Pyserini create dense index for collections with JSONL format:
Expand Down
4 changes: 4 additions & 0 deletions integrations/resources/sample_queries.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
1 document
2 one
3 contents
4 text

0 comments on commit 102ed2b

Please sign in to comment.