Skip to content
Permalink
Browse files

Various documentation updates (#666)

+ Pass over Solrini documentation (confirmed that everyone works)
+ Updated recent publications
+ Added Lucene 7.6 v.s Lucene 8.0 efficiency results
  • Loading branch information...
lintool committed Jun 1, 2019
1 parent bb377ef commit 88fd615972ed9eb218cc5a9ea302f4533e12638f
Showing with 92 additions and 30 deletions.
  1. +2 −0 .gitignore
  2. +31 −8 README.md
  3. +13 −0 docs/lucene7-vs-lucene8.md
  4. +6 −2 docs/runbook-ecir2019-axiomatic.md
  5. +1 −1 docs/runbook-ecir2019-ccrf.md
  6. +39 −19 docs/solrini.md
@@ -18,6 +18,8 @@ log.*
out.*
runs.regression/
runs.jdiq2018/
src/main/resources/solr/anserini-twitter/conf/lang/
src/main/resources/solr/anserini/conf/lang/
# automatically generated by ECIR2019_axiomatic scripts
src/main/resources/topics-and-qrels/qrels.cw09.all.txt
src/main/resources/topics-and-qrels/qrels.cw12.all.txt
@@ -4,7 +4,12 @@ Anserini
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.anserini/anserini/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.anserini/anserini)
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat-square)](./LICENSE)

Anserini is an open-source information retrieval toolkit built on Lucene that aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications. This effort grew out of [a reproducibility study of various open-source retrieval engines in 2016](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_ECIR2016.pdf) (Lin et al., ECIR 2016). Additional details can be found [in a short paper](https://dl.acm.org/authorize?N47337) (Yang et al., SIGIR 2017) and a [journal article](https://dl.acm.org/citation.cfm?doid=3289400.3239571) (Yang et al., JDIQ 2018).
Anserini is an open-source information retrieval toolkit built on Lucene that aims to bridge the gap between academic information retrieval research and the practice of building real-world search applications.
This effort grew out of [a reproducibility study of various open-source retrieval engines in 2016](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_ECIR2016.pdf) (Lin et al., ECIR 2016).
See [Yang et al. (SIGIR 2017)](https://dl.acm.org/authorize?N47337) and [Yang et al. (JDIQ 2018)](https://dl.acm.org/citation.cfm?doid=3289400.3239571) for overviews.

Anserini is currently based on Lucene 7.6, with a planned upgrade to Lucene 8 in the near future.
Based on [preliminary experiments](docs/lucene7-vs-lucene8.md), query evaluation latency has been much improved in Lucene 8.

## Getting Started

@@ -77,7 +82,7 @@ Runbooks:
-mapper CountDocumentMapper -context CountDocumentMapperContext
```

## Python Interface
## Python Integration

Anserini was designed with Python integration in mind, for connecting with popular deep learning toolkits such as PyTorch. This is accomplished via [pyjnius](https://github.com/kivy/pyjnius). The `SimpleSearcher` class provides a simple Python/Java bridge, shown below:

@@ -105,6 +110,16 @@ hits[0].score
hits[0].content
```

## Solr Integration

Anserini provides code for indexing into SolrCloud, thus providing interoperable support for test collections wiith local Lucene indexes and Solr indexes.
See [this page](docs/solrini.md) for more details.

## Elasticsearch Integration

Anserini integration with Elastic search is coming soon!
See [Issues 633](https://github.com/castorini/anserini/issues/633).

## Release History

+ v0.4.0: March 4, 2019 [[Release Notes](docs/release-notes/release-notes-v0.4.0.md)]
@@ -114,12 +129,20 @@ hits[0].content

## References

+ Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, Sebastiano Vigna. [Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_ECIR2016.pdf) _Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016)_, pages 408-420, March 2016, Padua, Italy.

+ Peilin Yang, Hui Fang, and Jimmy Lin. [Anserini: Enabling the Use of Lucene for Information Retrieval Research.](https://dl.acm.org/authorize?N47337) _Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017)_, pages 1253-1256, August 2017, Tokyo, Japan.

+ Peilin Yang, Hui Fang, and Jimmy Lin. [Anserini: Reproducible Ranking Baselines Using Lucene.](https://dl.acm.org/citation.cfm?doid=3289400.3239571) Journal of Data and Information Quality, 10(4), Article 16, 2018.
+ Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, Sebastiano Vigna. [Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_etal_ECIR2016.pdf) _ECIR 2016_, pages 408-420.
+ Peilin Yang, Hui Fang, and Jimmy Lin. [Anserini: Enabling the Use of Lucene for Information Retrieval Research.](https://dl.acm.org/authorize?N47337) _SIGIR 2017_, pages 1253-1256.
+ Peilin Yang, Hui Fang, and Jimmy Lin. [Anserini: Reproducible Ranking Baselines Using Lucene.](https://dl.acm.org/citation.cfm?doid=3289400.3239571) _Journal of Data and Information Quality_, 10(4), Article 16, 2018.
+ Peilin Yang and Jimmy Lin. [Reproducing and Generalizing Semantic Term Matching in Axiomatic Information Retrieval](https://cs.uwaterloo.ca/~jimmylin/publications/Yang_Lin_ECIR2019.pdf). _ECIR 2019_, pages 369-381.
+ Ruifan Yu, Yuhao Xie and Jimmy Lin. [Simple Techniques for Cross-Collection Relevance Transfer.](https://cs.uwaterloo.ca/~jimmylin/publications/Yu_etal_ECIR2019.pdf) _ECIR 2019_, page 397-409.
+ Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. [End-to-End Open-Domain Question Answering with BERTserini.](https://aclweb.org/anthology/papers/N/N19/N19-4013/) _NAACL-HLT 2019 Demos_, pages 72-77.
+ Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. [Solr Integration in the Anserini Information Retrieval Toolkit.](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019a.pdf) _SIGIR 2019_.
+ Ryan Clancy, Jaejun Lee, Zeynep Akkalyoncu Yilmaz, and Jimmy Lin. [Information Retrieval Meets Scalable Text Analytics: Solr Integration with Spark.](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019b.pdf) _SIGIR 2019_.
+ Jimmy Lin and Peilin Yang. [The Impact of Score Ties on Repeatability in Document Ranking.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_Yang_SIGIR2019.pdf) _SIGIR 2019_.
+ Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. [Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models.](https://cs.uwaterloo.ca/~jimmylin/publications/Lin_Yang_SIGIR2019.pdf) _SIGIR 2019_.
+ Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. [The SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC 2019).](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019_workshop.pdf) _SIGIR 2019_.

## Acknowledgments

This research has been supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada and the U.S. National Science Foundation under IIS-1423002 and CNS-1405688. Any opinions, findings, and conclusions or recommendations expressed do not necessarily reflect the views of the sponsors.
This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
Previous support came from the U.S. National Science Foundation under IIS-1423002 and CNS-1405688.
Any opinions, findings, and conclusions or recommendations expressed do not necessarily reflect the views of the sponsors.
@@ -0,0 +1,13 @@
# Anserini: Lucene 7 vs. Lucene 8

Experiments performed in late April 2019 on an Intel E5-2699 v4 @ 2.20GHz processor, single thread.
Query evaluation latency on the ClueWeb12-B13 collection, running the first 10k queries from the [TREC 2005 Terabyte Track efficiency queries](https://trec.nist.gov/data/terabyte05.html):

Hits | Lucene 7.6 | Lucene 8.0 | speedup |
----------:|-----------:|-----------:|--------:|
10 hits | 2885s | 654s | ~3.5 |
100 hits | 3209s | 1161s | ~2.8 |
1000 hits | 5691s | 4050s | ~1.4 |

Results are averaged over three trials, after discarding a warmup run.

@@ -2,9 +2,13 @@

This page documents code for replicating results from the following paper:

+ Peilin Yang and Jimmy Lin. Reproducing and Generalizing Semantic Term Matching in Axiomatic Information Retrieval. Proceedings of the 41th European Conference on Information Retrieval (ECIR 2019), April 2019, Cologne, Germany.
+ Peilin Yang and Jimmy Lin. [Reproducing and Generalizing Semantic Term Matching in Axiomatic Information Retrieval](https://cs.uwaterloo.ca/~jimmylin/publications/Yang_Lin_ECIR2019.pdf). _Proceedings of the 41th European Conference on Information Retrieval, Part I (ECIR 2019)_, pages 369-381, April 2019, Cologne, Germany.

**Requirements**: Python>=2.6 or Python>=3.5 `pip install -r src/main/python/requirements.txt`
**Requirements**: With Python>=2.6 or Python>=3.5:

```
pip install -r src/main/python/requirements.txt
```

## Parameter Sensitivity Plots

@@ -2,7 +2,7 @@

This page documents code for replicating results from the following paper:

- Ruifan Yu, Yuhao Xie and Jimmy Lin. Simple Techniques for Cross-Collection Relevance Transfer. Proceedings of the 41th European Conference on Information Retrieval (ECIR 2019), April 2019, Cologne, Germany.
+ Ruifan Yu, Yuhao Xie and Jimmy Lin. [Simple Techniques for Cross-Collection Relevance Transfer.](https://cs.uwaterloo.ca/~jimmylin/publications/Yu_etal_ECIR2019.pdf) _Proceedings of the 41th European Conference on Information Retrieval, Part I (ECIR 2019)_, page 397-409, April 2019, Cologne, Germany.

**Requirements**: The main requirements are:

@@ -1,40 +1,60 @@
# Solrini - Solr Integration with Anserini
# Solrini: Solr Integration with Anserini

In order to index collections with Solr, we'll setup a single-node SolrCloud instance running locally.
This page documents code for replicating results from the following paper:

## Setup
+ Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. [Solr Integration in the Anserini Information Retrieval Toolkit.](https://cs.uwaterloo.ca/~jimmylin/publications/Clancy_etal_SIGIR2019a.pdf) _Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019)_, July 2019, Paris, France.

1) From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36).
2) Extract the archive:
- `mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1`
3) Start Solr:
- `solrini/bin/solr start -c -m 8G`
4) Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:
- `pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd`
We provide instructions for setting up a single-node SolrCloud instance running locally and indexing into it from Anserini.
Instructions for setting up SolrCloud clusters can be found by searching the web.

## Setting up a Single-Node SolrCloud Instance

From the Solr [archives](https://archive.apache.org/dist/lucene/solr/), download the Solr (non `-src`) version that matches Anserini's [Lucene version](https://github.com/castorini/anserini/blob/master/pom.xml#L36).

Extract the archive:

```
mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1
```

Start Solr:

```
solrini/bin/solr start -c -m 8G
```

Adjust memory usage (i.e., `-m 8G` as appropriate).

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

```
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
```

Solr should be available at [http://localhost:8983](http://localhost:8983) for browsing.
Solr should now be available at [http://localhost:8983/](http://localhost:8983/) for browsing.

Notes:
- `-m 8G` may need to be updated depending on your machine's memory capacity

## Indexing
## Indexing into SolrCloud from Anserini

We can use Anserini as a common "frontend" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes).
Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters.
Most notably, we replace the `-index` parmeter (which specifies the Lucene index path on disk) with Solr parameters.
Most notably, we replace the `-index` parameter (which specifies the Lucene index path on disk) with Solr parameters.

We'll index [robust04](https://github.com/castorini/Anserini/blob/master/docs/experiments-robust04.md) as an example:

1. Create the `robust04` collection from the Solr [collections page](http://localhost:8983/solr/#/~collections).
- Make sure the `config set` value is set to `anserini`
2. Run the Solr indexing command for `robust04`:
Create the `robust04` collection from the Solr [collections page](http://localhost:8983/solr/#/~collections). Make sure the `config set` value is set to `anserini`

Run the Solr indexing command for `robust04`:

```
sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator JsoupGenerator \
-threads 8 -input /path/to/robust04 \
-solr -solr.index robust04 -solr.zkUrl localhost:9983 \
-storePositions -storeDocvectors -storeRawDocs
```

Make sure `/path/to/robust04` is updated with the appropriate path.

Once indexing has completed, you should be able to query `robust04` from the Solr [query interface](http://localhost:8983/solr/#/robust04/query).

To index other collections, the above instructions can be followed making appropriate substitutions for paremeters based on the collection's [experiment docs](https://github.com/castorini/anserini/tree/master/docs).
Other collections can be indexed by substituting the appropriate parameters; see each collection's [experiment docs](https://github.com/castorini/anserini/tree/master/docs).

0 comments on commit 88fd615

Please sign in to comment.
You can’t perform that action at this time.