indri

This is a clone of Indri 5.12 with minor customizations.

IndriRunQuery customizations

Unless noted, all parameters are thrown on the command line just like other IndriRunQuery parameters.

NB. Input queries are assumed to be natural language, not Indri query language.

Condensed List Relevance Models

parameter	type	default	description
rerankSize	int	0	relevance model reranks a length `rerankSize` initial query likelihood retrieval.

Using condensed list relevance models can substantially improve speed without degrading effectiveness of relevance models. This means you can run massive query expansion very quickly.

External Expansion

parameter	type	default	description
externalIndex	path	NONE	RM is built from an initial query likelihood retrieval from `externalIndex`.

Using external expansion with a large external index can substantially improve effectiveness of query expansion. This can be combined with condensed list relevance models (from the target corpus) if you are concerned about speed.

Dependence Models

To construct and run a dependence model query, just use -dm=<dm parameter:dm parameter value>[,<dm parameter:dm parameter value>]+. Defaults are used if any DM parameters are set.

parameter	type	default	description
order	int	1	dependence model order
combineWeight	float	0.85	weight for combine subquery
owWeight	float	0.10	weight for ordered window subquery
uwWeight	float	0.05	weight for unordered window subquery
uwSize	int	8	unordered window size
rerankSize	int	0	number of query likelihood results to rerank (0=retrieve)

Order k=1 dependence models are the classic sequential dependence model. Order k computes the dependences between all query terms within a k word window. Order k=-1 is the classic full dependence model.

order	indri query
0	#combine( colorless green ideas sleep furiously )
1	#weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( green idea ) #1( idea sleep ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( green idea ) #uw8( idea sleep ) #uw8( sleep furiously ) ) )
2	#weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( colorless idea ) #1( green idea ) #1( green sleep ) #1( idea sleep ) #1( idea furiously ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( colorless idea ) #uw8( green idea ) #uw8( green sleep ) #uw8( idea sleep ) #uw8( idea furiously ) #uw8( sleep furiously ) ) )
-1	#weight( 0.85 #combine( colorless green ideas sleep furiously ) 0.1 #combine( #1( colorless green ) #1( colorless idea ) #1( colorless sleep ) #1( colorless furiously ) #1( green idea ) #1( green sleep ) #1( green furiously ) #1( idea sleep ) #1( idea furiously ) #1( sleep furiously ) ) 0.05 #combine( #uw8( colorless green) #uw8( colorless idea ) #uw8( colorless sleep ) #uw8( colorless furiously ) #uw8( green idea ) #uw8( green sleep ) #uw8( green furiously ) #uw8( idea sleep ) #uw8( idea furiously ) #uw8( sleep furiously ) ) )

This implementation also includes condensed list dependence models which, like condensed list relevance models, first conduct a query likelihood retrieval and then rerank the results using the dependence model. This can substantially improve speed, especially for longer queries.

For example, to run a classic SDM reranking of a length 100 query likelihood initial retrieval,

IndriRunQuery -index=/path/to/index -query="hello world" -dm=order:1,rerank:100

Dependence models are built after internally stopping and stemming the query terms.

Passage Retrieval

parameter	type	default	description
passageLength	int	0	length of passages to retrieve.
passageOverlap	int	0	passage overlap.
fbRankDocuments	bool	false	used in conjunction with `passageLength` > 0 and PRF for document ranking based on passage RMs.

Passage Retrieval

parameter	type	default	description
field	string	NONE	restrict retrieval to this field.

RM Parameters

To use advanced RM parameters, just use -rm=<rm parameter:rm parameter value>[,<rm parameter:rm parameter value>]+. Defaults are used if any RM parameters are set.

parameter	type	default	description
passageLength	int	0	passage length for passage-based RM (0=doc).
passageOverlap	int	0	passage overlap for passage-based RM (0=doc).
field	string	NONE	field restriction for field-based RM.
fbDocs	int	0	num feedback documents
fbTerms	int	0	num feeback terms
fbOrigWeight	float	0.0	original weight.
targetPassages	bool	false	rank passages instead of documents in the final retrieval.
condensed	int	0	rerank

If you want the initial retrieval to be a DM (see above), you can pass the following into the rm parameter list,

parameter	type	default
dm.combineWeight	float	0.85
dm.owWeight	float	0.10
dm.uwWeight	float	0.05
dm.uwSize	int	8
dm.order	int	1
dm.rerankSize	int	0

Baseline Cheatsheet

Here's a simple cheatsheet for good baselines for your text retrieval experiments. I'm assuming that you are evaluating on a target corpus (e.g. Associated Press documents) and might either train your model with documents from the target corpus or from some large external corpus (e.g. Gigaword). You also might incorporate term proximity or other linear term relationships (e.g. n- or skip-grams).

	training corpus
proximity	none	target	external
no	QL	RM3	EE
yes	DM	DM+RM3	DM+EE

For whatever the appropriate baseline is for your experiment, you should train the baseline's parameters on the same training set you're using for your model. For RM3/EE, good ranges are,

parameter	min	max
fbDocs	5	500
fbTerms	5	500
fbOrigWeight	0	1

Don's original DM weights seem to be robust across conditions but you may still want to play with the parameters above.

If you find DM/RM3/EE slow, instead of limiting the parameter ranges, you should opt for the condensed list versions of the methods. In general, you can rerank 1000 documents from a QL initial retrieval and preserve rank-equivalence with re-retrieval, especially at the top of the final ranking.

If you are not sure where your model fits in the cheatsheet above, a strong baseline is DM+EE.

Note to paper readers (including reviewers): You'll come across papers that use poorly implemented pseudo-relevance feedback baselines. Here are common mistakes,

no parameter tuning (i.e. a fixed number of feedback documents, expansion terms, or interpolation weight).
poor parameter tuning (e.g. only testing 5-25 terms and/or documents). I've seen effectiveness peak at hundreds of documents/terms for some collections.
unweighted document feedback (i.e. treating all feedback documents as equally relevant).
unweighted term expansion (i.e. treating the new terms just additional query terms).

Citation

Use the standard indri citation,

@inproceedings{strohman:indri,
author = {Trevor Strohman and Donald Metzler and Howard Turtle and W. B. Croft},
booktitle = {Proceedings of the International Conference on Intelligence Analysis},
title = {Indri: A language model-based search engine for complex queries},
year = {2004}
}

as well as the following for reproducibility,

@online{diaz:indri,
author = {Fernando Diaz},
title = {indri},
year = {2018},
url = {https://github.com/diazf/indri}
}

and references to any specific algorithms above you are using.

Notes

To build on OSX, be should to set CXXFLAGS=-fno-tree-vectorize. Thanks to Luke Gallagher.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
buildindex		buildindex
clarity		clarity
contrib		contrib
debug		debug
depend		depend
dist		dist
doc		doc
dumpindex		dumpindex
harvestlinks		harvestlinks
include/indri		include/indri
indrid		indrid
lang		lang
makeprior		makeprior
modifyfields		modifyfields
obj		obj
pagerank		pagerank
reformulate		reformulate
rmodel		rmodel
runquery		runquery
site-search		site-search
src		src
swig		swig
INSTALL		INSTALL
LICENSE.txt		LICENSE.txt
MakeDefns.in		MakeDefns.in
Makefile		Makefile
Makefile.app.in		Makefile.app.in
README		README
README.md		README.md
config.guess		config.guess
config.sub		config.sub
configure		configure
configure.ac		configure.ac
indri-VS2010.sln		indri-VS2010.sln
indri-VS2012.sln		indri-VS2012.sln
install-sh		install-sh
stopwords.param		stopwords.param

License

diazf/indri

Folders and files

Latest commit

History

Repository files navigation

indri

IndriRunQuery customizations

Passage Retrieval

Passage Retrieval

RM Parameters

Baseline Cheatsheet

Citation

Notes

About

Resources

License

Stars

Watchers

Forks

Languages