Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Many of the information that was published on the web is no longer available online. It can only be found in web archives. Web archive information retrieval (WAIR) is a new and challenging research area that addresses the retrieval of document versions from web archives, according to a topical and temporal criteria of relevance. We publicly release this dataset to facilitate research in Learning to Rank for WAIR (L2R4WAIR).
The dataset is composed by a set of quadruples <query, version, grade, features>, where the grade indicates the degree of relevance of the version for the query. We use a three-level scale of relevance (not relevant, relevant and very relevant) converted to an integer scale ranging from 0 to 2. The document version is identified by URL and timestamp. The features represent a vector of feature values, each describing an estimate of relevance for the <query, version> pair.
The quadruples were obtained from the PWA9609 test collection available at https://github.com/arquivo/pwa-technologies/wiki/TestCollection.
We followed the file format used in LETOR datasets. Each of the following lines corresponds to a quadruple and represents one training example:
============================================================= 0 qid:21 1:0.10 2:0.233 3:0.611 ... 68:0.643 # id21968747index0 2 qid:21 1:0.70 2:0.344 3:0.221 ... 68:0.869 # id114746079index0 0 qid:22 1:0.05 2:0.112 3:0.118 ... 68:0.434 # id172346033index3 =============================================================
The first column is the relevance label. The second column is the query id, and the following 68 columns are the feature ids with their values. The last column, after the # symbol, is the version identifier.
We followed LETOR and partitioned each dataset into five parts with the same number of queries, denoted as S1, S2, S3, S4, and S5. The idea is to evaluate results using a five-fold cross validation, where three parts are for training, one part for validation, and the remaining part for testing. The training set is used to learn ranking models. The validation set is used to tune the parameters of learning algorithms. The test set is used to evaluate the performance of the learned ranking models. The final results are the average over the five different folds described in the following table:
|Folds||Training set||Validation set||Test set|
Consult the complete list of features.
Get the files of the dataset to research in Learning to Rank for WAIR. This zip file contains the following files:
- fold1 to fold5 : all folders of the dataset with the raw scores of features.
- fold1.normalized to fold5.normalized : all folders of the dataset with the normalized scores of features.
- qrels.fold1 to qrels.fold5: all qrels of the dataset partitioned by folder.
A list of mappings between each version id and the corresponding <URL, timestamp> pair can be used to create more features.
Results can be computed with the trec_eval tool used by the TREC community.
If you have any questions or suggestions, please kindly contact migcosta (at) gmail.com.