Feature/msmarco psg #117

crystina-z · 2020-11-21T05:31:33Z

Just sending this PR to better track the progress :) Don't worry about it now

Now running ms marco psg while only reranking the top100 data looks right.

Confusing stuff to solve

So far none of the runs on reranking top1k could save the checkpoint even tho it seems finished in some cases (but it can be saved in reranking the top100 case). While i was suspecting this is due to the running time limit in slurm, yet everytime queueing up a 4 days runs some other weird bug can pop up so after so many days it is still unconfirmed...
The most recent run throws this error, while the sampler.generate_example and sampler.get_preds_in_trec_format seem to align with each other, and the dev records are prepared in this run so it's not because of overdue cache data. Still checking what's happening here. (Again this does not happen for the reranking top100 case.)

Traceback (most recent call last):
  File "run.py", line 108, in <module>
    task_entry_function()
  File "/home/czhang/aaai2021/tmp/capreolus/capreolus/task/rerank.py", line 54, in train
    return self.rerank_run(best_search_run, self.get_results_path())
  File "/home/czhang/aaai2021/tmp/capreolus/capreolus/task/rerank.py", line 101, in rerank_run
    self.benchmark.relevance_level,
  File "/home/czhang/aaai2021/tmp/capreolus/capreolus/trainer/tensorflow.py", line 205, in train
    trec_preds = self.get_preds_in_trec_format(dev_predictions, dev_data)
  File "/home/czhang/aaai2021/tmp/capreolus/capreolus/trainer/tensorflow.py", line 429, in get_preds_in_trec_format
    pred_dict[qid][docid] = predictions[i].numpy().astype(np.float16).item()
IndexError: list index out of range

Features/Support to add

right now I have to hack the code to let the ranker only do evaluation on dev qids, otherwise the evaluation there (on millions of training queries!) gonna block the process forever. Probably this is already solved in feature/fit branch where only test qids are evaluated?
use sqlite to prepare and store the docid2passage, so this can be run on the RAM-poor machines
~~(maybe less urgent) handle the msmarco downloading using the allenai/ir_datasets~~

Sidenote (about the running time of some operation)

evaluation (on only dev set, not including the training queries)
1.1 pytrec_eval: 40 sec
1.2 (trec_eval: 4 secs)
Loading the whole runfile (including training data): ~310s (Saercher.load_trec_run())
Preparing the passage: ?
Prepare training runs: 3.5 hours
tfrecords (train + dev): 2 hours
training:
6.1 (3k iteration) < 2 hours
6.2 (30k iter) 11~12 hours
inference
7.1 (top100): several hours
7.2 (top1k) > 1.5 days

… collection, searcher

lgtm-com · 2020-11-21T06:36:56Z

This pull request introduces 3 alerts when merging 649804e into bf50423 - view on LGTM.com

new alerts:

3 for Unused import

…/capreolus into feature/msmarco_psg

…h batch size

… mode for done file

… to finish before using the ckpt

… evaluator and msmarco benchmark accordingly. so that benchmark can specify by itself if they wana include train_qids in dev set for non-neural-net algorithms, where no training is needed

…s in qrels should be considered (2) update all the places that called eval_runs and _eval_runs, ensuring the inputed qrels are filtered (3) change the api of trainer.train, removing qrels and relevance_level, instead, sending an eval_fn(runs) to the train() which handles the evaluation logit completely

lgtm-com · 2021-08-08T20:34:17Z

This pull request introduces 2 alerts when merging 7664df9 into 18e31b7 - view on LGTM.com

new alerts:

2 for Unused import

…/capreolus into feature/msmarco_psg

lgtm-com · 2021-08-22T05:12:02Z

This pull request introduces 1 alert when merging 569276e into 1767d5a - view on LGTM.com

new alerts:

1 for Unused import

lgtm-com · 2021-08-22T15:55:27Z

This pull request introduces 1 alert when merging 1bbf0f2 into 1767d5a - view on LGTM.com

new alerts:

1 for Unused import

andrewyates

looks good! I left some minor questions/comments

capreolus/searcher/special.py

capreolus/evaluator.py

…us into feature/msmarco_psg

crystina-z · 2021-08-25T16:19:09Z

capreolus/eval/msmarco_eval.py

+Creation Date : 06/12/2018
+Last Modified : 1/21/2019
+Authors : Daniel Campos <dacamp@microsoft.com>, Rutger van Haasteren <ruvanh@microsoft.com>
+"""


actually do we just keep the headings as it is?

lgtm-com · 2021-08-25T18:01:51Z

This pull request introduces 1 alert when merging bbe134e into 1767d5a - view on LGTM.com

new alerts:

1 for Unused import

Crystina333 and others added 4 commits November 16, 2020 15:18

migrating msmarco code to the latest version. add msmmarco benchmark,…

e358e6c

… collection, searcher

change the log position

e2b1236

bug on special searcher

c27df32

tmp commit that will need to be removed later

649804e

Crystina333 added 2 commits November 22, 2020 23:12

Merge branch 'feature/msmarco_psg' of https://github.com/capreolus-ir…

827623c

…/capreolus into feature/msmarco_psg

fix the bug which cause the padded doc in prediction does aligned wit…

b31fda7

…h batch size

crystina-z mentioned this pull request Nov 25, 2020

before each tf-record contain 20001 samples, which causes the padded … #118

Closed

Crystina333 and others added 22 commits November 25, 2020 14:48

remove duplcate decay & always add pad passage to the end

e667ef0

add dependency & fix tmp_topic bug

c066f81

remove duplcate decay & always add pad passage to the end

8fd2e24

fix the bug which cause the padded doc in prediction does aligned wit…

02317be

…h batch size

bugfix: add dependencies, remove train_qids in bm25 topic file, use w…

73a8a86

… mode for done file

remote id2psg prep

059d19e

support multi gpu training

bf468bd

add msmarco mrr@100 eval

ce30b07

merge

2f25769

bugfix: dev record files need to sorted

6c3759f

bugfix: msamrco kw passage, get registered and fix tokenizer config name

c5e3215

skip if the rerank number is less than MaxMrrRank

3917b99

clean up (bert passage without id2psg)

e1edffd

use real test set in msmarco psg rather than dev set

0b30276

add earlystop config for tensorflow so dont need to wait for the eval…

973994e

… to finish before using the ckpt

remove extractor.usecache config from cache path

5b6e738

add use_train_as_dev and property non_nn_dev to benchmark, and update…

ddd199a

… evaluator and msmarco benchmark accordingly. so that benchmark can specify by itself if they wana include train_qids in dev set for non-neural-net algorithms, where no training is needed

skip evaluation if no qrel is detected

2625e65

rename the passage-preparation functions

deacac7

bugfix: the dict name of msmarco fold

f6df52b

test version of docs

37dcd8e

crystina-z mentioned this pull request Aug 16, 2021

[1st sub-PR of #117] environment files #175

Merged

crystina-z added 2 commits August 17, 2021 06:14

force pip==21.1.3

9504069

add validatefreq in repro doc

c236cd2

This was referenced Aug 17, 2021

[2nd sub-PR of #117] handle other trec topic format #176

Merged

[3nd sub-PR of #117] allow sampler to keep the queries with no qrels #177

Merged

[4th sub-PR of #117] Feature/benchmark validation #178

Merged

crystina-z added 5 commits August 19, 2021 04:14

setup-cc which seems to work now;

46891ef

clean up

1107799

Merge branch 'feature/msmarco_psg' of https://github.com/capreolus-ir…

f0c840b

…/capreolus into feature/msmarco_psg

Update setup-cc.md

bd5fd07

add tensorflow-ranking version (as in #169)

4866463

crystina-z mentioned this pull request Aug 21, 2021

[5th sub-PR of #117] update setup process for CC #179

Merged

merge master

eaf3395

crystina-z force-pushed the feature/msmarco_psg branch from a67740d to eaf3395 Compare August 21, 2021 01:24

crystina-z added 3 commits August 22, 2021 05:13

change msmarco config into "msmarco-small"

c73171c

Merge branch 'master' into feature/msmarco_psg

dfdcdf4

move the location of sample_slurm_scripts

569276e

crystina-z changed the title ~~[WIP] Feature/msmarco psg~~ Feature/msmarco psg Aug 22, 2021

kevinmartinjos mentioned this pull request Aug 22, 2021

Denserank - add support for RepBERT and FAISS #181

Open

Reproduce MS MARCO Passage fine-tuning

1bbf0f2

andrewyates requested changes Aug 25, 2021

View reviewed changes

capreolus/searcher/special.py Outdated Show resolved Hide resolved

capreolus/searcher/special.py Show resolved Hide resolved

capreolus/searcher/special.py Show resolved Hide resolved

capreolus/evaluator.py Show resolved Hide resolved

crystina-z added 3 commits August 25, 2021 17:49

add docstring for special searchers

ade707e

remove get_file_line_number

6eb83fd

Merge branch 'feature/msmarco_psg' of github.com:capreolus-ir/capreol…

bbe134e

…us into feature/msmarco_psg

crystina-z commented Aug 25, 2021

View reviewed changes

andrewyates merged commit 3521171 into master Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/msmarco psg #117

Feature/msmarco psg #117

crystina-z commented Nov 21, 2020 •

edited

Loading

lgtm-com bot commented Nov 21, 2020

lgtm-com bot commented Aug 8, 2021

lgtm-com bot commented Aug 22, 2021

lgtm-com bot commented Aug 22, 2021

andrewyates left a comment

crystina-z Aug 25, 2021

lgtm-com bot commented Aug 25, 2021

Feature/msmarco psg #117

Feature/msmarco psg #117

Conversation

crystina-z commented Nov 21, 2020 • edited Loading

Just sending this PR to better track the progress :) Don't worry about it now

lgtm-com bot commented Nov 21, 2020

lgtm-com bot commented Aug 8, 2021

lgtm-com bot commented Aug 22, 2021

lgtm-com bot commented Aug 22, 2021

andrewyates left a comment

Choose a reason for hiding this comment

crystina-z Aug 25, 2021

Choose a reason for hiding this comment

lgtm-com bot commented Aug 25, 2021

crystina-z commented Nov 21, 2020 •

edited

Loading