Add msmarco v2 document segmentation script #706

jacklin64 · 2021-07-16T02:13:29Z

No description provided.

send pr later

sync

lintool · 2021-07-16T02:17:26Z

scripts/msmarco_v2_doc_segment.py

@@ -0,0 +1,90 @@
+import argparse


should we move into scripts/msmarco_v2 and call something like segment_docs.py?

I like script to begin with verbs...

lintool · 2021-07-16T02:17:39Z

scripts/msmarco_v2_doc_segment.py

@@ -0,0 +1,90 @@
+import argparse


Please add our usual boilerplate header.

lintool · 2021-07-16T02:18:25Z

scripts/msmarco_v2_doc_segment.py

+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Concatenate MS MARCO original docs with predicted queries')
+    parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.')


how about just --input and --output?

lintool · 2021-07-16T02:18:41Z

scripts/msmarco_v2_doc_segment.py

+        description='Concatenate MS MARCO original docs with predicted queries')
+    parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.')
+    parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.')
+    parser.add_argument('--max_length', default=10)


--max-length? hyphen instead of underscore.

OK. Not sure shall we align this rule ? I found other scripts use underscore (https://github.com/castorini/pyserini/blob/master/scripts/entity_linking.py)

lintool · 2021-07-16T02:18:47Z

scripts/msmarco_v2_doc_segment.py

+    parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.')
+    parser.add_argument('--max_length', default=10)
+    parser.add_argument('--stride', default=5)
+    parser.add_argument('--num_workers', default=1, type=int)


lintool · 2021-07-16T02:19:04Z

@MXueguang @crystina-z take a look also?

MXueguang · 2021-07-16T02:25:01Z

rest of the things looks good to me

crystina-z · 2021-07-16T02:50:21Z

scripts/msmarco_v2/segment_docs.py

+        description='Concatenate MS MARCO original docs with predicted queries')
+    parser.add_argument('--input', required=True, help='MS MARCO V2 corpus path.')
+    parser.add_argument('--output', required=True, help='Output file path with json format.')
+    parser.add_argument('--max_length', default=10)


Let's add a "help" to the max_length so that lates ppl know directly this means the number of sentences in each segment?

the rest looks good to me.

Lin Jack and others added 29 commits May 26, 2021 13:48

add tasb msmarco dev subset reproduce

5a66347

resolve version comment

066c3d5

manually resolve conflict

397280e

initialize tct-colbert-v2 doc

8b74b0a

fix alpha for doct5query fusion

0ef7d58

add baseline

044eabb

Merge branch 'master' of github.com:jacklin64/pyserini

437ad41

add tct_colbert-v2 integration test

9f56a36

add distilbert_tasb integration

2457f6d

fix typo

5d13d11

add baseline exp

0be15ea

Merge branch 'master' of github.com:jacklin64/pyserini

4bd4231

rearrange

9f3729f

rearrange tct-v2 exp order

2936de6

resolve conflict

555c87b

fix function name

f6180a0

Delete test_distilbert_tasb.py

853cba9

send pr later

Delete test_tct_colbert-v2.py

cbd1c53

send pr later

clarify the results in the table

2b24168

add tasb and tct-v2 integration

719de11

Merge branch 'castorini:master' into master

7737c26

Merge branch 'castorini:master' into master

557dd2b

add tct doc encoding

975e4b6

resolve conflict

e75aa36

Merge branch 'castorini-master'

b14f182

sync to master

53ee49c

Merge pull request #3 from castorini/master

df2659b

sync

add msmarco v2 document segmentation

e3de279

Merge branch 'master' of github.com:jacklin64/pyserini

196ee31

lintool reviewed Jul 16, 2021

View reviewed changes

scripts/msmarco_v2_doc_segment.py Outdated

@@ -0,0 +1,90 @@

import argparse

Copy link

Member

lintool Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add our usual boilerplate header.

lintool reviewed Jul 16, 2021

View reviewed changes

Lin Jack added 2 commits July 15, 2021 22:40

rename and add boilerplate header.

df0ac90

delete redundant file

5b2cccb

crystina-z reviewed Jul 16, 2021

View reviewed changes

fix description and help

ba5a40b

crystina-z approved these changes Jul 16, 2021

View reviewed changes

lintool approved these changes Jul 16, 2021

View reviewed changes

lintool merged commit a6b6545 into castorini:master Jul 16, 2021

MXueguang pushed a commit to MXueguang/pyserini that referenced this pull request Nov 5, 2021

Add msmarco doc corpus v2 document segmentation script (castorini#706)

0b4d64b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add msmarco v2 document segmentation script #706

Add msmarco v2 document segmentation script #706

jacklin64 commented Jul 16, 2021

lintool Jul 16, 2021

lintool Jul 16, 2021

lintool Jul 16, 2021

lintool Jul 16, 2021

jacklin64 Jul 16, 2021

lintool Jul 16, 2021

lintool commented Jul 16, 2021

MXueguang commented Jul 16, 2021

crystina-z Jul 16, 2021

crystina-z Jul 16, 2021

Add msmarco v2 document segmentation script #706

Add msmarco v2 document segmentation script #706

Conversation

jacklin64 commented Jul 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Jul 16, 2021

MXueguang commented Jul 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment