-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add msmarco v2 document segmentation script #706
Conversation
send pr later
send pr later
scripts/msmarco_v2_doc_segment.py
Outdated
@@ -0,0 +1,90 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move into scripts/msmarco_v2
and call something like segment_docs.py
?
I like script to begin with verbs...
scripts/msmarco_v2_doc_segment.py
Outdated
@@ -0,0 +1,90 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add our usual boilerplate header.
scripts/msmarco_v2_doc_segment.py
Outdated
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser( | ||
description='Concatenate MS MARCO original docs with predicted queries') | ||
parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about just --input
and --output
?
scripts/msmarco_v2_doc_segment.py
Outdated
description='Concatenate MS MARCO original docs with predicted queries') | ||
parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.') | ||
parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.') | ||
parser.add_argument('--max_length', default=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--max-length
? hyphen instead of underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Not sure shall we align this rule ? I found other scripts use underscore (https://github.com/castorini/pyserini/blob/master/scripts/entity_linking.py)
scripts/msmarco_v2_doc_segment.py
Outdated
parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.') | ||
parser.add_argument('--max_length', default=10) | ||
parser.add_argument('--stride', default=5) | ||
parser.add_argument('--num_workers', default=1, type=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@MXueguang @crystina-z take a look also? |
rest of the things looks good to me |
scripts/msmarco_v2/segment_docs.py
Outdated
description='Concatenate MS MARCO original docs with predicted queries') | ||
parser.add_argument('--input', required=True, help='MS MARCO V2 corpus path.') | ||
parser.add_argument('--output', required=True, help='Output file path with json format.') | ||
parser.add_argument('--max_length', default=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a "help" to the max_length
so that lates ppl know directly this means the number of sentences in each segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest looks good to me.
No description provided.