Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add msmarco v2 document segmentation script #706

Merged
merged 32 commits into from
Jul 16, 2021
Merged

Add msmarco v2 document segmentation script #706

merged 32 commits into from
Jul 16, 2021

Conversation

jacklin64
Copy link
Member

No description provided.

@@ -0,0 +1,90 @@
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move into scripts/msmarco_v2 and call something like segment_docs.py?

I like script to begin with verbs...

@@ -0,0 +1,90 @@
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add our usual boilerplate header.

if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Concatenate MS MARCO original docs with predicted queries')
parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just --input and --output?

description='Concatenate MS MARCO original docs with predicted queries')
parser.add_argument('--original_docs_path', required=True, help='MS MARCO .tsv corpus file.')
parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.')
parser.add_argument('--max_length', default=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-length? hyphen instead of underscore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Not sure shall we align this rule ? I found other scripts use underscore (https://github.com/castorini/pyserini/blob/master/scripts/entity_linking.py)

parser.add_argument('--output_docs_path', required=True, help='Output file in the anserini jsonl format.')
parser.add_argument('--max_length', default=10)
parser.add_argument('--stride', default=5)
parser.add_argument('--num_workers', default=1, type=int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@lintool
Copy link
Member

lintool commented Jul 16, 2021

@MXueguang @crystina-z take a look also?

@MXueguang
Copy link
Member

rest of the things looks good to me

description='Concatenate MS MARCO original docs with predicted queries')
parser.add_argument('--input', required=True, help='MS MARCO V2 corpus path.')
parser.add_argument('--output', required=True, help='Output file path with json format.')
parser.add_argument('--max_length', default=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a "help" to the max_length so that lates ppl know directly this means the number of sentences in each segment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest looks good to me.

@lintool lintool merged commit a6b6545 into castorini:master Jul 16, 2021
MXueguang pushed a commit to MXueguang/pyserini that referenced this pull request Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants