-
Notifications
You must be signed in to change notification settings - Fork 0
Script for ingest stress test (SCP-2649) #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bistline
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great, and I don't see any issues with the script itself, but I'm not sure how to handle the data files themselves. I know the bucket is private - that's not the issue - rather just the references themselves, and how should we handle that in a public repository.
| import subprocess | ||
| from datetime import datetime | ||
|
|
||
| DATA = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is great for our purposes, I'm hesitant to hard code these files in a public repo, since I'm not sure all of this data is public. I'm fairly certain the data from the staging studies is not (it may not have ever been published, so I'm not sure where that leaves us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Files renamed ef57bde
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would assuage my concerns
| import subprocess | ||
| from datetime import datetime | ||
|
|
||
| DATA = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from
|
|
||
| PREREQUISITES | ||
| set up access token on command line | ||
| ACCESS_TOKEN=`gcloud auth print-access-token` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might add a note, since this is a public repo, that this is an internal-only script for SCP developers since it relies on access to a non-public bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note added ef57bde
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
The only blocker I see concerns which genome name is used. See comment below for details.
My other suggestions are lower-priority refinements.
Also, at a higher level, this might frame better as ingest_stress_test.py. We have high-level smoke tests for Ingest Pipeline in single_cell_portal_core, and while this tests manage-study unlike that, the focus of the ticket is on many concurrent ingests (a stress test) and not simple tests of core functionality (a smoke test). I don't feel strongly about this, though.
scripts/ingest_smoke_test.py
Outdated
| test_type = "smoke" | ||
| else: | ||
| test_type = "stress" | ||
| study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F-string would make this list comprehension a bit easier to digest:
| study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)] | |
| study_names = [f'{test_type}-{test_id}-{i}' for i in range(args.number)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I support readability (and consistency is good too). Updated to f-string in ef57bde
scripts/ingest_smoke_test.py
Outdated
| # create studies unless it is a dev-run | ||
| if not args.dev_run and not msaction.send_request(study_names[0], "study_exists"): | ||
| for study in study_names: | ||
| print('Requesting creation of', study) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be ideal if we could standardize on a way to concatenate strings in Python. It seems we use f-strings in a plurality of cases in scp-ingest-pipeline, so I'd suggest that for this PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked for all the instances of concatenation that I could find to update to f-strings. Will try consistently use f-strings in future.
scripts/ingest_smoke_test.py
Outdated
| '--species', | ||
| "human", | ||
| '--genome', | ||
| "hg19", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the default genome name. In this case, the name is GRCh37.
SCP has support for genome "alias", which could be a place to use UCSC names (e.g. hg19), but that's coincidentally not set in staging. The default genome assembly names (e.g. GRCh37) match Ensembl and NCBI.
| "hg19", | |
| "GRCh37", |
Separately, but related: given "hg19" isn't a name or alias on staging, I wonder why this doesn't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated genome value to use the default genome name 38d94ae
created SCP-2678 to follow up on what "hg19" didn't cause ingest to fail
scripts/ingest_smoke_test.py
Outdated
| help='server to test against [development,staging,production]', | ||
| default='staging', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in our public CLI:
| help='server to test against [development,staging,production]', | |
| default='staging', | |
| help='server to test against', | |
| choices=['development', 'staging', 'production'], | |
| default='staging', |
Using choices gives built-in validation, and includes the options in --help output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New to this feature, love it! 38d94ae
scripts/ingest_smoke_test.py
Outdated
| parser.add_argument( | ||
| '--data-set', | ||
| default='small', | ||
| help='Select data set for stress test [small, medium, large].', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| help='Select data set for stress test [small, medium, large].', | |
| help='Select data set for stress test.', | |
| choices=['small', 'medium', 'large'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
38d94ae
|
|
||
| parser.add_argument( | ||
| '--token', | ||
| default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the public CLI, use of this script always requires a token, so:
| default=None, | |
| required=True, |
This would also ensure an error is thrown if this required field is missing:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 38d94ae
scripts/ingest_smoke_test.py
Outdated
| set up access token on command line | ||
| ACCESS_TOKEN=`gcloud auth print-access-token` | ||
|
|
||
| EXAMPLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| EXAMPLE | |
| EXAMPLES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
38d94ae
Codecov Report
@@ Coverage Diff @@
## development #135 +/- ##
================================================
- Coverage 64.26% 52.31% -11.95%
================================================
Files 22 22
Lines 2622 2609 -13
================================================
- Hits 1685 1365 -320
- Misses 937 1244 +307
Continue to review full report at Codecov.
|
ingest_smoke_test.pywill create a study and upload a dense matrix, cluster and metadata file to the study using manage-study (an installed executable from the single-cell-portal python package). By default this is run against the SCP staging server.Setup:
ACCESS_TOKEN=`gcloud auth print-access-token`Run a smoke test:
This supports SCP-2649.