Script for ingest stress test (SCP-2649) #135

jlchang · 2020-08-18T18:30:51Z

ingest_smoke_test.py will create a study and upload a dense matrix, cluster and metadata file to the study using manage-study (an installed executable from the single-cell-portal python package). By default this is run against the SCP staging server.

Setup:

"smoke" virtualenv activated with single-cell-portal installed (see setup instructions)
VPN (to access staging environment)
have correct Google identity (non-SA account)
Access token ACCESS_TOKEN=`gcloud auth print-access-token`

Run a smoke test:

python ingest_smoke_test.py --token=$ACCESS_TOKEN

This supports SCP-2649.

bistline

This is really great, and I don't see any issues with the script itself, but I'm not sure how to handle the data files themselves. I know the bucket is private - that's not the issue - rather just the references themselves, and how should we handle that in a public repository.

bistline · 2020-08-18T19:36:14Z

scripts/ingest_smoke_test.py

+import subprocess
+from datetime import datetime
+
+DATA = {


While this is great for our purposes, I'm hesitant to hard code these files in a public repo, since I'm not sure all of this data is public. I'm fairly certain the data from the staging studies is not (it may not have ever been published, so I'm not sure where that leaves us).

As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from

Agreed. Files renamed ef57bde

That would assuage my concerns

devonbush · 2020-08-18T20:17:55Z

scripts/ingest_smoke_test.py

+import subprocess
+from datetime import datetime
+
+DATA = {


As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from

devonbush · 2020-08-18T20:23:07Z

scripts/ingest_smoke_test.py

+
+PREREQUISITES
+set up access token on command line
+ACCESS_TOKEN=`gcloud auth print-access-token`


I might add a note, since this is a public repo, that this is an internal-only script for SCP developers since it relies on access to a non-public bucket

note added ef57bde

eweitz

Looks good!

The only blocker I see concerns which genome name is used. See comment below for details.

My other suggestions are lower-priority refinements.

Also, at a higher level, this might frame better as ingest_stress_test.py. We have high-level smoke tests for Ingest Pipeline in single_cell_portal_core, and while this tests manage-study unlike that, the focus of the ticket is on many concurrent ingests (a stress test) and not simple tests of core functionality (a smoke test). I don't feel strongly about this, though.

eweitz · 2020-08-18T19:28:08Z

scripts/ingest_smoke_test.py

+        test_type = "smoke"
+    else:
+        test_type = "stress"
+    study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)]


F-string would make this list comprehension a bit easier to digest:

Suggested change

study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)]

study_names = [f'{test_type}-{test_id}-{i}' for i in range(args.number)]

Makes sense, I support readability (and consistency is good too). Updated to f-string in ef57bde

eweitz · 2020-08-18T19:36:34Z

scripts/ingest_smoke_test.py

+    # create studies unless it is a dev-run
+    if not args.dev_run and not msaction.send_request(study_names[0], "study_exists"):
+        for study in study_names:
+            print('Requesting creation of', study)


It'd be ideal if we could standardize on a way to concatenate strings in Python. It seems we use f-strings in a plurality of cases in scp-ingest-pipeline, so I'd suggest that for this PR too.

Looked for all the instances of concatenation that I could find to update to f-strings. Will try consistently use f-strings in future.

eweitz · 2020-08-18T19:56:46Z

scripts/ingest_smoke_test.py

+                '--species',
+                "human",
+                '--genome',
+                "hg19",


We should use the default genome name. In this case, the name is GRCh37.

SCP has support for genome "alias", which could be a place to use UCSC names (e.g. hg19), but that's coincidentally not set in staging. The default genome assembly names (e.g. GRCh37) match Ensembl and NCBI.

Suggested change

"hg19",

"GRCh37",

Separately, but related: given "hg19" isn't a name or alias on staging, I wonder why this doesn't fail.

updated genome value to use the default genome name 38d94ae
created SCP-2678 to follow up on what "hg19" didn't cause ingest to fail

eweitz · 2020-08-18T20:06:30Z

scripts/ingest_smoke_test.py

+        help='server to test against [development,staging,production]',
+        default='staging',


As in our public CLI:

Suggested change

help='server to test against [development,staging,production]',

default='staging',

help='server to test against',

choices=['development', 'staging', 'production'],

default='staging',

Using choices gives built-in validation, and includes the options in --help output.

New to this feature, love it! 38d94ae

eweitz · 2020-08-18T20:07:37Z

scripts/ingest_smoke_test.py

+    parser.add_argument(
+        '--data-set',
+        default='small',
+        help='Select data set for stress test [small, medium, large].',


Suggested change

help='Select data set for stress test [small, medium, large].',

help='Select data set for stress test.',

choices=['small', 'medium', 'large']

eweitz · 2020-08-18T20:09:38Z

scripts/ingest_smoke_test.py

+
+    parser.add_argument(
+        '--token',
+        default=None,


Unlike the public CLI, use of this script always requires a token, so:

Suggested change

default=None,

required=True,

This would also ensure an error is thrown if this required field is missing:

Good catch! 38d94ae

eweitz · 2020-08-18T20:10:19Z

scripts/ingest_smoke_test.py

+set up access token on command line
+ACCESS_TOKEN=`gcloud auth print-access-token`
+
+EXAMPLE


Suggested change

EXAMPLE

EXAMPLES

codecov · 2020-08-18T23:45:58Z

Codecov Report

Merging #135 into development will decrease coverage by 11.94%.
The diff coverage is 85.71%.

@@               Coverage Diff                @@
##           development     #135       +/-   ##
================================================
- Coverage        64.26%   52.31%   -11.95%     
================================================
  Files               22       22               
  Lines             2622     2609       -13     
================================================
- Hits              1685     1365      -320     
- Misses             937     1244      +307

Impacted Files	Coverage Δ
ingest/expression_files/expression_files.py	`81.33% <ø> (-0.25%)`	⬇️
ingest/mongo_connection.py	`53.84% <0.00%> (+3.84%)`	⬆️
ingest/monitor.py	`77.04% <ø> (ø)`
ingest/expression_files/dense_ingestor.py	`96.09% <90.00%> (+0.98%)`	⬆️
ingest/genomes/utils.py	`0.00% <0.00%> (-50.00%)`	⬇️
ingest/make_toy_data.py	`0.00% <0.00%> (-27.15%)`	⬇️
ingest/expression_files/__init__.py	`100.00% <0.00%> (ø)`
ingest/genomes/genome_annotation_metadata.py
ingest/genomes/genome_annotations.py
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c814c4...e83b1c1. Read the comment docs.

jlchang added 3 commits August 18, 2020 10:12

add stress test script for use in smoke tests

a8f521d

black formatting changes

70cadac

add error handling for 403 return code

53df2ae

jlchang requested review from bistline, devonbush, eweitz and knapii-developments August 18, 2020 18:30

bistline requested changes Aug 18, 2020

View reviewed changes

devonbush suggested changes Aug 18, 2020

View reviewed changes

eweitz requested changes Aug 18, 2020

View reviewed changes

update default study for --dev-run

3c3046a

update data paths per PR comments

ef57bde

bistline approved these changes Aug 19, 2020

View reviewed changes

jlchang added 2 commits August 19, 2020 10:48

PR comment updates

38d94ae

correct typos in file paths

063dd48

jlchang requested review from devonbush and eweitz August 19, 2020 16:37

eweitz approved these changes Aug 19, 2020

View reviewed changes

devonbush approved these changes Aug 20, 2020

View reviewed changes

Merge branch 'development' into jlc_smoke_test

e83b1c1

jlchang merged commit 6ed4a64 into development Aug 20, 2020

jlchang deleted the jlc_smoke_test branch August 20, 2020 11:15

	study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)]
	study_names = [f'{test_type}-{test_id}-{i}' for i in range(args.number)]

		help='server to test against [development,staging,production]',
		default='staging',

	help='Select data set for stress test [small, medium, large].',
	help='Select data set for stress test.',
	choices=['small', 'medium', 'large']

Script for ingest stress test (SCP-2649) #135

Script for ingest stress test (SCP-2649) #135

Uh oh!

Conversation

jlchang commented Aug 18, 2020 • edited by eweitz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bistline left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eweitz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jlchang commented Aug 18, 2020 •

edited by eweitz

Loading

eweitz left a comment •

edited

Loading

codecov bot commented Aug 18, 2020 •

edited

Loading