Skip to content

Conversation

@jlchang
Copy link
Contributor

@jlchang jlchang commented Aug 18, 2020

ingest_smoke_test.py will create a study and upload a dense matrix, cluster and metadata file to the study using manage-study (an installed executable from the single-cell-portal python package). By default this is run against the SCP staging server.

Setup:

  • "smoke" virtualenv activated with single-cell-portal installed (see setup instructions)
  • VPN (to access staging environment)
  • have correct Google identity (non-SA account)
  • Access token ACCESS_TOKEN=`gcloud auth print-access-token`

Run a smoke test:

python ingest_smoke_test.py --token=$ACCESS_TOKEN

This supports SCP-2649.

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great, and I don't see any issues with the script itself, but I'm not sure how to handle the data files themselves. I know the bucket is private - that's not the issue - rather just the references themselves, and how should we handle that in a public repository.

import subprocess
from datetime import datetime

DATA = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is great for our purposes, I'm hesitant to hard code these files in a public repo, since I'm not sure all of this data is public. I'm fairly certain the data from the staging studies is not (it may not have ever been published, so I'm not sure where that leaves us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Files renamed ef57bde

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would assuage my concerns

import subprocess
from datetime import datetime

DATA = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as these buckets are private (and it seemed to require me to login to access them), I don't think using them for testing is materially different than having the data live on our staging servers. However, I might suggest reaming the files as something generic like "scp-large-test-[x].tsv", to address the concern that we're expressing favortism to certain studies/groups in our source, and then putting a README in the google bucket that describes where the files come from


PREREQUISITES
set up access token on command line
ACCESS_TOKEN=`gcloud auth print-access-token`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might add a note, since this is a public repo, that this is an internal-only script for SCP developers since it relies on access to a non-public bucket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note added ef57bde

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

The only blocker I see concerns which genome name is used. See comment below for details.

My other suggestions are lower-priority refinements.

Also, at a higher level, this might frame better as ingest_stress_test.py. We have high-level smoke tests for Ingest Pipeline in single_cell_portal_core, and while this tests manage-study unlike that, the focus of the ticket is on many concurrent ingests (a stress test) and not simple tests of core functionality (a smoke test). I don't feel strongly about this, though.

test_type = "smoke"
else:
test_type = "stress"
study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F-string would make this list comprehension a bit easier to digest:

Suggested change
study_names = [test_type + '-' + test_id + '-' + str(i) for i in range(args.number)]
study_names = [f'{test_type}-{test_id}-{i}' for i in range(args.number)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I support readability (and consistency is good too). Updated to f-string in ef57bde

# create studies unless it is a dev-run
if not args.dev_run and not msaction.send_request(study_names[0], "study_exists"):
for study in study_names:
print('Requesting creation of', study)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be ideal if we could standardize on a way to concatenate strings in Python. It seems we use f-strings in a plurality of cases in scp-ingest-pipeline, so I'd suggest that for this PR too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked for all the instances of concatenation that I could find to update to f-strings. Will try consistently use f-strings in future.

'--species',
"human",
'--genome',
"hg19",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the default genome name. In this case, the name is GRCh37.

SCP has support for genome "alias", which could be a place to use UCSC names (e.g. hg19), but that's coincidentally not set in staging. The default genome assembly names (e.g. GRCh37) match Ensembl and NCBI.

Suggested change
"hg19",
"GRCh37",

Separately, but related: given "hg19" isn't a name or alias on staging, I wonder why this doesn't fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated genome value to use the default genome name 38d94ae
created SCP-2678 to follow up on what "hg19" didn't cause ingest to fail

Comment on lines 67 to 68
help='server to test against [development,staging,production]',
default='staging',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in our public CLI:

Suggested change
help='server to test against [development,staging,production]',
default='staging',
help='server to test against',
choices=['development', 'staging', 'production'],
default='staging',

Using choices gives built-in validation, and includes the options in --help output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New to this feature, love it! 38d94ae

parser.add_argument(
'--data-set',
default='small',
help='Select data set for stress test [small, medium, large].',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help='Select data set for stress test [small, medium, large].',
help='Select data set for stress test.',
choices=['small', 'medium', 'large']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38d94ae


parser.add_argument(
'--token',
default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the public CLI, use of this script always requires a token, so:

Suggested change
default=None,
required=True,

This would also ensure an error is thrown if this required field is missing:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 38d94ae

set up access token on command line
ACCESS_TOKEN=`gcloud auth print-access-token`

EXAMPLE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EXAMPLE
EXAMPLES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38d94ae

@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #135 into development will decrease coverage by 11.94%.
The diff coverage is 85.71%.

Impacted file tree graph

@@               Coverage Diff                @@
##           development     #135       +/-   ##
================================================
- Coverage        64.26%   52.31%   -11.95%     
================================================
  Files               22       22               
  Lines             2622     2609       -13     
================================================
- Hits              1685     1365      -320     
- Misses             937     1244      +307     
Impacted Files Coverage Δ
ingest/expression_files/expression_files.py 81.33% <ø> (-0.25%) ⬇️
ingest/mongo_connection.py 53.84% <0.00%> (+3.84%) ⬆️
ingest/monitor.py 77.04% <ø> (ø)
ingest/expression_files/dense_ingestor.py 96.09% <90.00%> (+0.98%) ⬆️
ingest/genomes/utils.py 0.00% <0.00%> (-50.00%) ⬇️
ingest/make_toy_data.py 0.00% <0.00%> (-27.15%) ⬇️
ingest/expression_files/__init__.py 100.00% <0.00%> (ø)
ingest/genomes/genome_annotation_metadata.py
ingest/genomes/genome_annotations.py
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c814c4...e83b1c1. Read the comment docs.

@jlchang jlchang requested review from devonbush and eweitz August 19, 2020 16:37
@jlchang jlchang merged commit 6ed4a64 into development Aug 20, 2020
@jlchang jlchang deleted the jlc_smoke_test branch August 20, 2020 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants