Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint/Models for "samples/examples" #2

Closed
cgreene opened this issue Jul 8, 2016 · 7 comments
Closed

Endpoint/Models for "samples/examples" #2

cgreene opened this issue Jul 8, 2016 · 7 comments

Comments

@cgreene
Copy link
Member

cgreene commented Jul 8, 2016

Researchers will select which samples they want to include in the analysis. From a machine learning point of view, we mean which examples are relevant to the researcher. These samples will have various metadata. The GDC Data Portal [ https://gdc-portal.nci.nih.gov/search/s ] has a very nice interface for these metadata. Essentially the facets on the left for "cases" are the same ones that we would expect to be relevant here.

@gwaybio
Copy link
Member

gwaybio commented Jul 8, 2016

The GDC portal is a good example of a friendly user interface and a good starting point to describe what a sample selector for this type of data should be able to do. For our purposes, however, I think we will need our interface to communicate with the gene selector. We don't necessarily want a user to have an option to select a tissue that is likely to drive poor classifier performance if the tissue does not have enough mutations to contribute. For example, if I choose to classify RAS mutations I don't necessarily want breast tumors in my classifier because they will add over 1,000 tumors with few RAS mutations and could saturate the negative samples in the classifier.

@cgreene
Copy link
Member Author

cgreene commented Jul 19, 2016

I created a quick class diagram of the items that I think have been sufficiently specified based on discussions thus far to start implementation of the models [Samples, Genes, Mutations, Mutation Types]. I went ahead and assumed we'd install django-genes and django-organisms in this project, as that lets use use what is there. At least django-genes will need a rest API but it already provides an elasticsearch index that will be useful to find the right gene when a user types an identifier.

django-cognoma

I'll create a pull request with the XML form from draw.io that we can edit.

@awm33
Copy link
Member

awm33 commented Jul 19, 2016

@cgreene This is great! Would you be able to indicate data type for each? If a field is an enumeration, what the potential values could be?

I assume an id int auto incrementing PK on each model. I would also recommend created_at and updated_at fields on each. We may also want to consider no deletes or soft deletes using deleted_at

@cgreene
Copy link
Member Author

cgreene commented Jul 28, 2016

Will fill in what I can but probably need the cognoma/cancer-data team to chime in. This is generally using text, unless i'm absolutely convinced that an enum or more complex approach makes sense. The cancer-data needs to fill some of these in (like age at diagnosis - I made it an integer but not sure it actually is in the data).

Sample:

  • Site: Short string
  • Project: Short string
  • Disease type: Short string
  • age_at_diagnosis: int
  • Gender: enum (male, female, unknown)
  • Vital: enum (alive, deceased, unknown)
  • days_to_death: int
  • Race: string? [data team?]
  • Ethnicity: string? [data team?]

@cgreene
Copy link
Member Author

cgreene commented Aug 10, 2016

Here is the ultra stripped down version requested by @aelkner at the meetup last night.
img_20160810_144027

@ypar
Copy link

ypar commented Sep 1, 2016

also @awm33, we can start setting up thing using subsets of input data.

sample table is downloadable by this link here

one example of a mutation table is here

@awm33 awm33 mentioned this issue Oct 4, 2016
@awm33
Copy link
Member

awm33 commented Oct 6, 2016

#25

@awm33 awm33 closed this as completed Oct 6, 2016
dcgoss added a commit that referenced this issue Jul 13, 2017
* Task Integration (#1)

* Testing CircleCI

* Testing new AWS IAM credentials/permissions

* Increased deployment timeout threshold (#64)

* Increased ecs_deploy timeout threshold to 180 seconds

The most recent deploy took about 120 seconds, but the current
threshold for the ecs_deploy script timeout is 90 seconds. This causes
builds on CircleCI to fail with the red X - not a good look. 180
seconds plays it on the safe side without being too long.

* Create task after classifer

* Task creation working

* Expand task on get

* Expand on classifier post

* Starting task creation tests

* Expanding task(s)

* Fixed tests

The unique ID of a task was causing problems in the testing environment
when integrating with a task-service container. This commit resolves
that problem by detecting if tests are running and generating a random
unique ID if that is the case.

* Forgot to import Gene

* Fixed task-def creation request data

In accordance with commits made in the task-service repository

* Added endpoint for completed notebook upload to classifier (for ml-worker)

When ml-worker completes running a notebook, it needs to upload the
completed notebook to core-service so that core-service can send an
email to the user with a link to download their completed notebook.

This commit enabled that functionality by adding:
- New authentication permission designed to only allow an internal
service to upload a notebook
- notebook_file attribute to Classifier model and serializer
- rudimentary file storage logic (stores files locally under
/media_files/notebook/classifier_<id>.ipynb, no S3 integration yet)
- Tests for notebook uploads, which include uploading a real notebook
- Whenever a test is ran that creates a file, you are always left with
the directory still on your filesystem after the test. I added a test
runner file which will delete the media_files directory after testing

* Make classifiers write-once only

For now, we will assume classifiers cannot be updated.

* Email & S3 (#2)

* Added sending email upon notebook upload

* Added S3 integration with django-storages

Followed this guide:
https://www.caktusgroup.com/blog/2014/11/10/Using-Amazon-S3-to-store-you
r-Django-sites-static-and-media-files/

* Fixed isses with sending email

* Consolidated task-service and core-service

All of the task-service functionality is now ported over into
core-service, including queueing, serialization, views, etc. All of the
relevant columns that used to be stored on Task and TaskDef objects in
task-service are now stored directly on the classifier in core-service.
This greatly simplifies Cognoma’s overall architecture and codebase.

* Converted flags to long args in circle.yml
dcgoss added a commit that referenced this issue Aug 3, 2017
#79)

* Task Integration (#1)

* Testing CircleCI

* Testing new AWS IAM credentials/permissions

* Increased deployment timeout threshold (#64)

* Increased ecs_deploy timeout threshold to 180 seconds

The most recent deploy took about 120 seconds, but the current
threshold for the ecs_deploy script timeout is 90 seconds. This causes
builds on CircleCI to fail with the red X - not a good look. 180
seconds plays it on the safe side without being too long.

* Create task after classifer

* Task creation working

* Expand task on get

* Expand on classifier post

* Starting task creation tests

* Expanding task(s)

* Fixed tests

The unique ID of a task was causing problems in the testing environment
when integrating with a task-service container. This commit resolves
that problem by detecting if tests are running and generating a random
unique ID if that is the case.

* Forgot to import Gene

* Fixed task-def creation request data

In accordance with commits made in the task-service repository

* Added endpoint for completed notebook upload to classifier (for ml-worker)

When ml-worker completes running a notebook, it needs to upload the
completed notebook to core-service so that core-service can send an
email to the user with a link to download their completed notebook.

This commit enabled that functionality by adding:
- New authentication permission designed to only allow an internal
service to upload a notebook
- notebook_file attribute to Classifier model and serializer
- rudimentary file storage logic (stores files locally under
/media_files/notebook/classifier_<id>.ipynb, no S3 integration yet)
- Tests for notebook uploads, which include uploading a real notebook
- Whenever a test is ran that creates a file, you are always left with
the directory still on your filesystem after the test. I added a test
runner file which will delete the media_files directory after testing

* Make classifiers write-once only

For now, we will assume classifiers cannot be updated.

* Email & S3 (#2)

* Added sending email upon notebook upload

* Added S3 integration with django-storages

Followed this guide:
https://www.caktusgroup.com/blog/2014/11/10/Using-Amazon-S3-to-store-you
r-Django-sites-static-and-media-files/

* Fixed isses with sending email

* Consolidated task-service and core-service

All of the task-service functionality is now ported over into
core-service, including queueing, serialization, views, etc. All of the
relevant columns that used to be stored on Task and TaskDef objects in
task-service are now stored directly on the classifier in core-service.
This greatly simplifies Cognoma’s overall architecture and codebase.

* Quick bug fixes

Forgot to remove references to serializer after a changed import
statement. Test runner will now not error if no media files were
created.

* Added https -> https redirect in nginx

* Fixed & updated email sending

- Upload request would fail if the user had no email registered. Added
fail_silently=True to bypass this failure.
- Updated the email message to include a link to the nbviewer website.
- Simplified MLWorkers permission logic

* User/Classifier security enhancements and /genes/ pagination

- /users/ only provides access to create users. Endpoint will not
return a list of users anymore.
- access to /users/id/ is only given to users accessing themselves and
internal services. Users/anonymous users cannot access other users.
- /classifiers/ only provides access to create classifiers. Endpoint
will not return a list of classifiers anymore.
- Before, accessing the /genes/ endpoint really slowed down the server
because it had to process 100 genes and their mutations. I lowered the
pagination size to 10 which speeds things up significantly.

* Removed unnecessary UniqueTaskConflict

This isn’t used anymore due to the core-service/task-service
consolidation.

* Added email message for classifier processing failure

* Hotfix

Was trying to access Classifier object information directly on the
serializer, which won’t work.

* Comment out genes endpoints and tests

It appears that these are not needed at the moment.

* Commented out mutations endpoint

Appears unneeded for now.

* Updated mutations test status codes

* Forgot to comment out extraneous test assertions for mutation endpoints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants