Endpoint/Models for "samples/examples" #2

cgreene · 2016-07-08T14:02:33Z

Researchers will select which samples they want to include in the analysis. From a machine learning point of view, we mean which examples are relevant to the researcher. These samples will have various metadata. The GDC Data Portal [ https://gdc-portal.nci.nih.gov/search/s ] has a very nice interface for these metadata. Essentially the facets on the left for "cases" are the same ones that we would expect to be relevant here.

gwaybio · 2016-07-08T20:29:39Z

The GDC portal is a good example of a friendly user interface and a good starting point to describe what a sample selector for this type of data should be able to do. For our purposes, however, I think we will need our interface to communicate with the gene selector. We don't necessarily want a user to have an option to select a tissue that is likely to drive poor classifier performance if the tissue does not have enough mutations to contribute. For example, if I choose to classify RAS mutations I don't necessarily want breast tumors in my classifier because they will add over 1,000 tumors with few RAS mutations and could saturate the negative samples in the classifier.

cgreene · 2016-07-19T13:07:05Z

I created a quick class diagram of the items that I think have been sufficiently specified based on discussions thus far to start implementation of the models [Samples, Genes, Mutations, Mutation Types]. I went ahead and assumed we'd install django-genes and django-organisms in this project, as that lets use use what is there. At least django-genes will need a rest API but it already provides an elasticsearch index that will be useful to find the right gene when a user types an identifier.

I'll create a pull request with the XML form from draw.io that we can edit.

awm33 · 2016-07-19T22:51:03Z

@cgreene This is great! Would you be able to indicate data type for each? If a field is an enumeration, what the potential values could be?

I assume an id int auto incrementing PK on each model. I would also recommend created_at and updated_at fields on each. We may also want to consider no deletes or soft deletes using deleted_at

cgreene · 2016-07-28T19:47:27Z

Will fill in what I can but probably need the cognoma/cancer-data team to chime in. This is generally using text, unless i'm absolutely convinced that an enum or more complex approach makes sense. The cancer-data needs to fill some of these in (like age at diagnosis - I made it an integer but not sure it actually is in the data).

Sample:

Site: Short string
Project: Short string
Disease type: Short string
age_at_diagnosis: int
Gender: enum (male, female, unknown)
Vital: enum (alive, deceased, unknown)
days_to_death: int
Race: string? [data team?]
Ethnicity: string? [data team?]

cgreene · 2016-08-10T18:42:12Z

Here is the ultra stripped down version requested by @aelkner at the meetup last night.

ypar · 2016-09-01T21:47:44Z

also @awm33, we can start setting up thing using subsets of input data.

sample table is downloadable by this link here

one example of a mutation table is here

awm33 · 2016-10-06T23:03:34Z

#25

* Task Integration (#1) * Testing CircleCI * Testing new AWS IAM credentials/permissions * Increased deployment timeout threshold (#64) * Increased ecs_deploy timeout threshold to 180 seconds The most recent deploy took about 120 seconds, but the current threshold for the ecs_deploy script timeout is 90 seconds. This causes builds on CircleCI to fail with the red X - not a good look. 180 seconds plays it on the safe side without being too long. * Create task after classifer * Task creation working * Expand task on get * Expand on classifier post * Starting task creation tests * Expanding task(s) * Fixed tests The unique ID of a task was causing problems in the testing environment when integrating with a task-service container. This commit resolves that problem by detecting if tests are running and generating a random unique ID if that is the case. * Forgot to import Gene * Fixed task-def creation request data In accordance with commits made in the task-service repository * Added endpoint for completed notebook upload to classifier (for ml-worker) When ml-worker completes running a notebook, it needs to upload the completed notebook to core-service so that core-service can send an email to the user with a link to download their completed notebook. This commit enabled that functionality by adding: - New authentication permission designed to only allow an internal service to upload a notebook - notebook_file attribute to Classifier model and serializer - rudimentary file storage logic (stores files locally under /media_files/notebook/classifier_<id>.ipynb, no S3 integration yet) - Tests for notebook uploads, which include uploading a real notebook - Whenever a test is ran that creates a file, you are always left with the directory still on your filesystem after the test. I added a test runner file which will delete the media_files directory after testing * Make classifiers write-once only For now, we will assume classifiers cannot be updated. * Email & S3 (#2) * Added sending email upon notebook upload * Added S3 integration with django-storages Followed this guide: https://www.caktusgroup.com/blog/2014/11/10/Using-Amazon-S3-to-store-you r-Django-sites-static-and-media-files/ * Fixed isses with sending email * Consolidated task-service and core-service All of the task-service functionality is now ported over into core-service, including queueing, serialization, views, etc. All of the relevant columns that used to be stored on Task and TaskDef objects in task-service are now stored directly on the classifier in core-service. This greatly simplifies Cognoma’s overall architecture and codebase. * Converted flags to long args in circle.yml

#79) * Task Integration (#1) * Testing CircleCI * Testing new AWS IAM credentials/permissions * Increased deployment timeout threshold (#64) * Increased ecs_deploy timeout threshold to 180 seconds The most recent deploy took about 120 seconds, but the current threshold for the ecs_deploy script timeout is 90 seconds. This causes builds on CircleCI to fail with the red X - not a good look. 180 seconds plays it on the safe side without being too long. * Create task after classifer * Task creation working * Expand task on get * Expand on classifier post * Starting task creation tests * Expanding task(s) * Fixed tests The unique ID of a task was causing problems in the testing environment when integrating with a task-service container. This commit resolves that problem by detecting if tests are running and generating a random unique ID if that is the case. * Forgot to import Gene * Fixed task-def creation request data In accordance with commits made in the task-service repository * Added endpoint for completed notebook upload to classifier (for ml-worker) When ml-worker completes running a notebook, it needs to upload the completed notebook to core-service so that core-service can send an email to the user with a link to download their completed notebook. This commit enabled that functionality by adding: - New authentication permission designed to only allow an internal service to upload a notebook - notebook_file attribute to Classifier model and serializer - rudimentary file storage logic (stores files locally under /media_files/notebook/classifier_<id>.ipynb, no S3 integration yet) - Tests for notebook uploads, which include uploading a real notebook - Whenever a test is ran that creates a file, you are always left with the directory still on your filesystem after the test. I added a test runner file which will delete the media_files directory after testing * Make classifiers write-once only For now, we will assume classifiers cannot be updated. * Email & S3 (#2) * Added sending email upon notebook upload * Added S3 integration with django-storages Followed this guide: https://www.caktusgroup.com/blog/2014/11/10/Using-Amazon-S3-to-store-you r-Django-sites-static-and-media-files/ * Fixed isses with sending email * Consolidated task-service and core-service All of the task-service functionality is now ported over into core-service, including queueing, serialization, views, etc. All of the relevant columns that used to be stored on Task and TaskDef objects in task-service are now stored directly on the classifier in core-service. This greatly simplifies Cognoma’s overall architecture and codebase. * Quick bug fixes Forgot to remove references to serializer after a changed import statement. Test runner will now not error if no media files were created. * Added https -> https redirect in nginx * Fixed & updated email sending - Upload request would fail if the user had no email registered. Added fail_silently=True to bypass this failure. - Updated the email message to include a link to the nbviewer website. - Simplified MLWorkers permission logic * User/Classifier security enhancements and /genes/ pagination - /users/ only provides access to create users. Endpoint will not return a list of users anymore. - access to /users/id/ is only given to users accessing themselves and internal services. Users/anonymous users cannot access other users. - /classifiers/ only provides access to create classifiers. Endpoint will not return a list of classifiers anymore. - Before, accessing the /genes/ endpoint really slowed down the server because it had to process 100 genes and their mutations. I lowered the pagination size to 10 which speeds things up significantly. * Removed unnecessary UniqueTaskConflict This isn’t used anymore due to the core-service/task-service consolidation. * Added email message for classifier processing failure * Hotfix Was trying to access Classifier object information directly on the serializer, which won’t work. * Comment out genes endpoints and tests It appears that these are not needed at the moment. * Commented out mutations endpoint Appears unneeded for now. * Updated mutations test status codes * Forgot to comment out extraneous test assertions for mutation endpoints

cgreene mentioned this issue Jul 19, 2016

create initial class diagram #11

Closed

awm33 mentioned this issue Jul 20, 2016

Status Chooser - Options cognoma/frontend#1

Closed

cgreene mentioned this issue Jul 28, 2016

Identify the types of clinical data fields for the django team cognoma/cancer-data#13

Open

awm33 mentioned this issue Oct 4, 2016

Initial API #25

Merged

awm33 closed this as completed Oct 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint/Models for "samples/examples" #2

Endpoint/Models for "samples/examples" #2

cgreene commented Jul 8, 2016

gwaybio commented Jul 8, 2016

cgreene commented Jul 19, 2016

awm33 commented Jul 19, 2016 •

edited

Loading

cgreene commented Jul 28, 2016

cgreene commented Aug 10, 2016

ypar commented Sep 1, 2016 •

edited

Loading

awm33 commented Oct 6, 2016

Endpoint/Models for "samples/examples" #2

Endpoint/Models for "samples/examples" #2

Comments

cgreene commented Jul 8, 2016

gwaybio commented Jul 8, 2016

cgreene commented Jul 19, 2016

awm33 commented Jul 19, 2016 • edited Loading

cgreene commented Jul 28, 2016

cgreene commented Aug 10, 2016

ypar commented Sep 1, 2016 • edited Loading

awm33 commented Oct 6, 2016

awm33 commented Jul 19, 2016 •

edited

Loading

ypar commented Sep 1, 2016 •

edited

Loading