Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority: Data Loader and data download form. #1

Closed
laceysanderson opened this issue May 18, 2017 · 16 comments
Closed

Priority: Data Loader and data download form. #1

laceysanderson opened this issue May 18, 2017 · 16 comments

Comments

@laceysanderson
Copy link
Member

The Global Trust data needs to be available for access as soon as possible. As such, we need to make a first release of this module with extremely basic functionality. To meet the need for access, at a minimum we need an upload page for advanced users to submit data and a download page for public users to access data. I will add more specific details for each to this issue.

@laceysanderson
Copy link
Member Author

laceysanderson commented May 26, 2017

Upload Specification

This will be an administrator form to differentiate it from the raw phenotypes. The Upload Page should be at admin/tripal/extensions/analyzed-phenotypes/upload. The page before that should simply be a listing of links like admin/tripal. Theming should be minimal in order to match the administrative theme of the site.

Process

The upload process will be a multi-step form:

  1. "Experiment" + Data File
    This page will have an "Experiment" Autocomplete at the top with a drag-and-drop file upload below. Very similar to the first step in Raw Phenotypes. It would also be good to put a warning at the top of the page as follows: "Phenotypic data should be (bold)filtered for outliers and mis-entries(bold) before being uploaded here. Do not upload data that should not be used in the final analysis for a scientific article. Furthermore, data should (bold)NOT be averaged across replicates or site-years(bold)."
  2. Validation Step
    For this form, data validation is going to be done in a Tripal Job rather then during the upload process. This is to make sure that data file size never causes a WSOD ;-) but also to take into account the more intense validation we are going to do at this stage. This page should have an auto-updating progress bar and then display validation messages in a similar format to the raw phenotypes module. Things to validate include:
    • Should include all the same validation from raw phenotypes.
    • Don't allow empty values for any columns (all required)
    • Check "Trait, Experiment, Germplasm, Year, Location, Replicate" combination is unique
      (This list might grow. Use the same approach as raw phenotypes).
  3. Fully Describe Traits (not optional)
    This page will have a list of the traits detected in the first step. For each trait the uploader must validate that the correct trait is chosen or define the trait fully if it's not recognized. Form elements per trait:
    • Trait Name
    • Trait Description
    • Trait Method
    • Trait Units
    • Trait Scale (if applicable)
    • Mapping to Crop Ontology
    • Mapping to Plant Trait Ontology
    • Picture Upload
    • Include Stats section calculated by us. Should include Min, Max, Mean, Standard Deviation per Site-year/trait combination on the data given to us.
  4. Show loading progress (same as in raw phenotypes)

File Specification

Upload file (tab-delimited) will have the following columns:

  • Trait Name (cvterm.name)
  • Germplasm Accession (stock.uniquename)
  • Germplasm Name (stock.name)
  • Year
  • Location
  • Replicate
  • Value
  • Data Collector

Data Storage

screen shot 2017-05-26 at 10 14 58 am

  1. Select the trait, germplasm and project. These primary key => value mappings should be saved during loading since they will be re-used many times.
  2. Create phenotype record
    • uniquename:
    • observable_id: NULL
    • attr_id: primary key for the trait
    • value: the measured value
    • cvalue_id: NULL
    • assay_id: NULL (maybe we should set this to show that the measurements were taken by eye)
    • project_id: primary key of the project
    • stock_id: primary key of the germplasm
  3. Create addition phenotype properties to store the remaining information
    • Year
      • phenotype_id: the phenotype_id from the record created in Trait Distribution Chart #2
      • type_id: FIND TERM
      • value: inputFile.year
      • cvalue_id: NULL
      • rank: 0
    • Location
      • phenotype_id: the phenotype_id from the record created in Trait Distribution Chart #2
      • type_id: cvterm_id for (OGI:0000021; "location on map")
      • value: inputFile.location
      • cvalue_id: NULL
      • rank: 0
    • Replicate
      • phenotype_id: the phenotype_id from the record created in Trait Distribution Chart #2
      • type_id: FIND TERM
      • value: inputFile.replicate
      • cvalue_id: NULL
      • rank: 0
    • Data Collector
      • phenotype_id: the phenotype_id from the record created in Trait Distribution Chart #2
      • type_id: cvterm_id for (CO_010:0000097; "collector name")
      • value: inputFile.dataCollector
      • cvalue_id: NULL
      • rank: 0

@reynoldtan
Copy link
Member

Mock-up #1

In this proof shows a 3-stage-data-loader for analyzed phenotypes. The design or layout goal is to pattern the interface to match the visual appearance of Tripal admin pages. The overall layout of the page shows the stage indicator (arrows pointing to the right/forward) followed by an autocomplete textfield form element and an area for validation result or Drag and Drop.

STAGE 01 - VALIDATE:

ap - stage 1 upload

Below is a note about distributing the validation process to 1. upload process 2. as a Tripal job.
Validation as part of the upload process handles minor or basic validation such as, check to ensure project was selected, file had the right number of columns, columns matched the required column headers etc. In the second process, when file passed basic validation, the module passes the extensive validation of data, in rows and columns, to the server as a Tripal job request. This method will allow the module to manage server resources more efficiently.

VALIDATION WITHOUT ERRORS.

ap - stage 1 no error
Next step button is added to instruct user to proceed. Consistent with Tripal admin pages, this button is left aligned and outside the the form element container/fieldset.

VALIDATION WITH ERRORS.

ap - stage 1 with error
Similar to rawphenotypes, errors detected are listed followed by a failed status message and Drag and Drop to allow user to re-upload.

STAGE 02 - DESCRIBE:

ap - stage 2 describe

In this page, admin is requested to fully describe all traits. A validation result window instructs user to complete forms as well as inform about relevant details regarding the file uploaded. Each trait detected has a set of form field elements and a title organized into one fieldset/container. Traits are sequentially numbered, related form fields are grouped together and the form for the first trait is uncollapsed on load, all to guide user when filling out form entries.

User will be notified of described and undescribed number of traits before proceeding.

Note:
Please confirm that a summary table is expected in Stats section (in the specs - Include a stats section calculated by us.)

STAGE 03 - SAVE:

ap - stage 3 save

Finally, my favourite stage :) is where data gets processed and stored. It illustrates the completed stage indicator, a series of warnings and status messages and finally a progress bar to show among other things, the progress.

Questions:

  1. What are the allowed file extensions?
  2. Does picture uploaded when describing traits need to be scaled (programmatically) eg 2000X2000 pixel to say 300X300 pixel?
  3. What field type is mapping to xyz ontology?
  4. Stat section, as mentioned above.
  5. What is validate if the correct trait is chosen?
  6. In the warning message, should we also include that if you are uploading raw data please click rawphenotypes...
  7. Is this admin part of analyzed phenotypes module?
  8. Please provide a sample file (tab delimeted) with sample data, to give me an idea of what I will be dealing with.

let me know...

@laceysanderson
Copy link
Member Author

First off, I Love the mockups! They are exactly what I was picturing. Also, I completely agree with the two-step validation: fast validation done on upload, line-by-line validation done in Tripal job --Good Solution!

Suggestions (including better specs):

  1. I would add an additional step. Your current "Validate" step would become "Upload" and include the fast validation. The next step would be "Validate" and would include the progress bar for the tripal job doing the more complete line-by-line validation. If not all validation passes, this page would also have a drop zone so users don't have to manually jump back a step.
  2. This way you don't end up with validation messages and project settings on the "Describe Traits" step.
  3. Image upload in "Describe" step needs to support multiple images.
  4. You likely want to add a genus drop-down to the first page as well. If it has already been set for that project then you can auto-fill it. This will be needed for stock look-up as well as to know which crop ontology to use.

Questions

  1. tsv, txt
  2. Yes, we will need to programatically resize the image --good catch!
  3. Autocomplete pulling terms from the correct ontology. Users aren't familiar with these ontologies though so a "Suggestions" box with terms matching keywords from the "Trait Name" field would be helpful.
  4. Stat's section: Your current mock-up is exactly what I meant 👍
  5. The trait might already exist. Therefore, based on the trait name, the form should attempt to fill in defaults for all the fields. The user should then be warned with a "Did you mean 'Plant Height'? Please check the method and units to ensure they match how the data was collected." If they change the method or units then a new trait should be created.
  6. Sure, can't hurt although I can't see them accidentally ending up here.
  7. Yes although it will be used by people such as Derek so still be sure to be helpful and double check Everything ;-)
  8. I don't actually have a sample file but I will try to generate one for you.

@reynoldtan
Copy link
Member

Mock-up #2

Image below illustrates the flow of loading data to AP.
stage-flow

Revised pages of Loader (in the order of stages shown above).
01 UPLOAD
ap - stage 1 upload
01 UPLOAD / NO VALIDATION ERRORS
ap - stage 1 upload - no errors
01 UPLOAD / WITH VALIDATION ERRORS
ap - stage 1 upload - with error

02 VALIDATE
ap - stage 2 validate
02 VALIDATE / NO VALIDATION ERRORS
ap - stage 2 validate - no errors
02 VALIDATE / WITH VALIDATION ERRORS
ap - stage 2 validate - with error

03 DESCRIBE
ap - stage 3 describe

04 SAVE
ap - stage 4 save

Thanks!

@laceysanderson
Copy link
Member Author

Looks perfect except for the "suggestions" in "03 DESCRIBE". I think this being a drop-down is very confusing and this would be better shown as a list. I would expect something more along the lines of "Possible Crop Ontology Term(s): Plant Height, Canopy Height, First Node Height" if the Trait name was "Height".

reynoldtan added a commit that referenced this issue Jul 4, 2017
reynoldtan added a commit that referenced this issue Jul 4, 2017
- Validating project is specified and the specified project do exists.
@reynoldtan
Copy link
Member

Questions/Clarifications:

Stage 2 - Validate:

  • Is the unique combination validation unique in the file or in the db records?

Stage 3 - Describe:
CVTERM NAME

  • When inserting a cvterm a DB is specified. What are the possible DBs and the default?
  • When inserting a cvterm will the cv_id value be cv_id = phenotype_measurement_types or should we create one specific to this module (eg. analyzedphenotype_measurement_type)?
    CVTERM UNIT
  • When relating the term to the unit in CVTERM_RELATIONSHIP. Will the type_id value be cv_id = phenotype_measurement_units or should we create one specific to this module (eg. analyzedphenotypes_measurement_units)?
    SCALE
  • Could not figure out how to store :(
    ONTOLOGY
    Looking at kp_entities module, I think ontology is stored as cvterm with a specific cv. The cvs in this case were LENTIL CROP ONTOLOGY, CHICKPEA CROP ONTOLOGY and so on (default-namespace in the obo file) with corresponding record in tripal_cv_ob.
  • What namespace/cv should we use for Crop Ontology?
  • What namespace/cv should we use for Plant Trait Ontology?
    Please confirm:
  • Ontology and Trait as CVTERM_RELATIONSHIP
    where type_id = cv_id of cv describe above or should we create a term specific to this module to describe the relationship (eg. analyzedphenotype_measurement_ontogoly)?
    object_id = cvterm_id of the Trait
    subject_id = cvterm_id of the Ontology

PHOTO
Suggestion: use the cvterm id number plus a sequence number plus the file extension.
example: cvterm: 2132_2.gif
where 2132 is the cvterm id and 2 is photo # 2.
This method does not require a table, just need to remember directory we are saving photo in. :)

  • As mentioned, when trait is found we auto fill the describe form with corresponding values. When modified (even just a word in description) becomes a new record. Will this be true if photos/ontology were changed?

Summary Table
Is the source of data from the file or stored records where site-year is table phenotype/ field location and year, min is min value of the record set, max is the max value of the record set, mean is the sum of the values divided by number of rows and standard deviation - need to google this :) ?

Saving Line

  • uniquename = trait name?
  • For location, replicate, year and data collector
    I can only query Location and Replicate (as rep) I believe from rawphenotypes module.
    Should we add a separate copy of these four term on install?

I hope I make sense and please let me know.
Thanks!

@laceysanderson
Copy link
Member Author

Stage 2 - Validate

Is the unique combination validation unique in the file or in the db records?

It should be unique in the database.

Stage 3 - Describe

CVTERM NAME

When inserting a cvterm a DB is specified. What are the possible DBs and the default?
When inserting a cvterm will the cv_id value be cv_id = phenotype_measurement_types or should we create one specific to this module (eg. analyzedphenotype_measurement_type)?

These should be configurable. More specifically, your module should create a settings form at Admin > Tripal > Extensions > Analyzed Phenotypes > Settings that allows the admin to select an existing cv and db per organism.genus. You should then use these ontologies when checking to see if a term already exists. Furthermore, the admin should be able to specify whether you can add new terms or not. Some sites will want to keep the ontologies pure; whereas, others will want to build them as they go.

CVTERM UNIT

When relating the term to the unit in CVTERM_RELATIONSHIP. Will the type_id value be cv_id = phenotype_measurement_units or should we create one specific to this module (eg. analyzedphenotypes_measurement_units)?

This should be stored the same way it is for the crop ontologies. However, I don't know what that is off the top of my head. I'll look into it and reply back later.

SCALE

Could not figure out how to store :(

This should be stored the same way it is for the crop ontologies. However, I don't know what that is off the top of my head. I'll look into it and reply back later.

ONTOLOGY

Looking at kp_entities module, I think ontology is stored as cvterm with a specific cv. The cvs in this case were LENTIL CROP ONTOLOGY, CHICKPEA CROP ONTOLOGY and so on (default-namespace in the obo file) with corresponding record in tripal_cv_ob.

There are actually multiple cvs created per crop ontology. For example, the Lentil Crop ontology creates the following cvs: "Crop Ontology, Lentil Variable", "Crop Ontology, Lentil Scale", "Crop Ontology, Lentil Method", "Crop Ontology, Lentil Trait", "Lentil Crop Ontology". Traits are contained in the "Crop Ontology, Lentil Trait"

What namespace/cv should we use for Crop Ontology?

This is dependant upon the organism.genus selected for the current data file. If the data file refers to "Lens" then we should use the "Crop Ontology, Lentil Trait" cv. This should also be configured in the same section as the cv/db per genus above. It also might be good to make this comparison optional since some sites might decide to use the crop ontology directly rather then map to it.

What namespace/cv should we use for Plant Trait Ontology?

The Plant Trait ontology currently can't be loaded into Tripal due to an incompatibility in the OBO format. I would comment out this section for now.

Please confirm:
Ontology and Trait as CVTERM_RELATIONSHIP
where type_id = cv_id of cv describe above or should we create a term specific to this module to describe the relationship (eg. analyzedphenotype_measurement_ontogoly)?
object_id = cvterm_id of the Trait
subject_id = cvterm_id of the Ontology

Use the following term as the type_id: cvtern.name=related, cv.name=synonym_type. Your subject and object are correct :-)

PHOTO

Suggestion: use the cvterm id number plus a sequence number plus the file extension.
example: cvterm: 2132_2.gif
where 2132 is the cvterm id and 2 is photo # 2.
This method does not require a table, just need to remember directory we are saving photo in. :)

Sure, let's run with this :-) Just make sure the files are managed by Drupal.

As mentioned, when trait is found we auto fill the describe form with corresponding values. When modified (even just a word in description) becomes a new record. Will this be true if photos/ontology were changed?

No. If photos or ontology mapping are changes then we can just update the current trait.

Summary Table

Is the source of data from the file or stored records where site-year is table phenotype/ field location and year, min is min value of the record set, max is the max value of the record set, mean is the sum of the values divided by number of rows and standard deviation - need to google this :) ?

The source data is from the file. You're correct on how to calculate min, max, mean. Standard deviation (how spread out the numbers are: https://www.mathsisfun.com/data/standard-deviation.html) can be calculated by adding the following to our module (create analyzedphenotypes/api/analyzedphenotypes.api.inc and include it in our .module file):

if (!function_exists('stats_standard_deviation')) {
    /**
     * This user-land implementation follows the implementation quite strictly;
     * it does not attempt to improve the code or algorithm in any way. It will
     * raise a warning if you have fewer than 2 values in your array, just like
     * the extension does (although as an E_USER_WARNING, not E_WARNING).
     * 
     * @param array $a 
     * @param bool $sample [optional] Defaults to false
     * @return float|bool The standard deviation or false on error.
     */
    function stats_standard_deviation(array $a, $sample = false) {
        $n = count($a);
        if ($n === 0) {
            trigger_error("The array has zero elements", E_USER_WARNING);
            return false;
        }
        if ($sample && $n === 1) {
            trigger_error("The array has only 1 element", E_USER_WARNING);
            return false;
        }
        $mean = array_sum($a) / $n;
        $carry = 0.0;
        foreach ($a as $val) {
            $d = ((double) $val) - $mean;
            $carry += $d * $d;
        };
        if ($sample) {
           --$n;
        }
        return sqrt($carry / $n);
    }
}

Source: http://php.net/manual/en/function.stats-standard-deviation.php#114473

Saving Line

uniquename = trait name

The uniquename has to be unique for the measurement. Therefore it should be a combination of trait_id, project_id, location, year, stock_id, and rep. Just to be safe I throw the date in there too when generating phenotypic data. See https://github.com/UofS-Pulse-Binfo/generate_trpdata/blob/7.x-3.x/generate_trpdata.drush.inc#L854.

For location, replicate, year and data collector
I can only query Location and Replicate (as rep) I believe from rawphenotypes module.
Should we add a separate copy of these four term on install?

We want to use public ontologies as much as possible... However, in the interests of time, I stuck to terms that were already available with Tripal3. These are what I used for generating phenotypic data:

  • Location: cvterm.name=Location, cv.name = nd_geolocation_property
  • Year: cvterm.name=Year, cv_name = tripal_pub
  • Replicate: cvterm.name=replicate, cv.name=local

Replicate did need to be created (See https://github.com/UofS-Pulse-Binfo/generate_trpdata/blob/7.x-3.x/generate_trpdata.drush.inc#L732)... These will work for now but keep in mind they are not ideal... Perhaps it would be good to add an issue to github to find better public terms ;-).

Saving to the Database

I see that you created two new tables ap_phenotype and ap_phenotypeprop using hook_schema(). You will want to use chado.phenotype and chado.phenotypeprop instead as these tables already exist :-) Unfortunately, the tables that come with chado are missing a few columns so your module will need to check to see if the table matches your expectations and then alter it if it doesn't. This should be done on module enable. How to make the changes if they're not already done:

chado_query('ALTER TABLE {phenotype} ADD COLUMN project_id integer REFERENCES {project} (project_id)');
chado_query('ALTER TABLE {phenotype} ADD COLUMN stock_id integer REFERENCES {stock} (stock_id)');
chado_query('ALTER TABLE {phenotypeprop} ADD COLUMN cvalue_id integer REFERENCES {cvterm} (cvterm_id)');

Outstanding Question: How should we relate the trait to it's unit and scale? My answer is to follow the same method as the crop ontologies. However, I don't know what that is off the top of my head so I'm adding this here with the intent of looking into this later.

@laceysanderson
Copy link
Member Author

Additionally, I've added functionality to display data as your trait distribution chart and a summary table. These require two materialized views which will need to be syn'd after new data is loaded. You upload form should automatically submit a job to sync these two materialized views (mview_phenotype, mview_phenotype_summary).

@reynoldtan reynoldtan assigned reynoldtan and unassigned reynoldtan Sep 7, 2017
@reynoldtan
Copy link
Member

ap - download

Analyzed Phenotypes Data Downloader Mockup #1
In this mockup shows a download page similar to rawphenotypes download page. The top section, proceeding the main title, is a set of informative icons that represent a relevant type of data or filter available to user. When selected, a series of form elements, populated by more detailed filters, allow for more customized refinements desired. Retrieval of the entire dataset is also supported by clicking the all dataset option.

All textarea form elements are multi select and have the first option to include/select all.

Notes

  1. Emphasize the infographics row by adding a box around the set (from Carolyn).
  2. Download all dataset may be the default option for most user who just want the data and do not want/unwilling to interact with the interface. Might cause heavy activity in the server (from Carolyn).
  3. An option to select the file type of the output file (txt, tsv or xlsx).

@laceysanderson
Copy link
Member Author

The mockup looks beautiful :-)

Thoughts/Suggestions:

  • I think it would be confusing for not all filters to be shown by default. This is a huge departure from the rest of the forms on KnowPulse...
  • I like the icons at the top (and am considering adding them to all forms ;-) ) but I think they would be better as a data summary rather then selectors for filter criteria. Additionally, I would add Years and Experiments to the list.
  • Filter criteria should have help text
  • I think it makes the most sense to make the trait selector limited to one selection. This keeps people from downloading all our data, which I think both Kirstin and our server would approve of ;-). Furthermore, analysis is only done on one trait at a time so this shouldn't limit the researcher. @carolyncaron what do you think?
  • Also, this form should be in the main portion of the site not the administrative section. The module already creates a summary page and the bean chart: this form should be on the same level as the bean chart with a similar path.
  • I second @carolyncaron's suggestion of a output file type option. We should talk with researchers to determine what formats would be useful based on the programs they use for analysis.

Filter Criteria Suggestions:

I suggest grouping by category to provide similar functionality to your "select by icon" while still showing all filter criteria. You might want to use the genotype filter as an example: http://knowpulse.usask.ca/portal/chado/genotype/Lens.

  1. Select the trait you are interested in. (required)
    • Species: Genus (required) + Species (default: All)
    • Trait (required; dependant upon genus above)
  2. Restrict dataset to a specific Experiment. (optional)
    • Experiment Name (autocomplete)
    • Year (show all years by default, change based on experiment selected)
    • Location (show all locations by default, change based on experiment selected)
  3. Additional Filter Criteria. (optional)
    • Germplasm Name (autocomplete)
    • Germplasm Accession (autocomplete; stock.uniquename)
    • Germplasm Type (select; stock.type_id)
    • Germplasm Maximum Allowed Missing Data (only include germplasm with a maximum amount of missing data supplied by the user; default=100)
  4. Choose your output format. (required)
    • This section needs more thought... I don't know what the ideal format for this data is. Plus the fields exported depend on what is selected above. For example, if you selected a single experiment then there is no need to include the experiment name with each datapoint. Same goes for year, location, germplasm name.

@carolyncaron any other filter criteria suggestions? What are your thoughts on what the exported file should look like?

@reynoldtan
Copy link
Member

Mockup #2
ap download page2 - default

This mockup shows data downloader with the minimum set to filter options. Below shows when all filters.

ap download page2

I have merged Germplasm accession and name into one field and user can type in accession or name.

Need more information on Allowed Missing data.

Thanks!

@laceysanderson
Copy link
Member Author

Looks good :-)

Comments

  • The trait name filter should not be a multi-select. We only want one trait to be downloadable at a time to keep people from stealing our data and to minimize server strain.
  • I would simplify "Tell us about the trait you are interested in" to "Select Trait"
  • Although the experiment filters are optional that fieldset should be open by default since it is highly recommended they filter by experiment ;-) This is because combing data across experiments is often not valid.
  • I love the merge of Germplasm accession and name 👍

Content for your Lorem ipsum's ;-)

  • Fieldset descriptions (for each fieldset in order)
    1. Indicate the trait you would like phenotypic data for by selecting the genus of the crop, as well as, the name of the trait below.
    2. It is highly recommended to restrict the dataset to a specific experiment. This can be done by entering the name of the experiment below (the name will autocomplete as you type). You can further filter by year and location if desired.
    3. We recommend you fill out as many of the following optional filters as possible to narrow the phenotype set to those you are most interested in.
    4. Select the format you would like the data exported in below.
  • Genus + Species form element: Select the genus of the crop you would like phenotypic data for. Additionally, the species can be indicated to further restrict the germplasm phenotypic data is exported for.
  • Maximum Allowed Missing Data: Enter the percent missing data per germplasm that you would like to allow. For example, a value of 20% will ensure that all germplasm exported have values for at least 20% of site-years this trait was observed in. If you further restrict the site-years exported using other filter criteria, this filter will be applied to the restricted dataset.

Clarification for "Maximum Allowed Missing Data"

Think of the dataset as a table where each row is a specific germplasm and each column is a location/year combination (site-year). This filter says to not export germplasm which have more than a given number of columns missing. For example, if the filter is set to 20% and there are 10 site-years in the exported dataset then only 2 columns per row can be empty. Any germplasm with more then that should not be added to the downloaded file. Thus this filter is easiest to apply while building the file (not in the select query).

@carolyncaron
Copy link
Member

carolyncaron commented Sep 11, 2017

I'm loving how Mockup 2 looks 👍

Regarding trait selection, I don't think we want to restrict the user to only select one. My initial concern was that someone may select ALL traits by default, which we definitely want to discourage for the reasons you mention, @laceysanderson. I think we still want the multi-select but to remove the "All Traits" option. While this doesn't prevent the user from selecting every single trait, it would require more work on their part and thus they are less likely to do it. ;-) My reasons for allowing multiple traits are entirely based on my experience with TASSEL and GAPIT, which I think inevitably will be part of downstream analysis for many users of this module.

  1. Both TASSEL and GAPIT allow multiple traits within a single trait input file. You can see TASSEL's manual here for expected formats for traits, their examples show 3-4 traits within a single table. This is related to your question about potential export formats - we probably want to take into consideration the input formats for GWAS software. Even though it is just as valid to upload traits in separate files to TASSEL or GAPIT, simply because it allows multiple traits, researchers will lean towards formatting their files the same way if their experiment involves multiple traits (and they often do, such as the latest GWAS I did with Hamid that involved seed shape, seed size and cotyledon colour).
  2. A big advantage to downloading multiple traits at once is that missing values will be included for germplasm that may not have any values for a particular trait. Thus, any stubborn researcher who wants all of their experimental data in a single table won't have to resort to using Excel to copy-and-paste columns and then manually insert missing data, which is sadly often error-prone...

It occurred to me that my reasons are based heavily on experimental design. Perhaps if we are really concerned about users downloading too many traits, we can somehow limit them to only download all traits within an experiment. So, if they don't choose an experiment, they can select one trait. Otherwise, they get the multi-select option for traits limited to their selected experiment. What do you think?

@carolyncaron
Copy link
Member

Filtering for Maximum Allowed Missing Data

I'm sorry to say it, but I don't think we should provide an option to filter for missing data. :-( I think that initially, when we have a few experiments based on using mostly the same germplasm, the filter for maximum allowed missing data could be useful. But over time, I can see there being issues with something like this. For example, assume that we measure plant height for Redberry at every site-year from now for the next five years. In five years, we continue to take plant height for Redberry but also for a new variety that was just released. If you let a few years pass, you now have a substantial number of measurements for Redberry, and a much smaller but still reasonable number of measurements for the new variety. Someone may select 50% missing data allowed and filter out the new variety as a result. It becomes very difficult to offer filter options based on stats with an ever-expanding database. Filtering % missing data using our VCF_filter module works well because it is restricted to individual files.

I also think it is very important that if a researcher has requested phenotypic data for specific germplasm (either by specifying an experiment, or multi-selecting germplasm names) then we shouldn't provide an additional filter that could potentially filter those germplasm out. I would opt to allow the researcher to do their own filtering based on statistics for their specified dataset (and hopefully they will do this using something like R!).

Additional Filtering Criteria

I think what we have now is a good amount to start with. It's hard to think of additional filters that aren't stats-based. @laceysanderson already pointed this out through chat, and allowing the user to select whether they want replicates (if they have permission to) is a great addition. 👍

Export Options

I still really like the R-friendly format option in raw phenotypes, and I would appreciate seeing something like that here. Other than that, as far as what columns to include in the file, it is really tough to say since some of them will definitely appear redundant depending on filtering criteria. I always lean towards providing the most information we can, so that it is then up to the user to remove what they don't need, if necessary. @reynoldtan's suggestion in chat to allow the user to specify columns could be a great compromise to this, however! Perhaps all columns could be selected by default?


I was thinking that it might be worthwhile if Reynold and I set up a meeting time with Derek and/or other phenotype analysts to get their feedback on filtering criteria/export formats. I'm sure they will have a pretty good idea of what else they'd like to see (or, at least, what they don't like, lol).

@reynoldtan
Copy link
Member

reynoldtan commented Sep 18, 2017

Mockup #3
mockup3

Click the link below to preview the header picker.
https://myfiddle-reynoldltan.c9users.io/

Thanks!

@laceysanderson
Copy link
Member Author

Addressed through multiple PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants