Metadata/downloading QC checklist #29

claraqin · 2020-11-30T21:53:51Z

We need to make the following changes to the workflow, particularly in the Download NEON Data vignette, to prevent QC-related issues from complicating processes downstream.

Metadata file(s) should be saved by default
Check for pre-existing downloads
Check for duplicate sample IDs (Due to either re-sampling vs labeling errors. In either case, choose which file to retain)
Remove QC-flagged data
Separate metadata into 16S or ITS - this could occur at the phyloseq step, or before downloading raw sequences

In addition, @lstanish suggests that it could be good to reorganize the columns in the metadata table so the most important columns come first. What are some columns to put first in the metadata table?

claraqin · 2020-12-11T18:27:18Z

@lstanish has completed 1 and 5 in the above checklist.

Just a thought about the column order in the metadata table: Because the metadata consists of several stackByTable csv's joined together, the columns are primarily organized by which csv they came from. For example, the first several columns all have to do with the raw data files, and the next several have to do with sequencing. We could revise the order in which we join the csv's so that the columns correspond to the order in which processing took place in the lab, i.e. raw data files, then DNA extraction, then PCR amplification, and finally marker gene sequencing. I'm more agnostic as to the order of columns within these broader groupings.

lstanish · 2021-01-14T01:48:45Z

zoey-rw · 2021-01-15T15:51:54Z

@claraqin @lstanish Tested this function and pushed a small change: the output is now a dataframe, which can be used the same way as the input dataframe.

As you referenced, the QC function cannot handle a test dataset that was generated using targetGene="all." Perhaps in that case, the QC function could use a loop to essentially run twice, creating a ITS output and a 16S output, and combining them back into an "all" dataframe (or outputting both separately in a list format).

lstanish · 2021-01-26T01:30:18Z

@zoey-rw Thanks for making that update to output a dataframe as well as a hard-copy file! It's good to know that's a useful output. I am curious to know how this function will behave if you use the params file to output the QCed data, did you happen to test that?

Regarding making the function useful for targetGene='all', is this a useful feature? I'm wondering because the data need to be parsed by targetGene for dada2 and all of the downstream analyses keep the 16S and ITS data separate. It's definitely possible and wouldn't be hard to allow the function to QC 16S and ITS in the same function call, just wondering whether that's something that users will want to do.

lstanish · 2021-02-09T00:33:43Z

@claraqin added in user option to remove records containing a NEON data flag (any of the qaqcStatus fields, and dataQF)

lstanish · 2021-03-14T21:51:24Z

@claraqin @zoey-rw Made one minor update to the error message if outDir="" and pushed the udpate.
Any luck testing and of the un-checked items above?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata/downloading QC checklist #29

Metadata/downloading QC checklist #29

claraqin commented Nov 30, 2020

claraqin commented Dec 11, 2020

lstanish commented Jan 14, 2021 •

edited

zoey-rw commented Jan 15, 2021 •

edited

lstanish commented Jan 26, 2021

lstanish commented Feb 9, 2021

lstanish commented Mar 14, 2021 •

edited

Metadata/downloading QC checklist #29

Metadata/downloading QC checklist #29

Comments

claraqin commented Nov 30, 2020

claraqin commented Dec 11, 2020

lstanish commented Jan 14, 2021 • edited

zoey-rw commented Jan 15, 2021 • edited

lstanish commented Jan 26, 2021

lstanish commented Feb 9, 2021

lstanish commented Mar 14, 2021 • edited

lstanish commented Jan 14, 2021 •

edited

zoey-rw commented Jan 15, 2021 •

edited

lstanish commented Mar 14, 2021 •

edited