Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata/downloading QC checklist #29

Open
claraqin opened this issue Nov 30, 2020 · 6 comments
Open

Metadata/downloading QC checklist #29

claraqin opened this issue Nov 30, 2020 · 6 comments

Comments

@claraqin
Copy link
Owner

We need to make the following changes to the workflow, particularly in the Download NEON Data vignette, to prevent QC-related issues from complicating processes downstream.

  1. Metadata file(s) should be saved by default
  2. Check for pre-existing downloads
  3. Check for duplicate sample IDs (Due to either re-sampling vs labeling errors. In either case, choose which file to retain)
  4. Remove QC-flagged data
  5. Separate metadata into 16S or ITS - this could occur at the phyloseq step, or before downloading raw sequences

In addition, @lstanish suggests that it could be good to reorganize the columns in the metadata table so the most important columns come first. What are some columns to put first in the metadata table?

@claraqin
Copy link
Owner Author

@lstanish has completed 1 and 5 in the above checklist.

Just a thought about the column order in the metadata table: Because the metadata consists of several stackByTable csv's joined together, the columns are primarily organized by which csv they came from. For example, the first several columns all have to do with the raw data files, and the next several have to do with sequencing. We could revise the order in which we join the csv's so that the columns correspond to the order in which processing took place in the lab, i.e. raw data files, then DNA extraction, then PCR amplification, and finally marker gene sequencing. I'm more agnostic as to the order of columns within these broader groupings.

@lstanish
Copy link
Collaborator

lstanish commented Jan 14, 2021

@claraqin qcMetadata function ready for testing! Function is in the code folder. Currently functionality:

  • can accept R data.frame or .csv file as input
  • checks for and removes duplicate sequence files
  • checks for and either flags or removes duplicate dnaSampleIDs in the same sequencing run
  • removes reverse read if forward read for the same sample is missing
  • removes forward read if user specified argument keepR2=TRUE
  • outputs .csv file into new QC_output folder
  • optional removal of records containing a NEON flag

To add:

  • check that F and R primers match across entire data set. Consider adding option for user to specify which primers to keep

Other functionality that would be good to add:

  • check for and either flag or remove duplicate dnaSampleIDs in different sequencing runs

Other tests to run:

  • feeding in different test datasets from downloadSequenceMetadata(), esp when targetGene="all", other tests to try and break the function.
  • I wasn't able to test the outDir when using the params.R file, only tested a user-defined input. It should work, but I didn't test

@zoey-rw
Copy link
Collaborator

zoey-rw commented Jan 15, 2021

@claraqin @lstanish Tested this function and pushed a small change: the output is now a dataframe, which can be used the same way as the input dataframe.

As you referenced, the QC function cannot handle a test dataset that was generated using targetGene="all." Perhaps in that case, the QC function could use a loop to essentially run twice, creating a ITS output and a 16S output, and combining them back into an "all" dataframe (or outputting both separately in a list format).

@lstanish
Copy link
Collaborator

@zoey-rw Thanks for making that update to output a dataframe as well as a hard-copy file! It's good to know that's a useful output. I am curious to know how this function will behave if you use the params file to output the QCed data, did you happen to test that?

Regarding making the function useful for targetGene='all', is this a useful feature? I'm wondering because the data need to be parsed by targetGene for dada2 and all of the downstream analyses keep the 16S and ITS data separate. It's definitely possible and wouldn't be hard to allow the function to QC 16S and ITS in the same function call, just wondering whether that's something that users will want to do.

@lstanish
Copy link
Collaborator

lstanish commented Feb 9, 2021

@claraqin added in user option to remove records containing a NEON data flag (any of the qaqcStatus fields, and dataQF)

@lstanish
Copy link
Collaborator

lstanish commented Mar 14, 2021

@claraqin @zoey-rw Made one minor update to the error message if outDir="" and pushed the udpate.
Any luck testing and of the un-checked items above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants