Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing data for tetraploid multiparenting population #25

Open
PaulaEB opened this issue Dec 1, 2022 · 3 comments
Open

Missing data for tetraploid multiparenting population #25

PaulaEB opened this issue Dec 1, 2022 · 3 comments
Labels

Comments

@PaulaEB
Copy link

PaulaEB commented Dec 1, 2022

Hello David,
Thanks for developing updog!
My project goal is identify QTLs for pest resistance, so we have a multiparenting population similar to a NAM pop (4 pollen recipients and a pollen donor) so we have four half-sib families. We are treating each family separated but I'd like to know your thoughts about if it's possible to do use all the population for the genotype calling.

And a last question would be about the missing data for de geno field. In the multidog$inddf output we don't see missing data, is this normal?

Thank you very much!
Paula E

@dcgerard
Copy link
Owner

dcgerard commented Dec 9, 2022

Hey @PaulaEB,

Thanks for trying out {updog}!

I haven't gotten around to allowing for multiparent populations yet. Some things you can look into:

  1. Are the genotypes estimated to be the same for the same parent for runs on different populations?
  2. Are the sequencing error rates, allele biases, and overdispersions estimated to be about the same at the same SNP?

If the answer is yes to both, then combining the different populations would not help much. Estimating the parent genotypes and those parameters is the benefit of using a larger sample size.

As for the missing data, if an individual has NA listed, then it should provide NA in the output. If it has 0 listed for the read-depth, then {updog} will impute the genotype from the prior distribution (which is the best you can do if you aren't use information from other SNPs). E.g. consider:

library(updog)
refvec <- c(3, 4, 0, 8, 3)
sizevec <- c(10, 10, 0, 10, 10)
fout <- flexdog(refvec = refvec, sizevec = sizevec, ploidy = 4, )
fout$geno
plot(fout$postmat[3, ], fout$gene_dist)
abline(0, 1)

refvec <- c(3, 4, NA, 8, 3)
sizevec <- c(10, 10, NA, 10, 10)
fout <- flexdog(refvec = refvec, sizevec = sizevec, ploidy = 4, )
fout$geno

Best,
David

@PaulaEB
Copy link
Author

PaulaEB commented Nov 3, 2023

Hello @dcgerard, many thanks for your clarification! I am going back to this data, but I would like to keep the missing (0) missing as GATK mark the missing values in DP as DP=0 (https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot)

Is it possible to change that from updog or should I do that in the VCF with other tool?

Thanks again
Paula

@dcgerard
Copy link
Owner

dcgerard commented Nov 6, 2023

Yey @PaulaEB,

You can do that in R really easily.

E.g., suppose this is the matrix containing the read-depths:

sizemat <- matrix(c(0, 1, 2, 1,
                    1, 0, 1, 1,
                    1, 2, 1, 0), ncol = 4, byrow = TRUE)

Then we can convert those 0's to NA's via:

sizemat[sizemat == 0] <- NA

Cheers,
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants