-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1000 genomes PCA data #20
Comments
Hi, I'm familiar with multi-qc and would love to have peddy included. The PCA background data does not vary between datasets, but it is, for now, only for hg19/grch37 datasets. Not sure if you found it, but the labels for the populations are: "AFR AMR EAS EUR SAS UNKNOWN" You could probably just convert that to a list of ancestries, but I uses the numbers as classes for scikit-learn. |
and, if there's a way I can distribute/store the data that's more amenable to external use (and still performant), I'm happy to discuss. |
Brilliant, thanks for the fast response @brentp. Good to know that the data doesn't vary - I pulled the data from the plot in your example report (see resulting json files), though the I hadn't thought about the fact that it was Human only - is the genome reported anywhere in the Peddy output so that I can check this? This example report shows what I've written so far, based on this example data from @leipzig. If there's anything else that you think I should put into the report then let me know (I don't have any experience of this type of analysis myself). Cheers, Phil |
ps. The labels are in the line you linked to, but are you generating the PCA co-ordinates differently for every run? I just pulled the |
peddy doesn't know the genome, so I can't report it. Everything but ancestry should work on any diploid genome. So, the ancestry info for the 1KG backgrounds will change each time; every time peddy is run, it subsets the 1KG sites (variants) to those that have been found in that cohort, then runs PCA on that particular subset of variants. (and then projects the incoming cohort onto those PCS). The resulting PCS are always output into the .html file, for the plot. So you could grab it from there. Also, there is an aggregate file called $prefix.peddy.ped that contains summary info from sex_check, het_check, and ped_check that might be the simplest to pull from.
|
Ok great, thanks for the feedback. I'll remove the current bundled data then. I like to avoid parsing data from HTML files where possible (it can get messy quickly) - any chance that Peddy could output this data (type, PC1, PC2 for all points plotted) to a I'm currently pulling all of the data in the MultiQC report from We've had some discussion over what to show in MultiQC/MultiQC#355 - I've already put in Phil |
I'll work on outputting a csv (or I can do JSON) of the 1KG PCs. A key utility of peddy is the inferred relationships (relatedness, IBS0, IBS2, etc) so if you had a way to optionally plot those (or to show the static png that peddy creates), that would be great. Those values are in ped_check.csv |
Thanks @brentp - sounds great! I'm in a bit of a rush to release this version of MultiQC, so I'll add the 1KG PCs in the next release if that's ok (JSON is also good). I've just tried to replicate your IBS2 v IBS0 plot in MultiQC - doesn't look quite as nice but is hopefully ok as a starting point. I've coloured by relatedness, but not the parent-sib / sib-sib information as I'm not sure where this comes from. I'll be away for a while after Christmas (hence the push to get a release out), but happy to continue improving the MultiQC output when I'm back. Let me know if I've got anything wrong or if there's anything I can do better in the module. Cheers, Phil |
Sounds good let's touch base in the new year after I get out the next release that outputs the 1kg pcs. |
I made this change and pushed a new release (0.2.9) to pypi. |
Ah nice - that was super fast, thanks! If you have any example data kicking around it would be great if you could submit it to ewels/MultiQC_TestData so I can see what to parse. Otherwise I'll try to get some myself when I get back in mid February. Thanks! Phil |
This is now in multiqc. |
Hi there,
@leipzig requested that I add a module to parse Peddy data for a tool I've written called MultiQC (see multiqc.info and ewels/MultiQC).
Quick question - does the PCA data shown in the background of the PCA Projection onto 1000 Genomes vary between datasets? See MultiQC/MultiQC#355 for the discussion - basically as these co-ordinates aren't exported by Peddy I've just taken a snapshot of that data, and I want to check that this is valid.
Many thanks,
Phil
The text was updated successfully, but these errors were encountered: