Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1000 genomes PCA data #20

Closed
ewels opened this issue Dec 19, 2016 · 12 comments
Closed

1000 genomes PCA data #20

ewels opened this issue Dec 19, 2016 · 12 comments

Comments

@ewels
Copy link

ewels commented Dec 19, 2016

Hi there,

@leipzig requested that I add a module to parse Peddy data for a tool I've written called MultiQC (see multiqc.info and ewels/MultiQC).

Quick question - does the PCA data shown in the background of the PCA Projection onto 1000 Genomes vary between datasets? See MultiQC/MultiQC#355 for the discussion - basically as these co-ordinates aren't exported by Peddy I've just taken a snapshot of that data, and I want to check that this is valid.

Many thanks,

Phil

@brentp
Copy link
Owner

brentp commented Dec 19, 2016

Hi, I'm familiar with multi-qc and would love to have peddy included.

The PCA background data does not vary between datasets, but it is, for now, only for hg19/grch37 datasets. Not sure if you found it, but the labels for the populations are: "AFR AMR EAS EUR SAS UNKNOWN"
and the indexes are here: https://github.com/brentp/peddy/blob/master/peddy/pca.py#L1
where, e.g. 3 would indicate that sample is EUR.

You could probably just convert that to a list of ancestries, but I uses the numbers as classes for scikit-learn.
Hope that helps and let me know any more questions.

@brentp brentp closed this as completed Dec 19, 2016
@brentp brentp reopened this Dec 19, 2016
@brentp
Copy link
Owner

brentp commented Dec 19, 2016

and, if there's a way I can distribute/store the data that's more amenable to external use (and still performant), I'm happy to discuss.

@ewels
Copy link
Author

ewels commented Dec 20, 2016

Brilliant, thanks for the fast response @brentp. Good to know that the data doesn't vary - I pulled the data from the plot in your example report (see resulting json files), though the UNKNOWN category didn't seem to have any data points?

I hadn't thought about the fact that it was Human only - is the genome reported anywhere in the Peddy output so that I can check this?

This example report shows what I've written so far, based on this example data from @leipzig. If there's anything else that you think I should put into the report then let me know (I don't have any experience of this type of analysis myself).

Cheers,

Phil

@ewels
Copy link
Author

ewels commented Dec 20, 2016

ps. The labels are in the line you linked to, but are you generating the PCA co-ordinates differently for every run? I just pulled the [x,y] plot co-ordinates for the MultiQC data, that was the bit I was unsure about. Apologies, my initial post wasn't very clear.

@brentp
Copy link
Owner

brentp commented Dec 20, 2016

peddy doesn't know the genome, so I can't report it. Everything but ancestry should work on any diploid genome.

So, the ancestry info for the 1KG backgrounds will change each time; every time peddy is run, it subsets the 1KG sites (variants) to those that have been found in that cohort, then runs PCA on that particular subset of variants. (and then projects the incoming cohort onto those PCS). The resulting PCS are always output into the .html file, for the plot. So you could grab it from there.

Also, there is an aggregate file called $prefix.peddy.ped that contains summary info from sex_check, het_check, and ped_check that might be the simplest to pull from.
The columns we use the most (other than those for relatedness) are:

  • het_call_rate
  • het_ratio
  • het_mean_depth
  • ancestry_prediction
  • sex_het_ratio

@ewels
Copy link
Author

ewels commented Dec 20, 2016

Ok great, thanks for the feedback. I'll remove the current bundled data then. I like to avoid parsing data from HTML files where possible (it can get messy quickly) - any chance that Peddy could output this data (type, PC1, PC2 for all points plotted) to a .csv file alongside the others?

I'm currently pulling all of the data in the MultiQC report from $prefix.peddy.ped as you say, with the exception of the sex_error column, which I take from the sex_check.csv file.

We've had some discussion over what to show in MultiQC/MultiQC#355 - I've already put in ancestry_prediction, sex_het_ratio and sex_error into the General Statistics table at the top of the report. I'd prefer to not add too many more columns to that top table (it can get very wide with other modules adding to it) - do you think it's worth adding a Peddy-specific table with extra columns?

Phil

@brentp
Copy link
Owner

brentp commented Dec 20, 2016

I'll work on outputting a csv (or I can do JSON) of the 1KG PCs.

A key utility of peddy is the inferred relationships (relatedness, IBS0, IBS2, etc) so if you had a way to optionally plot those (or to show the static png that peddy creates), that would be great. Those values are in ped_check.csv

@ewels
Copy link
Author

ewels commented Dec 21, 2016

Thanks @brentp - sounds great! I'm in a bit of a rush to release this version of MultiQC, so I'll add the 1KG PCs in the next release if that's ok (JSON is also good).

I've just tried to replicate your IBS2 v IBS0 plot in MultiQC - doesn't look quite as nice but is hopefully ok as a starting point. I've coloured by relatedness, but not the parent-sib / sib-sib information as I'm not sure where this comes from.

I'll be away for a while after Christmas (hence the push to get a release out), but happy to continue improving the MultiQC output when I'm back. Let me know if I've got anything wrong or if there's anything I can do better in the module.

Cheers,

Phil

@brentp
Copy link
Owner

brentp commented Dec 21, 2016

Sounds good let's touch base in the new year after I get out the next release that outputs the 1kg pcs.
Thanks!

brentp added a commit that referenced this issue Dec 22, 2016
@brentp
Copy link
Owner

brentp commented Dec 22, 2016

I made this change and pushed a new release (0.2.9) to pypi.
It goes to $prefix.background_pca.json.
Let me know if you need any changes.

@ewels
Copy link
Author

ewels commented Dec 23, 2016

Ah nice - that was super fast, thanks! If you have any example data kicking around it would be great if you could submit it to ewels/MultiQC_TestData so I can see what to parse. Otherwise I'll try to get some myself when I get back in mid February.

Thanks!

Phil

@brentp
Copy link
Owner

brentp commented Feb 12, 2018

This is now in multiqc.

@brentp brentp closed this as completed Feb 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants