1000 genomes PCA data #20

ewels · 2016-12-19T16:35:16Z

Hi there,

@leipzig requested that I add a module to parse Peddy data for a tool I've written called MultiQC (see multiqc.info and ewels/MultiQC).

Quick question - does the PCA data shown in the background of the PCA Projection onto 1000 Genomes vary between datasets? See MultiQC/MultiQC#355 for the discussion - basically as these co-ordinates aren't exported by Peddy I've just taken a snapshot of that data, and I want to check that this is valid.

Many thanks,

Phil

brentp · 2016-12-19T16:50:01Z

Hi, I'm familiar with multi-qc and would love to have peddy included.

The PCA background data does not vary between datasets, but it is, for now, only for hg19/grch37 datasets. Not sure if you found it, but the labels for the populations are: "AFR AMR EAS EUR SAS UNKNOWN"
and the indexes are here: https://github.com/brentp/peddy/blob/master/peddy/pca.py#L1
where, e.g. 3 would indicate that sample is EUR.

You could probably just convert that to a list of ancestries, but I uses the numbers as classes for scikit-learn.
Hope that helps and let me know any more questions.

brentp · 2016-12-19T18:30:23Z

and, if there's a way I can distribute/store the data that's more amenable to external use (and still performant), I'm happy to discuss.

ewels · 2016-12-20T10:07:22Z

Brilliant, thanks for the fast response @brentp. Good to know that the data doesn't vary - I pulled the data from the plot in your example report (see resulting json files), though the UNKNOWN category didn't seem to have any data points?

I hadn't thought about the fact that it was Human only - is the genome reported anywhere in the Peddy output so that I can check this?

This example report shows what I've written so far, based on this example data from @leipzig. If there's anything else that you think I should put into the report then let me know (I don't have any experience of this type of analysis myself).

Cheers,

Phil

ewels · 2016-12-20T10:11:12Z

ps. The labels are in the line you linked to, but are you generating the PCA co-ordinates differently for every run? I just pulled the [x,y] plot co-ordinates for the MultiQC data, that was the bit I was unsure about. Apologies, my initial post wasn't very clear.

brentp · 2016-12-20T13:11:23Z

peddy doesn't know the genome, so I can't report it. Everything but ancestry should work on any diploid genome.

So, the ancestry info for the 1KG backgrounds will change each time; every time peddy is run, it subsets the 1KG sites (variants) to those that have been found in that cohort, then runs PCA on that particular subset of variants. (and then projects the incoming cohort onto those PCS). The resulting PCS are always output into the .html file, for the plot. So you could grab it from there.

Also, there is an aggregate file called $prefix.peddy.ped that contains summary info from sex_check, het_check, and ped_check that might be the simplest to pull from.
The columns we use the most (other than those for relatedness) are:

het_call_rate
het_ratio
het_mean_depth
ancestry_prediction
sex_het_ratio

ewels · 2016-12-20T17:59:35Z

Ok great, thanks for the feedback. I'll remove the current bundled data then. I like to avoid parsing data from HTML files where possible (it can get messy quickly) - any chance that Peddy could output this data (type, PC1, PC2 for all points plotted) to a .csv file alongside the others?

I'm currently pulling all of the data in the MultiQC report from $prefix.peddy.ped as you say, with the exception of the sex_error column, which I take from the sex_check.csv file.

We've had some discussion over what to show in MultiQC/MultiQC#355 - I've already put in ancestry_prediction, sex_het_ratio and sex_error into the General Statistics table at the top of the report. I'd prefer to not add too many more columns to that top table (it can get very wide with other modules adding to it) - do you think it's worth adding a Peddy-specific table with extra columns?

Phil

brentp · 2016-12-20T20:06:00Z

I'll work on outputting a csv (or I can do JSON) of the 1KG PCs.

A key utility of peddy is the inferred relationships (relatedness, IBS0, IBS2, etc) so if you had a way to optionally plot those (or to show the static png that peddy creates), that would be great. Those values are in ped_check.csv

ewels · 2016-12-21T15:41:01Z

Thanks @brentp - sounds great! I'm in a bit of a rush to release this version of MultiQC, so I'll add the 1KG PCs in the next release if that's ok (JSON is also good).

I've just tried to replicate your IBS2 v IBS0 plot in MultiQC - doesn't look quite as nice but is hopefully ok as a starting point. I've coloured by relatedness, but not the parent-sib / sib-sib information as I'm not sure where this comes from.

I'll be away for a while after Christmas (hence the push to get a release out), but happy to continue improving the MultiQC output when I'm back. Let me know if I've got anything wrong or if there's anything I can do better in the module.

Cheers,

Phil

brentp · 2016-12-21T23:15:39Z

Sounds good let's touch base in the new year after I get out the next release that outputs the 1kg pcs.
Thanks!

brentp · 2016-12-22T15:03:47Z

I made this change and pushed a new release (0.2.9) to pypi.
It goes to $prefix.background_pca.json.
Let me know if you need any changes.

ewels · 2016-12-23T08:23:00Z

Ah nice - that was super fast, thanks! If you have any example data kicking around it would be great if you could submit it to ewels/MultiQC_TestData so I can see what to parse. Otherwise I'll try to get some myself when I get back in mid February.

Thanks!

Phil

brentp · 2018-02-12T15:52:35Z

This is now in multiqc.

brentp closed this as completed Dec 19, 2016

brentp reopened this Dec 19, 2016

brentp added a commit that referenced this issue Dec 22, 2016

output json file of 1kg PCs. see #20

63b0262

brentp mentioned this issue Dec 23, 2016

example peddy 1kg data MultiQC/test-data#27

Closed

brentp closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1000 genomes PCA data #20

1000 genomes PCA data #20

ewels commented Dec 19, 2016

brentp commented Dec 19, 2016

brentp commented Dec 19, 2016

ewels commented Dec 20, 2016

ewels commented Dec 20, 2016

brentp commented Dec 20, 2016

ewels commented Dec 20, 2016

brentp commented Dec 20, 2016

ewels commented Dec 21, 2016

brentp commented Dec 21, 2016

brentp commented Dec 22, 2016

ewels commented Dec 23, 2016

brentp commented Feb 12, 2018

1000 genomes PCA data #20

1000 genomes PCA data #20

Comments

ewels commented Dec 19, 2016

brentp commented Dec 19, 2016

brentp commented Dec 19, 2016

ewels commented Dec 20, 2016

ewels commented Dec 20, 2016

brentp commented Dec 20, 2016

ewels commented Dec 20, 2016

brentp commented Dec 20, 2016

ewels commented Dec 21, 2016

brentp commented Dec 21, 2016

brentp commented Dec 22, 2016

ewels commented Dec 23, 2016

brentp commented Feb 12, 2018