In [1]:
import pandas
from glob import glob
import os

---

Find all results files and read them in using pandas:

In [2]:
files = glob('../output/*/*.variants.tsv')

In [41]:
df = pandas.concat(pandas.read_csv(f, sep="\t") for f in filter(lambda x: os.path.getsize(x) > 1, files))
df.to_csv("../output/cohort.variants.tsv", sep="\t")

---

Now, we have a table containing all variants that we found. Keep in mind that all variants reported are in fact a cross product of `Variant x Samples x CSQ` Fields. That means a single variant if a vcf of three family members with 2 CSQ entries will generate six rows in our table. However, all rows will have an identical `var_id` column which we can use to group the data.

First though, we will subset the data to only contain child-data. Since parent data is also included, a de-novo filter can be applied easily by checking the parental alleles of the variant.

In [4]:
fam = pandas.read_csv("../library/inova.fam", sep="\t", names=[ "family_id", "sample_id", "father_id", "mother_id", "sex", "is_affected"])
children = set(fam[(fam['mother_id'] != '0') & (fam['father_id'] != '0')].sample_id)

In [5]:
children_only = df[df.sample_id.str.match('|'.join(children))]

## Total number of variants detected

After filtration, we can group by `var_id` to get the real number of non-duplicated rows. This also gives us the ratio of `Variants per Offspring`, which can be intepreted as the likelihood that a child is a carrier for a mutation under the assumption that every sample carries at most one mutation. This assumption is most likely not true and we can rectify this later.

In [15]:
mutations = sum(map(lambda x: len(x[1].groupby('sample_id')), children_only.groupby('var_id')))
mutations

249

In [16]:
mutations / len(children)

0.1952941176470588

Accounting for the number of variants per sample, some rare cases will have two or more variants associated.

In [20]:
mut_per_carrier = list(map(lambda x: len(x[1].groupby('var_id')), children_only.groupby('sample_id')))
carriers = len(mut_per_carrier)
_mut = sum(mut_per_carrier) # we can validate the above result, since _mut == mutations must hold

In [24]:
carriers / len(children)

0.17333333333333334