-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include counts/observations with frequency data #151
Conversation
@@ -112,20 +112,21 @@ It includes the following fields. | |||
- `Population`: the unique identifer of the population | |||
- `Allele`: the allele of each variant in the microhap, separated by pipe symbols | |||
- `Frequency`: the frequency of the allele in the specified population (a real number between 0.0 and 1.0) | |||
- `Count`: the total number of alleles (denominator) used to compute the given population frequency estimate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the DB build docs to clarify that this count is not the numerator but the denominator of the frequency calculation.
@@ -73,6 +73,7 @@ def cleanup_frequencies(freq): | |||
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-120.v1" | |||
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-120.v2" | |||
freq = freq.drop(columns="NumVars") | |||
freq["Count"] = freq["Count"].astype("Int16") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is missing count data for 1 or 2 data sources, so we need to cast the Count
column as a nullable integer type to prevent it from defaulting to a float (along with many unnecessary decimal places).
total_count = sum(agg_tallies[marker].values()) | ||
for mhallele, agg_count in sorted(agg_tallies[marker].items()): | ||
freq = agg_count / sum(agg_tallies[marker].values()) | ||
yield marker, "1KGP", mhallele, freq | ||
freq = agg_count / total_count | ||
yield marker, "1KGP", mhallele, freq, total_count | ||
for population, haplocounts in sorted(popcounts.items()): | ||
total_count = sum(haplocounts.values()) | ||
for mhallele, count in sorted(haplocounts.items()): | ||
freq = count / sum(haplocounts.values()) | ||
yield marker, population, mhallele, freq | ||
freq = count / total_count | ||
yield marker, population, mhallele, freq, total_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These, the 1000 Genomes Project frequencies, are used most widely in MicroHapDB.
@@ -39,4 +39,5 @@ def reformat_frequencies(infile, outfile): | |||
entry = (standardname, "MHDBP-48c2cfb2aa", haplotype, row.Frequency) | |||
freqdata.append(entry) | |||
freqtable = pd.DataFrame(freqdata, columns=["Marker", "Population", "Allele", "Frequency"]) | |||
freqtable["Count"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the studies with missing count data.
counts_by_marker = dict() | ||
for markerid, subset in freqs.groupby("MarkerName"): | ||
counts_by_marker[markerid] = subset.NumberOfObservations.sum() | ||
freqs.drop(columns=["NumberOfObservations"], inplace=True) | ||
freqs["Haplotype"] = freqs["Haplotype"].apply(lambda x: "|".join(list(x))) | ||
freqs["Population"] = "MHDBP-7c055e7ee8" | ||
freqs.rename(columns={"MarkerName": "Marker", "Haplotype": "Allele"}, inplace=True) | ||
freqs = freqs[["Marker", "Population", "Allele", "Frequency"]] | ||
freqs["Count"] = freqs.Marker.apply(lambda x: counts_by_marker[x]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, the required changes were delightfully simple.
@@ -59,4 +59,5 @@ rule frequencies: | |||
entry = [row.Marker, pop.replace(" ", ""), allele, frequency] | |||
table.append(entry) | |||
frequencies = pd.DataFrame(table, columns=["Marker", "Population", "Allele", "Frequency"]) | |||
frequencies["Count"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other study with missing count data.
@@ -35,6 +35,7 @@ def compile_variant_map(markers): | |||
merged = read_table("merged.csv") | |||
populations = read_table("population.csv") | |||
frequencies = read_table("frequency.csv.gz") | |||
frequencies["Count"] = frequencies["Count"].astype("Int16") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting the Count
column as a nullable integer type not only when writing to a file, but also when loading into memory.
def test_counts_random(): | ||
freq = microhapdb.frequencies | ||
pops = microhapdb.populations | ||
for _ in range(5): | ||
random_marker = choice(microhapdb.markers.Name) | ||
random_population = choice(pops[pops.Source == "Byrska-Bishop2022"].ID.to_list()) | ||
subset = freq[(freq.Marker == random_marker) & (freq.Population == random_population)] | ||
assert len(set(subset.Count)) == 1, subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 5 random markers, test that the allele frequencies for a given population all have the same allele count.
The
microhapdb.frequencies
table has never shown the number of haplotypes used to calculate frequency estimates for each population. For data sources providing precomputed frequency data, this number is provided in almost every case (with 1 or 2 exceptions). For the 1000 Genomes Project data, this number is computed directly during the database build procedure and can be emitted along with the frequency.This PR updates the frequency table to include the counts, enabling e.g. the computing of conservative minimum allele frequencies for previously unobserved microhap alleles.
Closes #148.