Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include counts/observations with frequency data #151

Merged
merged 3 commits into from
Oct 25, 2023
Merged

Conversation

standage
Copy link
Member

@standage standage commented Oct 24, 2023

The microhapdb.frequencies table has never shown the number of haplotypes used to calculate frequency estimates for each population. For data sources providing precomputed frequency data, this number is provided in almost every case (with 1 or 2 exceptions). For the 1000 Genomes Project data, this number is computed directly during the database build procedure and can be emitted along with the frequency.

This PR updates the frequency table to include the counts, enabling e.g. the computing of conservative minimum allele frequencies for previously unobserved microhap alleles.

Closes #148.

@standage standage marked this pull request as ready for review October 25, 2023 00:54
@@ -112,20 +112,21 @@ It includes the following fields.
- `Population`: the unique identifer of the population
- `Allele`: the allele of each variant in the microhap, separated by pipe symbols
- `Frequency`: the frequency of the allele in the specified population (a real number between 0.0 and 1.0)
- `Count`: the total number of alleles (denominator) used to compute the given population frequency estimate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the DB build docs to clarify that this count is not the numerator but the denominator of the frequency calculation.

@@ -73,6 +73,7 @@ def cleanup_frequencies(freq):
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-120.v1"
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-120.v2"
freq = freq.drop(columns="NumVars")
freq["Count"] = freq["Count"].astype("Int16")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is missing count data for 1 or 2 data sources, so we need to cast the Count column as a nullable integer type to prevent it from defaulting to a float (along with many unnecessary decimal places).

Comment on lines +155 to +163
total_count = sum(agg_tallies[marker].values())
for mhallele, agg_count in sorted(agg_tallies[marker].items()):
freq = agg_count / sum(agg_tallies[marker].values())
yield marker, "1KGP", mhallele, freq
freq = agg_count / total_count
yield marker, "1KGP", mhallele, freq, total_count
for population, haplocounts in sorted(popcounts.items()):
total_count = sum(haplocounts.values())
for mhallele, count in sorted(haplocounts.items()):
freq = count / sum(haplocounts.values())
yield marker, population, mhallele, freq
freq = count / total_count
yield marker, population, mhallele, freq, total_count
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These, the 1000 Genomes Project frequencies, are used most widely in MicroHapDB.

@@ -39,4 +39,5 @@ def reformat_frequencies(infile, outfile):
entry = (standardname, "MHDBP-48c2cfb2aa", haplotype, row.Frequency)
freqdata.append(entry)
freqtable = pd.DataFrame(freqdata, columns=["Marker", "Population", "Allele", "Frequency"])
freqtable["Count"] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the studies with missing count data.

Comment on lines +64 to +72
counts_by_marker = dict()
for markerid, subset in freqs.groupby("MarkerName"):
counts_by_marker[markerid] = subset.NumberOfObservations.sum()
freqs.drop(columns=["NumberOfObservations"], inplace=True)
freqs["Haplotype"] = freqs["Haplotype"].apply(lambda x: "|".join(list(x)))
freqs["Population"] = "MHDBP-7c055e7ee8"
freqs.rename(columns={"MarkerName": "Marker", "Haplotype": "Allele"}, inplace=True)
freqs = freqs[["Marker", "Population", "Allele", "Frequency"]]
freqs["Count"] = freqs.Marker.apply(lambda x: counts_by_marker[x])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, the required changes were delightfully simple.

@@ -59,4 +59,5 @@ rule frequencies:
entry = [row.Marker, pop.replace(" ", ""), allele, frequency]
table.append(entry)
frequencies = pd.DataFrame(table, columns=["Marker", "Population", "Allele", "Frequency"])
frequencies["Count"] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other study with missing count data.

@@ -35,6 +35,7 @@ def compile_variant_map(markers):
merged = read_table("merged.csv")
populations = read_table("population.csv")
frequencies = read_table("frequency.csv.gz")
frequencies["Count"] = frequencies["Count"].astype("Int16")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting the Count column as a nullable integer type not only when writing to a file, but also when loading into memory.

Comment on lines +115 to +122
def test_counts_random():
freq = microhapdb.frequencies
pops = microhapdb.populations
for _ in range(5):
random_marker = choice(microhapdb.markers.Name)
random_population = choice(pops[pops.Source == "Byrska-Bishop2022"].ID.to_list())
subset = freq[(freq.Marker == random_marker) & (freq.Population == random_population)]
assert len(set(subset.Count)) == 1, subset
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 5 random markers, test that the allele frequencies for a given population all have the same allele count.

@standage standage merged commit c4f7bcc into master Oct 25, 2023
3 checks passed
@standage standage deleted the freq/counts branch October 25, 2023 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include allele counts in frequency table
1 participant