Include counts/observations with frequency data #151

standage · 2023-10-24T20:16:41Z

The microhapdb.frequencies table has never shown the number of haplotypes used to calculate frequency estimates for each population. For data sources providing precomputed frequency data, this number is provided in almost every case (with 1 or 2 exceptions). For the 1000 Genomes Project data, this number is computed directly during the database build procedure and can be emitted along with the frequency.

This PR updates the frequency table to include the counts, enabling e.g. the computing of conservative minimum allele frequencies for previously unobserved microhap alleles.

Closes #148.

standage · 2023-10-25T00:54:58Z

dbbuild/README.md

@@ -112,20 +112,21 @@ It includes the following fields.
 - `Population`: the unique identifer of the population
 - `Allele`: the allele of each variant in the microhap, separated by pipe symbols
 - `Frequency`: the frequency of the allele in the specified population (a real number between 0.0 and 1.0)
+- `Count`: the total number of alleles (denominator) used to compute the given population frequency estimate


Updated the DB build docs to clarify that this count is not the numerator but the denominator of the frequency calculation.

standage · 2023-10-25T00:56:14Z

dbbuild/build.py

@@ -73,6 +73,7 @@ def cleanup_frequencies(freq):
    freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-120.v1"
    freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-120.v2"
    freq = freq.drop(columns="NumVars")
+    freq["Count"] = freq["Count"].astype("Int16")


There is missing count data for 1 or 2 data sources, so we need to cast the Count column as a nullable integer type to prevent it from defaulting to a float (along with many unnecessary decimal places).

standage · 2023-10-25T00:57:26Z

dbbuild/sources/byrskabishop2022/util.py

+        total_count = sum(agg_tallies[marker].values())
        for mhallele, agg_count in sorted(agg_tallies[marker].items()):
-            freq = agg_count / sum(agg_tallies[marker].values())
-            yield marker, "1KGP", mhallele, freq
+            freq = agg_count / total_count
+            yield marker, "1KGP", mhallele, freq, total_count
        for population, haplocounts in sorted(popcounts.items()):
+            total_count = sum(haplocounts.values())
            for mhallele, count in sorted(haplocounts.items()):
-                freq = count / sum(haplocounts.values())
-                yield marker, population, mhallele, freq
+                freq = count / total_count
+                yield marker, population, mhallele, freq, total_count


These, the 1000 Genomes Project frequencies, are used most widely in MicroHapDB.

standage · 2023-10-25T00:57:50Z

dbbuild/sources/chen2019/util.py

@@ -39,4 +39,5 @@ def reformat_frequencies(infile, outfile):
        entry = (standardname, "MHDBP-48c2cfb2aa", haplotype, row.Frequency)
        freqdata.append(entry)
    freqtable = pd.DataFrame(freqdata, columns=["Marker", "Population", "Allele", "Frequency"])
+    freqtable["Count"] = None


One of the studies with missing count data.

standage · 2023-10-25T00:59:10Z

dbbuild/sources/staadig2021/util.py

+    counts_by_marker = dict()
+    for markerid, subset in freqs.groupby("MarkerName"):
+        counts_by_marker[markerid] = subset.NumberOfObservations.sum()
    freqs.drop(columns=["NumberOfObservations"], inplace=True)
    freqs["Haplotype"] = freqs["Haplotype"].apply(lambda x: "|".join(list(x)))
    freqs["Population"] = "MHDBP-7c055e7ee8"
    freqs.rename(columns={"MarkerName": "Marker", "Haplotype": "Allele"}, inplace=True)
    freqs = freqs[["Marker", "Population", "Allele", "Frequency"]]
+    freqs["Count"] = freqs.Marker.apply(lambda x: counts_by_marker[x])


In most cases, the required changes were delightfully simple.

standage · 2023-10-25T00:59:43Z

dbbuild/sources/zou2022/Snakefile

@@ -59,4 +59,5 @@ rule frequencies:
                entry = [row.Marker, pop.replace(" ", ""), allele, frequency]
                table.append(entry)
        frequencies = pd.DataFrame(table, columns=["Marker", "Population", "Allele", "Frequency"])
+        frequencies["Count"] = None


The other study with missing count data.

standage · 2023-10-25T01:00:34Z

microhapdb/tables.py

@@ -35,6 +35,7 @@ def compile_variant_map(markers):
 merged = read_table("merged.csv")
 populations = read_table("population.csv")
 frequencies = read_table("frequency.csv.gz")
+frequencies["Count"] = frequencies["Count"].astype("Int16")


Casting the Count column as a nullable integer type not only when writing to a file, but also when loading into memory.

standage · 2023-10-25T01:01:35Z

microhapdb/tests/test_frequency.py

+def test_counts_random():
+    freq = microhapdb.frequencies
+    pops = microhapdb.populations
+    for _ in range(5):
+        random_marker = choice(microhapdb.markers.Name)
+        random_population = choice(pops[pops.Source == "Byrska-Bishop2022"].ID.to_list())
+        subset = freq[(freq.Marker == random_marker) & (freq.Population == random_population)]
+        assert len(set(subset.Count)) == 1, subset


For 5 random markers, test that the allele frequencies for a given population all have the same allele count.

Daniel Standage added 3 commits October 24, 2023 16:15

Counts for all frequency data (sans 1KGP)

f0b9b5b

The big one: 1000 Genomes Project frequencies

4a57dec

Column data type; test suite updates

4f94a77

standage marked this pull request as ready for review October 25, 2023 00:54

standage commented Oct 25, 2023

View reviewed changes

standage merged commit c4f7bcc into master Oct 25, 2023
3 checks passed

standage deleted the freq/counts branch October 25, 2023 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include counts/observations with frequency data #151

Include counts/observations with frequency data #151

standage commented Oct 24, 2023 •

edited

Loading

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

standage Oct 25, 2023

Include counts/observations with frequency data #151

Include counts/observations with frequency data #151

Conversation

standage commented Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

standage commented Oct 24, 2023 •

edited

Loading