Interval check for re-defined microhaps #112

standage · 2023-02-27T19:12:52Z

Recent changes to nomenclature provide for better handling of loci where the precise SNPs of interest (the marker definition) has been intentionally adjusted over time to take advantage of additional discriminating power.

In this PR, I've added an interval-based check for microhaps that have been unintentionally renamed/redefined in different studies. I've introduced a "merge" step in the build process, and a record of merged microhaps is now kept.

Tangentially, I also revisited Sun 2020 (#87) to include 11 markers that had been excluded previously. They were excluded because they were redundant definitions, and the MicroHapDB build process didn't explicitly handle such cases previously. Now that the build process has been improved, the markers were added back so that all source and provenance information can be correctly tracked.

The microhapdb.Marker.standardize_ids function was updated to support lookup by merged microhap ID. For example, looking up mh11USC-11pB will return mh11PK-63643, or mh02FHL-006 will return mh02ZHA-013. Along with this, the function was also updated to return IDs for all associated markers if a locus ID is provided. For example, mh02KK-031 returns mh02KK-031.v1 and mh02KK-031.v2.

Closes #108.

…arker or locus names

standage · 2023-02-28T16:36:00Z

dbbuild/README.md

@@ -179,6 +179,7 @@ They can be installed using pip and/or conda.
 - ucsc-liftover
 - selenium
 - geckodriver
+- intervaltree


New dependency for dbbuild.

standage · 2023-02-28T16:36:25Z

dbbuild/build-summary.txt

+Merging 'mh01NK-001' --> 'mh01NH-04'
+Merging 'mh11USC-11pB' --> 'mh11PK-63643'
+Merging 'mh13SHY-003' --> 'mh13KK-218'
+Merging 'mh16FHL-004' --> 'mh16KK-259'
+Merging 'mh02FHL-003' --> 'mh02KK-029'
+Merging 'mh02FHL-010' --> 'mh02KK-014'
+Merging 'mh02FHL-006' --> 'mh02ZHA-013'
+Merging 'mh22USC-22qB' --> 'mh22KK-340'
+Merging 'mh03SHY-003' --> 'mh03KK-017'
+Merging 'mh05KK-020' --> 'mh05KK-023'
+Merging 'mh05KK-121' --> 'mh05KK-120'
+Merging 'mh09USC-9pA' --> 'mh09KK-010'


Merged loci now listed in the build summary.

standage · 2023-02-28T16:37:34Z

dbbuild/lib/interval.py

+
+    def check(self):
+        for chrom, tree in sorted(self.trees.items()):
+            tree.merge_overlaps(data_reducer=lambda locus, marker: locus + [marker], data_initializer=list())


I've used the intervaltree package several times before, but I've never taken advantage of this merge_overlaps function. It's very convenient!

standage · 2023-02-28T16:38:29Z

dbbuild/lib/source.py

+            if marker.name in self.interval_index.mergeables:
+                marker.name = self.interval_index.mergeables[marker.name]


Renaming of merged microhaps needs to happen before updating ambiguous marker names.

standage · 2023-02-28T16:38:51Z

dbbuild/merged.csv

+Derivative,Original
+mh01NK-001,mh01NH-04
+mh02FHL-010,mh02KK-014
+mh02FHL-003,mh02KK-029
+mh02FHL-006,mh02ZHA-013
+mh03SHY-003,mh03KK-017
+mh05KK-020,mh05KK-023
+mh05KK-121,mh05KK-120
+mh09USC-9pA,mh09KK-010
+mh11USC-11pB,mh11PK-63643
+mh13SHY-003,mh13KK-218
+mh16FHL-004,mh16KK-259
+mh22USC-22qB,mh22KK-340


New core database table.

standage · 2023-02-28T16:40:46Z

dbbuild/sources/sun2020/Snakefile

+        markers = pd.read_csv(input.tsv, sep="\t")
+        markers["Name"] = markers["Microhaplotype"].str.replace("zha", "ZHA-")
+        markers["Xref"] = None
+        markers["NumVars"] = markers["# SNPs"]
+        markers["Refr"] = None
+        markers["Chrom"] = markers["Position (build37)"].apply(lambda x: x.split(":")[0])
+        markers["Positions"] = None
+        markers["VarRef"] = markers["SNPs"].str.strip("/").str.replace("/", ";")
+        columns = ["Name", "Xref", "NumVars", "Refr", "Chrom", "Positions", "VarRef"]
+        markers = markers[columns].sort_values("Name")
+        markers.to_csv(output.csv, index=False)


New procedure automating the marker compiling process that was done manually before.

standage · 2023-02-28T16:41:22Z

dbbuild/sources/sun2020/marker.csv

+mh13KK-213,,3,,chr13,,rs8181845;rs679482;rs9510616
+mh13KK-218,,4,,chr13,,rs1927847;rs9536429;rs7492234;rs9536430
+mh13KK-223,,4,,chr13,,rs1192204;rs1192205;rs3825483;rs3825481
 mh13ZHA-002,,4,,chr13,,rs72649485;rs12877457;rs9514021;rs9514022
 mh14ZHA-006,,4,,chr14,,rs71205883;rs7160425;rs7161550;rs78689987
+mh16KK-302,,4,,chr16,,rs1395579;rs1395580;rs1395582;rs9939248
 mh16ZHA-003,,3,,chr16,,rs6498348;rs16960309;rs8052581
 mh16ZHA-004,,4,,chr16,,rs34771585;rs72638292;rs4781308;rs4781311
 mh16ZHA-006,,3,,chr16,,rs2966051;rs2914455;rs12922936
+mh16ZHA-009,,4,,chr16,,rs76047588;rs11641186;rs11641193;rs80213582


Full complement of markers reported in Sun 2020 now included.

standage · 2023-02-28T16:44:27Z

microhapdb/marker.py


        ids = set()
        for ident in idents:
+            locusnames = microhapdb.markers.Name.apply(lambda x: x.split(".")[0])


Initially, I supported microhap lookup by locus name with a pandas series str.contains operation. But that was a bit too permissive: that would allow mh01 or mh12KK to be treated as identifiers and retrieve all IDs that begin with that prefix, which was not the intent of this update. So instead, I created a series of locus names (.v? stripped off) and now microhap ID queries must either match a full locus name or a full marker name to return a result.

Daniel Standage added 4 commits February 27, 2023 12:19

Interval check for redundant MH names

6224d02

Restore Sun 2020

f7bf25c

Frequencies and rebuild

7e3638a

New build, test suite update

11d5dee

standage added enhancement New feature or request datasources References to existing data sources or proposals for new sources labels Feb 27, 2023

Update standardize_id function to make sure that queries are either m…

1ced747

…arker or locus names

standage marked this pull request as ready for review February 28, 2023 16:35

standage commented Feb 28, 2023

View reviewed changes

standage merged commit e9d5f63 into master Feb 28, 2023

standage deleted the intervalcheck branch February 28, 2023 16:45

standage mentioned this pull request Mar 8, 2023

New nomenclature module #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interval check for re-defined microhaps #112

Interval check for re-defined microhaps #112

standage commented Feb 27, 2023 •

edited

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

standage Feb 28, 2023

		if marker.name in self.interval_index.mergeables:
		marker.name = self.interval_index.mergeables[marker.name]

Interval check for re-defined microhaps #112

Interval check for re-defined microhaps #112

Conversation

standage commented Feb 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

standage commented Feb 27, 2023 •

edited