New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interval check for re-defined microhaps #112
Conversation
…arker or locus names
@@ -179,6 +179,7 @@ They can be installed using pip and/or conda. | |||
- ucsc-liftover | |||
- selenium | |||
- geckodriver | |||
- intervaltree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New dependency for dbbuild.
Merging 'mh01NK-001' --> 'mh01NH-04' | ||
Merging 'mh11USC-11pB' --> 'mh11PK-63643' | ||
Merging 'mh13SHY-003' --> 'mh13KK-218' | ||
Merging 'mh16FHL-004' --> 'mh16KK-259' | ||
Merging 'mh02FHL-003' --> 'mh02KK-029' | ||
Merging 'mh02FHL-010' --> 'mh02KK-014' | ||
Merging 'mh02FHL-006' --> 'mh02ZHA-013' | ||
Merging 'mh22USC-22qB' --> 'mh22KK-340' | ||
Merging 'mh03SHY-003' --> 'mh03KK-017' | ||
Merging 'mh05KK-020' --> 'mh05KK-023' | ||
Merging 'mh05KK-121' --> 'mh05KK-120' | ||
Merging 'mh09USC-9pA' --> 'mh09KK-010' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged loci now listed in the build summary.
|
||
def check(self): | ||
for chrom, tree in sorted(self.trees.items()): | ||
tree.merge_overlaps(data_reducer=lambda locus, marker: locus + [marker], data_initializer=list()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used the intervaltree
package several times before, but I've never taken advantage of this merge_overlaps
function. It's very convenient!
if marker.name in self.interval_index.mergeables: | ||
marker.name = self.interval_index.mergeables[marker.name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming of merged microhaps needs to happen before updating ambiguous marker names.
Derivative,Original | ||
mh01NK-001,mh01NH-04 | ||
mh02FHL-010,mh02KK-014 | ||
mh02FHL-003,mh02KK-029 | ||
mh02FHL-006,mh02ZHA-013 | ||
mh03SHY-003,mh03KK-017 | ||
mh05KK-020,mh05KK-023 | ||
mh05KK-121,mh05KK-120 | ||
mh09USC-9pA,mh09KK-010 | ||
mh11USC-11pB,mh11PK-63643 | ||
mh13SHY-003,mh13KK-218 | ||
mh16FHL-004,mh16KK-259 | ||
mh22USC-22qB,mh22KK-340 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New core database table.
markers = pd.read_csv(input.tsv, sep="\t") | ||
markers["Name"] = markers["Microhaplotype"].str.replace("zha", "ZHA-") | ||
markers["Xref"] = None | ||
markers["NumVars"] = markers["# SNPs"] | ||
markers["Refr"] = None | ||
markers["Chrom"] = markers["Position (build37)"].apply(lambda x: x.split(":")[0]) | ||
markers["Positions"] = None | ||
markers["VarRef"] = markers["SNPs"].str.strip("/").str.replace("/", ";") | ||
columns = ["Name", "Xref", "NumVars", "Refr", "Chrom", "Positions", "VarRef"] | ||
markers = markers[columns].sort_values("Name") | ||
markers.to_csv(output.csv, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New procedure automating the marker compiling process that was done manually before.
mh13KK-213,,3,,chr13,,rs8181845;rs679482;rs9510616 | ||
mh13KK-218,,4,,chr13,,rs1927847;rs9536429;rs7492234;rs9536430 | ||
mh13KK-223,,4,,chr13,,rs1192204;rs1192205;rs3825483;rs3825481 | ||
mh13ZHA-002,,4,,chr13,,rs72649485;rs12877457;rs9514021;rs9514022 | ||
mh14ZHA-006,,4,,chr14,,rs71205883;rs7160425;rs7161550;rs78689987 | ||
mh16KK-302,,4,,chr16,,rs1395579;rs1395580;rs1395582;rs9939248 | ||
mh16ZHA-003,,3,,chr16,,rs6498348;rs16960309;rs8052581 | ||
mh16ZHA-004,,4,,chr16,,rs34771585;rs72638292;rs4781308;rs4781311 | ||
mh16ZHA-006,,3,,chr16,,rs2966051;rs2914455;rs12922936 | ||
mh16ZHA-009,,4,,chr16,,rs76047588;rs11641186;rs11641193;rs80213582 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full complement of markers reported in Sun 2020 now included.
|
||
ids = set() | ||
for ident in idents: | ||
locusnames = microhapdb.markers.Name.apply(lambda x: x.split(".")[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I supported microhap lookup by locus name with a pandas series str.contains
operation. But that was a bit too permissive: that would allow mh01
or mh12KK
to be treated as identifiers and retrieve all IDs that begin with that prefix, which was not the intent of this update. So instead, I created a series of locus names (.v? stripped off) and now microhap ID queries must either match a full locus name or a full marker name to return a result.
Recent changes to nomenclature provide for better handling of loci where the precise SNPs of interest (the marker definition) has been intentionally adjusted over time to take advantage of additional discriminating power.
In this PR, I've added an interval-based check for microhaps that have been unintentionally renamed/redefined in different studies. I've introduced a "merge" step in the build process, and a record of merged microhaps is now kept.
Tangentially, I also revisited Sun 2020 (#87) to include 11 markers that had been excluded previously. They were excluded because they were redundant definitions, and the MicroHapDB build process didn't explicitly handle such cases previously. Now that the build process has been improved, the markers were added back so that all source and provenance information can be correctly tracked.
The
microhapdb.Marker.standardize_ids
function was updated to support lookup by merged microhap ID. For example, looking upmh11USC-11pB
will returnmh11PK-63643
, ormh02FHL-006
will returnmh02ZHA-013
. Along with this, the function was also updated to return IDs for all associated markers if a locus ID is provided. For example,mh02KK-031
returnsmh02KK-031.v1
andmh02KK-031.v2
.Closes #108.