Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interval check for re-defined microhaps #112

Merged
merged 5 commits into from Feb 28, 2023
Merged

Interval check for re-defined microhaps #112

merged 5 commits into from Feb 28, 2023

Conversation

standage
Copy link
Member

@standage standage commented Feb 27, 2023

Recent changes to nomenclature provide for better handling of loci where the precise SNPs of interest (the marker definition) has been intentionally adjusted over time to take advantage of additional discriminating power.

In this PR, I've added an interval-based check for microhaps that have been unintentionally renamed/redefined in different studies. I've introduced a "merge" step in the build process, and a record of merged microhaps is now kept.

Tangentially, I also revisited Sun 2020 (#87) to include 11 markers that had been excluded previously. They were excluded because they were redundant definitions, and the MicroHapDB build process didn't explicitly handle such cases previously. Now that the build process has been improved, the markers were added back so that all source and provenance information can be correctly tracked.

The microhapdb.Marker.standardize_ids function was updated to support lookup by merged microhap ID. For example, looking up mh11USC-11pB will return mh11PK-63643, or mh02FHL-006 will return mh02ZHA-013. Along with this, the function was also updated to return IDs for all associated markers if a locus ID is provided. For example, mh02KK-031 returns mh02KK-031.v1 and mh02KK-031.v2.

Closes #108.

@standage standage added enhancement New feature or request datasources References to existing data sources or proposals for new sources labels Feb 27, 2023
@standage standage marked this pull request as ready for review February 28, 2023 16:35
@@ -179,6 +179,7 @@ They can be installed using pip and/or conda.
- ucsc-liftover
- selenium
- geckodriver
- intervaltree
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New dependency for dbbuild.

Comment on lines +13 to +24
Merging 'mh01NK-001' --> 'mh01NH-04'
Merging 'mh11USC-11pB' --> 'mh11PK-63643'
Merging 'mh13SHY-003' --> 'mh13KK-218'
Merging 'mh16FHL-004' --> 'mh16KK-259'
Merging 'mh02FHL-003' --> 'mh02KK-029'
Merging 'mh02FHL-010' --> 'mh02KK-014'
Merging 'mh02FHL-006' --> 'mh02ZHA-013'
Merging 'mh22USC-22qB' --> 'mh22KK-340'
Merging 'mh03SHY-003' --> 'mh03KK-017'
Merging 'mh05KK-020' --> 'mh05KK-023'
Merging 'mh05KK-121' --> 'mh05KK-120'
Merging 'mh09USC-9pA' --> 'mh09KK-010'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged loci now listed in the build summary.


def check(self):
for chrom, tree in sorted(self.trees.items()):
tree.merge_overlaps(data_reducer=lambda locus, marker: locus + [marker], data_initializer=list())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the intervaltree package several times before, but I've never taken advantage of this merge_overlaps function. It's very convenient!

Comment on lines +169 to +170
if marker.name in self.interval_index.mergeables:
marker.name = self.interval_index.mergeables[marker.name]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming of merged microhaps needs to happen before updating ambiguous marker names.

Comment on lines +1 to +13
Derivative,Original
mh01NK-001,mh01NH-04
mh02FHL-010,mh02KK-014
mh02FHL-003,mh02KK-029
mh02FHL-006,mh02ZHA-013
mh03SHY-003,mh03KK-017
mh05KK-020,mh05KK-023
mh05KK-121,mh05KK-120
mh09USC-9pA,mh09KK-010
mh11USC-11pB,mh11PK-63643
mh13SHY-003,mh13KK-218
mh16FHL-004,mh16KK-259
mh22USC-22qB,mh22KK-340
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New core database table.

Comment on lines +10 to +20
markers = pd.read_csv(input.tsv, sep="\t")
markers["Name"] = markers["Microhaplotype"].str.replace("zha", "ZHA-")
markers["Xref"] = None
markers["NumVars"] = markers["# SNPs"]
markers["Refr"] = None
markers["Chrom"] = markers["Position (build37)"].apply(lambda x: x.split(":")[0])
markers["Positions"] = None
markers["VarRef"] = markers["SNPs"].str.strip("/").str.replace("/", ";")
columns = ["Name", "Xref", "NumVars", "Refr", "Chrom", "Positions", "VarRef"]
markers = markers[columns].sort_values("Name")
markers.to_csv(output.csv, index=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New procedure automating the marker compiling process that was done manually before.

Comment on lines +21 to +30
mh13KK-213,,3,,chr13,,rs8181845;rs679482;rs9510616
mh13KK-218,,4,,chr13,,rs1927847;rs9536429;rs7492234;rs9536430
mh13KK-223,,4,,chr13,,rs1192204;rs1192205;rs3825483;rs3825481
mh13ZHA-002,,4,,chr13,,rs72649485;rs12877457;rs9514021;rs9514022
mh14ZHA-006,,4,,chr14,,rs71205883;rs7160425;rs7161550;rs78689987
mh16KK-302,,4,,chr16,,rs1395579;rs1395580;rs1395582;rs9939248
mh16ZHA-003,,3,,chr16,,rs6498348;rs16960309;rs8052581
mh16ZHA-004,,4,,chr16,,rs34771585;rs72638292;rs4781308;rs4781311
mh16ZHA-006,,3,,chr16,,rs2966051;rs2914455;rs12922936
mh16ZHA-009,,4,,chr16,,rs76047588;rs11641186;rs11641193;rs80213582
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full complement of markers reported in Sun 2020 now included.


ids = set()
for ident in idents:
locusnames = microhapdb.markers.Name.apply(lambda x: x.split(".")[0])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I supported microhap lookup by locus name with a pandas series str.contains operation. But that was a bit too permissive: that would allow mh01 or mh12KK to be treated as identifiers and retrieve all IDs that begin with that prefix, which was not the intent of this update. So instead, I created a series of locus names (.v? stripped off) and now microhap ID queries must either match a full locus name or a full marker name to return a result.

@standage standage merged commit e9d5f63 into master Feb 28, 2023
@standage standage deleted the intervalcheck branch February 28, 2023 16:45
@standage standage mentioned this pull request Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasources References to existing data sources or proposals for new sources enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record merged MH names
1 participant