-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for non-overlapping markers at build time #128
Conversation
class LocusOldDeleteMe(list): | ||
@property | ||
def name(self): | ||
return self[0].locus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, your wish is my command.
def check_overlap(self): | ||
if len(self) <= 1: | ||
return | ||
for i in range(len(self.markers)): | ||
marker1 = self.markers[i] | ||
for j in range(len(self.markers)): | ||
if i == j: | ||
continue | ||
marker2 = self.markers[j] | ||
if marker1.overlaps(marker2): | ||
break | ||
else: | ||
raise ValueError(f"{marker1.name} ({marker1.sourcename}) does not overlap with any other markers defined at {marker1.locus}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New check. Identified the markers described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build now passes without this check failing.
@@ -40,7 +40,7 @@ def resolve(self): | |||
for marker in self.markers: | |||
yield marker | |||
return | |||
definitions = set(self.markers_by_definition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped unused variable.
@@ -40,7 +40,7 @@ def resolve(self): | |||
for marker in self.markers: | |||
yield marker | |||
return | |||
definitions = set(self.markers_by_definition) | |||
self.check_overlap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the new function is called.
"mh06WL-066": "mh06WL-066a", | ||
"mh06WL-067": "mh06WL-067a", | ||
"mh06WL-068": "mh06WL-068a", | ||
} | ||
markers = markers.replace(substitutions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually editing some marker names. For marker names that occur only once, the syntax is simple and can be easily batched.
dbbuild/sources/yu2022g1/Snakefile
Outdated
markers.loc[(markers.Name == "mh06WL-069") & (markers.VarRef.str.contains("rs281863504")), "Name"] = "mh06WL-037" | ||
markers.loc[(markers.Name == "mh22WL-016") & (markers.VarRef.str.contains("rs79222659")), "Name"] = "mh22WL-016a" | ||
markers.loc[(markers.Name == "mh22WL-016") & (markers.VarRef.str.contains("rs1004689")), "Name"] = "mh22WL-016b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manipulating marker names that occur multiple times in the same source require a bit more involved syntax.
microhapdb/tests/test_cli.py
Outdated
- 1409 distinct loci | ||
- 1414 distinct loci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference reflects the number of affected loci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@standage I'm totally trusting you on the markers that needed manual intervention, but if you would like me to review that more closely I'd be happy to. Otherwise I just left a couple comments on the overlap()
function
dbbuild/lib/marker.py
Outdated
def overlaps(self, other): | ||
return self.start <= other.end and self.end >= other.start | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there also be a check to make sure they are on the same chromosome? Also if my interval math is correct ( which is always a big if 😅 ), for half-open intervals the overlap calculation would be self.start < other.end and self.end > other.start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there also be a check to make sure they are on the same chromosome?
...yes
I'll take care of that right away! 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...for half-open intervals...
These are actually 1-based inclusive intervals. The amount of interval arithmetic performed in MicroHapDB is pretty limited, so while I usually like to use 0-based half-open intervals in code, I haven't taken the time to switch over in MicroHapDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally trusting you on the markers that needed manual intervention
I don't think you need to check my code too closely, but it should be pretty straightforward if you want to see if MicroHapDB marker queries return results that are consistent with what I've described for the affected markers in this issue thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the best check would be to make sure that microhapdb marker --format=fasta
doesn't create any ridiculously long sequences, since that is what caught this bug in the first place. Probably should add a test for that...
def overlaps(self, other): | ||
same_chrom = self.chrom_num == other.chrom_num | ||
return same_chrom and self.start <= other.end and self.end >= other.start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in b2a42f0.
I'll rebase this. |
if pd.isna(self.data.RSIDs): | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor bug found and fixed.
def test_locus_length(): | ||
loci = defaultdict(Locus) | ||
for marker in Marker.objectify(microhapdb.markers): | ||
loci[marker.locus].markers.append(marker) | ||
for locus in loci.values(): | ||
# This is the length of the Fasta representation of the sequence, not the sequence itself, | ||
# but...close enough. 🙃 | ||
assert len(locus.fasta) < 700, locus.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test. Most microhaps are shorter than 300-350 bp, but a few loci include overlapping markers that span more than 600 bp.
I am seeing some discrepancies in the coordinates. For example, when I query the first marker mh06WL-066a I get |
It looks like these correspond to different genome builds. I pulled the coordinates in my prose from |
This PR updates the MicroHapDB build process to make sure that markers sharing the same name actually overlap. There are a few cases from Yu 2022 that need manual intervention, either to add an a/b suffix or to use an alternative marker name.
Closes #126.
mh06WL-066
This marker is defined at the locus chr6:31274441-31274634 in Group 1 and chr6:124896325-124896527 in Group 3. The first has been updated with the name
mh06WL-66a
and the secondmh06WL-66b
.mh06WL-067
This marker is defined at the locus chr6:170686762-170686884 in Group 1 and chr6:23760007-23760286 in Group 3. The first has been updated with the name
mh06WL-67a
and the secondmh06WL-67b
.mh06WL-068
This marker is defined at the locus chr6:29924713-29924889 in Group 1 and chr6:130936249-130936495 in Group 3. The first has been updated with the name
mh06WL-68a
and the secondmh06WL-68b
.mh06WL-069, mh06WL-017, and mh06WL-037
Oddly, the marker mh06WL-069 is defined twice in Group 1, first at the locus chr6:32630944-32631142 and second at the locus chr6:108031328-108031471. The first definition perfectly matches the definition for mh06WL-017 in Group 2, both of which overlap with the marker mh06WL-037 defined in Group 4. The Group 1 definition for mh06WL-069 was updated to use the name mh06WL-017. The label mh06WL-037 ends up getting merged with mh06WL-017 in the final build.
mh20WL-001, mh20WL-015, and mh20WL-026
In Group 3, the marker mh20WL-015 is defined at the locus chr20:62720047-62720255 (which matches mh20WL-001 in Groups 1, 2, and 4), but in Group 4 mh20WL-015 is defined at the locus chr20:38492803-38492899 (which matches mh20WL-026 in Group 1). The definition for mh20WL-015 in Group 3 was modified to use the name mh20WL-001, allowing mh20WL-015 in Group 4 to be merged with mh20WL-026 as defined in Group 1.
mh22WL-016
This marker is defined at two distinct loci in Group 1: chr22:45991612-45991645 and chr22:48651997-48652184. The first has been updated with the name
mh22WL-016a
and the secondmh22WL-016b
.