Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for non-overlapping markers at build time #128

Merged
merged 8 commits into from
Jun 22, 2023
Merged

Conversation

standage
Copy link
Member

@standage standage commented Jun 16, 2023

This PR updates the MicroHapDB build process to make sure that markers sharing the same name actually overlap. There are a few cases from Yu 2022 that need manual intervention, either to add an a/b suffix or to use an alternative marker name.

Closes #126.


mh06WL-066

This marker is defined at the locus chr6:31274441-31274634 in Group 1 and chr6:124896325-124896527 in Group 3. The first has been updated with the name mh06WL-66a and the second mh06WL-66b.

$ grep -e mh06WL-066 microhapdb/data/marker.csv 
mh06WL-066a,10,194,chr6,31306664,31306857,31306664;31306672;31306736;31306778;31306803;31306805;31306809;31306842;31306853;31306857,31274441;31274449;31274513;31274555;31274580;31274582;31274586;31274619;31274630;31274634,rs9468942;rs35647108;rs6931873;rs10484554;rs184149624;rs9468944;rs561074738;rs9264944;rs118169956;rs9264946,Yu2022G1
mh06WL-066b,4,203,chr6,124575179,124575381,124575179;124575207;124575295;124575381,124896325;124896353;124896441;124896527,rs12111530;rs611601;rs575793;rs576646,Yu2022G3

mh06WL-067

This marker is defined at the locus chr6:170686762-170686884 in Group 1 and chr6:23760007-23760286 in Group 3. The first has been updated with the name mh06WL-67a and the second mh06WL-67b.

$ grep -e mh06WL-067 microhapdb/data/marker.csv 
mh06WL-067b,6,280,chr6,23759779,23760058,23759779;23759800;23759821;23759824;23760036;23760058,23760007;23760028;23760049;23760052;23760264;23760286,rs12529689;rs12526357;rs72835578;rs13199839;rs6456572;rs6456573,Yu2022G3
mh06WL-067a,7,123,chr6,170377674,170377796,170377674;170377762;170377771;170377777;170377785;170377792;170377796,170686762;170686850;170686859;170686865;170686873;170686880;170686884,rs80046082;rs74200689;rs118007249;rs146473817;rs35931761;rs34430936;rs149631868,Yu2022G1

mh06WL-068

This marker is defined at the locus chr6:29924713-29924889 in Group 1 and chr6:130936249-130936495 in Group 3. The first has been updated with the name mh06WL-68a and the second mh06WL-68b.

$ grep -e mh06WL-068 microhapdb/data/marker.csv 
mh06WL-068a,6,177,chr6,29956936,29957112,29956936;29956951;29957002;29957045;29957056;29957112,29924713;29924728;29924779;29924822;29924833;29924889,rs73415987;rs892666;rs12665039;rs428905;rs9260671;rs2517702,Yu2022G1
mh06WL-068b,3,247,chr6,130615104,130615350,130615104;130615240;130615350,130936249;130936385;130936495,rs4897446;rs17059290;rs4144214,Yu2022G3

mh06WL-069, mh06WL-017, and mh06WL-037

Oddly, the marker mh06WL-069 is defined twice in Group 1, first at the locus chr6:32630944-32631142 and second at the locus chr6:108031328-108031471. The first definition perfectly matches the definition for mh06WL-017 in Group 2, both of which overlap with the marker mh06WL-037 defined in Group 4. The Group 1 definition for mh06WL-069 was updated to use the name mh06WL-017. The label mh06WL-037 ends up getting merged with mh06WL-017 in the final build.

$ grep -e mh06WL-069 -e mh06WL-037 -e mh06WL-017 microhapdb/data/marker.csv
mh06WL-017.v2,16,100,chr6,32663167,32663266,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663261;32663262;32663265;32663266,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043,rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426,Yu2022G4
mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663288;32663293;32663298;32663302;32663303;32663327;32663332;32663336;32663342;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631065;32631070;32631075;32631079;32631080;32631104;32631109;32631113;32631119;32631129;32631130;32631133;32631142,rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357,Yu2022G1;Yu2022G2
mh06WL-069,3,144,chr6,107710124,107710267,107710124;107710141;107710267,108031328;108031345;108031471,rs9386660;rs539627870;rs2748441,Yu2022G1
$ grep -e mh06WL-017 -e mh06WL-037 -e mh06WL-069 microhapdb/data/merged.csv 
mh06WL-037,mh06WL-017

mh20WL-001, mh20WL-015, and mh20WL-026

In Group 3, the marker mh20WL-015 is defined at the locus chr20:62720047-62720255 (which matches mh20WL-001 in Groups 1, 2, and 4), but in Group 4 mh20WL-015 is defined at the locus chr20:38492803-38492899 (which matches mh20WL-026 in Group 1). The definition for mh20WL-015 in Group 3 was modified to use the name mh20WL-001, allowing mh20WL-015 in Group 4 to be merged with mh20WL-026 as defined in Group 1.

$ grep -e mh20WL-026 -e mh20WL-015 -e mh20WL-001 microhapdb/data/marker.csv 
mh20WL-026.v2,3,97,chr20,39864161,39864257,39864161;39864172;39864257,38492803;38492814;38492899,rs6101739;rs17802541;rs2208094,Yu2022G4
mh20WL-026.v1,4,101,chr20,39864161,39864261,39864161;39864172;39864257;39864261,38492803;38492814;38492899;38492903,rs6101739;rs17802541;rs2208094;rs2208093,Yu2022G1
mh20WL-001.v2,7,209,chr20,64088694,64088902,64088694;64088852;64088854;64088857;64088862;64088882;64088902,62720047;62720205;62720207;62720210;62720215;62720235;62720255,rs62218105;rs77312089;rs78514502;rs78828793;rs77461050;rs62218112;rs62218113,Yu2022G3
mh20WL-001.v1,6,51,chr20,64088852,64088902,64088852;64088854;64088857;64088862;64088882;64088902,62720205;62720207;62720210;62720215;62720235;62720255,rs77312089;rs78514502;rs78828793;rs77461050;rs62218112;rs62218113,Yu2022G1;Yu2022G2;Yu2022G4
$ grep -e mh20WL-001 -e mh20WL-015 -e mh20WL-026 microhapdb/data/merged.csv 
mh20WL-015,mh20WL-026

mh22WL-016

This marker is defined at two distinct loci in Group 1: chr22:45991612-45991645 and chr22:48651997-48652184. The first has been updated with the name mh22WL-016a and the second mh22WL-016b.

$ grep mh22WL-016 microhapdb/data/marker.csv 
mh22WL-016a,4,34,chr22,45595732,45595765,45595732;45595735;45595762;45595765,45991612;45991615;45991642;45991645,rs79222659;rs73447079;rs77662475;rs74430055,Yu2022G1
mh22WL-016b,3,188,chr22,48256185,48256372,48256185;48256263;48256372,48651997;48652075;48652184,rs1004689;rs59649586;rs2075958,Yu2022G1

Comment on lines -63 to -66
class LocusOldDeleteMe(list):
@property
def name(self):
return self[0].locus
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, your wish is my command.

Comment on lines +62 to +74
def check_overlap(self):
if len(self) <= 1:
return
for i in range(len(self.markers)):
marker1 = self.markers[i]
for j in range(len(self.markers)):
if i == j:
continue
marker2 = self.markers[j]
if marker1.overlaps(marker2):
break
else:
raise ValueError(f"{marker1.name} ({marker1.sourcename}) does not overlap with any other markers defined at {marker1.locus}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New check. Identified the markers described above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build now passes without this check failing.

@@ -40,7 +40,7 @@ def resolve(self):
for marker in self.markers:
yield marker
return
definitions = set(self.markers_by_definition)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped unused variable.

@@ -40,7 +40,7 @@ def resolve(self):
for marker in self.markers:
yield marker
return
definitions = set(self.markers_by_definition)
self.check_overlap()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the new function is called.

Comment on lines +31 to +35
"mh06WL-066": "mh06WL-066a",
"mh06WL-067": "mh06WL-067a",
"mh06WL-068": "mh06WL-068a",
}
markers = markers.replace(substitutions)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually editing some marker names. For marker names that occur only once, the syntax is simple and can be easily batched.

Comment on lines 36 to 38
markers.loc[(markers.Name == "mh06WL-069") & (markers.VarRef.str.contains("rs281863504")), "Name"] = "mh06WL-037"
markers.loc[(markers.Name == "mh22WL-016") & (markers.VarRef.str.contains("rs79222659")), "Name"] = "mh22WL-016a"
markers.loc[(markers.Name == "mh22WL-016") & (markers.VarRef.str.contains("rs1004689")), "Name"] = "mh22WL-016b"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manipulating marker names that occur multiple times in the same source require a bit more involved syntax.

- 1409 distinct loci
- 1414 distinct loci
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference reflects the number of affected loci.

@standage standage marked this pull request as ready for review June 16, 2023 15:55
Copy link
Collaborator

@agshumate agshumate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@standage I'm totally trusting you on the markers that needed manual intervention, but if you would like me to review that more closely I'd be happy to. Otherwise I just left a couple comments on the overlap() function

Comment on lines 171 to 173
def overlaps(self, other):
return self.start <= other.end and self.end >= other.start

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a check to make sure they are on the same chromosome? Also if my interval math is correct ( which is always a big if 😅 ), for half-open intervals the overlap calculation would be self.start < other.end and self.end > other.start

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a check to make sure they are on the same chromosome?

...yes

I'll take care of that right away! 😀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...for half-open intervals...

These are actually 1-based inclusive intervals. The amount of interval arithmetic performed in MicroHapDB is pretty limited, so while I usually like to use 0-based half-open intervals in code, I haven't taken the time to switch over in MicroHapDB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally trusting you on the markers that needed manual intervention

I don't think you need to check my code too closely, but it should be pretty straightforward if you want to see if MicroHapDB marker queries return results that are consistent with what I've described for the affected markers in this issue thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the best check would be to make sure that microhapdb marker --format=fasta doesn't create any ridiculously long sequences, since that is what caught this bug in the first place. Probably should add a test for that...

Comment on lines +171 to +173
def overlaps(self, other):
same_chrom = self.chrom_num == other.chrom_num
return same_chrom and self.start <= other.end and self.end >= other.start
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in b2a42f0.

@standage
Copy link
Member Author

I'll rebase this.

Comment on lines +244 to +245
if pd.isna(self.data.RSIDs):
return []
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor bug found and fixed.

Comment on lines +688 to +695
def test_locus_length():
loci = defaultdict(Locus)
for marker in Marker.objectify(microhapdb.markers):
loci[marker.locus].markers.append(marker)
for locus in loci.values():
# This is the length of the Fasta representation of the sequence, not the sequence itself,
# but...close enough. 🙃
assert len(locus.fasta) < 700, locus.name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test. Most microhaps are shorter than 300-350 bp, but a few loci include overlapping markers that span more than 600 bp.

@agshumate
Copy link
Collaborator

This PR updates the MicroHapDB build process to make sure that markers sharing the same name actually overlap. There are a few cases from Yu 2022 that need manual intervention, either to add an a/b suffix or to use an alternative marker name.

Closes #126.

mh06WL-066

This marker is defined at the locus chr6:31274441-31274634 in Group 1 and chr6:124896325-124896527 in Group 3. The first has been updated with the name mh06WL-66a and the second mh06WL-66b.

$ grep -e mh06WL-066 microhapdb/data/marker.csv 
mh06WL-066a,10,194,chr6,31306664,31306857,31306664;31306672;31306736;31306778;31306803;31306805;31306809;31306842;31306853;31306857,31274441;31274449;31274513;31274555;31274580;31274582;31274586;31274619;31274630;31274634,rs9468942;rs35647108;rs6931873;rs10484554;rs184149624;rs9468944;rs561074738;rs9264944;rs118169956;rs9264946,Yu2022G1
mh06WL-066b,4,203,chr6,124575179,124575381,124575179;124575207;124575295;124575381,124896325;124896353;124896441;124896527,rs12111530;rs611601;rs575793;rs576646,Yu2022G3

mh06WL-067

This marker is defined at the locus chr6:170686762-170686884 in Group 1 and chr6:23760007-23760286 in Group 3. The first has been updated with the name mh06WL-67a and the second mh06WL-67b.

$ grep -e mh06WL-067 microhapdb/data/marker.csv 
mh06WL-067b,6,280,chr6,23759779,23760058,23759779;23759800;23759821;23759824;23760036;23760058,23760007;23760028;23760049;23760052;23760264;23760286,rs12529689;rs12526357;rs72835578;rs13199839;rs6456572;rs6456573,Yu2022G3
mh06WL-067a,7,123,chr6,170377674,170377796,170377674;170377762;170377771;170377777;170377785;170377792;170377796,170686762;170686850;170686859;170686865;170686873;170686880;170686884,rs80046082;rs74200689;rs118007249;rs146473817;rs35931761;rs34430936;rs149631868,Yu2022G1

mh06WL-068

This marker is defined at the locus chr6:29924713-29924889 in Group 1 and chr6:130936249-130936495 in Group 3. The first has been updated with the name mh06WL-68a and the second mh06WL-68b.

$ grep -e mh06WL-068 microhapdb/data/marker.csv 
mh06WL-068a,6,177,chr6,29956936,29957112,29956936;29956951;29957002;29957045;29957056;29957112,29924713;29924728;29924779;29924822;29924833;29924889,rs73415987;rs892666;rs12665039;rs428905;rs9260671;rs2517702,Yu2022G1
mh06WL-068b,3,247,chr6,130615104,130615350,130615104;130615240;130615350,130936249;130936385;130936495,rs4897446;rs17059290;rs4144214,Yu2022G3

mh06WL-069, mh06WL-017, and mh06WL-037

Oddly, the marker mh06WL-069 is defined twice in Group 1, first at the locus chr6:32630944-32631142 and second at the locus chr6:108031328-108031471. The first definition perfectly matches the definition for mh06WL-017 in Group 2, both of which overlap with the marker mh06WL-037 defined in Group 4. The Group 1 definition for mh06WL-069 was updated to use the name mh06WL-017. The label mh06WL-037 ends up getting merged with mh06WL-017 in the final build.

$ grep -e mh06WL-069 -e mh06WL-037 -e mh06WL-017 microhapdb/data/marker.csv
mh06WL-017.v2,16,100,chr6,32663167,32663266,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663261;32663262;32663265;32663266,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043,rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426,Yu2022G4
mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663288;32663293;32663298;32663302;32663303;32663327;32663332;32663336;32663342;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631065;32631070;32631075;32631079;32631080;32631104;32631109;32631113;32631119;32631129;32631130;32631133;32631142,rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357,Yu2022G1;Yu2022G2
mh06WL-069,3,144,chr6,107710124,107710267,107710124;107710141;107710267,108031328;108031345;108031471,rs9386660;rs539627870;rs2748441,Yu2022G1
$ grep -e mh06WL-017 -e mh06WL-037 -e mh06WL-069 microhapdb/data/merged.csv 
mh06WL-037,mh06WL-017

mh20WL-001, mh20WL-015, and mh20WL-026

In Group 3, the marker mh20WL-015 is defined at the locus chr20:62720047-62720255 (which matches mh20WL-001 in Groups 1, 2, and 4), but in Group 4 mh20WL-015 is defined at the locus chr20:38492803-38492899 (which matches mh20WL-026 in Group 1). The definition for mh20WL-015 in Group 3 was modified to use the name mh20WL-001, allowing mh20WL-015 in Group 4 to be merged with mh20WL-026 as defined in Group 1.

$ grep -e mh20WL-026 -e mh20WL-015 -e mh20WL-001 microhapdb/data/marker.csv 
mh20WL-026.v2,3,97,chr20,39864161,39864257,39864161;39864172;39864257,38492803;38492814;38492899,rs6101739;rs17802541;rs2208094,Yu2022G4
mh20WL-026.v1,4,101,chr20,39864161,39864261,39864161;39864172;39864257;39864261,38492803;38492814;38492899;38492903,rs6101739;rs17802541;rs2208094;rs2208093,Yu2022G1
mh20WL-001.v2,7,209,chr20,64088694,64088902,64088694;64088852;64088854;64088857;64088862;64088882;64088902,62720047;62720205;62720207;62720210;62720215;62720235;62720255,rs62218105;rs77312089;rs78514502;rs78828793;rs77461050;rs62218112;rs62218113,Yu2022G3
mh20WL-001.v1,6,51,chr20,64088852,64088902,64088852;64088854;64088857;64088862;64088882;64088902,62720205;62720207;62720210;62720215;62720235;62720255,rs77312089;rs78514502;rs78828793;rs77461050;rs62218112;rs62218113,Yu2022G1;Yu2022G2;Yu2022G4
$ grep -e mh20WL-001 -e mh20WL-015 -e mh20WL-026 microhapdb/data/merged.csv 
mh20WL-015,mh20WL-026

mh22WL-016

This marker is defined at two distinct loci in Group 1: chr22:45991612-45991645 and chr22:48651997-48652184. The first has been updated with the name mh22WL-016a and the second mh22WL-016b.

$ grep mh22WL-016 microhapdb/data/marker.csv 
mh22WL-016a,4,34,chr22,45595732,45595765,45595732;45595735;45595762;45595765,45991612;45991615;45991642;45991645,rs79222659;rs73447079;rs77662475;rs74430055,Yu2022G1
mh22WL-016b,3,188,chr22,48256185,48256372,48256185;48256263;48256372,48651997;48652075;48652184,rs1004689;rs59649586;rs2075958,Yu2022G1

I am seeing some discrepancies in the coordinates. For example, when I query the first marker mh06WL-066a I get
mh06WL-066a 10 194 chr6 31306664 31306857 7.785 Yu2022G1
where as your comment has the coordinates listed as chr6:31274441-31274634. Am I missing something here? I haven't checked them all but the first few seem to have these discrepancies.

@standage
Copy link
Member Author

standage commented Jun 20, 2023

Am I missing something here? I haven't checked them all but the first few seem to have these discrepancies.

It looks like these correspond to different genome builds. I pulled the coordinates in my prose from dbbuild/sources/yu2022g?/Yu2022-TableS?.tsv, which appear to be GRCh37. However, microhapdb marker returns GRCh38 coordinates by default. If you include q in the --columns configuration (or look directly at the microhapdb/data/marker.csv file) you should be able to confirm.

@agshumate agshumate merged commit ed84d4c into master Jun 22, 2023
3 checks passed
@standage standage deleted the test/overlap branch June 27, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Locus check
2 participants