-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure consistent MH frequencies #150
Conversation
mh06WL-017 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357 15.69 6.43 32 199 | ||
mh06WL-017 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs17843724;rs281863382;rs281863378;rs281863375;rs281863371;rs17613622;rs281863364;rs281863363;rs281863362;rs281863357 15.69 6.43 32 199 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original change motivating this PR. Digging to the bottom of it took more time than I'd like to admit!
mh06WL-069 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357 15.69 | ||
mh06WL-069 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs17843724;rs281863382;rs281863378;rs281863375;rs281863371;rs17613622;rs281863364;rs281863363;rs281863362;rs281863357 15.69 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same marker, different name.
mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663288;32663293;32663298;32663302;32663303;32663327;32663332;32663336;32663342;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631065;32631070;32631075;32631079;32631080;32631104;32631109;32631113;32631119;32631129;32631130;32631133;32631142,rs71542446;rs41270931;rs28724261;rs71542447;rs2854267;rs9274216;rs35986240;rs201930518;rs17613599;rs71542448;rs9274217;rs58770498;rs58770498;rs17613606;rs9274218;rs115495316;rs9274219;rs9274220;rs2856703;rs17843723;rs41270932;rs9274222;rs72844333;rs200098349;rs74222206;rs281863378;rs9274225;rs41270933;rs17613629;rs17613636;rs281863362;rs9274227,Yu2022G1;Yu2022G2 | ||
mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663293;32663298;32663302;32663303;32663309;32663327;32663332;32663336;32663342;32663345;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142,rs71542446;rs41270931;rs28724261;rs71542447;rs2854267;rs9274216;rs35986240;rs201930518;rs17613599;rs71542448;rs9274217;rs58770498;rs17613606;rs9274218;rs115495316;rs9274219;rs9274220;rs2856703;rs41270932;rs9274222;rs72844333;rs200098349;rs17843724;rs74222206;rs281863378;rs9274225;rs41270933;rs17613622;rs17613629;rs17613636;rs281863362;rs9274227,Yu2022G1;Yu2022G2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change as propagated to the marker table.
if len(reference) != len(positions): | ||
message = f"variant count mismatch: {len(reference)} vs {len(positions)}" | ||
raise ValueError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the following code to the build procedure to check for similar issues in the future.
if len(allelevars) != len(self.variant_lengths): | ||
message = f"num var mismatch ({self.name}): {len(allelevars)} vs {len(self.variant_lengths)}" | ||
raise ValueError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another check, this time in the main code.
def test_mh06WL017v1(): | ||
subset = microhapdb.markers[microhapdb.markers.Name == "mh06WL-017.v1"] | ||
markers = list(Marker.objectify(subset)) | ||
assert len(markers) == 1 | ||
marker = markers[0] | ||
assert len(marker) == 199 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit test for this problematic marker.
def test_all_consistent(): | ||
inconsistent_markers = set() | ||
for markerid, table in microhapdb.frequencies.groupby("Marker"): | ||
numvars = set([a.count("|") + 1 for a in table.Allele]) | ||
if len(numvars) > 1: | ||
inconsistent_markers.add(markerid) | ||
assert len(inconsistent_markers) == 0, sorted(inconsistent_markers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After fixing the aforementioned problem, I wrote this test and ran it iteratively to identify and fix additional issues.
def cleanup_frequencies(freq): | ||
freq["NumVars"] = freq.Allele.apply(lambda x: x.count("|") + 1) | ||
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Kidd2018"), "Marker"] = "mh01NH-01.v2" | ||
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Turchi2019"), "Marker"] = "mh01NH-01.v2" | ||
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Gandotra2020"), "Marker"] = "mh01NH-01.v2" | ||
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Staadig2021"), "Marker"] = "mh01NH-01.v1" | ||
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-023.v1" | ||
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-023.v2" | ||
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Turchi2019"), "Marker"] = "mh05KK-023.v1" | ||
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Gandotra2020"), "Marker"] = "mh05KK-023.v3" | ||
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-120.v1" | ||
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-120.v2" | ||
freq = freq.drop(columns="NumVars") | ||
return freq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is the result of running that new test dozens of times. There are a handful of markers from Kidd & company that could not be rectified with a reasonable amount of effort using my automated procedure. Fortunately, a few "manual" replacements as specified here did the trick.
These issues were the result of different names being used to refer to different SNP sets at the same locus.
This branch began with an in-depth investigation of mh06WL-017, and the cause of its 31-SNP allele format even though the definition has 32 SNPs. In the process, I discovered a handful of other markers that did not have consistent allele representations. This PR addresses those issues, as described below.
Closes #146.