Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure consistent MH frequencies #150

Merged
merged 2 commits into from
Oct 23, 2023
Merged

Ensure consistent MH frequencies #150

merged 2 commits into from
Oct 23, 2023

Conversation

standage
Copy link
Member

This branch began with an in-depth investigation of mh06WL-017, and the cause of its 31-SNP allele format even though the definition has 32 SNPs. In the process, I discovered a handful of other markers that did not have consistent allele representations. This PR addresses those issues, as described below.

Closes #146.

mh06WL-017 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357 15.69 6.43 32 199
mh06WL-017 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs17843724;rs281863382;rs281863378;rs281863375;rs281863371;rs17613622;rs281863364;rs281863363;rs281863362;rs281863357 15.69 6.43 32 199
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original change motivating this PR. Digging to the bottom of it took more time than I'd like to admit!

mh06WL-069 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs58770498;rs281863382;rs281863378;rs281863375;rs281863371;rs17843723;rs281863364;rs281863363;rs281863362;rs281863357 15.69
mh06WL-069 32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142 rs281863504;rs281863503;rs281863502;rs281863499;rs281863486;rs281863485;rs35986240;rs281863467;rs281863466;rs281863464;rs281863463;rs281863439;rs281863431;rs281863430;rs281863427;rs281863426;rs281863425;rs281863414;rs281863406;rs281863401;rs281863398;rs281863397;rs17843724;rs281863382;rs281863378;rs281863375;rs281863371;rs17613622;rs281863364;rs281863363;rs281863362;rs281863357 15.69
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same marker, different name.

mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663288;32663293;32663298;32663302;32663303;32663327;32663332;32663336;32663342;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631065;32631070;32631075;32631079;32631080;32631104;32631109;32631113;32631119;32631129;32631130;32631133;32631142,rs71542446;rs41270931;rs28724261;rs71542447;rs2854267;rs9274216;rs35986240;rs201930518;rs17613599;rs71542448;rs9274217;rs58770498;rs58770498;rs17613606;rs9274218;rs115495316;rs9274219;rs9274220;rs2856703;rs17843723;rs41270932;rs9274222;rs72844333;rs200098349;rs74222206;rs281863378;rs9274225;rs41270933;rs17613629;rs17613636;rs281863362;rs9274227,Yu2022G1;Yu2022G2
mh06WL-017.v1,32,199,chr6,32663167,32663365,32663167;32663168;32663169;32663172;32663190;32663191;32663214;32663218;32663219;32663222;32663223;32663252;32663261;32663262;32663265;32663266;32663267;32663281;32663293;32663298;32663302;32663303;32663309;32663327;32663332;32663336;32663342;32663345;32663352;32663353;32663356;32663365,32630944;32630945;32630946;32630949;32630967;32630968;32630991;32630995;32630996;32630999;32631000;32631029;32631038;32631039;32631042;32631043;32631044;32631058;32631070;32631075;32631079;32631080;32631086;32631104;32631109;32631113;32631119;32631122;32631129;32631130;32631133;32631142,rs71542446;rs41270931;rs28724261;rs71542447;rs2854267;rs9274216;rs35986240;rs201930518;rs17613599;rs71542448;rs9274217;rs58770498;rs17613606;rs9274218;rs115495316;rs9274219;rs9274220;rs2856703;rs41270932;rs9274222;rs72844333;rs200098349;rs17843724;rs74222206;rs281863378;rs9274225;rs41270933;rs17613622;rs17613629;rs17613636;rs281863362;rs9274227,Yu2022G1;Yu2022G2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change as propagated to the marker table.

Comment on lines +49 to +51
if len(reference) != len(positions):
message = f"variant count mismatch: {len(reference)} vs {len(positions)}"
raise ValueError(message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the following code to the build procedure to check for similar issues in the future.

Comment on lines +341 to +343
if len(allelevars) != len(self.variant_lengths):
message = f"num var mismatch ({self.name}): {len(allelevars)} vs {len(self.variant_lengths)}"
raise ValueError(message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another check, this time in the main code.

Comment on lines +979 to +984
def test_mh06WL017v1():
subset = microhapdb.markers[microhapdb.markers.Name == "mh06WL-017.v1"]
markers = list(Marker.objectify(subset))
assert len(markers) == 1
marker = markers[0]
assert len(marker) == 199
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit test for this problematic marker.

Comment on lines +91 to +97
def test_all_consistent():
inconsistent_markers = set()
for markerid, table in microhapdb.frequencies.groupby("Marker"):
numvars = set([a.count("|") + 1 for a in table.Allele])
if len(numvars) > 1:
inconsistent_markers.add(markerid)
assert len(inconsistent_markers) == 0, sorted(inconsistent_markers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After fixing the aforementioned problem, I wrote this test and ran it iteratively to identify and fix additional issues.

Comment on lines +63 to +76
def cleanup_frequencies(freq):
freq["NumVars"] = freq.Allele.apply(lambda x: x.count("|") + 1)
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Kidd2018"), "Marker"] = "mh01NH-01.v2"
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Turchi2019"), "Marker"] = "mh01NH-01.v2"
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Gandotra2020"), "Marker"] = "mh01NH-01.v2"
freq.loc[(freq.Marker == "mh01NK-001") & (freq.Source == "Staadig2021"), "Marker"] = "mh01NH-01.v1"
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-023.v1"
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-023.v2"
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Turchi2019"), "Marker"] = "mh05KK-023.v1"
freq.loc[(freq.Marker.str.startswith(("mh05KK-023", "mh05KK-020"))) & (freq.Source == "Gandotra2020"), "Marker"] = "mh05KK-023.v3"
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 3), "Marker"] = "mh05KK-120.v1"
freq.loc[(freq.Marker.str.startswith("mh05KK-120")) & (freq.Source == "Kidd2018") & (freq.NumVars == 4), "Marker"] = "mh05KK-120.v2"
freq = freq.drop(columns="NumVars")
return freq
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is the result of running that new test dozens of times. There are a handful of markers from Kidd & company that could not be rectified with a reasonable amount of effort using my automated procedure. Fortunately, a few "manual" replacements as specified here did the trick.

These issues were the result of different names being used to refer to different SNP sets at the same locus.

@standage standage merged commit 668fdb3 into master Oct 23, 2023
3 checks passed
@standage standage deleted the freq/consistent branch October 23, 2023 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with detail report for mh06WL-017.v1
1 participant