Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix renaming #111

Merged
merged 3 commits into from
Feb 24, 2023
Merged

Fix renaming #111

merged 3 commits into from
Feb 24, 2023

Conversation

standage
Copy link
Member

@standage standage commented Feb 24, 2023

The PR updates the database build code and fixes a bug with marker renaming. This fixes the issue noted in #110 with MH allele frequencies coming out of sync with the precise marker definitions to which they correspond.

As an added benefit, markers defined consistently in multiple studies will now have each study noted in its "Source" column. Markers from chr19 are shown below to demonstrate.

         Name  NumVars  Extent Chrom    Start      End    Ae                   Source
 mh19USC-19pA        3      38 chr19   561779   561816 2.913           delaPuente2020
   mh19KK-056        2     201 chr19  4852125  4852325 2.594                 Kidd2018
  mh19SHY-001        8     185 chr19  7698913  7699097 6.347                   Wu2021
   mh19CP-007        3      42 chr19 14310740 14310781 3.254                 Kidd2018
 mh19USC-19pB        5      66 chr19 16040865 16040930 3.799           delaPuente2020
  mh19ZHA-006        6      63 chr19 20579863 20579925 3.167                  Sun2020
    mh19NH-23        3      95 chr19 22052724 22052818 2.024              Hiroaki2015
mh19KK-299.v1        5     154 chr19 22546698 22546851 4.060      Kidd2018;Turchi2019
mh19KK-299.v2        7     154 chr19 22546698 22546851 4.073             Gandotra2020
mh19KK-299.v4       10     182 chr19 22546698 22546879 4.073              Pakstis2021
mh19KK-299.v3        3      63 chr19 22546749 22546811 3.603              Staadig2021
  mh19ZHA-007        4     141 chr19 28397316 28397456 4.428              Kureshi2020
 mh19USC-19qA        4      46 chr19 33273772 33273817 3.523           delaPuente2020
   mh19KK-301        4      64 chr19 50938488 50938551 2.624      Kidd2018;Turchi2019
   mh19KK-300        7     182 chr19 50947787 50947968 5.821 Gandotra2020;Pakstis2021
   mh19KK-057        3     115 chr19 51654949 51655063 2.539      Kidd2018;Turchi2019
  mh19ZHA-009        5     178 chr19 53129073 53129250 4.347              Kureshi2020
 mh19USC-19qB        3      27 chr19 53714388 53714414 4.933           delaPuente2020
  mh19SHY-002        9     165 chr19 55588421 55588585 3.613                   Wu2021

closes #110.

Comment on lines +205696 to +205702
mh21KK-320.v1,Italians,A|A|C|A,0.25000,Turchi2019
mh21KK-320.v1,Italians,A|A|C|G,0.10700,Turchi2019
mh21KK-320.v1,Italians,G|A|C|A,0.12200,Turchi2019
mh21KK-320.v1,Italians,G|A|C|G,0.16800,Turchi2019
mh21KK-320.v1,Italians,G|A|T|A,0.07700,Turchi2019
mh21KK-320.v1,Italians,G|G|C|A,0.08200,Turchi2019
mh21KK-320.v1,Italians,G|G|C|G,0.19400,Turchi2019
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These frequencies now correspond to the correct definition (see #110).

Comment on lines +185864 to +185870
mh18KK-293.v1,MHDBP-7c055e7ee8,A|G|A|A,0.59300,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,A|G|A|G,0.02000,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,A|T|A|G,0.02000,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,A|T|G|A,0.07300,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,G|G|A|A,0.24700,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,G|T|A|A,0.01300,Staadig2021
mh18KK-293.v1,MHDBP-7c055e7ee8,G|T|A|G,0.03300,Staadig2021
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed some Staadig and Gandotra associated frequencies.

Comment on lines +38 to +60
def resolve(self):
if len(self) <= 1:
for marker in self.markers:
yield marker
return
definitions = set(self.markers_by_definition)
for marker in sorted(self.markers, key=lambda m: (m.sources[0].year, m.name.lower())):
if marker.posstr() in self.definition_names:
message = f"Marker {marker.name} as defined in {marker.sources[0].name} was defined previously and is redundant"
print(message)
self.source_name_map[marker.sources[0].name][marker.name] = self.definition_names[marker.posstr()]
continue
else:
new_name = marker.name
if len(self.markers_by_definition) > 1:
new_name = f"{marker.name}.v{len(self.definition_names) + 1}"
self.definition_names[marker.posstr()] = new_name
self.source_name_map[marker.sources[0].name][marker.name] = new_name
marker.name = new_name
for othermarker in self.markers_by_definition[marker.posstr()]:
if othermarker != marker:
marker.sources.append(othermarker.sources[0])
yield marker
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored code for determining which names are applied to which marker definitions.

Comment on lines -57 to +59
self.source = source
self.sources = list()
if source is not None:
self.sources.append(source)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source list now instead of a single source.

@standage standage merged commit 7ad401e into master Feb 24, 2023
@standage standage deleted the fix/renaming branch February 24, 2023 20:08
@standage standage mentioned this pull request Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent allele representations for some markers
1 participant