-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Medline parser does not link affiliations to authors #4303
Comments
Which raw data are you using, and which bit of Biopython? e.g. XML with |
Currently, this is the flow of my program:
In the above code snippet, I utilize the Entrez module to search and retrieve Medline records from a specified database(pubmed). The retrieved records are then processed using the Medline module, where additional business logic can be applied.
This information was obtained by splitting the provided string using the delimiter ", ". It appears that the original paper contains four authors and five department affiliations. |
See issue #1382 and pull request #2228. The parser appears to (unfortunately) be working as designed, you get two lists with no mapping information provided. The MedLine parser would have needed a more dramatic restructuring when affiliation started to be provided for more than just the first author. I'm going to retitle this issue... |
Repeating my notes just added to old issue #1382, the parser appears to have implicitly assumed one affiliation per author as per initial examples (and perhaps that was true initially). But that assumption does not hold (anymore?). This example https://pubmed.ncbi.nlm.nih.gov/37195739/?format=pubmed reads:
Four authors (four FAU lines), with 1, 2, 1 and 2 affiliations respectively (six AD lines). Because we currently parse the information into two separate lists with no mapping between them, we don't track which of the 6 affiliation(s) belong to which of the 4 authors. |
Hello, I made PR for this here: #4307 |
alternatively, this issue's author (@Bodasaieswar) could use the method I added in #4307 as an example of how to build an author-affiliation mapping object. See here: def get_all_author_affiliations(handle):
"""Get mapping of authors and all their affiliations.
The handle is either a Medline file, a file-like object, or a list
of lines describing Medline record.
Typical usage:
>>> from Bio import Medline
>>> file = open("Medline/pubmed_result4.txt")
>>> affiliations = Medline.get_all_author_affiliations(file)
>>> batty = affiliations["Batty S"]
>>> for affiliation in batty:
... print(affiliation)
...
Undergraduate Pipeline Network Summer Research Program, University of New Mexico Health Sciences Center.
Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.
"""
affiliations = Record()
author = ""
parsing_author = False
handle = iter(handle)
for line in handle:
line = line.rstrip()
key = line[:4].rstrip()
if key == "AU":
parsing_author = True
author = line[6:]
affiliations[author] = []
elif key == "AD" and parsing_author:
affiliations[author].append(line[6:])
elif line[:6] == " " and parsing_author:
# Continuation line, append to last item
affiliations[author][-1] += line[5:]
elif key != "AD" and parsing_author:
parsing_author = False
return affiliations |
This looks like it would work, but means re-parsing the file. I would rather the existing Medline parser was changed to retain the author-affiliation mapping, which would mean changing the data structure returned. What are your thoughts @mdehoon? |
That was my first thought but I went against it for two reasons:
Just wanted to share my thoughts on the current implementation in the PR but I'm happy to make it however you and @mdehoon think is best |
I completely agree with @peterjc. It would be really great if we could use the existing Medline parser. I would suggest structuring the 'AD' data as a list of lists, where each inner list represents the affiliations of a specific author. For example,
|
I have made the changes accordingly, and now the 'AD' field contains a list of lists. @peterjc please review and let me know for any issues def parse(handle):
"""Read Medline records one by one from the handle.
The handle is either is a Medline file, a file-like object, or a list
of lines describing one or more Medline records.
Typical usage::
>>> from Bio import Medline
>>> with open("Medline/pubmed_result2.txt") as handle:
... records = Medline.parse(handle)
... for record in records:
... print(record['TI'])
...
A high level interface to SCOP and ASTRAL ...
GenomeDiagram: a python package for the visualization of ...
Open source clustering software.
PDB file parser and structure class implemented in Python.
"""
# These keys point to string values
textkeys = (
"ID",
"PMID",
"SO",
"RF",
"NI",
"JC",
"TA",
"IS",
"CY",
"TT",
"CA",
"IP",
"VI",
"DP",
"YR",
"PG",
"LID",
"DA",
"LR",
"OWN",
"STAT",
"DCOM",
"PUBM",
"DEP",
"PL",
"JID",
"SB",
"PMC",
"EDAT",
"MHDA",
"PST",
"AB",
"EA",
"TI",
"JT",
)
handle = iter(handle)
key = ""
record = Record()
_au_counter = 0
for line in handle:
line = line.rstrip()
if line[:6] == " ": # continuation line
if key in ["MH"]:
# Multi-line MESH term, want to append to last entry in list
record[key][-1] += line[5:] # including space using line[5:]
if key in ["AD"]:
# Multi-line AD term, want to append to last entry in list of list
record[key][-1][-1] += line[5:] # including space using line[5:]
else:
record[key].append(line[6:])
elif line:
key = line[:4].rstrip()
if key in ["AU"]:
_au_counter += 1 #Counter of Number of Authors
if key not in record:
record[key] = []
if key in ['AD']:
"""
Check if the number of existing entries in 'AD' (Affiliations)
is greater than or equal to the number of authors ('AU')
"""
if len(record[key]) >= _au_counter:
record[key][-1].append(line[6:])
else:
# If there are not enough entries, create a new inner list and append the affiliations to it
record[key].append([line[6:]])
else:
record[key].append(line[6:])
elif record:
# Join each list of strings into one string.
for key in record:
if key in textkeys:
record[key] = " ".join(record[key])
yield record
record = Record()
if record: # catch last one
for key in record:
if key in textkeys:
record[key] = " ".join(record[key])
yield record Differences: |
Thank you.
I prefer a solution that will work well for the future, and worry less about whether it will cause a change for users. Changes happen only once, but the future is forever. A list of lists for the Looking at the example:
I can see four blocks here, with keys |
Perhaps we can do that with a new richer structure per author (using dicts? named tuple?), but leave the existing lists in place but deprecated (to facilitate people adjusting to the change)? |
Hi, following a brief discussion on the subject on the email list: has a fix been pushed yet? Seems to be stuck in @peterjc 's suggestion for back-compatibility... |
During my analysis of the API data, I noticed an inconsistency in the mapping for a specific publication. For example, when retrieving the publication with PubMed ID 31015011, the API indicates that there are seven authors associated with the publication. However, the API only provides information for five affiliations.
Upon reviewing the actual paper, it became apparent that the first, third, and fourth authors belong to Department 1, while the remaining authors follow a sequential incrementation for their department affiliations. This inconsistency raises concerns about the accuracy of the current author-affiliation mapping provided by the API.
Is there any other field which would be helpful for mapping ?
The text was updated successfully, but these errors were encountered: