Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medline parser does not link affiliations to authors #4303

Open
Bodasaieswar opened this issue May 17, 2023 · 13 comments
Open

Medline parser does not link affiliations to authors #4303

Bodasaieswar opened this issue May 17, 2023 · 13 comments

Comments

@Bodasaieswar
Copy link

During my analysis of the API data, I noticed an inconsistency in the mapping for a specific publication. For example, when retrieving the publication with PubMed ID 31015011, the API indicates that there are seven authors associated with the publication. However, the API only provides information for five affiliations.

Upon reviewing the actual paper, it became apparent that the first, third, and fourth authors belong to Department 1, while the remaining authors follow a sequential incrementation for their department affiliations. This inconsistency raises concerns about the accuracy of the current author-affiliation mapping provided by the API.

Is there any other field which would be helpful for mapping ?

@peterjc
Copy link
Member

peterjc commented May 17, 2023

Which raw data are you using, and which bit of Biopython? e.g. XML with Bio.Entrez?

@Bodasaieswar
Copy link
Author

Currently, this is the flow of my program:

try:
    # Perform Entrez search and fetch Medline records
    handle = Entrez.esearch(db=db, term=parameter, retmax=9999)
    record = Entrez.read(handle)
    idlist = record["IdList"]
    handle = Entrez.efetch(db=db, id=idlist, rettype=rettype, retmode="text")
    records = Medline.parse(handle)
    
    # Perform business logic on records

In the above code snippet, I utilize the Entrez module to search and retrieve Medline records from a specified database(pubmed). The retrieved records are then processed using the Medline module, where additional business logic can be applied.

#handle.read() for a Medline ID: 37195739
PMID- 37195739
.
.
.
FAU - Oh, Mijung
AU  - Oh M
AD  - Department of Pathology, School of Medicine.
FAU - Batty, Skylar
AU  - Batty S
AD  - Undergraduate Pipeline Network Summer Research Program, University of New Mexico 
      Health Sciences Center.
AD  - Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 
      85721, USA.
FAU - Banerjee, Nayan
AU  - Banerjee N
AD  - School of Chemical Sciences, Indian Association for the Cultivation of Science, 
      2A & 2B Raja S. C. Mullick Road, Jadavpur, Kolkata 700032, West Bengal, India.
FAU - Kim, Tae-Hyung
AU  - Kim TH
AD  - Department of Pathology, School of Medicine.
AD  - University of New Mexico Comprehensive Cancer Center, Albuquerque, NM 87131, USA.
.
.
.

#And the record data after Medline.parse(handle)
.
.
'FAU': [
'Oh, Mijung', 
'Batty, Skylar', 
'Banerjee, Nayan', 
'Kim, Tae-Hyung'], 

'AD': [
'Department of Pathology, School of Medicine.', 
'Undergraduate Pipeline Network Summer Research Program, University of New Mexico Health Sciences Center.', 
'Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.', 
'School of Chemical Sciences, Indian Association for the Cultivation of Science, 2A & 2B Raja S. C. Mullick Road, Jadavpur, Kolkata 700032, West Bengal, India.', 
'Department of Pathology, School of Medicine.', 
'University of New Mexico Comprehensive Cancer Center, Albuquerque, NM 87131, USA.']
.
.

This information was obtained by splitting the provided string using the delimiter ", ". It appears that the original paper contains four authors and five department affiliations.

Orginal paper:
image

@peterjc
Copy link
Member

peterjc commented May 18, 2023

See issue #1382 and pull request #2228. The parser appears to (unfortunately) be working as designed, you get two lists with no mapping information provided.

The MedLine parser would have needed a more dramatic restructuring when affiliation started to be provided for more than just the first author.

I'm going to retitle this issue...

@peterjc peterjc changed the title Incorrect Mapping Issue: Number of Authors vs Number of Affiliations Medline parser does not link affiliations to authors May 18, 2023
@peterjc
Copy link
Member

peterjc commented May 18, 2023

Repeating my notes just added to old issue #1382, the parser appears to have implicitly assumed one affiliation per author as per initial examples (and perhaps that was true initially). But that assumption does not hold (anymore?).

This example https://pubmed.ncbi.nlm.nih.gov/37195739/?format=pubmed reads:

...
FAU - Oh, Mijung
AU  - Oh M
AD  - Department of Pathology, School of Medicine.
FAU - Batty, Skylar
AU  - Batty S
AD  - Undergraduate Pipeline Network Summer Research Program, University of New Mexico 
      Health Sciences Center.
AD  - Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 
      85721, USA.
FAU - Banerjee, Nayan
AU  - Banerjee N
AD  - School of Chemical Sciences, Indian Association for the Cultivation of Science, 
      2A & 2B Raja S. C. Mullick Road, Jadavpur, Kolkata 700032, West Bengal, India.
FAU - Kim, Tae-Hyung
AU  - Kim TH
AD  - Department of Pathology, School of Medicine.
AD  - University of New Mexico Comprehensive Cancer Center, Albuquerque, NM 87131, USA.
...

Four authors (four FAU lines), with 1, 2, 1 and 2 affiliations respectively (six AD lines).

Because we currently parse the information into two separate lists with no mapping between them, we don't track which of the 6 affiliation(s) belong to which of the 4 authors.

@erik-whiting
Copy link
Contributor

Hello, I made PR for this here: #4307
Please let me know if this is helpful or if there's anything you'd like me to change about the PR to make it better

@erik-whiting
Copy link
Contributor

alternatively, this issue's author (@Bodasaieswar) could use the method I added in #4307 as an example of how to build an author-affiliation mapping object. See here:

def get_all_author_affiliations(handle):
    """Get mapping of authors and all their affiliations.
    The handle is either a Medline file, a file-like object, or a list
    of lines describing Medline record.
    Typical usage:
        >>> from Bio import Medline
        >>> file = open("Medline/pubmed_result4.txt")
        >>> affiliations = Medline.get_all_author_affiliations(file)
        >>> batty = affiliations["Batty S"]
        >>> for affiliation in batty:
        ...     print(affiliation)
        ...
        Undergraduate Pipeline Network Summer Research Program, University of New Mexico Health Sciences Center.
        Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA.
    """
    affiliations = Record()
    author = ""
    parsing_author = False
    handle = iter(handle)
    for line in handle:
        line = line.rstrip()
        key = line[:4].rstrip()
        if key == "AU":
            parsing_author = True
            author = line[6:]
            affiliations[author] = []
        elif key == "AD" and parsing_author:
            affiliations[author].append(line[6:])
        elif line[:6] == "      " and parsing_author:
            # Continuation line, append to last item
            affiliations[author][-1] += line[5:]
        elif key != "AD" and parsing_author:
            parsing_author = False
    return affiliations

@peterjc
Copy link
Member

peterjc commented May 19, 2023

This looks like it would work, but means re-parsing the file. I would rather the existing Medline parser was changed to retain the author-affiliation mapping, which would mean changing the data structure returned.

What are your thoughts @mdehoon?

@erik-whiting
Copy link
Contributor

erik-whiting commented May 19, 2023

This looks like it would work, but means re-parsing the file. I would rather the existing Medline parser was changed to retain the author-affiliation mapping, which would mean changing the data structure returned.

What are your thoughts @mdehoon?

That was my first thought but I went against it for two reasons:

  1. People that have written scripts assuming AD to be an array will suddenly have a dictionary to work with. I wasn't sure if you considered that an acceptable change or not
  2. The Record class seems to be a high-fidelity mimic of the pubmed text file, and since the AD attribute int the pubmed text file was just plain text, I was worried a dictionary was too rich (for lack of a better word) of an object. So I opted for a totally new method.

Just wanted to share my thoughts on the current implementation in the PR but I'm happy to make it however you and @mdehoon think is best

@Bodasaieswar
Copy link
Author

This looks like it would work, but means re-parsing the file. I would rather the existing Medline parser was changed to retain the author-affiliation mapping, which would mean changing the data structure returned.

What are your thoughts @mdehoon?

I completely agree with @peterjc. It would be really great if we could use the existing Medline parser. I would suggest structuring the 'AD' data as a list of lists, where each inner list represents the affiliations of a specific author. For example,

AD : [[affiliations of the first author], [affiliations of the second author], [affiliations of the third author]]

@Bodasaieswar
Copy link
Author

Bodasaieswar commented May 19, 2023

I have made the changes accordingly, and now the 'AD' field contains a list of lists. @peterjc please review and let me know for any issues

def parse(handle):
    """Read Medline records one by one from the handle.

    The handle is either is a Medline file, a file-like object, or a list
    of lines describing one or more Medline records.

    Typical usage::

        >>> from Bio import Medline
        >>> with open("Medline/pubmed_result2.txt") as handle:
        ...     records = Medline.parse(handle)
        ...     for record in records:
        ...         print(record['TI'])
        ...
        A high level interface to SCOP and ASTRAL ...
        GenomeDiagram: a python package for the visualization of ...
        Open source clustering software.
        PDB file parser and structure class implemented in Python.

    """
    # These keys point to string values
    textkeys = (
        "ID",
        "PMID",
        "SO",
        "RF",
        "NI",
        "JC",
        "TA",
        "IS",
        "CY",
        "TT",
        "CA",
        "IP",
        "VI",
        "DP",
        "YR",
        "PG",
        "LID",
        "DA",
        "LR",
        "OWN",
        "STAT",
        "DCOM",
        "PUBM",
        "DEP",
        "PL",
        "JID",
        "SB",
        "PMC",
        "EDAT",
        "MHDA",
        "PST",
        "AB",
        "EA",
        "TI",
        "JT",
    )
    handle = iter(handle)

    key = ""
    record = Record()
    _au_counter = 0
    for line in handle:
        line = line.rstrip()
        if line[:6] == "      ":  # continuation line
            if key in ["MH"]:
                # Multi-line MESH term, want to append to last entry in list
                record[key][-1] += line[5:]  # including space using line[5:]
            if key in ["AD"]:
                # Multi-line AD term, want to append to last entry in list of list
                record[key][-1][-1] += line[5:] # including space using line[5:]
            else:
                record[key].append(line[6:])
        elif line:
            key = line[:4].rstrip()
            if key in ["AU"]:
                _au_counter += 1  #Counter of Number of Authors
            if key not in record:
                record[key] = []
            if key in ['AD']:
                """
                Check if the number of existing entries in 'AD' (Affiliations) 
                is greater than or equal to the number of authors ('AU')
                """
                if len(record[key]) >= _au_counter:
                    record[key][-1].append(line[6:])
                else:
                    # If there are not enough entries, create a new inner list and append the affiliations to it
                    record[key].append([line[6:]])
            else:
                record[key].append(line[6:])
        elif record:
            # Join each list of strings into one string.
            for key in record:
                if key in textkeys:
                    record[key] = " ".join(record[key])
            yield record
            record = Record()
    if record:  # catch last one
        for key in record:
            if key in textkeys:
                record[key] = " ".join(record[key])
        yield record

Differences:

image

@mdehoon
Copy link
Contributor

mdehoon commented May 21, 2023

Thank you.

People that have written scripts assuming AD to be an array will suddenly have a dictionary to work with. I wasn't sure if you considered that an acceptable change or not

I prefer a solution that will work well for the future, and worry less about whether it will cause a change for users. Changes happen only once, but the future is forever.

A list of lists for the AD field has the advantage of being simple.
On the other hand, users will have to remember this structure, and also we are effectively creating parallel vectors instead of taking advantage of the object-oriented design of Python.

Looking at the example:

...
FAU - Oh, Mijung
AU  - Oh M
AD  - Department of Pathology, School of Medicine.
FAU - Batty, Skylar
AU  - Batty S
AD  - Undergraduate Pipeline Network Summer Research Program, University of New Mexico 
      Health Sciences Center.
AD  - Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 
      85721, USA.
FAU - Banerjee, Nayan
AU  - Banerjee N
AD  - School of Chemical Sciences, Indian Association for the Cultivation of Science, 
      2A & 2B Raja S. C. Mullick Road, Jadavpur, Kolkata 700032, West Bengal, India.
FAU - Kim, Tae-Hyung
AU  - Kim TH
AD  - Department of Pathology, School of Medicine.
AD  - University of New Mexico Comprehensive Cancer Center, Albuquerque, NM 87131, USA.
...

I can see four blocks here, with keys FAU, AU, and AD.
Then I would create one dictionary for each block, and give the Record class an attribute (not a key) .authors that is a list of dictionaries. This moves away from a direct correspondence between the Medline raw data and the Python object, but anyway the point of the parser is to transform the Medline data into a reasonable Pythonesque representation of the data.

@peterjc
Copy link
Member

peterjc commented May 22, 2023

Perhaps we can do that with a new richer structure per author (using dicts? named tuple?), but leave the existing lists in place but deprecated (to facilitate people adjusting to the change)?

@idoerg
Copy link
Contributor

idoerg commented Jan 4, 2024

Hi, following a brief discussion on the subject on the email list: has a fix been pushed yet? Seems to be stuck in @peterjc 's suggestion for back-compatibility...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants