We have detected the presence of Bibliographic Resources in OpenCitations Meta that are linked to multiple external IDs associated, in the real world, to publications that are periodically published in the same venue (journal). E.g., editorial comments or recurrent news columns in a specific research field.

Although the periodicity of these publications might not be relevant to the cause of the problem (i.e. having IDs of separate resources all linked to a single one in Meta), it seems appropriate to point it out and take it into consideration, since these scenarios do not seem to be randomly generated by software bugs in Meta (contrary to the cases where different real-world entities that have no perceivable common features have been erroneously merged); rather, they seem to result from errors in the data provided by OpenCitations' primary sources (e.g. Crossref, DataCite, PubMed, OpenAire, ecc.).

To analyse such incorrectly-represented entities, we need to take a close look at some examples. Let's consider and study a specific resource: [br/061903839782](https://w3id.org/oc/meta/br/061903839782).

It has several contributors and definitely too many external IDs linked to it.

As the OpenCitatations' REST API currently struggles to retrieve the resource, I used the SPARQL endpoint to retrieve all the external ID values, using the following approach.

In [2]:
import csv
from SPARQLWrapper import SPARQLWrapper, JSON
from string import Template

def get_br_external_ids(br_uris):

    fp = 'C:/Users/media/Downloads/external_ids_of_br061903839782.csv'
    query_template = Template('''
    PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
    PREFIX fabio: <http://purl.org/spar/fabio/>
    PREFIX datacite: <http://purl.org/spar/datacite/>

    SELECT ?value ?scheme {
    $br_uri datacite:hasIdentifier ?id .
    ?id datacite:usesIdentifierScheme ?scheme ;
        literal:hasLiteralValue ?value .
    }
    ''')


    out = dict()

    # sparql = SPARQLWrapper('https://k8s.opencitations.net/meta/sparql')
    sparql = SPARQLWrapper('https://test.opencitations.net/meta/sparql')


    for br in br_uris:
        out[br] = dict()
        query = query_template.substitute(br_uri=f'<{br}>')
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            id_value = result['value']['value']
            id_scheme = result['scheme']['value'].removeprefix('http://purl.org/spar/datacite/')
            out[br][id_value] = id_scheme

    return out

I used the `get_br_external_ids()` function above to retrieve the external IDs of [br/061903839782](https://w3id.org/oc/meta/br/061903839782):

In [3]:
from pprint import pprint
import json

brs_to_check = ['https://w3id.org/oc/meta/br/061903839782']
br061903839782_external_ids = get_br_external_ids(brs_to_check)['https://w3id.org/oc/meta/br/061903839782']
pprint(br061903839782_external_ids)

with open('br061903839782_external_ids.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(br061903839782_external_ids, indent=4))

{'10.1016/j.surneu.2003.09.012': 'doi',
 '10.1016/j.surneu.2003.10.021': 'doi',
 '10.1016/j.surneu.2003.11.010': 'doi',
 '10.1016/j.surneu.2003.12.003': 'doi',
 '10.1016/j.surneu.2004.03.007': 'doi',
 '10.1016/j.surneu.2004.04.013': 'doi',
 '10.1016/j.surneu.2004.06.007': 'doi',
 '10.1016/j.surneu.2004.07.019': 'doi',
 '10.1016/j.surneu.2004.08.077': 'doi',
 '10.1016/j.surneu.2004.09.024': 'doi',
 '10.1016/j.surneu.2004.10.026': 'doi',
 '10.1016/j.surneu.2004.11.019': 'doi',
 '10.1016/j.surneu.2004.12.007': 'doi',
 '10.1016/j.surneu.2005.01.011': 'doi',
 '10.1016/j.surneu.2005.02.004': 'doi',
 '10.1016/j.surneu.2005.03.015': 'doi',
 '10.1016/j.surneu.2005.04.023': 'doi',
 '10.1016/j.surneu.2005.05.004': 'doi',
 '10.1016/j.surneu.2005.06.012': 'doi',
 '10.1016/j.surneu.2005.07.045': 'doi',
 '10.1016/j.surneu.2005.08.005': 'doi',
 '10.1016/j.surneu.2005.09.007': 'doi',
 '10.1016/j.surneu.2005.10.015': 'doi',
 '10.1016/j.surneu.2005.11.054': 'doi',
 '10.1016/j.surneu.2005.12.018': 'doi',


First, we can check how many of these external IDs are registered in OpenAire.

In [4]:
import json

with open('br061903839782_external_ids.json', 'r', encoding='utf-8') as f:
    br061903839782_external_ids = json.load(f)

id_list = [id for id in br061903839782_external_ids.keys()]

In [6]:
import requests

def get_ids_in_openaire_from_doi(id_list):
    openaire_api = "https://api.openaire.eu/search/researchProducts"

    output = dict()
    for id in id_list:
        params = {
        "originalId": id,
        "format": "json"
        }

        response = requests.get(openaire_api, params=params)

        if response.status_code == 200:
            response = response.json()
            something_returned = response['response']['results']
            if something_returned:
                output[id] = dict()
                returned_info = something_returned['result']
                for el in returned_info:
                    pids = el['metadata']['oaf:entity']['oaf:result']['pid']
                    for pid in pids:
                        output[id].update({pid['$']: pid['@classid']})

        else:
            print(f"Error: {response.status_code}")
    return output


In [7]:
# all_ids_in_openaire = get_ids_in_openaire_from_doi(id_list)
# pprint(all_ids_in_openaire)

As we can see, only 2 of all the IDs associated with the BR are registered in OpenAire: DOI [10.1016/s0090-3019(02)00969-2](https://api.openaire.eu/search/researchProducts?doi=10.1016/s0090-3019(02)00969-2&format=json) and PMID [19559923](https://api.openaire.eu/search/researchProducts?originalId=19559923&format=json) (represented as two different entities). 

We now query the E-Utils API for the PubMed database by passing the DOIs linked to [br/061903839782](https://w3id.org/oc/meta/br/061903839782) to the call, and see which PMIDs are associated to each of them in PubMed. Finally, we store the results in a JSON file, as querying the API takes a while.

In [8]:
import requests
import json
import time
from tqdm import tqdm


def get_pmids_for_doi(doi_list):

    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    pmid_dict_out = dict()
    
    for doi in tqdm(doi_list):
        params = {
            "db": "pubmed",
            "term": doi,
            "field": "DOI",
            "retmode": "json"
        }
        try:
            response = requests.get(base_url, params=params)
            data = response.json()
            
            # Extract PMID from the response
            pmids = data['esearchresult']['idlist']
            if pmids:
                if pmid_dict_out.get(doi):
                    pmid_dict_out[doi] += pmids
                else:
                    pmid_dict_out[doi] = pmids
            time.sleep(0.60)
        except Exception as e:
            print(e)
            time.sleep(5)
            response = requests.get(base_url, params=params)
            data = response.json()
            
            # Extract PMID from the response
            pmids = data['esearchresult']['idlist']
            if pmids:
                if pmid_dict_out.get(doi):
                    pmid_dict_out[doi] += pmids
                else:
                    pmid_dict_out[doi] = pmids

    
    return pmid_dict_out

In [9]:
with open('br061903839782_external_ids.json', 'r', encoding='utf-8') as f:
    br061903839782_external_ids = json.load(f)

doi_list = [value for value, scheme in br061903839782_external_ids.items() if scheme=='doi']

In [10]:
# doi_to_pmids_dict = get_pmids_for_doi(doi_list)
# pprint(doi_to_pmids_dict)
# with open('doi_to_pmids_register_br061903839782.json', 'w', encoding='utf-8') as outfp:
#     outfp.write(json.dumps(doi_to_pmids_dict, indent=4))

Let's examine the obtained results.

In [28]:
with open('doi_to_pmids_register_br061903839782.json', 'r', encoding='utf-8') as fp:
    doi_pmids_mapping_pubmed = json.load(fp)

First of all, just by manually taking a look at the results, we can see that most of the DOIs are mapped, in PubMed, to several PMIDs. Therefore, the great majority of the PMIDs are connected to multiple DOIs. For example, see following code cell:

In [None]:
# print 2 example DOIs linked to multiple PMIDs in PubMed
examples=2
for doi, pmids in doi_pmids_mapping_pubmed.items():
    print({doi:pmids})
    examples-=1
    if examples==0:
        break

{'10.1016/j.surneu.2003.09.012': ['32166015', '31662883', '34514277', '31267820', '32336781', '25806538', '24343475', '24343474', '24343473', '24343472', '24343471', '24343470', '24343469', '24343468', '24343467', '24343466', '24343465', '24343464', '24343463', '24343462']}
{'10.1016/j.surneu.2003.10.021': ['32166015', '31662883', '34514277', '31267820', '32336781', '25806538', '24343475', '24343474', '24343473', '24343472', '24343471', '24343470', '24343469', '24343468', '24343467', '24343466', '24343465', '24343464', '24343463', '24343462']}


In [None]:
single_mapped_dois = {}
for doi, pmid_list in doi_pmids_mapping_pubmed.items():
    if len(pmid_list) == 1:
        single_mapped_dois[doi] = pmid_list[0]

pprint(single_mapped_dois)
print(len(single_mapped_dois))

{'10.1016/j.surneu.2004.10.026': '15639508',
 '10.1016/j.surneu.2005.03.015': '15936360',
 '10.1016/j.surneu.2005.05.004': '16050995',
 '10.1016/j.surneu.2005.10.015': '16378837',
 '10.1016/j.surneu.2005.11.054': '16427394',
 '10.1016/j.surneu.2005.12.018': '16488237',
 '10.1016/j.surneu.2006.01.015': '16531185',
 '10.1016/j.surneu.2006.11.031': '17210283',
 '10.1016/j.surneu.2007.02.053': '17445598',
 '10.1016/j.surneu.2007.03.036': '17512322',
 '10.1016/j.surneu.2007.05.002': '17586209',
 '10.1016/j.surneu.2008.02.022': '18424297',
 '10.1016/j.surneu.2008.04.005': '18486693',
 '10.1016/j.surneu.2008.09.024': '19055951',
 '10.1016/j.surneu.2008.10.013': '19084681',
 '10.1016/j.surneu.2009.03.017': '19427937',
 '10.1016/j.surneu.2009.04.028': '19559923',
 '10.1016/s0090-3019(01)00712-1': '11922043',
 '10.1016/s0090-3019(02)00969-2': '12638558',
 '10.1016/s0090-3019(03)00073-9': '12681533',
 '10.1016/s0090-3019(03)00175-7': '12765795'}
21


In PubMed, there are 21 PMIDs that are linked to one and only one DOI. 

**Retrieve the DOIs associated in PubMed to the PMIDs of the BR.**

In [12]:
pmids_for_br061903839782= [k for k,v in br061903839782_external_ids.items() if v=='pmid'] # All the PMIDs pointing to br/061903839782 in Meta
print('PMIDs pointing to br/061903839782 in Meta:')
print(pmids_for_br061903839782)

PMIDs pointing to br/061903839782 in Meta:
['17445598', '17512322', '15639508', '15936360', '16050995', '16378837', '16427394', '16488237', '16531185', '17210283', '17586209', '18424297', '18486693', '19055951', '19084681', '19427937', '19559923']


In [13]:
import requests
from xml.etree import ElementTree


# Function to retrieve detailed metadata using efetch
def fetch_dois_efetch(pmids_to_search):
    doi_mapping = {}

    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    
    for id in pmids_to_search:
        params = {
        "db": "pubmed",
        "id": id,
        "retmode": "xml",
        "tool": "MyPythonScript",
        }
    
    
        response = requests.get(url, params=params)
        response.raise_for_status()
    
        # Parse XML response
        root = ElementTree.fromstring(response.content)
        for article in root.findall(".//PubmedArticle"):
            pmid = article.find(".//PMID").text
            # doi = None
            doi_list = []
            # Search for DOI within ArticleIdList
            for article_id in article.findall(".//ArticleId"):
                if article_id.get("IdType") == "doi":
                    doi = article_id.text
                    doi_list.append(doi)
                elif article_id.get("IdType") != "pubmed" and article_id.get("IdType") != "pii":
                    print(article_id.get("IdType"), article_id.text)
                    # break
            doi_mapping[pmid] = doi_list
            #pub_type_list = [pub_type.text for pub_type in article.findall(".//PublicationType")]
            #print(id, pub_type_list)
            
    
    return doi_mapping

In [14]:
# Retrieve DOIs for all the PMIDs that are associated with BR in Meta via PubMed API

pmids_in_meta2dois = fetch_dois_efetch(pmids_for_br061903839782)

pprint(pmids_in_meta2dois)

{'15639508': ['10.1016/j.surneu.2004.10.026'],
 '15936360': ['10.1016/j.surneu.2005.03.015'],
 '16050995': ['10.1016/j.surneu.2005.05.004'],
 '16378837': ['10.1016/j.surneu.2005.10.015'],
 '16427394': ['10.1016/j.surneu.2005.11.054'],
 '16488237': ['10.1016/j.surneu.2005.12.018'],
 '16531185': ['10.1016/j.surneu.2006.01.015'],
 '17210283': ['10.1016/j.surneu.2006.11.031'],
 '17445598': ['10.1016/j.surneu.2007.02.053'],
 '17512322': ['10.1016/j.surneu.2007.03.036'],
 '17586209': ['10.1016/j.surneu.2007.05.002'],
 '18424297': ['10.1016/j.surneu.2008.02.022'],
 '18486693': ['10.1016/j.surneu.2008.04.005'],
 '19055951': ['10.1016/j.surneu.2008.09.024'],
 '19084681': ['10.1016/j.surneu.2008.10.013'],
 '19427937': ['10.1016/j.surneu.2009.03.017'],
 '19559923': ['10.1016/j.surneu.2009.04.028']}


We can double check the PMIDs mapped to single DOIs (that we obtained by passing the DOIs associated to the BR to a call to PubMed API), by making another call to PubMed API and sending the PMIDs instead of the DOIs. We should get the same mapping as we already have, but indexed by PMID instead of DOI.

In [15]:
# Retrieve via PubMed API the DOIs for all the PMIDs that are linked to a single DOI in PubMed according to the data in 'doi_to_pmids_register_br061903839782.json'
single_pmids2dois = fetch_dois_efetch(list(single_mapped_dois.values()))

In [16]:
# check if PMID2DOI and DOI2PMID links are represented consistently in PubMed (starting from 1:1 DOI2PMID mappings from PubMed)
for pmid, doi_list in single_pmids2dois.items():
    for doi,pmids_list in doi_pmids_mapping_pubmed.items():
        if pmid in pmids_list and doi_list[0] != doi:
            print(pmid)

In [17]:
pmids_from_meta = set(pmids_for_br061903839782)
pmids_single_mapped_from_pubmed = set(single_mapped_dois.values())

print("Are all the PMIDs mapped to a single DOI in PubMed associated to br/061903839782 in Meta, or vice versa?", pmids_from_meta == pmids_single_mapped_from_pubmed)

pmids_in_meta_only= list(pmids_from_meta.difference(pmids_single_mapped_from_pubmed))
print('PMIDS in Meta only: ', pmids_in_meta_only)

pmids_in_pubmed_only= list(pmids_single_mapped_from_pubmed.difference(pmids_from_meta))
print('PMIDs in PubMed only', pmids_in_pubmed_only)

Are all the PMIDs mapped to a single DOI in PubMed associated to br/061903839782 in Meta, or vice versa? False
PMIDS in Meta only:  []
PMIDs in PubMed only ['12681533', '12765795', '11922043', '12638558']


In Meta, we have all the 4 DOIs that appear to be associated, in a 1:1 ratio, to the following PMIDs in PubMed: '12765795', '11922043', '12638558', '12681533'. Strangely enough, though, we do not have the PMIDs (or at least they are not associated with the BR we are examining). This seems strange: we have most probably taken these DOIs from Crossref, but how did we merge them into br/061903839782, if we do not have the PMIDs to which they are linked to in PubMed?

**Check if these PMIDs are associated with other BRs in Meta.**

In [18]:
def get_br_from_pmid(pmids_list):
    query_template = Template('''
    PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>
    PREFIX fabio: <http://purl.org/spar/fabio/>
    PREFIX datacite: <http://purl.org/spar/datacite/>

    SELECT ?inputValue ?br ?doiValue {
        BIND($value AS ?inputValue)
        ?id literal:hasLiteralValue ?inputValue ;
            datacite:usesIdentifierScheme datacite:pmid .
        ?br datacite:hasIdentifier ?id .
        OPTIONAL {
            ?br datacite:hasIdentifier ?doi .
            ?doi datacite:usesIdentifierScheme datacite:doi ;
                literal:hasLiteralValue ?doiValue .              
        }
    }
    ''')


    # sparql = SPARQLWrapper('https://k8s.opencitations.net/meta/sparql')
    sparql = SPARQLWrapper('https://test.opencitations.net/meta/sparql')

    out = dict()
    for pmid in pmids_list:
        query = query_template.substitute(value=f'"{pmid}"')
        sparql.setQuery(query)
        # print(sparql.queryString)
        sparql.setReturnFormat(JSON)
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            pmid = result['inputValue']['value']
            br = result['br']['value']
            doi = result['doiValue']['value']
            # print(pmid, br, doi)
            if not out.get(br):
                out[br] = dict()
            if not out[br].get('pmid'):
                out[br]['pmid'] = set()
            if not out[br].get('doi'):
                out[br]['doi'] = set()
            out[br]['pmid'].add(pmid)
            out[br]['doi'].add(doi)
    return out


In [34]:
get_br_from_pmid(pmids_in_pubmed_only)

{'https://w3id.org/oc/meta/br/061403699666': {'pmid': {'12681533'},
  'doi': {'10.1016/s0090-3019(03)00073-9'}},
 'https://w3id.org/oc/meta/br/061403699690': {'pmid': {'12765795'},
  'doi': {'10.1016/s0090-3019(03)00175-7'}},
 'https://w3id.org/oc/meta/br/061403699545': {'pmid': {'11922043'},
  'doi': {'10.1016/s0090-3019(01)00712-1'}}}

Of the 4 PMIDs that are not associated with `br/061903839782` in Meta but are mapped in PubMed to 4 DOIs that are linked to it ('12765795', '11922043', '12638558', '12681533'), 3 of them are in fact present in Meta, but are associated with separate bibliographic resources. These bibliographic resources all have a DOI that points also to `br/061903839782`.

| pmid     | br uri                                   | doi                           |
|----------|------------------------------------------|-------------------------------|
| 12765795 | https://w3id.org/oc/meta/br/061403699690 | 10.1016/s0090-3019(03)00175-7 |
| 12681533 | https://w3id.org/oc/meta/br/061403699666 | 10.1016/s0090-3019(03)00073-9 |
| 11922043 | https://w3id.org/oc/meta/br/061403699545 | 10.1016/s0090-3019(01)00712-1 |

PMID 12638558 is not present anywhere in Meta, though its DOI is among the ones associated with `br/061903839782`.

Moreover, each of the 3 DOIs in the table above points also, in Meta, to another BR, which has the DOI itself as its only external ID. See for example the SPARQL query below, which retrieves all the BRs to which is associated the DOI 10.1016/s0090-3019(03)00175-7 (and try passing the other two DOIs): one of them is the BR we are examining, i.e. br/061903839782, another is the BR that also has the PMID, i.e. br/061403699690 in the table above, and the last one is BR br/061302095914.

```sparql
PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>

SELECT ?br {
    ?id literal:hasLiteralValue "10.1016/s0090-3019(03)00175-7".
    ?br datacite:hasIdentifier ?id.
}
```

**OUTPUT**:

|  | br |
| -- | -- |
| 1 | https://w3id.org/oc/meta/br/061302095914 |
| 2 | https://w3id.org/oc/meta/br/061403699690 |
| 3	| https://w3id.org/oc/meta/br/061903839782 |

Let's see if the PMIDs that are mapped to a single DOI in PubMed are associated also with other BRs in Meta.

In [20]:
_pmids = pmids_single_mapped_from_pubmed.difference(pmids_in_pubmed_only)  # exclude the 4 PMIDs we have already searched in Meta

res = get_br_from_pmid(_pmids)


In [21]:
pprint(res)

{'https://w3id.org/oc/meta/br/061903839782': {'doi': {'10.1016/j.surneu.2003.09.012',
                                                      '10.1016/j.surneu.2003.10.021',
                                                      '10.1016/j.surneu.2003.11.010',
                                                      '10.1016/j.surneu.2003.12.003',
                                                      '10.1016/j.surneu.2004.03.007',
                                                      '10.1016/j.surneu.2004.04.013',
                                                      '10.1016/j.surneu.2004.06.007',
                                                      '10.1016/j.surneu.2004.07.019',
                                                      '10.1016/j.surneu.2004.08.077',
                                                      '10.1016/j.surneu.2004.09.024',
                                                      '10.1016/j.surneu.2004.10.026',
                                                      

From the above results it emerges that all those PMIDs associated with br/061903839782 in Meta that are associated to a single DOI in PubMed are exclusively associated with br/061903839782 (none of them points to any other BR).