# Investigate 404 error in the ingest pipeline

During our routine weekly data ingestion, we encountered an unusual issue with a subset of document identifiers (docids). Specifically, we identified 3659 instances where docids could be successfully retrieved via the xdd API endpoint. However, attempting to locate these same docids through direct access to Elasticsearch resulted in 404 errors, indicating that the documents were not found.

In [24]:
from askem.elastic import get_text
import requests

This is one of the example id

In [25]:
sample_404_docid = "65d8bf52c627927cfbc386c7"

requests.get(f"https://xdd.wisc.edu/api/articles?docid={sample_404_docid}").json()

{'success': {'v': 1,
  'data': [{'type': 'fulltext',
    '_gddid': '65d8bf52c627927cfbc386c7',
    'title': 'Effect of hexagonal structure nanoparticles on the morphological performance of the ceramic scaffold using analytical oscillation response',
    'volume': '47',
    'journal': 'Ceramics International',
    'link': [{'url': 'https://www.sciencedirect.com/science/article/pii/S0272884221008427',
      'type': 'publisher'}],
    'publisher': 'Elsevier',
    'abstract': '',
    'author': [{'name': 'Sahmani, Saeid'},
     {'name': 'Soleimani, Maryam'},
     {'name': 'Kolooshani, Amin'},
     {'name': 'Saber-Samandari, Saeed'},
     {'name': 'Khandan, Amirsalar'}],
    'pages': '18339--18350',
    'number': '13',
    'identifier': [{'type': 'doi', 'id': '10.1016/j.ceramint.2021.03.155'}],
    'year': '2021'}],
  'hits': 1,
  'license': 'https://creativecommons.org/licenses/by-nc/2.0/'}}

In [26]:
get_text(docid=sample_404_docid)

NotFoundError: NotFoundError(404, "{'_index': 'articles_v1', '_type': '_doc', '_id': '65d8bf52c627927cfbc386c7', 'found': False}")

In [27]:
# Make sure get_text is working fine...
get_text(docid="5ebdebf5998e17af826e9591")[:100]

'medRxiv preprint doi: https://doi.org/10.1101/2020.05.11.20098087.this version posted May 14, 2020. '

Let's generate a list of problematic docids

In [28]:
import re

docids_404 = []
with open("tmp/error.log", "r") as f:
    for line in f:
        if "NotFoundError" in line:
            docid = re.search(r"docid: (\w+)", line).group(1)
            docids_404.append(docid)

In [30]:
docids_404 = list(set(docids_404))
print(f"{len(docids_404)=}")

with open("tmp/docids_404.txt", "w") as f:
    for docid in docids_404:
        f.write(docid + "\n")

len(docids_404)=3659
