# Why Checking DOIs via API is the “Silver Bullet” for AI Detection
Checking References, specifically through their Digital Object Identifiers (DOIs), is arguably the most definitive method to catch AI hallucinations. Large Language Models (LLMs) like ChatGPT often generate plausible-sounding citations that do not actually exist.

Here is why the Python + doi.org Content Negotiation method is superior:

-  Deterministic Accuracy (Binary Result)
Unlike analyzing writing style or “perplexity” scores—which are probabilistic and prone to false positives—a DOI check is binary. A DOI either exists in the global registry, or it doesn’t.

Result: 404 Not Found = 100% Fake Reference.

 - Detecting “Stolen” DOIs
AI sometimes hallucinates by taking a real DOI from an unrelated paper and attaching it to a fake citation.

- The Fix: By retrieving the metadata (JSON) directly from the source, you can compare the actual title in the database against the title listed in the suspicious paper. If the paper claims to be about “Economics” but the DOI resolves to “Marine Biology,” it is undeniable proof of AI generation.
-  Global Coverage (Not Just One Publisher)
By querying the central doi.org resolver rather than specific publisher APIs (like Elsevier or Wiley), this method covers all academic content.

Efficiency: It handles redirects automatically, finding the metadata whether the paper is hosted on Crossref, DataCite, or mEDRA.
- . Scalability and Automation
Manually clicking 50 links is tedious. This Python script allows for batch processing. You can feed it a list of 100 references and receive a full audit report in seconds, making it perfect for editors, professors, or automated quality control systems.

In this section, we proved that this is an efficient way to find if a paper is valid or not. 

In [6]:
import re
import pandas as pd
import time

In [3]:
import requests

def verify_doi_validity(doi_input):
    """
    Checks if a DOI exists by querying the doi.org resolver directly.
    Returns detailed metadata if valid, or an error status if invalid.
    """
    # Clean the input to ensure we only have the DOI string
    clean_doi = doi_input.replace("https://doi.org/", "").replace("http://doi.org/", "")
    
    url = f"https://doi.org/{clean_doi}"
    
    headers = {
        "Accept": "application/vnd.citationstyles.csl+json"
    }

    try:
        response = requests.get(url, headers=headers, allow_redirects=True, timeout=10)
        
        if response.status_code == 200:
            try:
                data = response.json()
            except ValueError:
                return {"status": "Error", "details": "Response was not valid JSON."}
            
            # 1. Extracting Title
            title = data.get('title', 'N/A')
            if isinstance(title, list) and len(title) > 0:
                title = title[0]
            
            # 2. Extracting Journal Name (Container Title)
            journal = data.get('container-title', 'N/A')
            if isinstance(journal, list) and len(journal) > 0:
                journal = journal[0]

            # 3. Extracting First Author's Last Name
            author_lastname = "N/A"
            if 'author' in data and len(data['author']) > 0:
                # We take the first author in the list
                author_lastname = data['author'][0].get('family', 'N/A')

            return {
                "status": "Valid",
                "real_title": title,
                "journal": journal,
                "first_author": author_lastname
            }
            
        elif response.status_code == 404:
            return {"status": "Invalid", "details": "DOI not found"}
        else:
            return {"status": "Error", "details": f"HTTP Code: {response.status_code}"}

    except Exception as e:
        return {"status": "Connection Error", "details": str(e)}

# --- Usage Example ---

doi_list_to_check = [
    "10.1038/nature123",            # Fake
    "10.1007/s10701-005-9016-x",    # Valid (Physics paper)
    "10.1016/j.jbi.2008.04.002",    # Valid (Bioinformatics paper)
    "10.1126/science.fake.999"      # Fake
]

# Header format for the table
print(f"{'DOI':<27} | {'Status':<8} | {'Author':<15} | {'Journal':<20} | {'Real Title'}")
print("-" * 110)

for doi in doi_list_to_check:
    result = verify_doi_validity(doi)
    
    if result['status'] == "Valid":
        # Clean and shorten strings for table display
        author = str(result['first_author'])[:15]
        journal = str(result['journal'])[:20]
        title = str(result['real_title'])[:35] + "..."
        
        print(f"{doi:<27} | {result['status']:<8} | {author:<15} | {journal:<20} | {title}")
    else:
        # For errors, we just print the details in the last column
        print(f"{doi:<27} | {result['status']:<8} | {'-':<15} | {'-':<20} | {result.get('details', '-')}")


DOI                         | Status   | Author          | Journal              | Real Title
--------------------------------------------------------------------------------------------------------------
10.1038/nature123           | Invalid  | -               | -                    | DOI not found
10.1007/s10701-005-9016-x   | Valid    | Ellis           | Foundations of Physi | Physics and the Real World...
10.1016/j.jbi.2008.04.002   | Valid    | Sward           | Journal of Biomedica | Reasons for declining computerized ...
10.1126/science.fake.999    | Invalid  | -               | -                    | DOI not found


# for csv files

In [None]:
# 1. Define the extraction function
def extract_dois_from_text(text):
    """
    Scans a text string for DOIs using regex.
    Returns a list of unique DOIs found, or an empty list.
    """
    # The standard DOI regex
    doi_pattern = r'\b(10\.\d{4,9}/[-._;()/:a-zA-Z0-9]+)\b'
    
    if not isinstance(text, str):
        return []
        
    matches = re.findall(doi_pattern, text)
    
    # Clean up trailing punctuation (like a period at the end of a sentence)
    unique_dois = set()
    for doi in matches:
        clean = doi.rstrip(".,)")
        unique_dois.add(clean)
        
    return list(unique_dois)

# 2. Apply it to the dataframe
print("Extracting DOIs from 'paper_text' column... this might take a moment.")
df['extracted_dois'] = df['paper_text'].apply(extract_dois_from_text)

# 3. Create a count column just to see how many we found per paper
df['doi_count'] = df['extracted_dois'].apply(len)

# 4. Filter to show only papers where we actually found DOIs
papers_with_dois = df[df['doi_count'] > 0].copy()

print(f"\nProcessing Complete.")
print(f"Total Papers Scanned: {len(df)}")
print(f"Papers containing DOIs: {len(papers_with_dois)}")

# Show a preview of the results
if len(papers_with_dois) > 0:
    print("\n--- Preview of Papers with Extracted DOIs ---")
    # We select just the ID, Year, Title, and the list of DOIs found
    display_cols = ['id', 'year', 'title', 'extracted_dois']
    try:
        display(papers_with_dois[display_cols].head())
    except NameError:
        print(papers_with_dois[display_cols].head())
else:
    print("No DOIs found. Note: Older papers (1987-1990s) often didn't print DOIs in their bibliographies.")

Extracting DOIs from 'paper_text' column... this might take a moment.

Processing Complete.
Total Papers Scanned: 7241
Papers containing DOIs: 130

--- Preview of Papers with Extracted DOIs ---


Unnamed: 0,id,year,title,extracted_dois
2373,3153,2006,The Neurodynamics of Belief Propagation on Bin...,[10.1088/1742-5468/2005/11/P11012]
2407,3184,2007,Invariant Common Spatial Patterns: Alleviating...,"[10.1016/j.neuroimage.2007.01.051, 10.1109/MSP..."
2529,3294,2007,Modeling homophily and stochastic equivalence ...,[10.1145/1134271.1134283]
2615,3371,2007,The Price of Bandit Information for Online Opt...,[10.1016/j.jcss.2004.10.016]
2671,3421,2008,Interpreting the neural code with Formal Conce...,"[10.1007/s10827-007-0039-5, 10.1038/nature06713]"


In [7]:
# 1. Your provided verification function
def verify_doi_validity(doi_input):
    clean_doi = doi_input.replace("https://doi.org/", "").replace("http://doi.org/", "")
    url = f"https://doi.org/{clean_doi}"
    headers = {"Accept": "application/vnd.citationstyles.csl+json"}

    try:
        response = requests.get(url, headers=headers, allow_redirects=True, timeout=10)
        
        if response.status_code == 200:
            try:
                data = response.json()
            except ValueError:
                return {"status": "Error", "details": "Response was not valid JSON."}
            
            title = data.get('title', 'N/A')
            if isinstance(title, list) and len(title) > 0: title = title[0]
            
            journal = data.get('container-title', 'N/A')
            if isinstance(journal, list) and len(journal) > 0: journal = journal[0]

            author_lastname = "N/A"
            if 'author' in data and len(data['author']) > 0:
                author_lastname = data['author'][0].get('family', 'N/A')

            return {
                "validity": "Valid",
                "meta_title": title,
                "meta_journal": journal,
                "meta_author": author_lastname,
                "details": "OK"
            }
        elif response.status_code == 404:
            return {"validity": "Invalid", "meta_title": "-", "meta_journal": "-", "meta_author": "-", "details": "DOI Not Found"}
        else:
            return {"validity": "Error", "meta_title": "-", "meta_journal": "-", "meta_author": "-", "details": f"HTTP {response.status_code}"}

    except Exception as e:
        return {"validity": "Conn Error", "meta_title": "-", "meta_journal": "-", "meta_author": "-", "details": str(e)}


# 2. Iterate through the papers and check their DOIs

results_list = []

# LIMITER: We only check the first 5 papers for this demo to save time.
# Remove .head(5) to run on all papers.
papers_to_check = papers_with_dois.head(5)

print(f"Starting verification on {len(papers_to_check)} papers...")

for index, row in papers_to_check.iterrows():
    paper_id = row['id']
    paper_year = row['year']
    extracted_dois = row['extracted_dois']
    
    print(f"Processing Paper ID {paper_id} ({len(extracted_dois)} DOIs found)...")
    
    for doi in extracted_dois:
        # Run the verification API
        res = verify_doi_validity(doi)
        
        # Save the result in a structured way
        results_list.append({
            "Paper_ID": paper_id,
            "Paper_Year": paper_year,
            "Checked_DOI": doi,
            "Status": res['validity'],
            "Real_Author": res['meta_author'],
            "Real_Journal": res['meta_journal'],
            "Real_Title": res['meta_title'],
            "Notes": res['details']
        })
        
        # Be polite to the API server, sleep a tiny bit
        time.sleep(0.2)

# 3. Convert results to a DataFrame for nice display
verification_df = pd.DataFrame(results_list)

print("\n--- Verification Complete ---")

# Display valid vs invalid counts
print(verification_df['Status'].value_counts())

print("\n--- Detailed Results Table ---")
# Displaying in a nice clean format
display_cols = ['Paper_ID', 'Checked_DOI', 'Status', 'Real_Author', 'Real_Journal']
try:
    display(verification_df[display_cols])
except NameError:
    print(verification_df[display_cols])

Starting verification on 5 papers...
Processing Paper ID 3153 (1 DOIs found)...
Processing Paper ID 3184 (2 DOIs found)...
Processing Paper ID 3294 (1 DOIs found)...
Processing Paper ID 3371 (1 DOIs found)...
Processing Paper ID 3421 (2 DOIs found)...

--- Verification Complete ---
Status
Valid    7
Name: count, dtype: int64

--- Detailed Results Table ---


Unnamed: 0,Paper_ID,Checked_DOI,Status,Real_Author,Real_Journal
0,3153,10.1088/1742-5468/2005/11/P11012,Valid,Mooij,Journal of Statistical Mechanics: Theory and E...
1,3184,10.1016/j.neuroimage.2007.01.051,Valid,Blankertz,NeuroImage
2,3184,10.1109/MSP.2008.4408441,Valid,Blankertz,IEEE Signal Processing Magazine
3,3294,10.1145/1134271.1134283,Valid,Airoldi,Proceedings of the 3rd international workshop ...
4,3371,10.1016/j.jcss.2004.10.016,Valid,Kalai,Journal of Computer and System Sciences
5,3421,10.1007/s10827-007-0039-5,Valid,Endres,Journal of Computational Neuroscience
6,3421,10.1038/nature06713,Valid,Kay,Nature
