# Parsing one single data to test it 

<div style="background-color:#013220; color:#e0f2e9; font-family:Arial, sans-serif; padding:20px; line-height:1.6;">

  <h2 style="color:#98ff98;">Step-by-Step Plan for the XML Parsing Snippet</h2>

  <ol>
    <li>
      <strong>Select one XML file</strong>  
      <p>We choose a single file from <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">train/XML/</code> to explore. This file represents one scientific article in structured XML format.</p>
    </li>

<li>
      <strong>Read and parse the XML</strong>  
      <p>We use <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">BeautifulSoup</code> with the <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">lxml-xml</code> parser to load structured tags (e.g., 
      <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;article-title&gt;</code>, 
      <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;abstract&gt;</code>, 
      <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;sec&gt;</code>).</p>
    </li>

 <li>
      <strong>Extract key sections</strong>
      <ul>
        <li><strong>Title</strong> — from the <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;article-title&gt;</code> tag</li>
        <li><strong>Abstract</strong> — from the <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;abstract&gt;</code> tag</li>
        <li><strong>Body</strong> — concatenated text from all <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;sec&gt;</code> sections</li>
        <li><strong>References</strong> — text from all <code style="background-color:#025930; color:#d4ffd4; padding:2px 5px; border-radius:4px;">&lt;ref&gt;</code> entries</li>
      </ul>
    </li>

 <li>
      <strong>Preview the data</strong>  
      <p>We print only the first few hundred characters of each section to check whether important dataset identifiers (like DOIs or registry IDs) appear.</p>
 </li>
  </ol>

  <hr style="border:none; height:1px; background-color:#2e8b57;">

  <h3 style="color:#98ff98;">Why This Matters</h3>
  <p>This step helps us understand <em>where</em> in the XML the dataset references usually appear.  
  For example:</p>
  <ul>
    <li>Some dataset DOIs may be in the <strong>References</strong> section.</li>
    <li>Mentions without DOIs may appear in the <strong>Body</strong> or <strong>Methods</strong> section.</li>
  </ul>
  <p>By identifying consistent patterns, we'll know which XML tags to target for automated dataset extraction and classification in later steps.</p>

</div>


In [2]:
import pandas as pd
from bs4 import BeautifulSoup

# 1. Pick one article
xml_path = "/kaggle/input/make-data-count-finding-data-references/train/XML/10.1590_1678-4685-gmb-2018-0055.xml"

# 2. Load and parse
with open(xml_path, 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f.read(), "lxml-xml")

# 3. Get title, abstract, body text, references
title = soup.find("article-title").get_text(" ", strip=True) if soup.find("article-title") else ""
abstract = soup.find("abstract").get_text(" ", strip=True) if soup.find("abstract") else ""

# Body text concatenation
body = " ".join(sec.get_text(" ", strip=True) for sec in soup.find_all("sec"))

# References section text
references = " ".join(ref.get_text(" ", strip=True) for ref in soup.find_all("ref"))

print("TITLE:\n", title)
print("\nABSTRACT:\n", abstract[:500], "...")
print("\nBODY PREVIEW:\n", body[:800], "...")
print("\nREFERENCES PREVIEW:\n", references[:500], "...")


TITLE:
 Mitochondrial genomes of genus Atta (Formicidae:
Myrmicinae) reveal high gene organization and giant intergenic
spacers

ABSTRACT:
 Abstract The ants of the genus Atta are considered important pests to
agriculture in the Americas, although Atta species are also
important contributors to ecosystem functions in the various habitats in which
they occur. The aim of this study was to assemble four complete mitochondrial
genomes of the genus Atta , construct the phylogenomic tree, and
analyze the gene content, order, and organization. The mitogenomes of A.
colombica , A. opaciceps , A.
texana , and A. sexdens rubropilosa comprise
 ...

BODY PREVIEW:
 Conflict of interest The authors declare no conflict of interest. Author contributions JTVB and SM generated the DNA library, de novo assemblies and wrote
the manuscript; MSB, AEGS and CA analyzed data and wrote the manuscript. ...

REFERENCES PREVIEW:
 Babbucci M Basso A Scupola A Patarnello T Negrisolo E 2014 Is it an ant or a butterfl

# dataset reference extractor