#Parse and Merge XML files by id

Data:

* articles-training-byarticle-20181122.xml → contains the article content
* ground-truth-training-byarticle-20181122.xml → contains the labels

Add derived features (Core+Stylometry+Metadata):

    full_text = title + " " + body

    domain = extract_domain(url)

    uppercase_ratio, exclamation_count, avg_sentence_length

Save to .tsv file.

 *XML Docs* (to learn how XML trees are structured and parsed in Python):
 https://docs.python.org/3/library/xml.etree.elementtree.html


*BeautifulSoup Docs* (for removing HTML tags and cleaning up article content):
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [21]:
import xml.etree.ElementTree as ET
import pandas as pd
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup

Create a dictionary of labels using:

    The article's id

    Its hyperpartisan label (converted to 1 or 0)

    Its URL

In [22]:
articles_path = "articles-training-byarticle-20181122.xml"
ground_truth_path = "ground-truth-training-byarticle-20181122.xml"
output_path = "byarticle_clean.tsv"

# === LOAD XML FILES ===
articles_tree = ET.parse(articles_path)
ground_truth_tree = ET.parse(ground_truth_path)

articles_root = articles_tree.getroot()
ground_truth_root = ground_truth_tree.getroot()

# === PARSE GROUND TRUTH (LABELS + URL) ===
labels = {}
for article in ground_truth_root.findall("article"):
    article_id = article.get("id")
    label = 1 if article.get("hyperpartisan") == "true" else 0
    url = article.get("url")
    labels[article_id] = {"label": label, "url": url}



In [23]:
print(labels)



# Helper functions
*   extract_domain(url) → gets domain like google.com

*   uppercase_ratio(text) → % of letters that are UPPERCASE | uses only alphabetic characters | Useful to detect shouting, emphasis, or strong emotional tone.

*   exclamation_count(text) → how many ! characters | Hyperpartisan texts often use many exclamation marks to express outrage or strong opinion.

*   avg_sentence_length(text) → average number of words per sentence | Hyperpartisan content may favor short, punchy sentences vs longer factual prose.








In [24]:
def extract_domain(url):
    try:
        return urlparse(url).netloc
    except (ValueError, TypeError):
        return ""

def uppercase_ratio(text):
    letters = [c for c in text if c.isalpha()]
    if not letters:
        return 0.0
    upper = sum(1 for c in letters if c.isupper())
    return upper / len(letters)

def exclamation_count(text):
    return text.count("!")

def avg_sentence_length(text):
    sentences = re.split(r'[.!?]', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    if not sentences:
        return 0.0
    total_words = sum(len(s.split()) for s in sentences)
    return total_words / len(sentences)


# Create dataframe -> dataset

For each article:

    Get its id and title

    Join all <p> tags to form the body

    Combine title + body into full_text

* Get the label and domain from the label dictionary
* Use helper functions to calculate features

In [25]:
data = []

for article in articles_root.findall("article"):
    article_id = article.get("id")
    title = article.get("title", "")

    # Get all inner HTML (including paragraphs, links, spans, etc.)
    raw_html = "".join(ET.tostring(elem, encoding="unicode") for elem in article)
    body = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ", strip=True)

    # Sentence boundary -> if title ends with punctuation
    if title.strip().endswith(('.', '!', '?')):
        full_text = f"{title.strip()} {body.strip()}"
    else:
        full_text = f"{title.strip()}. {body.strip()}"




    if article_id in labels:
        label = labels[article_id]["label"]
        url = labels[article_id]["url"]
        domain = extract_domain(url)

        row = {
            "id": article_id,
            "title": title,
            "body": body,
            "full_text": full_text,
            "label": label,
            "domain": domain,
            "uppercase_ratio": uppercase_ratio(full_text),
            "exclamation_count": exclamation_count(full_text),
            "avg_sentence_length": avg_sentence_length(full_text),
        }

        data.append(row)

    df = pd.DataFrame(data)

In [26]:
# Check shape and columns
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

# Check label balance
print("\nLabel distribution:")
print(df['label'].value_counts())

# Look for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Peek at the data
print("\nSample rows:")
print(df.sample(3))

# Show feature stats
print("\nFeature summary:")
print(df[['uppercase_ratio', 'exclamation_count', 'avg_sentence_length']].describe())


Shape: (645, 9)
Columns: ['id', 'title', 'body', 'full_text', 'label', 'domain', 'uppercase_ratio', 'exclamation_count', 'avg_sentence_length']

Label distribution:
label
0    407
1    238
Name: count, dtype: int64

Missing values:
id                     0
title                  0
body                   0
full_text              0
label                  0
domain                 0
uppercase_ratio        0
exclamation_count      0
avg_sentence_length    0
dtype: int64

Sample rows:
          id                                              title  \
486  0000486  US aircraft are picking off ISIS militants in ...   
253  0000253   Trump Jr. calls charge of collusion 'ridiculous'   
585  0000585  Biden does it in front of us! ‘Outraged’ Dems ...   

                                                  body  \
486  A convoy used by the Islamic State of Iraq and...   
253  Trump Jr. allowed Tuesday night in an intervie...   
585  The left continues to harp on unproven allegat...   

              

# Analyze
Class split: ~63% vs ~37%. Slight imbalance.
* uppercase_ratio: Avg ~5.6%, max ~52% (some might shout a lot)

* exclamation_count: Mostly 0–1, but some articles use a lot of exclamation marks (up to 34)

* avg_sentence_length: Varies from very short (3) to very long (220), so could benefit from normalization or log-scaling

# Converting and saving
to .tsv

In [27]:
df.to_csv(output_path, sep="\t", index=False)
print(f"Saved cleaned dataset to: {output_path}")

Saved cleaned dataset to: byarticle_clean.tsv
