[Updated 4/7/2025]
Added preprocessing for test dataset. Train dataset remains the same.

Implemented:
* Text cleaning	✅	full_text appears clean and consistent
* Combining title + body	✅	using full_text
* Feature extraction	✅	Includes stylistic features (uppercase ratio, exclamations, sentence length)
* Label extraction	✅	Labels are correctly matched and no missing values
* Domain filtering	✅	domain column is preserved (useful for group-based splits in part 2 of the project)
* TF-IDF preparation (applies in the model part)
* XML parsing	✅	extraction from PAN XML format



-> Stopwords will be applied directly in TF-IDF or CountVect.
->

#Parse and Merge XML files by id

Data:

* articles-training-byarticle-20181122.xml → contains the article content
* ground-truth-training-byarticle-20181122.xml → contains the labels
* articles-test-byarticle-20181207.xml → Test article texts
* ground-truth-test-byarticle-20181207.xml → Balanced test labels for testing/evaluation

Add derived features (Core+Stylometry+Metadata):

    full_text = title + " " + body

    domain = extract_domain(url)

    uppercase_ratio, exclamation_count, avg_sentence_length

Save to .tsv file.

 *XML Docs* (to learn how XML trees are structured and parsed in Python):
 https://docs.python.org/3/library/xml.etree.elementtree.html


*BeautifulSoup Docs* (for removing HTML tags and cleaning up article content):
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [15]:
import xml.etree.ElementTree as ET
import pandas as pd
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup

Create a dictionary of labels using:

    The article's id

    Its hyperpartisan label (converted to 1 or 0)

    Its URL

In [16]:
articles_path = "articles-training-byarticle-20181122.xml"
ground_truth_path = "ground-truth-training-byarticle-20181122.xml"

articles_path_test = "articles-test-byarticle-20181207.xml"
ground_truth_path_test = "ground-truth-test-byarticle-20181207.xml"

output_path = "byarticle_clean.tsv"
output_path_test = "byarticle_clean_test.tsv"

# === LOAD XML FILES ===
articles_tree = ET.parse(articles_path)
ground_truth_tree = ET.parse(ground_truth_path)

articles_tree_test = ET.parse(articles_path_test)
ground_truth_tree_test = ET.parse(ground_truth_path_test)


In [17]:
articles_root = articles_tree.getroot()
ground_truth_root = ground_truth_tree.getroot()
articles_root_test = articles_tree_test.getroot()
ground_truth_root_test = ground_truth_tree_test.getroot()


In [18]:
# === PARSE GROUND TRUTH (LABELS + URL) ===
labels = {}
for article in ground_truth_root.findall("article"):
    article_id = article.get("id")
    label = 1 if article.get("hyperpartisan") == "true" else 0
    url = article.get("url")
    labels[article_id] = {"label": label, "url": url}

labels_test = {}
for article in ground_truth_root_test.findall("article"):
    article_id = article.get("id")
    label = 1 if article.get("hyperpartisan") == "true" else 0
    url = article.get("url")
    labels_test[article_id] = {"label": label, "url": url}

In [19]:
print(labels)
print(labels_test)



# Helper functions
*   extract_domain(url) → gets domain like google.com

*   uppercase_ratio(text) → % of letters that are UPPERCASE | uses only alphabetic characters | Useful to detect shouting, emphasis, or strong emotional tone.

*   exclamation_count(text) → how many ! characters | Hyperpartisan texts often use many exclamation marks to express outrage or strong opinion.

*   avg_sentence_length(text) → average number of words per sentence | Hyperpartisan content may favor short, punchy sentences vs longer factual prose.








In [20]:
def extract_domain(url):
    try:
        return urlparse(url).netloc
    except (ValueError, TypeError):
        return ""

def uppercase_ratio(text):
    letters = [c for c in text if c.isalpha()]
    if not letters:
        return 0.0
    upper = sum(1 for c in letters if c.isupper())
    return upper / len(letters)

def exclamation_count(text):
    return text.count("!")

def avg_sentence_length(text):
    sentences = re.split(r'[.!?]', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    if not sentences:
        return 0.0
    total_words = sum(len(s.split()) for s in sentences)
    return total_words / len(sentences)


# Create dataframe -> dataset

For each article:

    Get its id and title

    Join all <p> tags to form the body

    Combine title + body into full_text

* Get the label and domain from the label dictionary
* Use helper functions to calculate features

In [21]:
# --------------------- TRAIN DATASET --------------------
data = []

for article in articles_root.findall("article"):
    article_id = article.get("id")
    title = article.get("title", "")

    # Get all inner HTML (including paragraphs, links, spans, etc.)
    raw_html = "".join(ET.tostring(elem, encoding="unicode") for elem in article)
    body = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ", strip=True)

    # Sentence boundary -> if title ends with punctuation
    if title.strip().endswith(('.', '!', '?')):
        full_text = f"{title.strip()} {body.strip()}"
    else:
        full_text = f"{title.strip()}. {body.strip()}"




    if article_id in labels:
        label = labels[article_id]["label"]
        url = labels[article_id]["url"]
        domain = extract_domain(url)

        row = {
            "id": article_id,
            "title": title,
            "body": body,
            "full_text": full_text,
            "label": label,
            "domain": domain,
            "uppercase_ratio": uppercase_ratio(full_text),
            "exclamation_count": exclamation_count(full_text),
            "avg_sentence_length": avg_sentence_length(full_text),
        }

        data.append(row)

    df = pd.DataFrame(data)

In [22]:
# ------------ TEST DATASET -----------------
data_test = []

for article in articles_root_test.findall("article"):
    article_id = article.get("id")
    title = article.get("title", "")

    # Get all inner HTML (including paragraphs, links, spans, etc.)
    raw_html = "".join(ET.tostring(elem, encoding="unicode") for elem in article)
    body = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ", strip=True)

    # Sentence boundary -> if title ends with punctuation
    if title.strip().endswith(('.', '!', '?')):
        full_text = f"{title.strip()} {body.strip()}"
    else:
        full_text = f"{title.strip()}. {body.strip()}"




    if article_id in labels_test:
        label = labels_test[article_id]["label"]
        url = labels_test[article_id]["url"]
        domain = extract_domain(url)

        row = {
            "id": article_id,
            "title": title,
            "body": body,
            "full_text": full_text,
            "label": label,
            "domain": domain,
            "uppercase_ratio": uppercase_ratio(full_text),
            "exclamation_count": exclamation_count(full_text),
            "avg_sentence_length": avg_sentence_length(full_text),
        }

        data_test.append(row)

    df_test = pd.DataFrame(data_test)

In [23]:
print("=========== TRAIN DATASET =======")
# Check shape and columns
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

# Check label balance
print("\nLabel distribution:")
print(df['label'].value_counts())

# Look for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Peek at the data
print("\nSample rows:")
print(df.sample(3))

# Show feature stats
print("\nFeature summary:")
print(df[['uppercase_ratio', 'exclamation_count', 'avg_sentence_length']].describe())


print("=========== TEST DATASET =======")
# Check shape and columns
print("Shape:", df_test.shape)
print("Columns:", df_test.columns.tolist())

# Check label balance
print("\nLabel distribution:")
print(df_test['label'].value_counts())

# Look for missing values
print("\nMissing values:")
print(df_test.isnull().sum())

# Peek at the data
print("\nSample rows:")
print(df_test.sample(3))

# Show feature stats
print("\nFeature summary:")
print(df_test[['uppercase_ratio', 'exclamation_count', 'avg_sentence_length']].describe())

Shape: (645, 9)
Columns: ['id', 'title', 'body', 'full_text', 'label', 'domain', 'uppercase_ratio', 'exclamation_count', 'avg_sentence_length']

Label distribution:
label
0    407
1    238
Name: count, dtype: int64

Missing values:
id                     0
title                  0
body                   0
full_text              0
label                  0
domain                 0
uppercase_ratio        0
exclamation_count      0
avg_sentence_length    0
dtype: int64

Sample rows:
          id                                              title  \
389  0000389     Rep. Grijalva arrested in front of Trump Tower   
194  0000194  House Intelligence Committee Subpoenas Firm Be...   
537  0000537  Trump Insists He Never Fired Ex-FBI Director C...   

                                                  body  \
389  NEW YORK (KGUN9-TV) - Arizona Congressman Raul...   
194  Senator Chuck Grassley sent a letter to FBI Di...   
537  WASHINGTON (Sputnik) - The White House initial...   

              

# Analyze
Training Dataset

    Size: 645 articles

    Label Distribution: Imbalanced (63% non-hyperpartisan, 37% hyperpartisan)

    No missing values

    Feature columns: Includes uppercase_ratio, exclamation_count, avg_sentence_length as stylistic features

    Feature Stats:

        uppercase_ratio: avg ~0.056, with a few high outliers (max ~0.52)

        exclamation_count: avg ~0.93, but max up to 34 (high variance)

        avg_sentence_length: avg ~18.3 words, with a wide spread (min 3.2, max 220)

Test Dataset

    Size: 628 articles

    Label Distribution: Perfectly balanced (50% / 50%)

    No missing values

    Same feature set as training

    Feature Stats:

        uppercase_ratio: avg ~0.054, similar to training

        exclamation_count: avg ~0.82, max up to 52 (more extreme outliers than training)

        avg_sentence_length: avg ~18.6 words, similar spread to training (max ~184.5)

Summary:
* Feature Distribution: The distributions of stylistic features are largely similar
* Label Balance: The test set is more balanced


# Converting and saving
to .tsv

In [25]:
df.to_csv(output_path, sep="\t", index=False)
print(f"Saved cleaned dataset to: {output_path}")

df_test.to_csv(output_path_test, sep="\t", index=False)
print(f"Saved cleaned dataset to: {output_path_test}")

Saved cleaned dataset to: byarticle_clean.tsv
Saved cleaned dataset to: byarticle_clean_test.tsv
