[Updated 4/11/2025] Change to more more adjustible function. Added preprocessing for bypublisher.

[Updated 4/7/2025]
Added preprocessing for test dataset. Train dataset remains the same.

Implemented:
* Text cleaning	✅	full_text appears clean and consistent
* Combining title + body	✅	using full_text
* Feature extraction	✅	Includes stylistic features (uppercase ratio, exclamations, sentence length)
* Label extraction	✅	Labels are correctly matched and no missing values
* Domain filtering	✅	domain column is preserved (useful for group-based splits in part 2 of the project)
* TF-IDF preparation (applies in the model part)
* XML parsing	✅	extraction from PAN XML format



-> Stopwords will be applied directly in TF-IDF. Lemmetization too.

#Parse and Merge XML files by id

Data:

* articles-training-byarticle-20181122.xml → contains the article content
* ground-truth-training-byarticle-20181122.xml → contains the labels
* articles-test-byarticle-20181207.xml → Test article texts
* ground-truth-test-byarticle-20181207.xml → Balanced test labels for testing/evaluation

* articles-test-bypublisher-20181212 → Init noisy articels
* ground-truth-test-bypublisher-20181212 → by publisher weak labels for the bypublisher

Add derived features (Core+Stylometry+Metadata):

    full_text = title + " " + body

    domain = extract_domain(url)

    uppercase_ratio, exclamation_count, avg_sentence_length

Save to .tsv file.

 *XML Docs* (to learn how XML trees are structured and parsed in Python):
 https://docs.python.org/3/library/xml.etree.elementtree.html


*BeautifulSoup Docs* (for removing HTML tags and cleaning up article content):
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [3]:
import xml.etree.ElementTree as ET
import pandas as pd
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup

Init data:

In [4]:
articles_path = "articles-training-byarticle-20181122.xml"
ground_truth_path = "ground-truth-training-byarticle-20181122.xml"

articles_path_test = "articles-test-byarticle-20181207.xml"
ground_truth_path_test = "ground-truth-test-byarticle-20181207.xml"

articles_path_bypublisher = "articles-test-bypublisher-20181212.xml"
ground_truth_path_bypublisher = "ground-truth-test-bypublisher-20181212.xml"

output_path = "byarticle_clean.tsv"
output_path_test = "byarticle_clean_test.tsv"
output_path_bypublisher = "bypublisher_clean.tsv"


# Helper functions
*   extract_domain(url) → gets domain like google.com

*   uppercase_ratio(text) → % of letters that are UPPERCASE | uses only alphabetic characters | Useful to detect shouting, emphasis, or strong emotional tone.

*   exclamation_count(text) → how many ! characters | Hyperpartisan texts often use many exclamation marks to express outrage or strong opinion.

*   avg_sentence_length(text) → average number of words per sentence | Hyperpartisan content may favor short, punchy sentences vs longer factual prose.








In [5]:
def extract_domain(url):
    try:
        return urlparse(url).netloc
    except (ValueError, TypeError):
        return ""

def uppercase_ratio(text):
    letters = [c for c in text if c.isalpha()]
    if not letters:
        return 0.0
    upper = sum(1 for c in letters if c.isupper())
    return upper / len(letters)

def exclamation_count(text):
    return text.count("!")

def avg_sentence_length(text):
    sentences = re.split(r'[.!?]', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    if not sentences:
        return 0.0
    total_words = sum(len(s.split()) for s in sentences)
    return total_words / len(sentences)


In [6]:
"""
preprocessing_utils.py

Reusable functions for preprocessing PAN-style XML datasets (e.g., by-article) into structured TSV format.

Main Function:
---------------
preprocess_byarticle_xml(
    articles_path: str,
    ground_truth_path: str,
    output_path: str,
    include_style_features: bool = True,
    print_data_info: bool = True
)

This function:
- Merges article content and ground-truth label XML files
- Cleans HTML and combines title + body into a full_text field
- Optionally extracts stylistic features (uppercase ratio, exclamation count, avg. sentence length)
- Saves the result as a .tsv file for use in modeling
- Optionally prints summary statistics

Useful for: preprocessing the PAN 2019 by-article dataset for baseline or deep model experiments.

Example:
--------
preprocess_byarticle_xml(
    articles_path="articles-training-byarticle.xml",
    ground_truth_path="ground-truth-training-byarticle.xml",
    output_path="data/train.tsv"
)
"""


'\npreprocessing_utils.py\n\nReusable functions for preprocessing PAN-style XML datasets (e.g., by-article) into structured TSV format.\n\nMain Function:\n---------------\npreprocess_byarticle_xml(\n    articles_path: str,\n    ground_truth_path: str,\n    output_path: str,\n    include_style_features: bool = True,\n    print_data_info: bool = True\n)\n\nThis function:\n- Merges article content and ground-truth label XML files\n- Cleans HTML and combines title + body into a full_text field\n- Optionally extracts stylistic features (uppercase ratio, exclamation count, avg. sentence length)\n- Saves the result as a .tsv file for use in modeling\n- Optionally prints summary statistics\n\nUseful for: preprocessing the PAN 2019 by-article dataset for baseline or deep model experiments.\n\nExample:\n--------\ndf = preprocess_byarticle_xml(\n    articles_path="articles-training-byarticle.xml",\n    ground_truth_path="ground-truth-training-byarticle.xml",\n    output_path="data/train.tsv"\n)\n

In [10]:
def preprocess_byarticle_xml(
    articles_path, ground_truth_path, output_path, drop_raw_text=False,
    include_style_features=True, print_data_info=True
):
    tree = ET.parse(articles_path)
    root = tree.getroot()

    gt_tree = ET.parse(ground_truth_path)
    gt_root = gt_tree.getroot()

    # Parse labels
    labels = {}
    for article in gt_root.findall("article"):
        article_id = article.get("id")
        label = 1 if article.get("hyperpartisan") == "true" else 0
        url = article.get("url")
        labels[article_id] = {"label": label, "url": url}

    data = []
    for article in root.findall("article"):
        article_id = article.get("id")
        title = article.get("title", "")
        raw_html = "".join(ET.tostring(elem, encoding="unicode") for elem in article)
        body = BeautifulSoup(raw_html, "html.parser").get_text(separator=" ", strip=True)

        full_text = f"{title.strip()}{'' if title.strip().endswith(('.', '!', '?')) else '.'} {body.strip()}"

        if article_id in labels:
            label = labels[article_id]["label"]
            url = labels[article_id]["url"]
            domain = extract_domain(url)
            row = {
                "id": article_id,
                "title": title,
                "body": body,
                "full_text": full_text,
                "label": label,
                "domain": domain,
            }
            if include_style_features:
                row.update({
                    "uppercase_ratio": uppercase_ratio(full_text),
                    "exclamation_count": exclamation_count(full_text),
                    "avg_sentence_length": avg_sentence_length(full_text),
                })
            data.append(row)

    df = pd.DataFrame(data)
    df.to_csv(output_path, sep="\t", index=False)

    if print_data_info:
        print(f"==== {output_path} ====")
        print("Shape:", df.shape)
        print("Label distribution:\n", df['label'].value_counts())
        print("Missing values:\n", df.isnull().sum())
        print("Sample rows:\n", df.sample(3))
        if include_style_features:
            print("Feature stats:\n", df[['uppercase_ratio', 'exclamation_count', 'avg_sentence_length']].describe())
        print(f"Saved to {output_path}")


In [13]:
#preprocess_byarticle_xml(articles_path, ground_truth_path, output_path)
#preprocess_byarticle_xml(articles_path_test, ground_truth_path_test, output_path_test)
preprocess_byarticle_xml(articles_path_bypublisher, ground_truth_path_bypublisher, output_path_bypublisher, False, True)

==== bypublisher_clean.tsv ====
Shape: (4000, 6)
Label distribution:
 label
0    2000
1    2000
Name: count, dtype: int64
Missing values:
 id           0
title        0
body         0
full_text    0
label        0
domain       0
dtype: int64
Sample rows:
            id                                              title  \
3028  1171826  NBC “Investigates” Gas Can Explosions … Becaus...   
642   0257837  Fmr. FBI Counterterrorism Agent: ‘Vast Majorit...   
3228  1240839  General Electric, California Teachers Associat...   

                                                   body  \
642   BigGovernment – On Friday’s Breitbart News Dai...   
3228  The California Teachers Association, the main ...   

                                              full_text  label          domain  
3028  NBC “Investigates” Gas Can Explosions … Becaus...      1   downtrend.com  
642   Fmr. FBI Counterterrorism Agent: ‘Vast Majorit...      1  iotwreport.com  
3228  General Electric, California Teachers Associ

Unnamed: 0,id,title,body,full_text,label,domain
0,0000036,The Postal Service Needs Congressional Help,The US Postal Service is facing “the most seri...,The Postal Service Needs Congressional Help. T...,0,leftjustified.com
1,0000110,Oscar-Nominated Palestinian Filmmaker: Freedom...,On Valentine's Day I was part of a panel discu...,Oscar-Nominated Palestinian Filmmaker: Freedom...,1,progressive.org
2,0000256,Which Law Schools Should Have Closed Their Doors?,Ever since the beginning of the law school cri...,Which Law Schools Should Have Closed Their Doo...,0,abovethelaw.com
3,0000335,US Badly Needs Russia’s Technology to Keep Its...,( SCF ) – Russian Energomash Corporation has f...,US Badly Needs Russia’s Technology to Keep Its...,0,geopoliticsalert.com
4,0001058,The Three Democrats Breaking Ranks on H.R. 2 V...,"Jan. 19, 2011 -- The House of Representatives ...",The Three Democrats Breaking Ranks on H.R. 2 V...,0,maplight.org
...,...,...,...,...,...,...
3995,1491832,That Time Neil DeGrasse Tyson Schooled Bob Lut...,Conservatives like to use the excuse of&#160;“...,That Time Neil DeGrasse Tyson Schooled Bob Lut...,1,youngprogressivevoices.com
3996,1493320,Geologists Not Sure What's Causing Earthquake ...,By Piper McGowin We are seeing a lot of geolog...,Geologists Not Sure What's Causing Earthquake ...,1,activistpost.com
3997,1493954,White House Budget Director: Trump is Done Neg...,White House Budget Director Mick Mulvaney told...,White House Budget Director: Trump is Done Neg...,1,freedomoutpost.com
3998,1494502,It’s Nice to Have the Option,"Alex Knapp · Friday, May 16, 2008 · 10 comment...",It’s Nice to Have the Option. Alex Knapp · Fri...,0,outsidethebeltway.com


The training split of the by-article dataset contains 645 articles, with an imbalanced label distribution (63\% non-hyperpartisan, 37\% hyperpartisan). In contrast, the test set is fully balanced, with 314 articles per class, enabling fair and consistent evaluation. No missing values were found in either split.

Each article includes a set of engineered stylistic features. In the training set, the average uppercase ratio is 0.056, with a maximum value of 0.52 indicating a few highly emphatic articles. The average number of exclamation marks per article is 0.93, but this ranges up to 34, suggesting strong variance in emotional tone. The average sentence length is 18.3 words, with a wide range from 3.2 to 220 words per sentence.

The test set exhibits similar feature distributions. The average uppercase ratio is 0.054, and the average sentence length is 18.6 words. Exclamation usage is also highly variable, with a maximum of 52 in one article. Overall, the stylistic feature distributions are comparable across splits, suggesting the model should generalize well across them.

By-publisher:

Shape: 4,000 articles

Label balance: 50/50 (2,000 per class)

Features: id, title, body, full_text, domain

No missing values

# Extras

As an extra for (optional) evaluation, considered to add sentiment_score. Else create separate .tsv file with just id, full_text, domain.

In [18]:
# Load the cleaned by-publisher dataset
df = pd.read_csv("data/bypublisher_clean.tsv", sep="\t")

#XML version

In [19]:
# Keep only the desired columns
df_xml = df[["id", "full_text", "domain"]]

# Save to a new .tsv file
df_xml.to_csv("data/bypublisher_xml.tsv", sep="\t", index=False)

print(" Cleaned file saved as data/bypublisher_minimal.tsv")

 Cleaned file saved as data/bypublisher_minimal.tsv


#Truncate to approx. 512 tokens

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def truncate_to_512_tokens(text):
    tokens = tokenizer.encode(text, truncation=True, max_length=512)
    return tokenizer.decode(tokens, skip_special_tokens=True)

# Apply truncation
df_roberta = df_xml.copy()
df_roberta["full_text"] = df_roberta["full_text"].apply(truncate_to_512_tokens)

# Save to file
df_roberta.to_csv("data/bypublisher_roberta_ready.tsv", sep="\t", index=False)
print(" Token-truncated file saved: data/bypublisher_roberta_ready.tsv")

 Token-truncated file saved: data/bypublisher_roberta_ready.tsv


#Sentiment score

In [14]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
# Download the VADER lexicon (only needs to be done once)
nltk.download('vader_lexicon')

In [17]:
# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Compute sentiment score for each article
df["sentiment_score"] = df["full_text"].apply(lambda x: sia.polarity_scores(str(x))["compound"])

# Save to new file
output_path = "data/bypublisher_clean_with_sentiment.tsv"
df.to_csv(output_path, sep="\t", index=False)

print(f" Sentiment scores added and saved to {output_path}")

 Sentiment scores added and saved to data/bypublisher_clean_with_sentiment.tsv
