# American Journalists: Byline Counts & Bias Profiles

This starter notebook helps you:
1. **Ingest _All the News 2.0_** (2.69M U.S. news articles, 2016–2020).
2. **Produce per-author (journalist) article counts** from bylines.
3. **Attach outlet-level political bias labels** from **AllSides** and **MBFC** to compute simple **per-author bias profiles**.
4. **Scrape AllSides _author-level_ ratings** (for journalists who have an AllSides page) and merge with your counts.

> **Licenses & Terms**  
> • _All the News 2.0_ is for **non-commercial research** only and explicitly **not for training commercial generative models**. Download it from the official page and respect their terms.  
> • AllSides and MBFC have their own terms. For MBFC, the recommended free route is using **NELA-GT** (which bundles MBFC source labels) or the research CSV used below.


## 0) Setup

In [3]:

# You can run this entire notebook locally. Internet is required for the auto-download helpers.
# If a download fails (e.g., behind a firewall), download manually from the provided links
# and place files in the `data/` folder defined below.

import re, zipfile
from pathlib import Path
import pandas as pd
import numpy as np

DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

AUTO_DOWNLOAD = True  # set to False to skip any attempted downloads

def safe_download(url: str, dest: Path, chunk=1<<20):
    """Stream-download a file to dest. Overwrites if exists."""
    import requests
    dest.parent.mkdir(parents=True, exist_ok=True)
    with requests.get(url, stream=True, timeout=120) as r:
        r.raise_for_status()
        with open(dest, "wb") as f:
            for c in r.iter_content(chunk_size=chunk):
                if c:
                    f.write(c)
    return dest

def unzip_if_needed(path: Path, dest_dir: Path):
    if path.suffix.lower() == ".zip":
        with zipfile.ZipFile(path, 'r') as zf:
            zf.extractall(dest_dir)
        return True
    return False

# Helper to normalize domains (strip subdomains like 'www.')
def normalize_domain(url_or_domain: str) -> str:
    s = url_or_domain.strip().lower()
    s = re.sub(r'^https?://', '', s)
    s = s.split('/')[0]
    s = re.sub(r'^www\.', '', s)
    return s

# All the outputs will be written under 'outputs/'
OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)
print("DATA_DIR =", DATA_DIR.resolve())
print("OUT_DIR  =", OUT_DIR.resolve())


DATA_DIR = /Users/baturalpkabadayi/Dev/Research/JournalistLLM/data
OUT_DIR  = /Users/baturalpkabadayi/Dev/Research/JournalistLLM/outputs


## 1) Get the Articles: _All the News 2.0_

**Source & Docs:**  
• Components.one dataset page (official host): see fields and counts; download hosted via Proton Drive.  
• Mirrors also exist on third-party sites, but you should prefer the official page for the latest schema.

**Expected file(s):**
- `all-the-news-2.csv` — a single CSV with ~2.7M rows.

> If auto-download fails, **go to the official page** and download manually, then place the CSV under `data/`.


### 1.1. Download

In [2]:

# --- Configure download URL(s) ---
# Official page: https://components.one/datasets/all-the-news-2-news-articles-dataset
# The page links to a Proton Drive URL (changes occasionally). Manual download is recommended.
ALL_NEWS_CSV = DATA_DIR / "all-the-news-2.csv"  # <-- put your downloaded CSV here

if AUTO_DOWNLOAD and not ALL_NEWS_CSV.exists():
    print("Auto-download is not guaranteed for All the News 2.0 (Proton Drive link).")
    print("Please download manually from the official page and place it at:", ALL_NEWS_CSV.resolve())

# Quick schema peek (only runs if file is present)
if ALL_NEWS_CSV.exists():
    print("Loading a small sample to preview schema...")
    sample = pd.read_csv(ALL_NEWS_CSV, nrows=5)
    display(sample.head())
else:
    print("CSV not found yet. Proceed after you place the file at:", ALL_NEWS_CSV.resolve())

Loading a small sample to preview schema...


Unnamed: 0,date,year,month,day,author,title,article,url,section,publication
0,2016-12-09 18:31:00,2016,12.0,9,Lee Drutman,We should take concerns about the health of li...,"This post is part of Polyarchy, an independent...",https://www.vox.com/polyarchy/2016/12/9/138983...,,Vox
1,2016-10-07 21:26:46,2016,10.0,7,Scott Davis,Colts GM Ryan Grigson says Andrew Luck's contr...,The Indianapolis Colts made Andrew Luck the h...,https://www.businessinsider.com/colts-gm-ryan-...,,Business Insider
2,2018-01-26 00:00:00,2018,1.0,26,,Trump denies report he ordered Mueller fired,"DAVOS, Switzerland (Reuters) - U.S. President ...",https://www.reuters.com/article/us-davos-meeti...,Davos,Reuters
3,2019-06-27 00:00:00,2019,6.0,27,,France's Sarkozy reveals his 'Passions' but in...,PARIS (Reuters) - Former French president Nico...,https://www.reuters.com/article/france-politic...,World News,Reuters
4,2016-01-27 00:00:00,2016,1.0,27,,Paris Hilton: Woman In Black For Uncle Monty's...,Paris Hilton arrived at LAX Wednesday dressed ...,https://www.tmz.com/2016/01/27/paris-hilton-mo...,,TMZ


### 1.2. Domain Normalization & Duplicate Checks

In [1]:
from collections import defaultdict

In [6]:
# =========================
# 1) DOMAIN NORMALIZATION & DUPLICATE CHECKS
# =========================
try:
    ALL_NEWS_CSV
except NameError:
    DATA_DIR = Path("data")
    DATA_DIR.mkdir(exist_ok=True)
    ALL_NEWS_CSV = DATA_DIR / "all-the-news-2.csv"

try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

report_out = OUT_DIR / "domain_check_reports"

try:
    report_out.mkdir(exist_ok=True)
except NameError:
    report_out = Path("outputs/domain_check_reports")
    report_out.mkdir(exist_ok=True)

# --- Robust domain normalizer (safe on NaN / empty / malformed URLs)
def normalize_domain(url_or_domain: str) -> str | None:
    if pd.isna(url_or_domain):
        return None
    s = str(url_or_domain).strip().lower()
    if not s:
        return None
    # strip scheme
    s = re.sub(r'^https?://', '', s)
    # take netloc
    s = s.split('/')[0]
    # strip www.
    s = re.sub(r'^www\.', '', s)
    return s or None

# --- Configure chunking
CHUNKSIZE = 200_000

# --- Accumulators
pair_counts = defaultdict(int)        # (publication, normalized_domain) -> count
pub_missing_url = defaultdict(int)    # publication -> rows with missing/invalid url
domain_counts = defaultdict(int)      # normalized_domain -> total count
pub_counts = defaultdict(int)         # publication -> total rows seen (with or without URL)

# --- First pass: compute counts for every (publication, normalized_domain)
for chunk in pd.read_csv(ALL_NEWS_CSV, chunksize=CHUNKSIZE):
    # Keep only needed columns
    sub = chunk[['publication', 'url']].copy()
    # Normalize
    sub['publication'] = sub['publication'].astype(str).str.strip()
    sub['normalized_domain'] = sub['url'].map(normalize_domain)

    # Tally totals per publication
    for p, n in sub['publication'].value_counts().items():
        pub_counts[p] += int(n)

    # Missing/invalid URL rows
    miss = sub[sub['normalized_domain'].isna()]
    for p, n in miss['publication'].value_counts().items():
        pub_missing_url[p] += int(n)

    # Valid domain rows: tally (publication, domain) and per-domain totals
    ok = sub.dropna(subset=['normalized_domain'])
    gb = ok.groupby(['publication', 'normalized_domain'], as_index=False).size()
    for _, row in gb.iterrows():
        key = (row['publication'], row['normalized_domain'])
        pair_counts[key] += int(row['size'])
        domain_counts[row['normalized_domain']] += int(row['size'])

# --- Build DataFrames from accumulators
pairs_df = (
    pd.DataFrame(
        [(p, d, c) for (p, d), c in pair_counts.items()],
        columns=['publication', 'normalized_domain', 'pair_count']
    )
    .sort_values(['publication', 'pair_count'], ascending=[True, False])
    .reset_index(drop=True)
)

pub_totals = (
    pd.DataFrame(list(pub_counts.items()), columns=['publication', 'pub_total_rows'])
    .merge(
        pd.DataFrame(list(pub_missing_url.items()), columns=['publication', 'missing_url_rows']),
        on='publication', how='left'
    )
    .fillna({'missing_url_rows': 0})
)

domain_totals = pd.DataFrame(list(domain_counts.items()), columns=['normalized_domain', 'domain_total_rows'])

# --- Publications that map to >1 normalized domain
multi_domain = (
    pairs_df.groupby('publication')['normalized_domain']
    .nunique()
    .reset_index(name='distinct_domains')
)
multi_domain = multi_domain[multi_domain['distinct_domains'] > 1]

# For those, compute coverage ratio of the top domain for that publication
def top_share_for_pub(pdf: pd.DataFrame) -> pd.Series:
    pdf = pdf.sort_values('pair_count', ascending=False)
    top_domain = pdf.iloc[0]['normalized_domain']
    top_count = pdf.iloc[0]['pair_count']
    total = int(pdf['pair_count'].sum())
    return pd.Series({
        'top_domain': top_domain,
        'top_count': int(top_count),
        'total_pairs': total,
        'top_share': float(top_count) / total if total else np.nan
    })

coverage = (
    pairs_df
    .groupby('publication', group_keys=False)
    .apply(top_share_for_pub)
    .reset_index()
)

publications_with_multiple_domains = (
    multi_domain
    .merge(coverage, on='publication', how='left')
    .merge(pub_totals, on='publication', how='left')
    .sort_values(['distinct_domains', 'top_share'], ascending=[False, True])
    .reset_index(drop=True)
)

# --- Domains used by >1 publication
domains_with_multiple_publications = (
    pairs_df.groupby('normalized_domain')['publication']
    .nunique()
    .reset_index(name='distinct_publications')
)
domains_with_multiple_publications = domains_with_multiple_publications[
    domains_with_multiple_publications['distinct_publications'] > 1
].sort_values('distinct_publications', ascending=False)

# --- Canonical mapping: pick the most frequent domain per publication
#     (You can later filter by a minimum top_share if you want to flag ambiguous cases)
canonical_map = (
    pairs_df.sort_values(['publication', 'pair_count'], ascending=[True, False])
    .groupby('publication', as_index=False)
    .first()[['publication', 'normalized_domain']]
    .rename(columns={'normalized_domain': 'canonical_domain'})
)
canonical_map = (
    canonical_map
    .merge(coverage[['publication', 'top_count', 'total_pairs', 'top_share']], on='publication', how='left')
    .merge(pub_totals, on='publication', how='left')
    .sort_values(['top_share', 'top_count'], ascending=[False, False])
    .reset_index(drop=True)
)

# --- Write reports
pairs_df.to_csv(report_out / "publication_domain_pair_counts.csv", index=False)
publications_with_multiple_domains.to_csv(report_out / "publications_with_multiple_domains.csv", index=False)
domains_with_multiple_publications.to_csv(report_out / "domains_with_multiple_publications.csv", index=False)
canonical_map.to_csv(report_out / "publication_domain_map.csv", index=False)

print("Saved:")
print(" -", report_out / "publication_domain_pair_counts.csv")
print(" -", report_out / "publications_with_multiple_domains.csv")
print(" -", report_out / "domains_with_multiple_publications.csv")
print(" -", report_out / "publication_domain_map.csv")

# --- Quick preview
display(canonical_map.head(12))
display(publications_with_multiple_domains.head(12))


  for chunk in pd.read_csv(ALL_NEWS_CSV, chunksize=CHUNKSIZE):
  for chunk in pd.read_csv(ALL_NEWS_CSV, chunksize=CHUNKSIZE):


Saved:
 - outputs/domain_check_reports/publication_domain_pair_counts.csv
 - outputs/domain_check_reports/publications_with_multiple_domains.csv
 - outputs/domain_check_reports/domains_with_multiple_publications.csv
 - outputs/domain_check_reports/publication_domain_map.csv


  .apply(top_share_for_pub)


Unnamed: 0,publication,canonical_domain,top_count,total_pairs,top_share,pub_total_rows,missing_url_rows
0,Reuters,reuters.com,840094,840094,1.0,840094,0.0
1,CNBC,cnbc.com,238096,238096,1.0,238096,0.0
2,The Hill,thehill.com,208411,208411,1.0,208411,0.0
3,People,people.com,136488,136488,1.0,136488,0.0
4,CNN,cnn.com,127602,127602,1.0,127602,0.0
5,Refinery 29,refinery29.com,111433,111433,1.0,111433,0.0
6,Vice,vice.com,101137,101137,1.0,101137,0.0
7,Mashable,mashable.com,94107,94107,1.0,94107,0.0
8,Business Insider,businessinsider.com,57953,57953,1.0,57953,0.0
9,The Verge,theverge.com,52424,52424,1.0,52424,0.0


Unnamed: 0,publication,distinct_domains,top_domain,top_count,total_pairs,top_share,pub_total_rows,missing_url_rows
0,The New York Times,26,nytimes.com,247587,252259,0.981479,252259,0.0
1,Gizmodo,3,gizmodo.com,27225,27228,0.99989,27228,0.0



## 2) Per-Author Article Counts

We parse the `author` field and count bylines per journalist.  
Notes:
- Multi-bylines like `"A, B"` are split on commas.
- Empty or missing authors are dropped (you can keep them if needed).
- We'll also keep **outlet counts** per author and the **time span** of their articles.


In [9]:
# =========================
# 2) BUILD AUTHOR x OUTLET x DOMAIN COUNTS (CHUNKED)
# =========================

# --- Inputs
try:
    ALL_NEWS_CSV
except NameError:
    DATA_DIR = Path("data"); DATA_DIR.mkdir(exist_ok=True)
    ALL_NEWS_CSV = DATA_DIR / "all-the-news-2.csv"

try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

CANONICAL_MAP_CSV = report_out / "publication_domain_map.csv"
assert CANONICAL_MAP_CSV.exists(), "Run Section 1 to generate publication_domain_map.csv first."

# --- Load canonical mapping {publication -> canonical_domain}
pub_map = pd.read_csv(CANONICAL_MAP_CSV)[['publication', 'canonical_domain']].copy()
pub_map['publication'] = pub_map['publication'].astype(str).str.strip()

# --- Normalizer (redefine here for isolation)
def normalize_domain(url_or_domain: str) -> str | None:
    if pd.isna(url_or_domain):
        return None
    s = str(url_or_domain).strip().lower()
    if not s:
        return None
    s = re.sub(r'^https?://', '', s)
    s = s.split('/')[0]
    s = re.sub(r'^www\.', '', s)
    return s or None

def explode_authors(df: pd.DataFrame) -> pd.DataFrame:
    """Split comma-separated bylines into one row per author."""
    df = df.copy()
    df['author'] = df['author'].fillna('').astype(str)
    df = df[df['author'].str.strip() != '']
    df['author'] = df['author'].str.split(',')
    df = df.explode('author')
    df['author'] = df['author'].str.strip()
    df = df[df['author'] != '']
    return df

# --- Person-like heuristics (tweak as you like)
PERSON_LIKE_RULES = {
    "min_alpha_len": 3,
    "require_space": True,
    "blocklist_substrings": [
        "staff", "editorial", "opinion", "desk", "team", "associated press",
        "reuters", "field level media", "cnn", "ap", "buzzfeed news",
        "wired staff", "bbc", "npr", "nyt cooking", "vox staff"
    ],
    "all_caps_threshold": 0.8,
}
def looks_person_like(name: str) -> bool:
    n = str(name).strip()
    if not n:
        return False
    low = n.lower()
    if any(s in low for s in PERSON_LIKE_RULES["blocklist_substrings"]):
        return False
    if PERSON_LIKE_RULES["require_space"] and " " not in n:
        return False
    letters = re.findall(r"[A-Za-z]", n)
    if len(letters) < PERSON_LIKE_RULES["min_alpha_len"]:
        return False
    if letters:
        caps = sum(1 for ch in letters if ch.isupper())
        if caps / len(letters) >= PERSON_LIKE_RULES["all_caps_threshold"]:
            return False
    return True

# --- Accumulate counts per (author, publication, domain)
CHUNKSIZE = 250_000

In [None]:
accum = []

for chunk in pd.read_csv(ALL_NEWS_CSV, chunksize=CHUNKSIZE):
    sub = chunk[['author', 'publication', 'url']].copy()
    sub['publication'] = sub['publication'].astype(str).str.strip()

    # Raw normalized domain from URL
    sub['normalized_domain'] = sub['url'].map(normalize_domain)

    # Attach canonical domain by publication, fall back to raw normalized_domain when missing
    sub = sub.merge(pub_map, on='publication', how='left')
    sub['domain'] = sub['canonical_domain'].fillna(sub['normalized_domain'])

    # We can safely drop rows with missing domain if we will join to MBFC/AllSides later
    sub = sub.dropna(subset=['domain'])

    # Split multi-byline authors
    sub = explode_authors(sub)

    # Per-chunk groupby
    g = (sub
         .groupby(['author', 'publication', 'domain'], as_index=False)
         .size()
         .rename(columns={'size': 'article_count'}))
    accum.append(g)

# Merge all chunk results
author_outlet_counts = (
    pd.concat(accum, ignore_index=True)
    .groupby(['author', 'publication', 'domain'], as_index=False)['article_count']
    .sum()
    .sort_values('article_count', ascending=False)
    .reset_index(drop=True)
)

# --- Humanlike filtered version
author_outlet_counts_humanlike = author_outlet_counts[
    author_outlet_counts['author'].map(looks_person_like)
].copy().reset_index(drop=True)

# --- Save
author_outlet_counts_filename = "author_outlet_counts.csv"
author_outlet_counts.to_csv(OUT_DIR / author_outlet_counts_filename, index=False)

author_outlet_counts_humanlike_filename = "author_outlet_counts_humanlike.csv"
author_outlet_counts_humanlike.to_csv(OUT_DIR / author_outlet_counts_humanlike_filename, index=False)

print("Saved:")
print(" -", OUT_DIR / author_outlet_counts_filename)
print(" -", OUT_DIR / author_outlet_counts_humanlike_filename)

# --- Quick peek
display(author_outlet_counts.head(15))
display(author_outlet_counts_humanlike.head(15))

Saved:
 - outputs/author_outlet_counts.csv
 - outputs/author_outlet_counts_humanlike.csv


Unnamed: 0,author,publication,domain,article_count
0,CNN,CNN,cnn.com,24477
1,opinion contributor,The Hill,thehill.com,19551
2,WIRED Staff,Wired,wired.com,15815
3,Field Level Media,Reuters,reuters.com,7890
4,John Bowden,The Hill,thehill.com,7023
5,Associated Press,Fox News,foxnews.com,6797
6,Rebecca Savransky,The Hill,thehill.com,5944
7,The Associated Press,The New York Times,nytimes.com,5649
8,Julia Manchester,The Hill,thehill.com,5503
9,The New York Times,The New York Times,nytimes.com,5499


Unnamed: 0,author,publication,domain,article_count
0,John Bowden,The Hill,thehill.com,7023
1,Rebecca Savransky,The Hill,thehill.com,5944
2,Julia Manchester,The Hill,thehill.com,5503
3,The New York Times,The New York Times,nytimes.com,5499
4,Brett Samuels,The Hill,thehill.com,5395
5,Max Greenwood,The Hill,thehill.com,5151
6,Dave Quinn,People,people.com,5023
7,Michael D. Shear,The New York Times,nytimes.com,4689
8,Jordain Carney,The Hill,thehill.com,4461
9,Alexia Fernandez,People,people.com,4326


### 2.1. Split Multiple Authors

This reads a chosen counts file, splits author cells containing ```" and "``` into separate rows, then re-aggregates to keep one row per ```(author, publication, domain)```.

In [None]:
# ---- Choose which file to fix
author_outlet_counts_filename = "author_outlet_counts_humanlike.csv"

# ---- Paths (fallbacks if these weren't defined earlier)
try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

in_path = OUT_DIR / author_outlet_counts_filename

# ---- Load
df = pd.read_csv(in_path)

# ---- Sanity: required columns
required_cols = {"author", "publication", "domain", "article_count"}
missing = required_cols - set(df.columns)
assert not missing, f"Missing columns: {missing}"

# ---- Split function: only split on ' and ' (case-insensitive), leave other separators alone
AND_PAT = re.compile(r"\s+and\s+", flags=re.IGNORECASE)

def split_and_expand(frame: pd.DataFrame) -> pd.DataFrame:
    f = frame.copy()
    # Only attempt split where there's no comma (commas were already exploded earlier)
    mask = f["author"].astype(str).str.contains(AND_PAT) & ~f["author"].astype(str).str.contains(",")
    f.loc[mask, "author"] = f.loc[mask, "author"].astype(str).str.split(AND_PAT)

    # Everything else becomes a 1-element list to allow explode
    f.loc[~mask, "author"] = f.loc[~mask, "author"].astype(str).apply(lambda x: [x])

    # Explode and trim
    f = f.explode("author")
    f["author"] = f["author"].astype(str).str.strip()
    # Drop empties if any
    f = f[f["author"] != ""]
    return f

df_expanded = split_and_expand(df)

# ---- Re-aggregate to merge duplicate rows created by the split
collapsed = (
    df_expanded
    .groupby(["author", "publication", "domain"], as_index=False)["article_count"]
    .sum()
    .sort_values("article_count", ascending=False)
    .reset_index(drop=True)
)

# ---- Save back to the SAME file (overwrite)
collapsed.to_csv(in_path, index=False)
print(f"Saved (overwritten): {in_path}")
display(collapsed.head(15))

Saved (overwritten): outputs/author_outlet_counts_humanlike.csv


Unnamed: 0,author,publication,domain,article_count
0,John Bowden,The Hill,thehill.com,7085
1,Rebecca Savransky,The Hill,thehill.com,5996
2,Brett Samuels,The Hill,thehill.com,5732
3,Julia Manchester,The Hill,thehill.com,5578
4,The New York Times,The New York Times,nytimes.com,5499
5,Max Greenwood,The Hill,thehill.com,5361
6,Dave Quinn,People,people.com,5023
7,Jordain Carney,The Hill,thehill.com,4846
8,Michael D. Shear,The New York Times,nytimes.com,4691
9,Alexia Fernandez,People,people.com,4326


### 2.2. Rename Publications and Merge

This lets us rename publications (e.g., ```"Vice News" → "Vice"```) to match label sources, then collapses duplicates created by the rename.

In [23]:
# ---- Choose which file to update
author_outlet_counts_filename = "author_outlet_counts_humanlike.csv"
# author_outlet_counts_filename = "author_outlet_counts_humanlike.csv"

# ---- Paths (fallbacks if these weren't defined earlier)
try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

path = OUT_DIR / author_outlet_counts_filename

# ---- Load
df = pd.read_csv(path)

# ---- Renaming dictionary
PUB_RENAME = {
    "Vice News": "Vice",
    "The New York Times": "New York Times"
}

# ---- Apply renaming (exact match)
df["publication"] = df["publication"].astype(str).str.strip().replace(PUB_RENAME)

# ---- Re-aggregate to merge rows that became identical after renaming
df_merged = (
    df.groupby(["author", "publication", "domain"], as_index=False)["article_count"]
      .sum()
      .sort_values("article_count", ascending=False)
      .reset_index(drop=True)
)

# ---- Save back to the SAME file (overwrite)
df_merged.to_csv(path, index=False)
print(f"Saved (overwritten): {path}")
display(df_merged.head(15))

Saved (overwritten): outputs/author_outlet_counts_humanlike.csv


Unnamed: 0,author,publication,domain,article_count
0,John Bowden,The Hill,thehill.com,7085
1,Rebecca Savransky,The Hill,thehill.com,5996
2,Brett Samuels,The Hill,thehill.com,5732
3,Julia Manchester,The Hill,thehill.com,5578
4,The New York Times,New York Times,nytimes.com,5499
5,Max Greenwood,The Hill,thehill.com,5361
6,Dave Quinn,People,people.com,5023
7,Jordain Carney,The Hill,thehill.com,4846
8,Michael D. Shear,New York Times,nytimes.com,4691
9,Alexia Fernandez,People,people.com,4326



## 3) Outlet Bias (AllSides)

We'll use a **public CSV mirror** of the AllSides ratings (as provided by the open-source _AllSideR_ project).  
- URL (raw CSV): `https://raw.githubusercontent.com/favstats/AllSideR/master/data/allsides_data.csv`  
- Fields include `news_source`, `rating` (Left, Lean Left, Center, Lean Right, Right), and a numeric `rating_num`.

> **Note:** AllSides' official methodology includes both categorical ratings and a numeric **Media Bias Meter** in [-6, +6]. This CSV mirrors the categorical form and provides a simple numeric encoding (1..5). You can map to the [-6..+6] scale if desired.


In [25]:
ALLSIDES_RAW = DATA_DIR / "allsides_data.csv"
ALLSIDES_URL = "https://raw.githubusercontent.com/favstats/AllSideR/master/data/allsides_data.csv"

if AUTO_DOWNLOAD:
    try:
        print("Downloading AllSides ratings...")
        safe_download(ALLSIDES_URL, ALLSIDES_RAW)
    except Exception as e:
        print("AllSides download failed:", e)
        print("Download manually from:", ALLSIDES_URL)

Downloading AllSides ratings...


In [26]:

allsides = None
if ALLSIDES_RAW.exists():
    allsides = pd.read_csv(ALLSIDES_RAW)
    # Keep only news/media outlets (drop 'Author'/'Think tank' etc. if desired)
    allsides = allsides[allsides['type'].str.contains("News|Media", case=False, na=False)].copy()
    # Standardize columns
    allsides = allsides.rename(columns={
        "news_source": "source_name",
        "rating": "allsides_label",
        "rating_num": "allsides_num"
    })
    # Derive a domain if URL is available
    if "url" in allsides.columns:
        # The URL here actually points to the AllSides *profile* page, not the outlet.
        # We'll keep domain empty; we join outlets via a separate publication→domain map.
        allsides['domain'] = ""
    else:
        allsides['domain'] = ""
    # Basic cleanup
    allsides['allsides_label'] = allsides['allsides_label'].str.lower()
    # Optional: map to [-2..+2] (Left=-2, Lean Left=-1, Center=0, Lean Right=+1, Right=+2)
    label_to_score = {
        "left": -2, "left-center": -1, "lean left": -1, "center": 0,
        "right-center": 1, "lean right": 1, "right": 2, "least biased": 0
    }
    allsides['allsides_score5'] = allsides['allsides_label'].map(label_to_score)
    allsides.to_csv("outputs/allsides_outlet_ratings.csv", index=False)
    print("Saved:", "outputs/allsides_outlet_ratings.csv")
else:
    print("AllSides CSV not found. Place it at:", ALLSIDES_RAW.resolve())


Saved: outputs/allsides_outlet_ratings.csv



## 4) Outlet Bias (MBFC)

There is no official free bulk CSV, but two pragmatic options exist for research:

1. **Use a research dataset that standardizes MBFC labels** — e.g., Idiap's open dataset exposes `data/mbfc.csv` (normalized) and `data/mbfc_raw.csv` (scraped), with bias labels like `left`, `left-center`, `center/least biased`, `right-center`, `right`, plus **factual reporting** (`low`, `mixed`, `high`).  
   - Raw CSV (normalized): `https://raw.githubusercontent.com/idiap/Factual-Reporting-and-Political-Bias-Web-Interactions/main/data/mbfc.csv`

2. **Use NELA-GT (2018/2020/2022)** — the dataset bundles **MBFC source-level labels** you can join by outlet/domain. Download the JSON/SQLite release from its Dataverse DOI (see links in the paper).

Below we implement Option 1 for convenience.


In [30]:

MBFC_RAW = DATA_DIR / "mbfc.csv"
MBFC_URL = "https://raw.githubusercontent.com/idiap/Factual-Reporting-and-Political-Bias-Web-Interactions/main/data/mbfc.csv"

if AUTO_DOWNLOAD:
    try:
        print("Downloading MBFC (normalized) from Idiap...")
        safe_download(MBFC_URL, MBFC_RAW)
    except Exception as e:
        print("MBFC download failed:", e)
        print("Manual URL:", MBFC_URL)

Downloading MBFC (normalized) from Idiap...


In [31]:

mbfc = None
if MBFC_RAW.exists():
    mbfc = pd.read_csv(MBFC_RAW)
    # Expected columns: 'source' (domain), 'bias' (e.g., left, left-center, neutral/least-biased, right-center, right),
    # 'factual_reporting' (low/mixed/high) – actual names may vary slightly.
    mbfc.columns = [c.lower() for c in mbfc.columns]
    rename_map = {}
    if 'source' in mbfc.columns: rename_map['source'] = 'domain'
    if 'bias' in mbfc.columns: rename_map['bias'] = 'mbfc_label'
    if 'factual_reporting' in mbfc.columns: rename_map['factual_reporting'] = 'mbfc_factuality'
    mbfc = mbfc.rename(columns=rename_map)
    # normalize values
    def normalize_domain(url_or_domain: str) -> str:
        import re
        s = str(url_or_domain).strip().lower()
        s = re.sub(r'^https?://', '', s)
        s = s.split('/')[0]
        s = re.sub(r'^www\.', '', s)
        return s
    mbfc['domain'] = mbfc['domain'].astype(str).map(normalize_domain)
    mbfc['mbfc_label'] = mbfc['mbfc_label'].astype(str).str.lower().str.replace('least biased','center')
    # map to symmetric numeric scale (-3..+3, with extremes)
    label_to_score7 = {
        "extreme left": -3, "far left": -3, "left": -2, "left-center": -1, "center": 0,
        "right-center": 1, "right": 2, "extreme right": 3, "far right": 3, "neutral": 0, "least biased": 0
    }
    mbfc['mbfc_score7'] = mbfc['mbfc_label'].map(label_to_score7)
    mbfc.to_csv("outputs/mbfc_outlet_ratings.csv", index=False)
    print("Saved:", "outputs/mbfc_outlet_ratings.csv")
else:
    print("MBFC CSV not found. Place it at:", MBFC_RAW.resolve())


Saved: outputs/mbfc_outlet_ratings.csv



## 5) Join Outlet Ratings → Articles → Authors


### 5.1. AllSidesR
This joins AllSidesR outlet ratings to the ```(author, publication, domain, article_count)``` table, computes each author’s article-weighted mean AllSides score, plus a few helpful fields:

- ```total_articles```, ```main_outlet```, ```main_outlet_share```
- ```as_mean_score``` (article-weighted mean of AllSides score in [-2..+2])
- optional label counts per author (how many of their articles came from outlets with each AllSides label)

In [27]:
# =========================
# 5.1) PER-AUTHOR ALLSIDES RATING (WEIGHTED)
# =========================

# ---- Inputs/paths
try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

COUNTS_PATH = OUT_DIR / "author_outlet_counts_humanlike.csv"
ALLSIDES_PATH = OUT_DIR / "allsides_outlet_ratings.csv"       # produced in Section 3

# ---- Load data
counts = pd.read_csv(COUNTS_PATH)  # columns: author, publication, domain, article_count
allsides = pd.read_csv(ALLSIDES_PATH)  # expected: source_name, allsides_label, allsides_score5, type, ...

# ---- Helpers
def clean_name(s: str) -> str:
    """
    Canonicalize outlet names for joining:
    - lowercase
    - strip whitespace
    - remove leading 'the '
    - collapse multiple spaces and remove some punctuation
    """
    if pd.isna(s):
        return ""
    s = str(s).strip().lower()
    s = re.sub(r"^the\s+", "", s)
    s = re.sub(r"[’'`]", "", s)      # normalize apostrophes
    s = re.sub(r"\s*&\s*", " and ", s)
    s = re.sub(r"[-_]", " ", s)
    s = re.sub(r"\s+", " ", s)
    return s

# OPTIONAL: manual crosswalk for tricky names (extend freely)
PUB_TO_ALLSIDES = {
    # examples you can uncomment/extend:
    # "washington post": "washington post",
    # "new york times": "new york times",
    # "vice news": "vice",
    # "buzzfeed news": "buzzfeed news",
    # "the hill": "the hill",
    # add more as needed...
}

# ---- Prepare join keys
counts['publication_for_join'] = counts['publication'].astype(str).str.strip().str.lower()
counts['publication_for_join'] = counts['publication_for_join'].replace(PUB_TO_ALLSIDES)
counts['pub_clean'] = counts['publication_for_join'].map(clean_name)

allsides['source_name'] = allsides['source_name'].astype(str)
allsides['source_clean'] = allsides['source_name'].map(clean_name)

# Keep only News/Media rows if you didn't filter earlier
if 'type' in allsides.columns:
    mask_news = allsides['type'].str.contains("news|media", case=False, na=False)
    allsides = allsides[mask_news].copy()

# ---- Merge by cleaned outlet name
joined = counts.merge(
    allsides[['source_clean', 'allsides_label', 'allsides_score5']],
    left_on='pub_clean', right_on='source_clean', how='left'
)

# ---- Aggregate per author
# Total articles and main outlet/share
author_totals = (
    joined.groupby('author', as_index=False)['article_count'].sum()
          .rename(columns={'article_count': 'total_articles'})
)

main_outlet = (joined
    .sort_values(['author', 'article_count'], ascending=[True, False])
    .groupby('author', as_index=False).first()[['author', 'publication', 'article_count']]
    .rename(columns={'publication': 'main_outlet', 'article_count': 'main_outlet_articles'})
)

author_core = author_totals.merge(main_outlet, on='author', how='left')
author_core['main_outlet_share'] = author_core['main_outlet_articles'] / author_core['total_articles']

# Weighted mean AllSides score per author
def wavg(series, weights):
    s = series.dropna()
    if s.empty:
        return np.nan
    w = weights.loc[s.index]
    return float(np.average(s, weights=w))

as_mean = (
    joined.groupby('author')['allsides_score5']
          .apply(lambda s: wavg(s, joined.loc[s.index, 'article_count']))
          .rename('as_mean_score')
          .reset_index()
)

# Optional: per-label article counts (pivot)
as_label_counts = pd.pivot_table(
    joined,
    index='author',
    columns='allsides_label',
    values='article_count',
    aggfunc='sum',
    fill_value=0
)
# prefix columns
as_label_counts.columns = [f"as_articles@{c}" for c in as_label_counts.columns]
as_label_counts = as_label_counts.reset_index()

# ---- Final per-author AllSides table
per_author_allsides = (
    author_core
    .merge(as_mean, on='author', how='left')
    .merge(as_label_counts, on='author', how='left')
    .sort_values('total_articles', ascending=False)
    .reset_index(drop=True)
)

# ---- Save
out_path_as = OUT_DIR / "per_author_allsides_weighted.csv"
per_author_allsides.to_csv(out_path_as, index=False)
print("Saved:", out_path_as)
display(per_author_allsides.head(15))

Saved: outputs/per_author_allsides_weighted.csv


Unnamed: 0,author,total_articles,main_outlet,main_outlet_articles,main_outlet_share,as_mean_score,as_articles@center,as_articles@left,as_articles@left-center
0,John Bowden,7085,The Hill,7085,1.0,0.0,7085.0,0.0,0.0
1,Rebecca Savransky,5996,The Hill,5996,1.0,0.0,5996.0,0.0,0.0
2,Brett Samuels,5732,The Hill,5732,1.0,0.0,5732.0,0.0,0.0
3,Julia Manchester,5588,The Hill,5578,0.99821,0.0,5578.0,0.0,0.0
4,The New York Times,5499,New York Times,5499,1.0,,,,
5,Max Greenwood,5361,The Hill,5361,1.0,0.0,5361.0,0.0,0.0
6,Dave Quinn,5023,People,5023,1.0,,,,
7,Jordain Carney,4846,The Hill,4846,1.0,0.0,4846.0,0.0,0.0
8,Michael D. Shear,4691,New York Times,4691,1.0,,,,
9,Alexia Fernandez,4326,People,4326,1.0,,,,


### 5.2. MBFC
MBFC data already has a domain column in the processed file, so we can join on ```domain``` directly.
This computes:

- ```mbfc_mean_score7``` (article-weighted mean on a [-3..+3] scale)
- optional **factuality mean** (map MBFC factual_reporting to a numeric scale)
- optional label counts per author

In [32]:
# =========================
# 5.2) PER-AUTHOR MBFC RATING (WEIGHTED BY DOMAIN)
# =========================

# ---- Inputs/paths
try:
    OUT_DIR
except NameError:
    OUT_DIR = Path("outputs"); OUT_DIR.mkdir(exist_ok=True)

COUNTS_PATH = OUT_DIR / "author_outlet_counts_humanlike.csv"
MBFC_PATH = OUT_DIR / "mbfc_outlet_ratings.csv"               # produced in Section 4

# ---- Load data
counts = pd.read_csv(COUNTS_PATH)  # columns: author, publication, domain, article_count
mbfc = pd.read_csv(MBFC_PATH)      # expected: domain, mbfc_label, mbfc_score7, (optional) mbfc_factuality

# ---- Merge by domain
joined = counts.merge(
    mbfc[['domain', 'mbfc_label', 'mbfc_score7'] + ([c for c in ['mbfc_factuality'] if c in mbfc.columns])],
    on='domain', how='left'
)

# ---- Aggregate per author (reuse totals and main outlet/share)
author_totals = (
    joined.groupby('author', as_index=False)['article_count'].sum()
          .rename(columns={'article_count': 'total_articles'})
)
main_outlet = (joined
    .sort_values(['author', 'article_count'], ascending=[True, False])
    .groupby('author', as_index=False).first()[['author', 'publication', 'article_count']]
    .rename(columns={'publication': 'main_outlet', 'article_count': 'main_outlet_articles'})
)
author_core = author_totals.merge(main_outlet, on='author', how='left')
author_core['main_outlet_share'] = author_core['main_outlet_articles'] / author_core['total_articles']

# Weighted mean MBFC bias score per author
def wavg(series, weights):
    s = series.dropna()
    if s.empty:
        return np.nan
    w = weights.loc[s.index]
    return float(np.average(s, weights=w))

mbfc_mean = (
    joined.groupby('author')['mbfc_score7']
          .apply(lambda s: wavg(s, joined.loc[s.index, 'article_count']))
          .rename('mbfc_mean_score7')
          .reset_index()
)

# Optional: factuality → numeric mapping (tweak as needed to match your CSV)
FACTUALITY_TO_NUM = {
    # map whatever appears in your mbfc.csv
    'very low': 1, 'low': 2, 'mixed': 3,
    'mostly factual': 4, 'high': 5, 'very high': 6,
    'na': np.nan, 'none': np.nan
}
if 'mbfc_factuality' in joined.columns:
    j2 = joined.copy()
    j2['mbfc_fact_num'] = j2['mbfc_factuality'].astype(str).str.lower().map(FACTUALITY_TO_NUM)
    mbfc_fact_mean = (
        j2.groupby('author')['mbfc_fact_num']
          .apply(lambda s: wavg(s, j2.loc[s.index, 'article_count']))
          .rename('mbfc_factuality_mean')
          .reset_index()
    )
else:
    mbfc_fact_mean = pd.DataFrame({'author': [], 'mbfc_factuality_mean': []})

# Optional: per-label article counts (pivot)
mbfc_label_counts = pd.pivot_table(
    joined,
    index='author',
    columns='mbfc_label',
    values='article_count',
    aggfunc='sum',
    fill_value=0
)
mbfc_label_counts.columns = [f"mbfc_articles@{c}" for c in mbfc_label_counts.columns]
mbfc_label_counts = mbfc_label_counts.reset_index()

# ---- Final per-author MBFC table
per_author_mbfc = (
    author_core
    .merge(mbfc_mean, on='author', how='left')
    .merge(mbfc_fact_mean, on='author', how='left')
    .merge(mbfc_label_counts, on='author', how='left')
    .sort_values('total_articles', ascending=False)
    .reset_index(drop=True)
)

# ---- Save
out_path_mb = OUT_DIR / "per_author_mbfc_weighted.csv"
per_author_mbfc.to_csv(out_path_mb, index=False)
print("Saved:", out_path_mb)
display(per_author_mbfc.head(15))

Saved: outputs/per_author_mbfc_weighted.csv


Unnamed: 0,author,total_articles,main_outlet,main_outlet_articles,main_outlet_share,mbfc_mean_score7,mbfc_factuality_mean,mbfc_articles@left,mbfc_articles@left-center,mbfc_articles@neutral,mbfc_articles@right
0,John Bowden,7085,The Hill,7085,1.0,0.0,3.0,0.0,0.0,7085.0,0.0
1,Rebecca Savransky,5996,The Hill,5996,1.0,0.0,3.0,0.0,0.0,5996.0,0.0
2,Brett Samuels,5732,The Hill,5732,1.0,0.0,3.0,0.0,0.0,5732.0,0.0
3,Julia Manchester,5588,The Hill,5578,0.99821,-0.003579,3.0,10.0,0.0,5578.0,0.0
4,The New York Times,5499,New York Times,5499,1.0,-1.0,5.0,0.0,5499.0,0.0,0.0
5,Max Greenwood,5361,The Hill,5361,1.0,0.0,3.0,0.0,0.0,5361.0,0.0
6,Dave Quinn,5023,People,5023,1.0,-2.0,5.0,5023.0,0.0,0.0,0.0
7,Jordain Carney,4846,The Hill,4846,1.0,0.0,3.0,0.0,0.0,4846.0,0.0
8,Michael D. Shear,4691,New York Times,4691,1.0,-1.0,5.0,0.0,4691.0,0.0,0.0
9,Alexia Fernandez,4326,People,4326,1.0,-2.0,5.0,4326.0,0.0,0.0,0.0
