# Policy Analysis

This notebook presents code to analyse the wikipedia policy of some articles. We can measure the policy regarding different topics :

* WP content’s assignment: Assess the quality of an article. The system is based on a letter scheme that reflects principally how factually complete the article is, though language quality and layout are also factors.

* WikiProject Importance Scale: The importance rating is part of the WikiProject assessment system.It is assigned per WikiProject, not globally. It tries to answer: "How likely is it that a typical Wikipedia reader will need this article?"

* Vital Article: There are 7,085,890 articles on the English Wikipedia, and 60,000 (around 0.7%) have been selected as vital articles to the project. They are organized into five levels: Level 1 contains 10 articles. Wikiepdia most vitals one. Level 2 is 100, level 2 has 1000, level 4 has 10 000 and 5 has 50 000.
Give direction to the prioritization of improvements of English Wikipedia articles (e.g. which articles to bring to WP:GA and WP:FAstatus)
Provide a measurement of quality of overall English Wikipedia (e.g. what proportion of the most important articles are at GA and FA status)
To serve as a centralized watchlist of English Wikipedia's most important articles.

* Page protection: In some circumstances, pages may need to be protected from modification by certain groups of editors. Pages are protected when there is disruption that cannot be prevented through other means, such as blocks. Protection is a technical restriction applied only by administrators, although any user may request protection. Protection can be indefinite or expire after a specified time.

* Contentious topics: Some WP pages are assigned CT by the Arbitration Committee after users submit a CT request. Only extended-confirmed editors may make edits related to the topic area, though editors who are not extended-confirmed may post constructive comments and make edit requests related to articles within the topic area on article talk pages.

## Requierements

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import requests
import time
from datetime import datetime
import csv
import mwparserfromhell

## List of control and main set of articles

In [2]:
# Articles List
control_articles = [
    "Pop music",
    "Rock and roll",
    "Eric Clapton",
    "Rolling Stone",
    "Jazz",
    "Swing",
    "Classical music",
    "Ludwig van Beethoven",
    "Wolfgang Amadeus Mozart",
    "Joseph Haydn",
    "Country music",
    "BTS",
    "K-Pop",
    "Electronic music",
    "Daft Punk",
    "Paul Kalkbrenner",
    "Trumpet",
    "Music theory",
    "Fender",
    "Marshall Amplification",
    "Jimi Hendrix",
    "Bob Marley",
    "Edith Piaf",
    "Royal Albert Hall",
    "Piano",
    "Saxophone",
    "Pink Floyd",
    "Nirvana (band)",
    "Nina Simone",
    "Music of Africa",
    "Major scale",
    "Major chord",
    "Minor chord",
    "Red Hot Chili Peppers",
    "Funk rock",
    "James Brown",
    "Dire Straits",
    "Mark Knofler",
    "John Frusciante",
    "Alan Clark",
    "Stevie Wonder",
    "Guitar"
]

control_articles = sorted(control_articles)

articles = [
    "COVID-19 pandemic in Ukraine",
    "History of Ukraine",
    "Crimea",
    "Russian annexation of Crimea",
    "2004 Ukrainian presidential election",
    "Football in Ukraine",
    "Bessarabia",
    "2014 pro-Russian unrest in Ukraine",
    "Communist Party of the Soviet Union",
    "English Civil War",
    "Christianity in Russia",
    "History of Christianity in Ukraine",
    "Flag of Ukraine",
    "Alexander II of Russia",
    "Eastern Front (World War II)",
    "Bukovina",
    "Epiphanius I of Ukraine",
    "History of Crimea",
    "Dissolution of the Soviet Union",
    "Crimean Tatars",
    "Catherine the Great",
    "Culture of Ukraine",
    "Abortion in Ukraine",
    "Christmas in Ukraine",
    "Armed Forces of Ukraine",
    "Demographics of Ukraine",
    "History of Kyiv",
    "Foreign relations of Ukraine",
    "Eastern Front (World War I)",
    "Economy of Ukraine",
    "Galicia (Eastern Europe)",
    "Euromaidan",
    "History of the Russian Orthodox Church",
    "Government of Ukraine",
    "Geography of Ukraine",
    "Censuses in Ukraine",
    "Administrative divisions of Ukraine",
    "Government of the Ukrainian People's Republic in exile",
    "Education in Ukraine",
    "2022 Russian invasion of Ukraine",
    "Buddhism in Ukraine"
]

## Get the policy information for main set

The information retrieved are the current articles' status. 

In [10]:
policy_analysis_articles = pd.read_csv('../datas/interim/Policy Analysis/policy_analysis_articles.csv')
policy_analysis_control_articles = pd.read_csv('../datas/interim/Policy Analysis/policy_analysis_control_articles.csv')
policy_analysis_control_articles

Unnamed: 0,article,class,importance,contentious
0,Alan Clark,C,Low,False
1,BTS,FA,Top,False
2,Bob Marley,GA,High,False
3,Classical music,B,,False
4,Country music,B,top,False
5,Daft Punk,GA,high,False
6,Dire Straits,B,high,False
7,Edith Piaf,,,
8,Electronic music,B,top,False
9,Eric Clapton,B,Mid,False


### Functions and details to get the policy information for main set 

In [9]:
URL = "https://en.wikipedia.org/w/api.php"

HEADERS = {
    "User-Agent": "DH_Project/1.0 (maxime.garambois@epfl.ch)"
}

CONTENTIOUS_KEYWORDS = ["contentious topics/"]

def get_talk_wikitext(title):
    """Retrieve raw wikitext of the Talk page."""
    params = {
        "action": "query",
        "titles": f"Talk:{title}",
        "prop": "revisions",
        "rvslots": "main",
        "rvprop": "content",
        "formatversion": "2",
        "format": "json"
    }
    response = requests.get(url=URL, params=params, headers=HEADERS)
    data = response.json()

    page = data["query"]["pages"][0]
    if "missing" in page:
        return None  # talk page doesn't exist

    return page["revisions"][0]["slots"]["main"]["content"]


def parse_assessments(wikitext):
    """Parse class, importance values, and contentious-topic status from wikitext."""
    code = mwparserfromhell.parse(wikitext)

    results = {
        "class": None,
        "importance": {},
        "contentious": False
    }

    for template in code.filter_templates():
        name = template.name.strip().lower()

        # GLOBAL CLASS (from banner shell)
        if "banner shell" in name:
            if template.has("class"):
                results["class"] = str(template.get("class").value).strip()

        # PER-WIKIPROJECT IMPORTANCE
        if "wikiproject" in name and not "banner shell" in name:
            project = template.name.strip().replace("WikiProject", "").strip()

            # look for either "importance" or "priority"
            if template.has("importance"):
                imp = str(template.get("importance").value).strip()
                results["importance"][project] = imp
            elif template.has("priority"):
                # Some projects use "priority" (e.g., Mathematics)
                imp = str(template.get("priority").value).strip()
                results["importance"][project] = imp

        # CONTENTIOUS TOPICS DETECTION
        temp_text = str(template).lower()
        if any(keyword in name for keyword in CONTENTIOUS_KEYWORDS):
            results["contentious"] = True

    return results


def get_article_assessment(title):
    """Main wrapper: fetch talk page and parse assessment."""
    wikitext = get_talk_wikitext(title)
    if not wikitext:
        return {"error": "Talk page does not exist"}

    return parse_assessments(wikitext)

def extract_relevant_importance(importance_dict):
    """
    From all WikiProject importance values:
    - If 'Ukraine' exists -> return that value.
    - Else -> return the first value in the dict.
    - If dict empty -> return None.
    """
    if not importance_dict:
        return None

    # Prefer Ukraine rating if present
    if "Ukraine" in importance_dict:
        return importance_dict["Ukraine"]

    # Otherwise take the first key in the dict
    first_key = next(iter(importance_dict))
    return importance_dict[first_key]

def get_data(articles_list, out_csv):
    rows = []

    for article in articles_list:
        assessment = get_article_assessment(article)

        # Extract global class
        article_class = assessment.get("class")

        # Extract importance from rules
        importance = extract_relevant_importance(assessment.get("importance", {}))

        # Contentious topic boolean
        contentious = assessment.get("contentious")

        rows.append({
            "article": article,
            "class": article_class,
            "importance": importance,
            "contentious": contentious
        })

    # Save CSV
    with open(out_csv, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["article", "class", "importance", "contentious"])
        writer.writeheader()
        writer.writerows(rows)

    print(f"✅ CSV saved at: {out_csv}")

OUT_CSV = '../datas/interim/Policy Analysis/policy_analysis_control_articles.csv'
get_data(control_articles, OUT_CSV)

✅ CSV saved at: ../datas/interim/Policy Analysis/policy_analysis_control_articles.csv


In [6]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
import re

API = "https://en.wikipedia.org/w/api.php"
HEADERS = {"User-Agent": "DH_Project/1.0 (maxime.garambois@epfl.ch)"}

def mw_normalize_and_redirects(title):
    """Return canonical title + any redirects (all with underscores)."""
    params = {
        "action": "query",
        "titles": title,
        "redirects": "1",
        "format": "json",
        "formatversion": "2",
    }
    r = requests.get(API, params=params, headers=HEADERS).json()
    pages = r.get("query", {}).get("pages", [])
    if not pages or "missing" in pages[0]:
        # fall back to the provided title
        return {title.replace(" ", "_")}
    canonical = pages[0]["title"].replace(" ", "_")
    candidates = {canonical}
    for redir in r["query"].get("redirects", []):
        candidates.add(redir["from"].replace(" ", "_"))
        candidates.add(redir["to"].replace(" ", "_"))
    return candidates

def parse_page_html(title):
    """Fetch parsed HTML for a wiki page title."""
    params = {"action": "parse", "page": title, "prop": "text", "format": "json"}
    r = requests.get(API, params=params, headers=HEADERS).json()
    if "error" in r:
        return None
    return r["parse"]["text"]["*"]

def collect_level_subpages(level):
    """
    From the root VA page for a level, collect all subpages like:
    Wikipedia:Vital articles/Level/<level>/People, /History, etc.
    Include the root too (some levels have direct links).
    """
    root = f"Wikipedia:Vital articles/Level/{level}"
    html = parse_page_html(root)
    subpages = set()
    if html:
        soup = BeautifulSoup(html, "html.parser")
        for a in soup.find_all("a"):
            href = a.get("href", "")
            title = a.get("title", "")
            # Prefer title (cleaner), but fall back to href if needed
            if title.startswith(f"Wikipedia:Vital articles/Level/{level}/"):
                subpages.add(title)
            elif href.startswith("/wiki/Wikipedia:Vital_articles/Level/"):
                # Extract after /wiki/
                target = href[len("/wiki/"):]
                if re.match(rf"Wikipedia:Vital_articles/Level/{level}\b", target):
                    subpages.add(target)
    subpages.add(root)
    return subpages

def vital_level_via_lists(article_title):
    """
    Search Vital Articles lists (levels 1..5) and return the level number
    where the article appears, or None if not found.
    """
    acceptable = {t.lower() for t in mw_normalize_and_redirects(article_title)}

    for level in range(1, 6):
        for subpage in collect_level_subpages(level):
            html = parse_page_html(subpage)
            if not html:
                continue
            soup = BeautifulSoup(html, "html.parser")
            for a in soup.find_all("a"):
                # Use the title attribute: it's the canonical page title
                if a.has_attr("title"):
                    link_title = a["title"].replace(" ", "_").lower()
                    if link_title in acceptable:
                        return level
                else:
                    # Fallback to href if no title (rare)
                    href = a.get("href", "")
                    if href.startswith("/wiki/"):
                        target = href[len("/wiki/"):].split("#", 1)[0]
                        target = unquote(target).replace(" ", "_").lower()
                        if target in acceptable:
                            return level
    return None

for article in articles:
    level = vital_level_via_lists(article)
    print(article, level)

## Try to get the evolution of the policy through time for each article

In [30]:
# Try to store the article's metadata changes in a timeline
WIKI_API = "https://en.wikipedia.org/w/api.php"
USER_AGENT = "DH_Project/1.0 (maxime.garambois@epfl.ch)"
SLEEP = 0.5

def request_api(params):
    """Send a request to the API with retry and user-agent."""
    headers = {"User-Agent": USER_AGENT}
    while True:
        try:
            response = requests.get(WIKI_API, params=params, headers=headers, timeout=10)
            response.raise_for_status()
            return response.json()
        except Exception as e:
            print(f"Retrying due to error: {e}")
            time.sleep(2)

def fetch_talk_revisions(article_title):
    """
    Fetch all wikitext revisions of the Talk:Article page.
    Returns a list of dicts: {rev_id, timestamp, content}
    """
    talk_title = f"Talk:{article_title}"
    params = {
        "action": "query",
        "format": "json",
        "prop": "revisions",
        "titles": talk_title,
        "rvprop": "ids|timestamp|content",
        "rvslots": "main",
        "rvlimit": "500",
    }

    revisions = []
    cont = True

    while cont:
        data = request_api(params)

        pages = data["query"]["pages"]
        page = next(iter(pages.values()))

        if "revisions" in page:
            for rev in page["revisions"]:
                revisions.append({
                    "rev_id": rev["revid"],
                    "timestamp": rev["timestamp"],
                    "content": rev["slots"]["main"].get("*", ""),
                })

        if "continue" in data:
            params.update(data["continue"])
        else:
            cont = False

        time.sleep(SLEEP)

    # sort from oldest → newest
    revisions.sort(key=lambda r: r["timestamp"])
    return revisions

# Regex patterns for WikiProject templates, quality, and importance
WIKIPROJECT_RE = re.compile(r"\{\{[Ww]ikiProject [^|}]+(?:\|[^}]+)?\}\}")
CLASS_RE = re.compile(r"class\s*=\s*([A-Za-z]+)", re.IGNORECASE)
IMPORTANCE_RE = re.compile(r"importance\s*=\s*([A-Za-z]+)", re.IGNORECASE)

VITAL_RE = re.compile(r"\{\{[Vv]ital article(?:\|level\s*=\s*(\d+))?", re.IGNORECASE)


def extract_metadata(wikitext):
    """Extract class, importance, vital-level metadata from a talk page revision."""
    
    class_rating = None
    importance_rating = None
    vital_level = None

    # 1. Parse WikiProject templates
    for template in WIKIPROJECT_RE.findall(wikitext):
        class_match = CLASS_RE.search(template)
        if class_match:
            class_rating = class_match.group(1).upper()

        imp_match = IMPORTANCE_RE.search(template)
        if imp_match:
            importance_rating = imp_match.group(1).capitalize()

    # 2. Parse Vital Article level
    vital_match = VITAL_RE.search(wikitext)
    if vital_match:
        vital_level = vital_match.group(1) or "yes"

    return {
        "class": class_rating,
        "importance": importance_rating,
        "vital": vital_level,
    }

def build_metadata_timeline(talk_revisions):
    """
    Returns a list of metadata changes:
    [
        {
            "timestamp": "...",
            "rev_id": ...,
            "class": "C",
            "importance": "High",
            "vital": "4"
        },
        ...
    ]
    Only stores metadata when it changes.
    """
    timeline = []
    last_state = {"class": None, "importance": None, "vital": None}

    for rev in talk_revisions:
        meta = extract_metadata(rev["content"])

        if meta != last_state:
            timeline.append({
                "timestamp": rev["timestamp"],
                "rev_id": rev["rev_id"],
                **meta
            })
            last_state = meta.copy()

    return timeline

def extract_article_metadata_timeline(article_title):
    print(f"\n=== Fetching Talk Page revisions for: {article_title} ===")
    talk_revs = fetch_talk_revisions(article_title)
    print(f"Fetched {len(talk_revs)} talk revisions.")

    print("=== Parsing metadata changes (Option B) ===")
    timeline = build_metadata_timeline(talk_revs)

    print(f"Metadata change points: {len(timeline)}")
    return timeline

In [34]:
import requests
import pandas as pd
import time

rows = []  # will hold one flattened JSON row per article

for t in control_articles:
    url = f"https://xtools.wmcloud.org/api/page/pageinfo/en.wikipedia.org/{t}"
    params = {"format": "json"}
    headers = {"accept": "application/json"}

    response = requests.get(url, params=params, headers=headers)

    if response.status_code == 200:
        data = response.json()
        row = pd.json_normalize(data)

        # add the article title (helps debugging later)
        row["article"] = t

        rows.append(row)

    else:
        print(f"Error for {t}: {response.status_code}")

    time.sleep(0.5)  # avoid rate limit

# Final combined dataframe
df = pd.concat(rows, ignore_index=True)
df.to_csv('../datas/interim/Policy Analysis/xtools_metadata_control_set.csv', index=False)

In [8]:
import sys
sys.path.append("..")  
from files.wp_api_policy import *

for article in articles:
    print(f"name : {article}")
    print(get_article_protection_status(article))
    print(50*'-')