<a href="https://colab.research.google.com/github/ajose3-ui/Wikipedia-Article-Finder/blob/main/WIKIPROJECT_THING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General tips for finding articles:


*   [Template for choosing an article](https://en.wikipedia.org/wiki/Template:Dashboard.wikiedu.org_choose_article).
*   [Template for evaluating an article](https://en.wikipedia.org/wiki/Template:Dashboard.wikiedu.org_evaluate_article).
*   Chose a topic that you are intrested in learning more about or are already passionate about. **Strive to pick an article about something that is meaningful to you**.
*   An article can only ever be as good as its sources. **Make sure you can find 3-5 high-quality sources** (literature reviews, textbooks, publications, books created by experts and published by a reputable institution) for any article you are seriously considering working on. You want to find secondary sources that focus on your topic and that are published by reputable institutions that do not have a vested interest in reporting information in a biased way.
*   **Talk pages may give you an idea of what changes need to be made to an article you are interested in**.
*   **You can create a [watchlist](https://youtu.be/Pa8Htsj3Gxg) to keep track of all the pages you are interested in**.
*   Keep in mind the fact that articles on the histories of LGBTQ+ individuals and of Black figures in STEM are relatively lacking, while articles on warfare and sports tend to be among the highest quality articles Wikipedia has to offer. This does not have to be intentional to be harmful.
*   Try to avoid editing “Good,” “A-Class,” or “Featured” articles (it is harder to improve them). **Try to pick articles that are of lower qualities but are of great importance to Wikiprojects or get a low of views**.
*  **Signs that articles that need improvement**: **imbalanced sections** (fringe aspects are emphasized while more important one have little information about them), **writen with little neutrality**, **few refrences are listed**, **the structure has no flow**, **the intro is hard to follow and lacks detail or accuracy**.
*   **Avoid controversial topics**. A list of them are provided at the bottom of [this page](https://en.wikipedia.org/wiki/Wikipedia:Contentious_topics).
*   **Avoid conflicts of interest**. Do not write about your program or the professor or heavily rely on their works for your project.
*   **Scroll to the bottom of an article you are interested in to check out its categories** ([this](https://en.wikipedia.org/wiki/Category:Academic_disciplines) might be a good place to start). Clicking on these categories takes you to related articles and subcategories.
*   When you go to the talk page of an article you will often see a colorful that tell you the [Wikiprojects](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory) it is associated with and how important it is to those Wikiproject. **Try to work on articles of mid or higher importance**.
*   **You can use [article finder](https://dashboard.wikiedu.org/article_finder) to find articles to edit. If you click on “Show Options,” you can pick the minimum views and maximum article quality you want for your results**. You can also decide if you want to search by category or key word. Additionally, if you scroll down to the bottom of the page, you can load more than just 50 results. Try to look for low quality articles that  get a lot of views.
*   [This is a super helpful page](https://en.wikipedia.org/wiki/Wikipedia:Student_assignments) if you want to learn more about how to successfully complete this project in general. They have a good section on choosing an article that is relatively concise. Please read the section if you are able to.
*   [This is a great article](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch) to keep in mind when looking for issues with the language used in articles. Avoid making these mistakes.
*   [This page](https://en.wikipedia.org/wiki/Wikipedia:Task_Center) guides you to articles that need fact-checking, categorization, and copy editing. This might be a good place to go to find an article to edit.


Other, even more general stuff:
*   This offers a list of [extra trainings](https://en.wikipedia.org/wiki/Help:Introduction) if you are still confused about some topic covered in your trainings.
*   Scroll down to the bottom of [this page](https://en.wikipedia.org/wiki/Help:Contents) to search for answers to [frequently asked questions](https://en.wikipedia.org/wiki/Wikipedia:FAQ) about Wikipedia and look at the help desk archives. There are also a lot of how to guides located here.
*   If you have questions about using and editing wikipedia you can go [here](https://en.wikipedia.org/wiki/Wikipedia:Teahouse) or [here](https://en.wikipedia.org/wiki/Wikipedia:Help_desk). You can also search for help pages and archived questions here. On [this page](https://en.wikipedia.org/wiki/Wikipedia:Teahouse/Suggestions), you can get personalized suggestions for articles delivered right to your user talk page.
*   [This is another good resource](https://en.wikipedia.org/wiki/Wikipedia:The_Wikipedia_Adventure) for learning how to edit a page.
*   [This is the wikipedia help menu](https://en.wikipedia.org/wiki/Help:Menu), and [this is the help directory](https://en.wikipedia.org/wiki/Help:Directory).
*   [What to strive for when editing an article](https://en.wikipedia.org/wiki/Wikipedia:The_perfect_article).
*   [This short page](https://en.wikipedia.org/wiki/Help:Your_first_article) walks you through everything you need to know when editing your first article.
*   [This instruction manual](https://en.wikipedia.org/wiki/Wikipedia:Instructional_material) has a list of videos, books, and short tutorials that might help you with any given thing you are struggling with. Whether you want to do a deep dive or brush up on something in a few minutes, this is a good resource.








Install these so that everything works properly:

In [None]:
# @title
!pip install textblob vaderSentiment textstat
!pip install itables

Search for topics that intrest you (for detailed table, scroll all the way to the right and click the top button to do an advanced search):

In [36]:
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta, timezone
from urllib.parse import quote
import time
import re
from IPython.display import display
from concurrent.futures import ThreadPoolExecutor, as_completed
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import textstat

DETAILED_TABLE = True #@param {type:"boolean"}
DETAILED_DATA = True #@param {type:"boolean"}

EMAIL = "ajose3@hawk.illinoistech.edu" #@param {type:"string"}
HEADERS = {"User-Agent": f"Wikiproject_article_finder/1.0 (Educational research; {EMAIL}; Python/requests)"}
SEARCH_QUERY = "surfing" #@param {type:"string"}
MAX_ARTICLES = "30" #@param {type:"string"}

try:
    MAX_ARTICLES = int(MAX_ARTICLES)
except (ValueError, TypeError):
    MAX_ARTICLES = None

LANG = "en"
REQUEST_DELAY = 0.03
CHECKPOINT_INTERVAL = 100
MAX_WORKERS = 3

def search_wikipedia_articles(query, max_articles=None):
    url = "https://en.wikipedia.org/w/api.php"
    titles = []
    offset = 0

    while True:
        params = {"action": "query", "list": "search", "srsearch": query, "srlimit": 50, "sroffset": offset, "format": "json"}
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)

        if not r.headers.get("Content-Type", "").startswith("application/json"):
            print("Non-JSON response, retrying...")
            time.sleep(2)
            continue

        data = r.json()
        results = data.get("query", {}).get("search", [])
        if not results:
            break

        for result in results:
            title = result["title"]
            titles.append(title)
            if max_articles and len(titles) >= max_articles:
                return titles[:max_articles]

        if "continue" in data:
            offset = data["continue"]["sroffset"]
            time.sleep(REQUEST_DELAY)
        else:
            break

    print(f"Total articles found: {len(titles)}")
    return titles

def get_article_class(title):
    url = "https://en.wikipedia.org/w/api.php"

    try:
        params = {"action": "query", "titles": f"Talk:{title}", "prop": "revisions", "rvprop": "content", "rvslots": "main", "format": "json"}
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)

        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return "Unassessed"

        data = r.json()

        if "query" not in data or "pages" not in data["query"]:
            return "Unassessed"

        page = next(iter(data.get("query", {}).get("pages", {}).values()))

        if int(page.get("pageid", -1)) <= 0:
            return "Unassessed"

        revisions = page.get("revisions", [])
        if not revisions:
            return "Unassessed"

        talk_text = revisions[0].get("slots", {}).get("main", {}).get("*", "")

        if not talk_text or len(talk_text.strip()) == 0:
            return "Unassessed"

        talk_lower = talk_text.lower()

        class_mapping = {
            'fa': 'FA', 'featured': 'FA', 'fa-class': 'FA', 'featured article': 'FA',
            'fl': 'FL', 'fl-class': 'FL', 'featured list': 'FL',
            'a': 'A', 'a-class': 'A',
            'ga': 'GA', 'good': 'GA', 'ga-class': 'GA', 'good article': 'GA',
            'b': 'B', 'b-class': 'B',
            'c': 'C', 'c-class': 'C',
            'start': 'START', 'start-class': 'START',
            'stub': 'STUB', 'stub-class': 'STUB',
            'list': 'LIST', 'list-class': 'LIST'
        }

        class_patterns = [
            r'\|\s*class\s*=\s*([A-Za-z\-]+)',
            r'\|\s*currentstatus\s*=\s*([A-Za-z\-]+)',
            r'{{[Cc]lass\|([A-Za-z\-]+)',
            r'{{WikiProject[^}]*\|\s*class\s*=\s*([A-Za-z\-]+)',
            r'{{[Aa]rticle\s*history[^}]*\|currentstatus\s*=\s*([A-Za-z\-]+)',
            r'{{[Rr]ated\s+as\s+([A-Za-z\-]+)',
            r'\|\s*1\s*=\s*([A-Za-z\-]+)(?:\-class)?',
            r'class\s*[:=]\s*([A-Za-z\-]+)',
            r'quality\s*[:=]\s*([A-Za-z\-]+)',
        ]

        found_classes = []
        for pattern in class_patterns:
            matches = re.findall(pattern, talk_text, re.IGNORECASE)
            for match in matches:
                clean = match.strip().lower().replace('-class', '').replace('_class', '').strip()
                if clean in class_mapping:
                    found_classes.append(class_mapping[clean])

        if found_classes:
            priority = ['FA', 'FL', 'GA', 'A', 'B', 'C', 'START', 'STUB', 'LIST']
            for cls in priority:
                if cls in found_classes:
                    return cls
            return found_classes[0]

        specific_markers = {
            'this is a featured article': 'FA',
            'featured article star': 'FA',
            'this is a good article': 'GA',
            'good article': 'GA',
            'ga-icon': 'GA'
        }

        for marker, cls in specific_markers.items():
            if marker in talk_lower:
                return cls

        badge_patterns = [
            (r'{{featured article}}', 'FA'),
            (r'{{fa}}', 'FA'),
            (r'{{good article}}', 'GA'),
            (r'{{ga}}', 'GA'),
        ]

        for pattern, cls in badge_patterns:
            if re.search(pattern, talk_text, re.IGNORECASE):
                return cls

        if 'wikiproject' not in talk_lower:
            return "Unassessed"

        quality_indicators = {
            'stub': 'STUB',
            'start': 'START',
            'b-class': 'B',
            'c-class': 'C'
        }

        for indicator, cls in quality_indicators.items():
            if indicator in talk_lower:
                return cls

        return "Unassessed"

    except requests.exceptions.Timeout:
        return "Unassessed"
    except requests.exceptions.RequestException:
        return "Unassessed"
    except Exception as e:
        return "Unassessed"

def get_all_article_data(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {"action": "query", "titles": title, "redirects": True, "prop": "revisions|extracts|images|categories", "rvprop": "content|timestamp", "rvslots": "main", "explaintext": True, "exlimit": 1, "imlimit": 500, "cllimit": 500, "format": "json"}

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return None

        data = r.json()
        page = next(iter(data["query"]["pages"].values()))

        wikitext = page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")
        plaintext = page.get("extract", "")
        timestamp = page.get("revisions", [{}])[0].get("timestamp", "")
        images = len(page.get("images", []))
        categories = len(page.get("categories", []))

        return {"wikitext": wikitext, "plaintext": plaintext, "timestamp": timestamp, "images": images, "categories": categories}
    except Exception as e:
        print(f"Error fetching data for {title}: {e}")
        return None

def analyze_from_cached_data(title, cached_data):
    if not cached_data:
        return get_empty_metrics()

    wikitext = cached_data["wikitext"]
    plaintext = cached_data["plaintext"]
    timestamp = cached_data["timestamp"]
    metrics = {}

    days_since_edit = 0
    if timestamp:
        last_edit = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
        days_since_edit = round((datetime.now(timezone.utc).replace(tzinfo=None) - last_edit).days)

    word_count = len(plaintext.split()) if plaintext else 0
    citation_needed_count = wikitext.lower().count("citation needed")

    metrics["Days Since Last Edit"] = days_since_edit
    metrics["Word Count"] = word_count
    metrics["Citations Needed"] = citation_needed_count
    metrics["Images"] = cached_data["images"]
    metrics["Categories"] = cached_data["categories"]

    source_metrics = analyze_source_quality_from_text(wikitext)
    metrics.update(source_metrics)
    neutrality_metrics = analyze_neutrality_from_text(plaintext)
    metrics.update(neutrality_metrics)
    readability_metrics = analyze_readability_from_text(plaintext)
    metrics.update(readability_metrics)
    sentiment_metrics = detect_sentiment_bias_from_text(plaintext)
    metrics.update(sentiment_metrics)
    metrics["Citations"] = get_citation_count_from_text(wikitext)

    return metrics

def analyze_source_quality_from_text(wikitext):
    cite_patterns = {"journal": r'{{cite journal', "book": r'{{cite book', "web": r'{{cite web', "news": r'{{cite news'}
    source_types = {k: len(re.findall(v, wikitext, re.IGNORECASE)) for k, v in cite_patterns.items()}
    total_typed_sources = sum(source_types.values())
    years = []
    current_year = datetime.now().year

    citation_blocks = re.findall(r'{{cite[^}]+}}', wikitext, re.IGNORECASE | re.DOTALL)
    for block in citation_blocks:
        year_matches = re.findall(r'\|(?:year|date|publication-date|access-date)\s*=\s*[^\d]*(\d{4})', block, re.IGNORECASE)
        for year_str in year_matches:
            year = int(year_str)
            if 1800 <= year <= current_year:
                years.append(year)
                break

    ref_blocks = re.findall(r'<ref[^>]*>(.*?)</ref>', wikitext, re.IGNORECASE | re.DOTALL)
    for ref in ref_blocks:
        year_patterns = [r'\((\d{4})\)', r'[,\s](\d{4})[,\.\s]', r'"(\d{4})"', r'(\d{4})-\d{2}-\d{2}']
        for pattern in year_patterns:
            year_matches = re.findall(pattern, ref)
            if year_matches:
                year = int(year_matches[0])
                if 1800 <= year <= current_year:
                    years.append(year)
                    break

    short_footnotes = re.findall(r'{{(?:sfn|harvnb|harv)[^}]*\|[^}]*?(\d{4})', wikitext, re.IGNORECASE)
    for year_str in short_footnotes:
        year = int(year_str)
        if 1800 <= year <= current_year:
            years.append(year)

    citation_templates = re.findall(r'{{citation[^}]+}}', wikitext, re.IGNORECASE | re.DOTALL)
    for block in citation_templates:
        year_matches = re.findall(r'\|(?:year|date|publication-date)\s*=\s*[^\d]*(\d{4})', block, re.IGNORECASE)
        for year_str in year_matches:
            year = int(year_str)
            if 1800 <= year <= current_year:
                years.append(year)
                break

    recent_sources = len([y for y in years if current_year - y <= 5]) if years else 0
    avg_source_age = (current_year - sum(years) / len(years)) if years else 0
    quality_score = 0

    if total_typed_sources > 0:
        scholarly_ratio = (source_types["journal"] + source_types["book"]) / total_typed_sources
        quality_score += scholarly_ratio * 50

    source_diversity = len([v for v in source_types.values() if v > 0])
    quality_score += source_diversity * 10

    if avg_source_age < 10:
        quality_score += 25
    elif avg_source_age < 20:
        quality_score += 15

    return {"Journal Sources": source_types["journal"], "Book Sources": source_types["book"], "Web Sources": source_types["web"], "News Sources": source_types["news"], "Avg Source Age": round(avg_source_age, 1), "Recent Sources (5yr)": recent_sources, "Source Quality Score": round(min(quality_score, 100), 1)}

def analyze_neutrality_from_text(text):
    if not text:
        return {"Hedging Words": 0, "Peacock Words": 0, "Weasel Words": 0, "Value Judgments": 0, "Neutrality Score": 100}

    text_lower = text.lower()
    peacock_terms = ["legendary", "iconic", "acclaimed", "prestigious", "renowned", "celebrated", "world-class", "premier", "leading", "foremost", "groundbreaking", "seminal", "pivotal", "revolutionary", "extraordinary", "remarkable", "outstanding", "exceptional", "unparalleled", "best"]
    peacock_count = sum(text_lower.count(term) for term in peacock_terms)

    weasel_patterns = [r'\bsome (people|experts|scholars|critics|observers|say|believe)', r'\bmany (believe|argue|claim|suggest|think|feel)', r'\bit (is said|has been said|is believed|is widely|is commonly)', r'\bmost (people|experts|scholars)', r'\bwidely (regarded|considered|accepted|believed)', r'\boften (considered|regarded|viewed)', r'\bgenerally (accepted|believed|considered)']
    weasel_count = sum(len(re.findall(pattern, text_lower)) for pattern in weasel_patterns)

    hedging_words = ["perhaps", "possibly", "maybe", "might", "could", "may", "seemingly"]
    hedging_count = sum(text_lower.count(word) for word in hedging_words)

    value_words = ["unfortunately", "fortunately", "clearly", "obviously", "naturally", "of course", "undoubtedly", "certainly", "arguably", "notably", "importantly", "surprisingly", "interestingly", "regrettably"]
    value_count = sum(text_lower.count(word) for word in value_words)

    word_count = len(text.split())
    neutrality_score = 100

    if word_count > 0:
        neutrality_score -= (peacock_count / word_count * 1000) * 10
        neutrality_score -= (weasel_count / word_count * 1000) * 15
        neutrality_score -= (value_count / word_count * 1000) * 8

    neutrality_score = max(0, min(100, neutrality_score))

    return {"Hedging Words": hedging_count, "Peacock Words": peacock_count, "Weasel Words": weasel_count, "Value Judgments": value_count, "Neutrality Score": round(neutrality_score, 1)}

def analyze_readability_from_text(text):
    if not text or len(text) < 100:
        return {"Flesch-Kincaid Grade": 0, "Reading Level": "Unknown"}

    try:
        flesch_grade = textstat.flesch_kincaid_grade(text)
        flesch_ease = textstat.flesch_reading_ease(text)

        if flesch_ease >= 90:
            level = "Elementary (5th grade)"
        elif flesch_ease >= 80:
            level = "Middle School (6-7th)"
        elif flesch_ease >= 70:
            level = "High School (8-9th)"
        elif flesch_ease >= 60:
            level = "High School (10-12th)"
        elif flesch_ease >= 50:
            level = "College"
        elif flesch_ease >= 30:
            level = "College Graduate"
        else:
            level = "Professional/Academic"

        return {"Flesch-Kincaid Grade": round(flesch_grade, 1), "Reading Level": level}
    except:
        return {"Flesch-Kincaid Grade": 0, "Reading Level": "Error"}

def detect_sentiment_bias_from_text(text):
    if not text:
        return {"Polarity": 0, "Subjectivity": 0, "VADER Compound": 0, "Sentiment": "Neutral"}

    try:
        blob = TextBlob(text)
        polarity = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity
        analyzer = SentimentIntensityAnalyzer()
        vader_scores = analyzer.polarity_scores(text)

        if abs(polarity) < 0.1 and subjectivity < 0.3:
            sentiment = "Neutral & Objective"
        elif abs(polarity) < 0.1:
            sentiment = "Neutral but Subjective"
        elif polarity > 0.2:
            sentiment = "Positive Bias Detected"
        elif polarity < -0.2:
            sentiment = "Negative Bias Detected"
        else:
            sentiment = "Slight Bias"

        return {"Polarity": round(polarity, 3), "Subjectivity": round(subjectivity, 3), "VADER Compound": round(vader_scores['compound'], 3), "Sentiment": sentiment}
    except:
        return {"Polarity": 0, "Subjectivity": 0, "VADER Compound": 0, "Sentiment": "Error"}

def get_citation_count_from_text(wikitext):
    if not wikitext:
        return 0

    named_refs = set()
    unnamed_count = 0
    ref_pattern = r'<ref(?:\s+[^>]*)?>'
    all_refs = re.findall(ref_pattern, wikitext, re.IGNORECASE | re.DOTALL)

    for ref in all_refs:
        if ref.strip().endswith('/>'):
            continue
        name_match = re.search(r'name\s*=\s*["\']([^"\']+)["\']', ref, re.IGNORECASE)
        if name_match:
            named_refs.add(name_match.group(1))
        else:
            unnamed_count += 1

    ref_count = len(named_refs) + unnamed_count
    sfn_count = len(re.findall(r'\{\{sfn[a-z]*\|', wikitext, re.IGNORECASE))
    harv_count = len(re.findall(r'\{\{harv[a-z]*\|', wikitext, re.IGNORECASE))
    r_count = len(re.findall(r'\{\{rp?\|', wikitext, re.IGNORECASE))
    efn_count = len(re.findall(r'\{\{efn[a-z]*\|', wikitext, re.IGNORECASE))
    footnote_count = sfn_count + efn_count
    return max(ref_count, footnote_count, harv_count, r_count)

def get_remaining_data(title):
    url = "https://en.wikipedia.org/w/api.php"
    sections = ""
    try:
        params = {"action": "parse", "page": title, "prop": "sections", "redirects": 1, "format": "json"}
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            parse_data = r.json().get("parse", {})
            if parse_data:
                section_list = parse_data.get("sections", [])
                sections = ", ".join(s["line"] for s in section_list)
    except:
        pass

    talk_page_size = 0
    try:
        params = {"action": "query", "titles": f"Talk:{title}", "redirects": True, "prop": "revisions", "rvprop": "size", "rvlimit": 1, "format": "json"}
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            pages = data.get("query", {}).get("pages", {})
            page = next(iter(pages.values()), {})
            if int(page.get("pageid", -1)) > 0:
                talk_page_size = page.get("revisions", [{}])[0].get("size", 0)
    except:
        pass

    num_editors, recent_edits = 0, 0
    try:
        params = {"action": "query", "titles": title, "redirects": True, "prop": "revisions", "rvprop": "timestamp|user", "rvlimit": 500, "format": "json"}
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            pages = data.get("query", {}).get("pages", {})
            page = next(iter(pages.values()), {})
            revisions = page.get("revisions", [])
            unique_editors = set()
            one_year_ago = datetime.now(timezone.utc) - timedelta(days=365)

            for rev in revisions:
                user = rev.get("user", "")
                if user:
                    unique_editors.add(user)
                timestamp_str = rev.get("timestamp", "")
                if timestamp_str:
                    timestamp = datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
                    if timestamp >= one_year_ago:
                        recent_edits += 1
            num_editors = len(unique_editors)
    except:
        pass

    views = 0
    try:
        end = datetime.now(timezone.utc).replace(tzinfo=None)
        start = end - timedelta(days=90)
        encoded_title = quote(title.replace(' ', '_'))
        pv_url = f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/{encoded_title}/daily/{start:%Y%m%d}/{end:%Y%m%d}"
        r = requests.get(pv_url, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            views = sum(d["views"] for d in data.get("items", []))
    except:
        pass

    return {"Section Names": sections, "Talk Page Size": talk_page_size, "Total Editors": num_editors, "Edits Last Year": recent_edits, "Last 3 Months Views": views}

def get_empty_metrics():
    return {"Days Since Last Edit": 0, "Word Count": 0, "Section Names": "", "Citations": 0, "Citations Needed": 0, "Images": 0, "Categories": 0, "Total Editors": 0, "Edits Last Year": 0, "Talk Page Size": 0, "Last 3 Months Views": 0, "Journal Sources": 0, "Book Sources": 0, "Web Sources": 0, "News Sources": 0, "Avg Source Age": 0, "Recent Sources (5yr)": 0, "Source Quality Score": 0, "Hedging Words": 0, "Peacock Words": 0, "Weasel Words": 0, "Value Judgments": 0, "Neutrality Score": 0, "Flesch-Kincaid Grade": 0, "Reading Level": "Unknown", "Polarity": 0, "Subjectivity": 0, "VADER Compound": 0, "Sentiment": "Unknown", "Article Class": "Unknown"}

def process_single_article(title):
    try:
        cached_data = get_all_article_data(title)
        metrics = analyze_from_cached_data(title, cached_data)
        remaining = get_remaining_data(title)
        metrics.update(remaining)
        metrics["Article"] = title
        article_class = get_article_class(title)
        metrics["Article Class"] = article_class
        time.sleep(REQUEST_DELAY)
        return metrics
    except Exception as e:
        print(f"Error processing {title}: {e}")
        empty = get_empty_metrics()
        empty["Article"] = title
        return empty

def percentile_rank(series):
    return series.rank(pct=True) * 100

def calculate_scores(df):
    pd.set_option('display.width', 1000)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

    df['Days Since Edit Percentile'] = 100 - percentile_rank(df['Days Since Last Edit'])
    df['Edits Last Year Percentile'] = percentile_rank(df['Edits Last Year'])
    df['Total Editors Percentile'] = percentile_rank(df['Total Editors'])
    df['Talk Page Percentile'] = percentile_rank(df['Talk Page Size'])
    df['Views Percentile'] = percentile_rank(df['Last 3 Months Views'])
    df['Word Count Percentile'] = percentile_rank(df['Word Count'])
    df['Images Percentile'] = percentile_rank(df['Images'])
    df['Categories Percentile'] = percentile_rank(df['Categories'])
    df['Citations Percentile'] = percentile_rank(df['Citations'])
    df['Citation Needed Percentile'] = 100 - percentile_rank(df['Citations Needed'])
    df['Citation/Word Ratio Percentile'] = percentile_rank(df['Citations'] / df['Word Count'].replace(0, 1))
    df['Source Quality Percentile'] = percentile_rank(df['Source Quality Score'])
    df['Journal Sources Percentile'] = percentile_rank(df['Journal Sources'])
    df['Book Sources Percentile'] = percentile_rank(df['Book Sources'])
    df['Recent Sources Percentile'] = percentile_rank(df['Recent Sources (5yr)'])
    df['Source Age Percentile'] = 100 - percentile_rank(df['Avg Source Age'])
    df['Neutrality Percentile'] = percentile_rank(df['Neutrality Score'])
    df['Hedging Words Percentile'] = 100 - percentile_rank(df['Hedging Words'])
    df['Peacock Words Percentile'] = 100 - percentile_rank(df['Peacock Words'])
    df['Weasel Words Percentile'] = 100 - percentile_rank(df['Weasel Words'])
    df['Value Judgments Percentile'] = 100 - percentile_rank(df['Value Judgments'])
    df['Reading Level Percentile'] = 100 - percentile_rank(df['Flesch-Kincaid Grade'])
    df['Polarity Neutrality Percentile'] = 100 - percentile_rank(df['Polarity'].abs())
    df['Objectivity Percentile'] = 100 - percentile_rank(df['Subjectivity'])

    df['Collaboration Score'] = ((df['Total Editors Percentile'] + df['Talk Page Percentile']) / 2).round().astype(int)
    df['Aliveness Score'] = ((df['Days Since Edit Percentile'] + df['Edits Last Year Percentile']) / 2).round().astype(int)
    df['Popularity Score'] = df['Views Percentile'].round().astype(int)
    df['Quality Score'] = (df['Citation/Word Ratio Percentile'] * 0.25 + df['Images Percentile'] * 0.10 + df['Categories Percentile'] * 0.10 + df['Citation Needed Percentile'] * 0.05 + df['Source Quality Percentile'] * 0.25 + df['Neutrality Percentile'] * 0.15 + df['Objectivity Percentile'] * 0.10).round().astype(int)
    df['Scholarly Source Score'] = (df['Source Quality Percentile'] * 0.30 + df['Journal Sources Percentile'] * 0.30 + df['Book Sources Percentile'] * 0.20 + df['Recent Sources Percentile'] * 0.10 + df['Source Age Percentile'] * 0.10).round().astype(int)
    df['NPOV Score'] = (df['Value Judgments Percentile'] * 0.10 + df['Hedging Words Percentile'] * 0.05 + df['Neutrality Percentile'] * 0.35 + df['Objectivity Percentile'] * 0.30 + df['Peacock Words Percentile'] * 0.10 + df['Weasel Words Percentile'] * 0.10).round().astype(int)
    df['Accessibility Score'] = df['Reading Level Percentile'].round().astype(int)

    df['Wikipedia Link'] = df['Article'].apply(lambda x: f"https://en.wikipedia.org/wiki/{x.replace(' ', '_')}")
    df['IIT Library Link'] = df["Article"].str.replace(' ', '%20').apply(lambda x: f"https://i-share-iit.primo.exlibrisgroup.com/discovery/search?query=any,contains,{x},AND&tab=Everything&search_scope=MyInst_and_CI&sortby=rank&vid=01CARLI_IIT:CARLI_IIT&mfacet=tlevel,include,peer_reviewed,1&lang=en&mode=advanced&offset=0")
    return df



articles = search_wikipedia_articles(SEARCH_QUERY, MAX_ARTICLES)

if not articles:
    print("\nNo articles found.")
else:

    rows = []
    start_time = time.time()

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        future_to_article = {executor.submit(process_single_article, title): title for title in articles}

        for i, future in enumerate(as_completed(future_to_article), 1):
            article = future_to_article[future]
            try:
                result = future.result()
                rows.append(result)
                print(f"[{i}/{len(articles)}] Completed: {article}")

                if i % CHECKPOINT_INTERVAL == 0:
                    df_checkpoint = pd.DataFrame(rows)
                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                    checkpoint_file = f"{SEARCH_QUERY.replace(' ', '_')}_{timestamp}_checkpoint.csv"
                    df_checkpoint.to_csv(checkpoint_file, index=False)
                    elapsed = time.time() - start_time
                    remaining_time = (elapsed / i) * (len(articles) - i)
                    print(f"Checkpoint saved: {checkpoint_file}")
                    print(f"Elapsed: {elapsed/60:.1f}min | Estimated remaining: {remaining_time/60:.1f}min")

            except Exception as e:
                print(f"Failed: {article} - {e}")

    if rows:
        df_unordered = pd.DataFrame(rows)

        column_order = ["Article", "Article Class", "Last 3 Months Views", "Word Count", "Talk Page Size", "Days Since Last Edit", "Edits Last Year", "Total Editors", "Images", "Citations", "Citations Needed", "Journal Sources", "Book Sources", "Web Sources", "News Sources", "Avg Source Age", "Recent Sources (5yr)", "Source Quality Score", "Hedging Words", "Peacock Words", "Weasel Words", "Value Judgments", "Neutrality Score", "Flesch-Kincaid Grade", "Reading Level", "Polarity", "Subjectivity", "VADER Compound", "Sentiment", "Categories", "Section Names"]

        df = df_unordered[[col for col in column_order if col in df_unordered.columns]]

        df_with_scores = calculate_scores(df.copy())

        percentile_cols = [col for col in df_with_scores.columns if 'Percentile' in col]
        df_final = df_with_scores.drop(columns=percentile_cols)

        final_column_order = [
            "Article",
            "Article Class",
            "Last 3 Months Views",
            "Word Count",
            "Edits Last Year",
            "Total Editors",
            "Citations",
            "Citations Needed",
            "Collaboration Score",
            "Aliveness Score",
            "Popularity Score",
            "Quality Score",
            "Scholarly Source Score",
            "NPOV Score",
            "Neutrality Score",
            "Flesch-Kincaid Grade",
            "Journal Sources",
            "Book Sources",
            "Avg Source Age",
            "Recent Sources (5yr)",
            "Web Sources",
            "News Sources",
            "Reading Level",
            "Accessibility Score",
            "Source Quality Score",
            "Sentiment",
            "Polarity",
            "Subjectivity",
            "VADER Compound",
            "Images",
            "Days Since Last Edit",
            "Talk Page Size",
            "Hedging Words",
            "Peacock Words",
            "Weasel Words",
            "Value Judgments",
            "Categories",
            "Section Names",
            "Wikipedia Link",
            "IIT Library Link",
        ]

        df_final = df_final[[col for col in final_column_order if col in df_final.columns]]

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"{SEARCH_QUERY.replace(' ', '_')}_{MAX_ARTICLES}_WITH_SCORES.csv"
        df_final.to_csv(filename, index=False)

        elapsed_total = time.time() - start_time
        print(f"\n{'='*60}")
        print(f"Successfully saved {len(df_final)} articles to: {filename}")
        print(f"Total time: {elapsed_total/60:.1f} minutes ({elapsed_total/len(df_final):.2f}s per article)")
        print(f"{'='*60}\n")

        from itables import init_notebook_mode, show
        init_notebook_mode(all_interactive=True)

        print("="*60)
        print("COMPOSITE SCORES CALCULATED")
        print("="*60)
        print("\nScore Definitions:")
        print("- Collaboration: Average of (Total Editors + Talk Page Size)")
        print("- Aliveness: Average of (Days Since Edit [inverted] + Edits Last Year)")
        print("- Popularity: Page Views")
        print("- Quality: Weighted average of content metrics, sources, and neutrality")
        print("- Scholarly Source: Quality and recency of academic sources")
        print("- NPOV Score: Neutral Point of View compliance")
        print("- Accessibility: Reading level")
        print("\nAll scores are on a 0-100 percentile scale.")
        print("="*60)

        print("\nInteractive Data Table")
        print("="*60)
        print("How to use:")
        print("SEARCH: Use the search box to filter across all columns")
        print("SORT: Click column headers to sort (shift+click for multi-column)")
        print("COLUMNS: Click 'Column visibility' button to show/hide columns")
        print("PAGES: Use dropdown to change rows per page (10/25/50/100)")
        print("EXPORT: Click 'CSV' or 'Excel' to download")
        print("="*60)
        print()

        df_print = df_final[["Article",
            "Article Class",
            "Last 3 Months Views",
            "Word Count",
            "Edits Last Year",
            "Citations",
            "Recent Sources (5yr)",
            "Reading Level",
            "Sentiment",
            "Collaboration Score",
            "Aliveness Score",
            "Popularity Score",
            "Quality Score",
            "Scholarly Source Score",
            "NPOV Score",
            "Wikipedia Link",
            "IIT Library Link"]].copy()

        if DETAILED_TABLE == True:
          display(df_final if DETAILED_DATA == True else df_print)
        else:
          show(df_final if DETAILED_DATA == True else df_print, scrollX=True, scrollY="600px", paging=True, lengthMenu=[10, 25, 50, 100], pageLength=25, buttons=['copy', 'csv', 'excel', 'colvis'], order=[[4, 'desc']], columnDefs=[{"className": "dt-left", "targets": "_all"}])

        print(f"\n{'='*60}")
        print("Summary Statistics")
        print(f"{'='*60}")
        numeric_cols = ['Last 3 Months Views', 'Word Count', 'Talk Page Size', 'Citations', 'Days Since Last Edit', 'Source Quality Score', 'Neutrality Score', 'Flesch-Kincaid Grade', 'Polarity', 'Subjectivity', 'Collaboration Score', 'Aliveness Score', 'Popularity Score', 'Quality Score', 'Scholarly Source Score', 'NPOV Score', 'Accessibility Score']
        available_cols = [col for col in numeric_cols if col in df_final.columns]
        print(df_final[available_cols].describe().round(1))

        print(f"\n{'='*60}")
        print("Article Class Distribution")
        print(f"{'='*60}")
        print(df_final['Article Class'].value_counts())

        print(f"\n{'='*60}")
        print("TOP 5 ARTICLES BY EACH SCORE:")
        print(f"{'='*60}")

        print("\nMost Collaborative:")
        print(df_final.nlargest(5, 'Collaboration Score')[['Article', 'Collaboration Score', 'Total Editors', 'Talk Page Size']])

        print("\nMost Alive:")
        print(df_final.nlargest(5, 'Aliveness Score')[['Article', 'Aliveness Score', 'Days Since Last Edit', 'Edits Last Year']])

        print("\nMost Popular:")
        print(df_final.nlargest(5, 'Popularity Score')[['Article', 'Popularity Score', 'Last 3 Months Views']])

        print("\nHighest Quality:")
        print(df_final.nlargest(5, 'Quality Score')[['Article', 'Quality Score', 'Citations', 'Source Quality Score', 'Neutrality Score']])

        print("\nBest Scholarly Sources:")
        print(df_final.nlargest(5, 'Scholarly Source Score')[['Article', 'Scholarly Source Score', 'Journal Sources', 'Book Sources']])

        print("\nMost Neutral (NPOV):")
        print(df_final.nlargest(5, 'NPOV Score')[['Article', 'NPOV Score', 'Neutrality Score']])

        print("\nMost Accessible:")
        print(df_final.nlargest(5, 'Accessibility Score')[['Article', 'Accessibility Score', 'Reading Level', 'Flesch-Kincaid Grade']])

    else:
        print("\nNo data to save")

[1/30] Completed: Surfing (disambiguation)
[2/30] Completed: Surf
[3/30] Completed: Couch surfing
[4/30] Completed: Surfing
[5/30] Completed: Surf music
[6/30] Completed: Surf culture
[7/30] Completed: Surfing at the Summer Olympics
[8/30] Completed: Powder surfing
[9/30] Completed: World Wide Web
[10/30] Completed: History of surfing
[11/30] Completed: Big wave surfing
[12/30] Completed: Car surfing
[13/30] Completed: Train surfing
[14/30] Completed: Snowboarding
[15/30] Completed: Surfing with the Alien
[16/30] Completed: Tanker surfing
[17/30] Completed: Crowd surfing
[18/30] Completed: Glossary of surfing
[19/30] Completed: Elevator surfing
[20/30] Completed: List of train-surfing injuries and deaths
[21/30] Completed: Surfing in Madeira
[22/30] Completed: Shoulder surfing
[23/30] Completed: Shoulder surfing (computer security)
[24/30] Completed: Surfing on a Rocket
[25/30] Completed: Surfing on Sine Waves
[26/30] Completed: World Surf League
[27/30] Completed: Bethany Hamilton
[28

COMPOSITE SCORES CALCULATED

Score Definitions:
- Collaboration: Average of (Total Editors + Talk Page Size)
- Aliveness: Average of (Days Since Edit [inverted] + Edits Last Year)
- Popularity: Page Views
- Quality: Weighted average of content metrics, sources, and neutrality
- Scholarly Source: Quality and recency of academic sources
- NPOV Score: Neutral Point of View compliance
- Accessibility: Reading level

All scores are on a 0-100 percentile scale.

Interactive Data Table
How to use:
SEARCH: Use the search box to filter across all columns
SORT: Click column headers to sort (shift+click for multi-column)
COLUMNS: Click 'Column visibility' button to show/hide columns
PAGES: Use dropdown to change rows per page (10/25/50/100)
EXPORT: Click 'CSV' or 'Excel' to download



0
Loading ITables v2.7.0 from the internet...  (need help?)



Summary Statistics
       Last 3 Months Views  Word Count  Talk Page Size  Citations  Days Since Last Edit  Source Quality Score  Neutrality Score  Flesch-Kincaid Grade  Polarity  Subjectivity  Collaboration Score  Aliveness Score  Popularity Score  Quality Score  Scholarly Source Score  NPOV Score  Accessibility Score
count                 30.0        30.0            30.0       30.0                  30.0                  30.0              30.0                  30.0      30.0          30.0                 30.0             30.0              30.0           30.0                    30.0        30.0                 30.0
mean               25116.0      1857.6          5765.2       53.9                 162.0                  47.5              90.1                  14.1       0.1           0.4                 51.6             50.0              51.7           51.1                    51.4        49.4                 48.3
std                68410.8      2397.7          7313.2      123.8         

# Ignore these:

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta, timezone
from urllib.parse import quote
import time
import re
from IPython.display import display
from concurrent.futures import ThreadPoolExecutor, as_completed
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import textstat

EMAIL = "ajose3@hawk.illinoistech.edu" #@param {type:"string"}

HEADERS = {
    "User-Agent": f"Wikiproject_article_finder/1.0 (Educational research; {EMAIL}; Python/requests)"
}

SEARCH_QUERY = "feminist theory" #@param {type:"string"}
MAX_ARTICLES = "10" #@param {type:"string"}

try:
  MAX_ARTICLES = int(MAX_ARTICLES)
except (ValueError, TypeError):
  MAX_ARTICLES = None

LANG = "en"
REQUEST_DELAY = 0.03
CHECKPOINT_INTERVAL = 100
MAX_WORKERS = 3

def search_wikipedia_articles(query, max_articles=None):
    """Search Wikipedia for articles containing specific keywords."""
    url = "https://en.wikipedia.org/w/api.php"
    print(f"Searching for articles related to: '{query}'")

    titles = []
    offset = 0

    while True:
        params = {
            "action": "query",
            "list": "search",
            "srsearch": query,
            "srlimit": 50,
            "sroffset": offset,
            "format": "json"
        }

        r = requests.get(url, params=params, headers=HEADERS, timeout=30)

        if not r.headers.get("Content-Type", "").startswith("application/json"):
            print("Non-JSON response, retrying...")
            time.sleep(2)
            continue

        data = r.json()
        results = data.get("query", {}).get("search", [])

        if not results:
            break

        print(f"  Found {len(results)} results at offset {offset}")

        for result in results:
            title = result["title"]
            titles.append(title)

            if max_articles and len(titles) >= max_articles:
                print(f"✓ Reached limit of {max_articles} articles")
                return titles[:max_articles]

        if "continue" in data:
            offset = data["continue"]["sroffset"]
            time.sleep(REQUEST_DELAY)
        else:
            break

    print(f"✓ Total articles found: {len(titles)}")
    return titles

def get_article_class(title):
    """Get Wikipedia article quality class (FA, GA, B, C, Start, Stub, etc.)"""
    url = "https://en.wikipedia.org/w/api.php"

    try:
        # Get talk page content
        params = {
            "action": "query",
            "titles": f"Talk:{title}",
            "prop": "revisions",
            "rvprop": "content",
            "rvslots": "main",
            "format": "json"
        }

        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return "Unknown"

        data = r.json()
        page = next(iter(data.get("query", {}).get("pages", {}).values()))

        # Check if talk page exists
        if int(page.get("pageid", -1)) <= 0:
            return "Unassessed"

        # Get talk page wikitext
        talk_text = page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")

        if not talk_text:
            return "Unassessed"

        # Look for WikiProject quality assessments
        # Common patterns: {{WikiProject|class=FA}}, {{class|FA}}, |class=FA
        class_patterns = [
            r'\|\s*class\s*=\s*([A-Z][A-Za-z]*)',  # |class=FA or |class=Start
            r'{{[Cc]lass\|([A-Z][A-Za-z]*)',        # {{class|FA}}
            r'{{WikiProject[^}]*\|\s*class\s*=\s*([A-Z][A-Za-z]*)',  # {{WikiProject|class=FA}}
        ]

        for pattern in class_patterns:
            matches = re.findall(pattern, talk_text, re.IGNORECASE)
            if matches:
                article_class = matches[0].upper()

                # Normalize to standard classes
                valid_classes = ['FA', 'FL', 'A', 'GA', 'B', 'C', 'START', 'STUB', 'LIST']

                if article_class in valid_classes:
                    return article_class
                elif article_class in ['FEATURED', 'FA-CLASS']:
                    return 'FA'
                elif article_class in ['GOOD', 'GA-CLASS']:
                    return 'GA'
                elif article_class in ['B-CLASS']:
                    return 'B'
                elif article_class in ['C-CLASS']:
                    return 'C'
                elif article_class in ['START-CLASS']:
                    return 'START'
                elif article_class in ['STUB-CLASS']:
                    return 'STUB'

        return "Unassessed"

    except Exception as e:
        print(f"Error getting class for {title}: {e}")
        return "Unknown"

def get_all_article_data(title):
    """Fetch ALL data for an article in one go to minimize API calls."""
    url = "https://en.wikipedia.org/w/api.php"

    params = {
        "action": "query",
        "titles": title,
        "redirects": True,
        "prop": "revisions|extracts|images|categories",
        "rvprop": "content|timestamp",
        "rvslots": "main",
        "explaintext": True,
        "exlimit": 1,
        "imlimit": 500,
        "cllimit": 500,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return None

        data = r.json()
        page = next(iter(data["query"]["pages"].values()))

        wikitext = page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")
        plaintext = page.get("extract", "")
        timestamp = page.get("revisions", [{}])[0].get("timestamp", "")
        images = len(page.get("images", []))
        categories = len(page.get("categories", []))

        return {
            "wikitext": wikitext,
            "plaintext": plaintext,
            "timestamp": timestamp,
            "images": images,
            "categories": categories
        }
    except Exception as e:
        print(f"Error fetching data for {title}: {e}")
        return None

def analyze_from_cached_data(title, cached_data):
    """Run all analyses using cached data"""
    if not cached_data:
        return get_empty_metrics()

    wikitext = cached_data["wikitext"]
    plaintext = cached_data["plaintext"]
    timestamp = cached_data["timestamp"]

    metrics = {}

    days_since_edit = 0
    if timestamp:
        last_edit = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
        days_since_edit = round((datetime.now(timezone.utc).replace(tzinfo=None) - last_edit).days)

    word_count = len(plaintext.split()) if plaintext else 0
    citation_needed_count = wikitext.lower().count("citation needed")

    metrics["Days Since Last Edit"] = days_since_edit
    metrics["Word Count"] = word_count
    metrics["Citations Needed"] = citation_needed_count
    metrics["Images"] = cached_data["images"]
    metrics["Categories"] = cached_data["categories"]

    source_metrics = analyze_source_quality_from_text(wikitext)
    metrics.update(source_metrics)

    neutrality_metrics = analyze_neutrality_from_text(plaintext)
    metrics.update(neutrality_metrics)

    readability_metrics = analyze_readability_from_text(plaintext)
    metrics.update(readability_metrics)

    sentiment_metrics = detect_sentiment_bias_from_text(plaintext)
    metrics.update(sentiment_metrics)

    metrics["Citations"] = get_citation_count_from_text(wikitext)

    return metrics

def analyze_source_quality_from_text(wikitext):
    """Analyze source quality from wikitext - works with all citation formats"""

    cite_patterns = {
        "journal": r'{{cite journal',
        "book": r'{{cite book',
        "web": r'{{cite web',
        "news": r'{{cite news',
    }

    source_types = {k: len(re.findall(v, wikitext, re.IGNORECASE)) for k, v in cite_patterns.items()}
    total_typed_sources = sum(source_types.values())

    years = []
    current_year = datetime.now().year

    citation_blocks = re.findall(r'{{cite[^}]+}}', wikitext, re.IGNORECASE | re.DOTALL)
    for block in citation_blocks:
        year_matches = re.findall(r'\|(?:year|date|publication-date|access-date)\s*=\s*[^\d]*(\d{4})', block, re.IGNORECASE)
        for year_str in year_matches:
            year = int(year_str)
            if 1800 <= year <= current_year:
                years.append(year)
                break

    ref_blocks = re.findall(r'<ref[^>]*>(.*?)</ref>', wikitext, re.IGNORECASE | re.DOTALL)
    for ref in ref_blocks:
        year_patterns = [
            r'\((\d{4})\)',
            r'[,\s](\d{4})[,\.\s]',
            r'"(\d{4})"',
            r'(\d{4})-\d{2}-\d{2}',
        ]

        for pattern in year_patterns:
            year_matches = re.findall(pattern, ref)
            if year_matches:
                year = int(year_matches[0])
                if 1800 <= year <= current_year:
                    years.append(year)
                    break

    short_footnotes = re.findall(r'{{(?:sfn|harvnb|harv)[^}]*\|[^}]*?(\d{4})', wikitext, re.IGNORECASE)
    for year_str in short_footnotes:
        year = int(year_str)
        if 1800 <= year <= current_year:
            years.append(year)

    citation_templates = re.findall(r'{{citation[^}]+}}', wikitext, re.IGNORECASE | re.DOTALL)
    for block in citation_templates:
        year_matches = re.findall(r'\|(?:year|date|publication-date)\s*=\s*[^\d]*(\d{4})', block, re.IGNORECASE)
        for year_str in year_matches:
            year = int(year_str)
            if 1800 <= year <= current_year:
                years.append(year)
                break

    recent_sources = len([y for y in years if current_year - y <= 5]) if years else 0
    avg_source_age = (current_year - sum(years) / len(years)) if years else 0

    quality_score = 0

    if total_typed_sources > 0:
        scholarly_ratio = (source_types["journal"] + source_types["book"]) / total_typed_sources
        quality_score += scholarly_ratio * 50

    source_diversity = len([v for v in source_types.values() if v > 0])
    quality_score += source_diversity * 10

    if avg_source_age < 10:
        quality_score += 25
    elif avg_source_age < 20:
        quality_score += 15

    return {
        "Journal Sources": source_types["journal"],
        "Book Sources": source_types["book"],
        "Web Sources": source_types["web"],
        "News Sources": source_types["news"],
        "Avg Source Age": round(avg_source_age, 1),
        "Recent Sources (5yr)": recent_sources,
        "Source Quality Score": round(min(quality_score, 100), 1)
    }

def analyze_neutrality_from_text(text):
    """Detect POV/bias issues"""
    if not text:
        return {
            "Hedging Words": 0,
            "Peacock Words": 0,
            "Weasel Words": 0,
            "Value Judgments": 0,
            "Neutrality Score": 100
        }

    text_lower = text.lower()

    peacock_terms = [
        "legendary", "iconic", "acclaimed", "prestigious", "renowned",
        "celebrated", "world-class", "premier", "leading", "foremost",
        "groundbreaking", "seminal", "pivotal", "revolutionary", "extraordinary",
        "remarkable", "outstanding", "exceptional", "unparalleled", "best"
    ]
    peacock_count = sum(text_lower.count(term) for term in peacock_terms)

    weasel_patterns = [
        r'\bsome (people|experts|scholars|critics|observers|say|believe)',
        r'\bmany (believe|argue|claim|suggest|think|feel)',
        r'\bit (is said|has been said|is believed|is widely|is commonly)',
        r'\bmost (people|experts|scholars)',
        r'\bwidely (regarded|considered|accepted|believed)',
        r'\boften (considered|regarded|viewed)',
        r'\bgenerally (accepted|believed|considered)',
    ]
    weasel_count = sum(len(re.findall(pattern, text_lower)) for pattern in weasel_patterns)

    hedging_words = ["perhaps", "possibly", "maybe", "might", "could", "may", "seemingly"]
    hedging_count = sum(text_lower.count(word) for word in hedging_words)

    value_words = [
        "unfortunately", "fortunately", "clearly", "obviously", "naturally",
        "of course", "undoubtedly", "certainly", "arguably", "notably",
        "importantly", "surprisingly", "interestingly", "regrettably"
    ]
    value_count = sum(text_lower.count(word) for word in value_words)

    word_count = len(text.split())
    neutrality_score = 100

    if word_count > 0:
        neutrality_score -= (peacock_count / word_count * 1000) * 10
        neutrality_score -= (weasel_count / word_count * 1000) * 15
        neutrality_score -= (value_count / word_count * 1000) * 8

    neutrality_score = max(0, min(100, neutrality_score))

    return {
        "Hedging Words": hedging_count,
        "Peacock Words": peacock_count,
        "Weasel Words": weasel_count,
        "Value Judgments": value_count,
        "Neutrality Score": round(neutrality_score, 1)
    }

def analyze_readability_from_text(text):
    """Calculate reading level"""
    if not text or len(text) < 100:
        return {
            "Flesch-Kincaid Grade": 0,
            "Reading Level": "Unknown"
        }

    try:
        flesch_grade = textstat.flesch_kincaid_grade(text)
        flesch_ease = textstat.flesch_reading_ease(text)

        if flesch_ease >= 90:
            level = "Elementary (5th grade)"
        elif flesch_ease >= 80:
            level = "Middle School (6-7th)"
        elif flesch_ease >= 70:
            level = "High School (8-9th)"
        elif flesch_ease >= 60:
            level = "High School (10-12th)"
        elif flesch_ease >= 50:
            level = "College"
        elif flesch_ease >= 30:
            level = "College Graduate"
        else:
            level = "Professional/Academic"

        return {
            "Flesch-Kincaid Grade": round(flesch_grade, 1),
            "Reading Level": level
        }
    except:
        return {
            "Flesch-Kincaid Grade": 0,
            "Reading Level": "Error"
        }

def detect_sentiment_bias_from_text(text):
    """Detect sentiment bias"""
    if not text:
        return {
            "Polarity": 0,
            "Subjectivity": 0,
            "VADER Compound": 0,
            "Sentiment": "Neutral"
        }

    try:
        blob = TextBlob(text)
        polarity = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity

        analyzer = SentimentIntensityAnalyzer()
        vader_scores = analyzer.polarity_scores(text)

        if abs(polarity) < 0.1 and subjectivity < 0.3:
            sentiment = "Neutral & Objective"
        elif abs(polarity) < 0.1:
            sentiment = "Neutral but Subjective"
        elif polarity > 0.2:
            sentiment = "Positive Bias Detected"
        elif polarity < -0.2:
            sentiment = "Negative Bias Detected"
        else:
            sentiment = "Slight Bias"

        return {
            "Polarity": round(polarity, 3),
            "Subjectivity": round(subjectivity, 3),
            "VADER Compound": round(vader_scores['compound'], 3),
            "Sentiment": sentiment
        }
    except:
        return {
            "Polarity": 0,
            "Subjectivity": 0,
            "VADER Compound": 0,
            "Sentiment": "Error"
        }

def get_citation_count_from_text(wikitext):
    """Count citations from wikitext"""
    if not wikitext:
        return 0

    named_refs = set()
    unnamed_count = 0
    ref_pattern = r'<ref(?:\s+[^>]*)?>'
    all_refs = re.findall(ref_pattern, wikitext, re.IGNORECASE | re.DOTALL)

    for ref in all_refs:
        if ref.strip().endswith('/>'):
            continue
        name_match = re.search(r'name\s*=\s*["\']([^"\']+)["\']', ref, re.IGNORECASE)
        if name_match:
            named_refs.add(name_match.group(1))
        else:
            unnamed_count += 1

    ref_count = len(named_refs) + unnamed_count

    sfn_count = len(re.findall(r'\{\{sfn[a-z]*\|', wikitext, re.IGNORECASE))
    harv_count = len(re.findall(r'\{\{harv[a-z]*\|', wikitext, re.IGNORECASE))
    r_count = len(re.findall(r'\{\{rp?\|', wikitext, re.IGNORECASE))
    efn_count = len(re.findall(r'\{\{efn[a-z]*\|', wikitext, re.IGNORECASE))

    footnote_count = sfn_count + efn_count
    return max(ref_count, footnote_count, harv_count, r_count)

def get_remaining_data(title):
    """Get data that requires separate API calls"""
    url = "https://en.wikipedia.org/w/api.php"

    sections = ""
    try:
        params = {
            "action": "parse",
            "page": title,
            "prop": "sections",
            "redirects": 1,
            "format": "json"
        }
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            parse_data = r.json().get("parse", {})
            if parse_data:
                section_list = parse_data.get("sections", [])
                sections = ", ".join(s["line"] for s in section_list)
    except:
        pass

    talk_page_size = 0
    try:
        params = {
            "action": "query",
            "titles": f"Talk:{title}",
            "redirects": True,
            "prop": "revisions",
            "rvprop": "size",
            "rvlimit": 1,
            "format": "json"
        }
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            pages = data.get("query", {}).get("pages", {})
            page = next(iter(pages.values()), {})
            if int(page.get("pageid", -1)) > 0:
                talk_page_size = page.get("revisions", [{}])[0].get("size", 0)
    except:
        pass

    num_editors, recent_edits = 0, 0
    try:
        params = {
            "action": "query",
            "titles": title,
            "redirects": True,
            "prop": "revisions",
            "rvprop": "timestamp|user",
            "rvlimit": 500,
            "format": "json"
        }
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            pages = data.get("query", {}).get("pages", {})
            page = next(iter(pages.values()), {})
            revisions = page.get("revisions", [])

            unique_editors = set()
            one_year_ago = datetime.now(timezone.utc) - timedelta(days=365)

            for rev in revisions:
                user = rev.get("user", "")
                if user:
                    unique_editors.add(user)
                timestamp_str = rev.get("timestamp", "")
                if timestamp_str:
                    timestamp = datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
                    if timestamp >= one_year_ago:
                        recent_edits += 1

            num_editors = len(unique_editors)
    except:
        pass

    views = 0
    try:
        end = datetime.now(timezone.utc).replace(tzinfo=None)
        start = end - timedelta(days=90)
        encoded_title = quote(title.replace(' ', '_'))
        pv_url = (
            f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/"
            f"en.wikipedia/all-access/user/"
            f"{encoded_title}/daily/"
            f"{start:%Y%m%d}/{end:%Y%m%d}"
        )
        r = requests.get(pv_url, headers=HEADERS, timeout=30)
        if r.headers.get("Content-Type", "").startswith("application/json"):
            data = r.json()
            views = sum(d["views"] for d in data.get("items", []))
    except:
        pass

    return {
        "Section Names": sections,
        "Talk Page Size": talk_page_size,
        "Total Editors": num_editors,
        "Edits Last Year": recent_edits,
        "Last 3 Months Views": views
    }

def get_empty_metrics():
    """Return empty metrics dict for failed articles"""
    return {
        "Days Since Last Edit": 0,
        "Word Count": 0,
        "Section Names": "",
        "Citations": 0,
        "Citations Needed": 0,
        "Images": 0,
        "Categories": 0,
        "Total Editors": 0,
        "Edits Last Year": 0,
        "Talk Page Size": 0,
        "Last 3 Months Views": 0,
        "Journal Sources": 0,
        "Book Sources": 0,
        "Web Sources": 0,
        "News Sources": 0,
        "Avg Source Age": 0,
        "Recent Sources (5yr)": 0,
        "Source Quality Score": 0,
        "Hedging Words": 0,
        "Peacock Words": 0,
        "Weasel Words": 0,
        "Value Judgments": 0,
        "Neutrality Score": 0,
        "Flesch-Kincaid Grade": 0,
        "Reading Level": "Unknown",
        "Polarity": 0,
        "Subjectivity": 0,
        "VADER Compound": 0,
        "Sentiment": "Unknown",
        "Article Class": "Unknown"
    }

def process_single_article(title):
    """Process a single article - to be run in parallel"""
    try:
        cached_data = get_all_article_data(title)
        metrics = analyze_from_cached_data(title, cached_data)
        remaining = get_remaining_data(title)
        metrics.update(remaining)
        metrics["Article"] = title

        # Get article class
        article_class = get_article_class(title)
        metrics["Article Class"] = article_class

        time.sleep(REQUEST_DELAY)
        return metrics

    except Exception as e:
        print(f"Error processing {title}: {e}")
        empty = get_empty_metrics()
        empty["Article"] = title
        return empty

# MAIN EXECUTION
print("="*60)
print(f"Searching Wikipedia for: '{SEARCH_QUERY}'")
print("="*60)

articles = search_wikipedia_articles(SEARCH_QUERY, MAX_ARTICLES)

if not articles:
    print("\nNo articles found.")
    exit()

print(f"\n{'='*60}")
print(f"Processing {len(articles)} articles with {MAX_WORKERS} parallel workers...")
print(f"{'='*60}\n")

rows = []
start_time = time.time()

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    future_to_article = {executor.submit(process_single_article, title): title for title in articles}

    for i, future in enumerate(as_completed(future_to_article), 1):
        article = future_to_article[future]
        try:
            result = future.result()
            rows.append(result)
            print(f"[{i}/{len(articles)}] Completed: {article}")

            if i % CHECKPOINT_INTERVAL == 0:
                df_checkpoint = pd.DataFrame(rows)
                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                checkpoint_file = f"{SEARCH_QUERY.replace(' ', '_')}_{timestamp}_checkpoint.csv"
                df_checkpoint.to_csv(checkpoint_file, index=False)
                elapsed = time.time() - start_time
                remaining_time = (elapsed / i) * (len(articles) - i)
                print(f"Checkpoint saved: {checkpoint_file}")
                print(f"Elapsed: {elapsed/60:.1f}min | Estimated remaining: {remaining_time/60:.1f}min")

        except Exception as e:
            print(f"Failed: {article} - {e}")

# REORDER COLUMNS AND DISPLAY
if rows:
    df_unordered = pd.DataFrame(rows)

    column_order = [
        "Article",
        "Article Class",
        "Last 3 Months Views",
        "Word Count",
        "Talk Page Size",
        "Days Since Last Edit",
        "Edits Last Year",
        "Total Editors",
        "Images",
        "Citations",
        "Citations Needed",
        "Journal Sources",
        "Book Sources",
        "Web Sources",
        "News Sources",
        "Avg Source Age",
        "Recent Sources (5yr)",
        "Source Quality Score",
        "Hedging Words",
        "Peacock Words",
        "Weasel Words",
        "Value Judgments",
        "Neutrality Score",
        "Flesch-Kincaid Grade",
        "Reading Level",
        "Polarity",
        "Subjectivity",
        "VADER Compound",
        "Sentiment",
        "Categories",
        "Section Names"
    ]

    df = df_unordered[[col for col in column_order if col in df_unordered.columns]]

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{SEARCH_QUERY.replace(' ', '_')}_{MAX_ARTICLES}.csv"
    df.to_csv(filename, index=False)

    try:
        drive_filename = f"/content/drive/MyDrive/{SEARCH_QUERY.replace(' ', '_')}_{MAX_ARTICLES}.csv"
        df.to_csv(drive_filename, index=False)
        print(f"Saved to Google Drive: {drive_filename}")
    except:
        print("Google Drive not mounted, saved locally only")

    elapsed_total = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"Successfully saved {len(df)} articles to: {filename}")
    print(f"Total time: {elapsed_total/60:.1f} minutes ({elapsed_total/len(df):.2f}s per article)")
    print(f"{'='*60}\n")

    from itables import init_notebook_mode, show
    init_notebook_mode(all_interactive=True)

    print("Interactive Data Table")
    print("="*60)
    print("How to use:")
    print("SEARCH: Use the search box to filter across all columns")
    print("SORT: Click column headers to sort (shift+click for multi-column)")
    print("COLUMNS: Click 'Column visibility' button to show/hide columns")
    print("PAGES: Use dropdown to change rows per page (10/25/50/100)")
    print("EXPORT: Click 'CSV' or 'Excel' to download")
    print("="*60)
    print()

    show(df,
        scrollX=True,
        scrollY="600px",
        paging=True,
        lengthMenu=[10, 25, 50, 100],
        pageLength=25,
        buttons=['copy', 'csv', 'excel', 'colvis'],
        order=[[2, 'desc']],
        columnDefs=[{"className": "dt-left", "targets": "_all"}]
    )

    print(f"\n{'='*60}")
    print("Summary Statistics")
    print(f"{'='*60}")
    numeric_cols = ['Last 3 Months Views', 'Word Count', 'Talk Page Size', 'Citations',
                   'Days Since Last Edit', 'Source Quality Score', 'Neutrality Score',
                   'Flesch-Kincaid Grade', 'Polarity', 'Subjectivity']
    available_cols = [col for col in numeric_cols if col in df.columns]
    print(df[available_cols].describe().round(1))

    # Show article class distribution
    print(f"\n{'='*60}")
    print("Article Class Distribution")
    print(f"{'='*60}")
    print(df['Article Class'].value_counts())

else:
    print("\nNo data to save")

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime


pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Load the CSV file
path = "/content/feminist_theory_10.csv" #@param {type:"string"}
df = pd.read_csv(path)

# Function to calculate percentile rank (0-100 scale)
def percentile_rank(series):
    """Convert values to percentile ranks (0-100)"""
    return series.rank(pct=True) * 100

# ============================================
# CREATE LINKS FIRST (before percentile calculations)
# ============================================

# Wikipedia link
df['Wikipedia Link'] = df['Article'].apply(
    lambda x: f"https://en.wikipedia.org/wiki/{x.replace(' ', '_')}"
)

# IIT Library search link
df['IIT Library Link'] = df["Article"].str.replace(' ', '%20').apply(
    lambda x: f"https://i-share-iit.primo.exlibrisgroup.com/discovery/search?query=any,contains,{x},AND&tab=Everything&search_scope=MyInst_and_CI&sortby=rank&vid=01CARLI_IIT:CARLI_IIT&mfacet=tlevel,include,peer_reviewed,1&lang=en&mode=advanced&offset=0"
)

# ============================================
# CALCULATE PERCENTILES FOR ALL METRICS
# ============================================

# Activity/Collaboration metrics
df['Days Since Edit Percentile'] = 100 - percentile_rank(df['Days Since Last Edit'])
df['Edits Last Year Percentile'] = percentile_rank(df['Edits Last Year'])
df['Total Editors Percentile'] = percentile_rank(df['Total Editors'])
df['Talk Page Percentile'] = percentile_rank(df['Talk Page Size'])

# Popularity
df['Views Percentile'] = percentile_rank(df['Last 3 Months Views'])

# Content quality
df['Word Count Percentile'] = percentile_rank(df['Word Count'])
df['Images Percentile'] = percentile_rank(df['Images'])
df['Categories Percentile'] = percentile_rank(df['Categories'])
df['Citations Percentile'] = percentile_rank(df['Citations'])
df['Citation Needed Percentile'] = 100 - percentile_rank(df['Citations Needed'])
df['Citation/Word Ratio Percentile'] = percentile_rank(df['Citations'] / df['Word Count'].replace(0, 1))

# Source quality metrics
df['Source Quality Percentile'] = percentile_rank(df['Source Quality Score'])
df['Journal Sources Percentile'] = percentile_rank(df['Journal Sources'])
df['Book Sources Percentile'] = percentile_rank(df['Book Sources'])
df['Recent Sources Percentile'] = percentile_rank(df['Recent Sources (5yr)'])
df['Source Age Percentile'] = 100 - percentile_rank(df['Avg Source Age'])

# Neutrality and bias metrics
df['Neutrality Percentile'] = percentile_rank(df['Neutrality Score'])
df['Peacock Words Percentile'] = 100 - percentile_rank(df['Peacock Words'])
df['Weasel Words Percentile'] = 100 - percentile_rank(df['Weasel Words'])
df['Value Judgments Percentile'] = 100 - percentile_rank(df['Value Judgments'])

# Readability metrics
df['Reading Level Percentile'] = 100 - percentile_rank(df['Flesch-Kincaid Grade'])

# Sentiment metrics
df['Polarity Neutrality Percentile'] = 100 - percentile_rank(df['Polarity'].abs())
df['Objectivity Percentile'] = 100 - percentile_rank(df['Subjectivity'])

# ============================================
# CREATE COMPOSITE SCORES
# ============================================

# Collaboration Score
df['Collaboration Score'] = (
    (df['Total Editors Percentile'] + df['Talk Page Percentile']) / 2
).round().astype(int)

# Aliveness Score
df['Aliveness Score'] = (
    (df['Days Since Edit Percentile'] + df['Edits Last Year Percentile']) / 2
).round().astype(int)

# Popularity Score
df['Popularity Score'] = df['Views Percentile'].round().astype(int)

# Quality Score
df['Quality Score'] = (
    df['Citation/Word Ratio Percentile'] * 0.25 +
    df['Images Percentile'] * 0.10 +
    df['Categories Percentile'] * 0.10 +
    df['Citation Needed Percentile'] * 0.05 +
    df['Source Quality Percentile'] * 0.25 +
    df['Neutrality Percentile'] * 0.15 +
    df['Objectivity Percentile'] * 0.10
).round().astype(int)

# Scholarly Source Score
df['Scholarly Source Score'] = (
    df['Source Quality Percentile'] * 0.30 +
    df['Journal Sources Percentile'] * 0.30 +
    df['Book Sources Percentile'] * 0.20 +
    df['Recent Sources Percentile'] * 0.10 +
    df['Source Age Percentile'] * 0.10
).round().astype(int)

# NPOV Score
df['NPOV Score'] = (
    df['Neutrality Percentile'] * 0.40 +
    df['Objectivity Percentile'] * 0.30 +
    df['Peacock Words Percentile'] * 0.15 +
    df['Weasel Words Percentile'] * 0.15
).round().astype(int)

# Accessibility Score
df['Accessibility Score'] = df['Reading Level Percentile'].round().astype(int)

# ============================================
# DISPLAY RESULTS
# ============================================

print("="*60)
print("COMPOSITE SCORES CALCULATED")
print("="*60)
print("\nScore Definitions:")
print("- Collaboration: Average of (Total Editors + Talk Page Size)")
print("- Aliveness: Average of (Days Since Edit [inverted] + Edits Last Year)")
print("- Popularity: Page Views")
print("- Quality: Weighted average of content metrics, sources, and neutrality")
print("- Scholarly Source: Quality and recency of academic sources")
print("- NPOV Score: Neutral Point of View compliance")
print("- Accessibility: Reading level")
print("\nAll scores are on a 0-100 percentile scale.")
print("="*60)

# Display core composite scores with links
display(df[['Article', 'Collaboration Score',
            'Aliveness Score', 'Popularity Score', 'Quality Score', 'Scholarly Source Score',
            'NPOV Score', 'Accessibility Score', 'Wikipedia Link', 'IIT Library Link']])

# Summary statistics
print("\n" + "="*60)
print("Composite Score Statistics:")
print("="*60)
score_cols = ['Collaboration Score', 'Aliveness Score', 'Popularity Score', 'Quality Score',
              'Scholarly Source Score', 'NPOV Score', 'Accessibility Score']
print(df[score_cols].describe().round(1).T)

# ============================================
# TOP ARTICLES BY EACH SCORE
# ============================================

print("\n" + "="*60)
print("TOP 5 ARTICLES BY EACH SCORE:")
print("="*60)

print("\nMost Collaborative:")
print(df.nlargest(5, 'Collaboration Score')[['Article', 'Collaboration Score', 'Total Editors', 'Talk Page Size']])

print("\nMost Alive:")
print(df.nlargest(5, 'Aliveness Score')[['Article', 'Aliveness Score', 'Days Since Last Edit', 'Edits Last Year']])

print("\nMost Popular:")
print(df.nlargest(5, 'Popularity Score')[['Article', 'Popularity Score', 'Last 3 Months Views']])

print("\nHighest Quality:")
print(df.nlargest(5, 'Quality Score')[['Article', 'Quality Score', 'Citations', 'Source Quality Score', 'Neutrality Score']])

print("\nBest Scholarly Sources:")
print(df.nlargest(5, 'Scholarly Source Score')[['Article', 'Scholarly Source Score', 'Journal Sources', 'Book Sources']])

print("\nMost Neutral (NPOV):")
print(df.nlargest(5, 'NPOV Score')[['Article', 'NPOV Score', 'Neutrality Score', 'Objectivity Percentile']])

print("\nMost Accessible:")
print(df.nlargest(5, 'Accessibility Score')[['Article', 'Accessibility Score', 'Reading Level', 'Flesch-Kincaid Grade']])

# ============================================
# SAVE RESULTS
# ============================================

# Drop percentile columns (keep only final scores)
percentile_cols = [col for col in df.columns if 'Percentile' in col]
df_final = df.drop(columns=percentile_cols)

# Save to local
filename_with_scores = path.split('/')[-1].replace('.csv', '_WITH_SCORES.csv')
df_final.to_csv(filename_with_scores, index=False)
print(f"\nSaved locally to: {filename_with_scores}")

# Save to Google Drive
try:
    drive_filename = "/content/drive/MyDrive/" + filename_with_scores
    df_final.to_csv(drive_filename, index=False)
    print(f"Saved to Google Drive: {drive_filename}")
except:
    print("Google Drive not mounted")

In [None]:
#combine dfs
path1 = "" #@param {type:"string"}
path2 = "" #@param {type:"string"}
# Read CSVs into DataFrame
First_Sheet = pd.read_csv(path1)
Second_Sheet = pd.read_csv(path2)
dfs = [First_Sheet,Second_Sheet]
combined_df = pd.concat(dfs, axis=0, ignore_index=True)

In [None]:
import pandas as pd
path = "/content/feminist_theory_10.csv" #@param {type:"string"}
# Read CSV into DataFrame
df = pd.read_csv(path)
show(df,
        scrollX=True,
        scrollY="600px",
        paging=True,
        lengthMenu=[10, 25, 50, 100],
        pageLength=25,
        buttons=['copy', 'csv', 'excel', 'colvis'],
        order=[[1, 'desc']],  # Sort by "Last 3 Months Views" descending by default
        columnDefs=[{"className": "dt-left", "targets": "_all"}]
    )

In [None]:
# @title
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML
import requests
import time
import re
import random
from collections import Counter

EMAIL = "ajose3@hawk.illinoistech.edu" #@param {type:"string"}
HEADERS = {
    "User-Agent": f"Wikipedia_Topic_Explorer/1.0 (Educational research; {EMAIL}; Python/requests)"
}

def search_wikipedia_multiple(query, top_n=3):
    """
    Search Wikipedia, get top 20 results, randomly sample 3 for variety.
    """
    url = "https://en.wikipedia.org/w/api.php"

    params = {
        "action": "query",
        "list": "search",
        "srsearch": query,
        "srlimit": 10,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        data = r.json()

        all_results = []
        for item in data.get("query", {}).get("search", []):
            all_results.append({
                "title": item["title"],
                "snippet": item.get("snippet", "")
            })

        if len(all_results) >= top_n:
            return random.sample(all_results, top_n)
        else:
            return all_results

    except Exception as e:
        print(f"Error searching Wikipedia: {e}")
        return []

def get_categories_and_wikiprojects(query, num_articles=10):
    """
    Get most common categories and WikiProjects from search results.
    """
    url = "https://en.wikipedia.org/w/api.php"

    # Get more articles for better category analysis
    search_params = {
        "action": "query",
        "list": "search",
        "srsearch": query,
        "srlimit": num_articles,
        "format": "json"
    }

    try:
        r = requests.get(url, params=search_params, headers=HEADERS, timeout=30)
        data = r.json()
        article_titles = [item["title"] for item in data.get("query", {}).get("search", [])]

        all_categories = []
        all_wikiprojects = []

        skip_terms = ['stub', 'articles needing', 'articles with', 'pages', 'wikipedia',
                     'template', 'all ', 'cs1', 'webarchive', 'coordinates', 'commons',
                     'use dmy', 'use mdy', 'living people', 'year', 'births', 'deaths']

        for title in article_titles:
            # Get categories
            cat_params = {
                "action": "query",
                "titles": title,
                "prop": "categories",
                "cllimit": 100,
                "format": "json"
            }

            r = requests.get(url, params=cat_params, headers=HEADERS, timeout=30)
            cat_data = r.json()
            page = next(iter(cat_data.get("query", {}).get("pages", {}).values()))

            for cat in page.get("categories", []):
                cat_name = cat["title"].replace("Category:", "")
                if not any(skip in cat_name.lower() for skip in skip_terms):
                    all_categories.append(cat_name)

            # Get WikiProjects from talk page
            talk_params = {
                "action": "query",
                "titles": f"Talk:{title}",
                "prop": "revisions",
                "rvprop": "content",
                "rvslots": "main",
                "format": "json"
            }

            r = requests.get(url, params=talk_params, headers=HEADERS, timeout=30)
            talk_data = r.json()
            talk_page = next(iter(talk_data.get("query", {}).get("pages", {}).values()))

            if int(talk_page.get("pageid", -1)) > 0:
                talk_text = talk_page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")

                # Find WikiProject templates
                wikiproject_pattern = r'{{WikiProject\s+([^}|]+)'
                projects = re.findall(wikiproject_pattern, talk_text, re.IGNORECASE)
                all_wikiprojects.extend(projects)

            time.sleep(0.1)

        # Count occurrences
        category_counts = Counter(all_categories)
        wikiproject_counts = Counter(all_wikiprojects)

        return {
            "categories": category_counts.most_common(10),
            "wikiprojects": wikiproject_counts.most_common(10)
        }

    except Exception as e:
        print(f"Error getting categories/projects: {e}")
        return {"categories": [], "wikiprojects": []}

def get_conceptnet_related(query, limit=15):
    """
    Get related concepts from ConceptNet API.
    """
    try:
        encoded_query = query.replace(' ', '_').lower()
        url = f"http://api.conceptnet.io/query?node=/c/en/{encoded_query}&limit=100"

        r = requests.get(url, timeout=30)

        if r.status_code != 200:
            return []

        data = r.json()

        related_concepts = set()

        for edge in data.get("edges", []):
            weight = edge.get("weight", 0)

            if weight < 1.0:
                continue

            start = edge.get("start", {})
            end = edge.get("end", {})

            for node in [start, end]:
                label = node.get("label", "")
                language = node.get("language", "")
                node_term = node.get("term", "")

                if language == "en":
                    if not label and node_term:
                        term_parts = node_term.split('/')
                        if len(term_parts) >= 3:
                            label = term_parts[-1].replace('_', ' ')

                    if label:
                        clean_label = label.replace("_", " ").strip()
                        if (clean_label.lower() != query.lower() and
                            len(clean_label) > 2 and
                            clean_label not in related_concepts):
                            related_concepts.add(clean_label)

        result = sorted(list(related_concepts))[:limit]
        return result

    except Exception as e:
        print(f"Error accessing ConceptNet: {e}")
        return []

def make_wikipedia_link(title):
    """
    Create a clickable HTML link to a Wikipedia article.
    """
    url_title = title.replace(' ', '_')
    url = f"https://en.wikipedia.org/wiki/{url_title}"
    return f'<a href="{url}" target="_blank">{title}</a>'

def make_category_link(category_name):
    """
    Create a clickable link to a Wikipedia category page.
    """
    url_cat = category_name.replace(' ', '_')
    url = f"https://en.wikipedia.org/wiki/Category:{url_cat}"
    return f'<a href="{url}" target="_blank">{category_name}</a>'

def make_wikiproject_link(project_name):
    """
    Create a clickable link to a WikiProject page.
    """
    url_project = project_name.strip().replace(' ', '_')
    url = f"https://en.wikipedia.org/wiki/Wikipedia:WikiProject_{url_project}"
    return f'<a href="{url}" target="_blank">WikiProject {project_name}</a>'

def get_see_also_links(article_title, max_links=10):
    """
    Extract links from See Also section.
    """
    url = "https://en.wikipedia.org/w/api.php"

    parse_params = {
        "action": "parse",
        "page": article_title,
        "prop": "sections|wikitext",
        "format": "json"
    }

    try:
        r = requests.get(url, params=parse_params, headers=HEADERS, timeout=30)
        data = r.json()

        if "parse" not in data:
            return []

        wikitext = data["parse"].get("wikitext", {}).get("*", "")

        section_pattern = r"==\s*See also\s*==(.*?)(?:==|$)"
        section_match = re.search(section_pattern, wikitext, re.IGNORECASE | re.DOTALL)

        if not section_match:
            return []

        see_also_text = section_match.group(1)
        links = re.findall(r'\[\[([^]|]+)(?:\|[^]]+)?\]\]', see_also_text)

        clean_links = []
        for link in links:
            if not link.startswith(('Category:', 'File:', 'Image:', 'Wikipedia:')):
                clean_links.append(link)

        return clean_links[:max_links]

    except Exception as e:
        return []

def get_filtered_links_from_article(article_title, max_links=12):
    """
    Get conceptual links from the article.
    """
    url = "https://en.wikipedia.org/w/api.php"

    links_params = {
        "action": "query",
        "titles": article_title,
        "prop": "links",
        "pllimit": 300,
        "plnamespace": 0,
        "format": "json"
    }

    try:
        r = requests.get(url, params=links_params, headers=HEADERS, timeout=30)
        data = r.json()
        page = next(iter(data.get("query", {}).get("pages", {}).values()))

        if "links" not in page:
            return []

        all_links = [link["title"] for link in page["links"]]

        conceptual_keywords = [
            'theory', 'philosophy', 'studies', 'criticism', 'ism',
            'epistemology', 'methodology', 'approach', 'framework',
            'perspective', 'analysis', 'research', 'science'
        ]

        exclude_patterns = [
            r'^\d{4}$',
            r'List of',
            r'Index of',
            r'^[A-Z]{2,4}$',
            r'University',
            r'Press$',
            r'Publishing',
            r'Books$',
            r'ISBN'
        ]

        filtered = []
        for link in all_links:
            if any(re.search(pattern, link) for pattern in exclude_patterns):
                continue

            if any(keyword in link.lower() for keyword in conceptual_keywords):
                filtered.append(link)

        if len(filtered) > max_links:
            return random.sample(filtered, max_links)
        return filtered

    except Exception as e:
        return []

def get_category_siblings(article_title, max_results=10):
    """
    Get other articles in the same meaningful categories.
    """
    url = "https://en.wikipedia.org/w/api.php"

    cat_params = {
        "action": "query",
        "titles": article_title,
        "prop": "categories",
        "cllimit": 50,
        "format": "json"
    }

    try:
        r = requests.get(url, params=cat_params, headers=HEADERS, timeout=30)
        cat_data = r.json()
        page = next(iter(cat_data.get("query", {}).get("pages", {}).values()))
        categories = [cat["title"] for cat in page.get("categories", [])]

        skip_terms = ['stub', 'articles', 'pages', 'wikipedia', 'template',
                     'all ', 'cs1', 'webarchive', 'coordinates', 'commons',
                     'use dmy', 'use mdy', 'living', 'year', 'century']

        meaningful_cats = []
        for cat in categories:
            if not any(skip in cat.lower() for skip in skip_terms):
                meaningful_cats.append(cat)

        all_siblings = []

        for category in meaningful_cats[:3]:
            member_params = {
                "action": "query",
                "list": "categorymembers",
                "cmtitle": category,
                "cmlimit": 20,
                "cmnamespace": 0,
                "format": "json"
            }

            r = requests.get(url, params=member_params, headers=HEADERS, timeout=30)
            member_data = r.json()

            for member in member_data.get("query", {}).get("categorymembers", []):
                title = member["title"]
                if title != article_title and title not in all_siblings:
                    all_siblings.append(title)

            time.sleep(0.1)

        if len(all_siblings) > max_results:
            return random.sample(all_siblings, max_results)
        return all_siblings

    except Exception as e:
        return []

def explore_topic_multiple_paths(query):
    """
    Get related topics from multiple sources and pathways.
    """
    wiki_results = search_wikipedia_multiple(query, top_n=3)
    conceptnet_related = get_conceptnet_related(query, limit=12)
    cat_and_projects = get_categories_and_wikiprojects(query, num_articles=15)

    pathways = []

    for i, result in enumerate(wiki_results):
        article_title = result["title"]

        pathway = {
            "title": article_title,
            "see_also": get_see_also_links(article_title, 8),
            "related_concepts": get_filtered_links_from_article(article_title, 10),
            "category_siblings": get_category_siblings(article_title, 8)
        }

        pathways.append(pathway)
        time.sleep(0.2)

    return {
        "wikipedia_matches": [r["title"] for r in wiki_results],
        "pathways": pathways,
        "conceptnet_alternatives": conceptnet_related,
        "categories": cat_and_projects["categories"],
        "wikiprojects": cat_and_projects["wikiprojects"]
    }

def create_smart_search_explorer():

    topic_input = widgets.Text(
        value='feminist theory',
        placeholder='Enter a topic',
        description='Topic:',
        style={'description_width': '100px'},
        layout=widgets.Layout(width='500px')
    )

    search_button = widgets.Button(
        description='Explore Topic',
        button_style='info',
        layout=widgets.Layout(width='200px', height='40px')
    )

    output = widgets.Output()

    def on_search_clicked(b):
        with output:
            clear_output()
            query = topic_input.value.strip()

            if not query:
                print("Please enter a topic.")
                return

            print("TOPIC EXPLORATION")
            print("="*70)
            print(f"Exploring: '{query}'")
            print("="*70)
            print("\nAnalyzing multiple pathways (30-40 seconds)...\n")

            results = explore_topic_multiple_paths(query)

            html_output = "<div style='font-family: monospace;'>"

            # Most Common Categories
            if results["categories"]:
                html_output += "<h3>MOST COMMON CATEGORIES FOR THIS TOPIC</h3>"
                html_output += "<hr>"
                html_output += "<p>These categories appear most frequently across articles about this topic:</p>"
                html_output += "<ol>"
                for cat_name, count in results["categories"]:
                    html_output += f"<li>{make_category_link(cat_name)} <em>({count} articles)</em></li>"
                html_output += "</ol>"

            # Most Common WikiProjects
            if results["wikiprojects"]:
                html_output += "<h3>MOST ACTIVE WIKIPROJECTS</h3>"
                html_output += "<hr>"
                html_output += "<p>These WikiProjects are most involved with articles on this topic:</p>"
                html_output += "<ol>"
                for project_name, count in results["wikiprojects"]:
                    html_output += f"<li>{make_wikiproject_link(project_name)} <em>({count} articles)</em></li>"
                html_output += "</ol>"

            # Wikipedia matches
            html_output += "<h3>WIKIPEDIA ARTICLES FOUND</h3>"
            html_output += "<hr>"
            if results["wikipedia_matches"]:
                html_output += "<ol>"
                for match in results["wikipedia_matches"]:
                    html_output += f"<li>{make_wikipedia_link(match)}</li>"
                html_output += "</ol>"
            else:
                html_output += "<p>No Wikipedia articles found.</p>"

            # Pathways
            for i, pathway in enumerate(results["pathways"], 1):
                html_output += f"<h3>PATHWAY {i}: Based on '{pathway['title']}'</h3>"
                html_output += "<hr>"

                if pathway["see_also"]:
                    html_output += "<h4>Editor-curated related topics:</h4>"
                    html_output += "<ol>"
                    for topic in pathway["see_also"]:
                        html_output += f"<li>{make_wikipedia_link(topic)}</li>"
                    html_output += "</ol>"

                if pathway["related_concepts"]:
                    html_output += "<h4>Related concepts from article:</h4>"
                    html_output += "<ol>"
                    for topic in pathway["related_concepts"]:
                        html_output += f"<li>{make_wikipedia_link(topic)}</li>"
                    html_output += "</ol>"

                if pathway["category_siblings"]:
                    html_output += "<h4>Similar topics (same categories):</h4>"
                    html_output += "<ol>"
                    for topic in pathway["category_siblings"]:
                        html_output += f"<li>{make_wikipedia_link(topic)}</li>"
                    html_output += "</ol>"

            # ConceptNet alternatives
            if results["conceptnet_alternatives"]:
                html_output += "<h3>ALTERNATIVE EXPLORATION ANGLES</h3>"
                html_output += "<hr>"
                html_output += "<p>Related concepts from semantic knowledge graph:</p>"
                html_output += "<ol>"
                for concept in results["conceptnet_alternatives"]:
                    html_output += f"<li>{make_wikipedia_link(concept)}</li>"
                html_output += "</ol>"

            html_output += "<hr>"
            html_output += "<p><strong>Exploration complete. Click any link to open the Wikipedia article.</strong></p>"
            html_output += "</div>"

            display(HTML(html_output))

    search_button.on_click(on_search_clicked)

    ui = widgets.VBox([
        widgets.HTML("<h2>Multi-Path Topic Explorer</h2>"),
        widgets.HTML("<p>Discovers related topics, categories, and WikiProjects from Wikipedia and semantic knowledge graphs.</p>"),
        topic_input,
        search_button,
        output
    ])

    display(ui)

create_smart_search_explorer()

In [None]:
import requests
import pandas as pd
from datetime import datetime, timedelta, timezone
from urllib.parse import quote
import time
import re
from IPython.display import display
EMAIL = "ajose3@hawk.illinoistech.edu" #@param {type:"string"}
import random

HEADERS = {
    "User-Agent": f"Wikiproject_article_finder/1.0 (Educational research; {EMAIL}; Python/requests)"
}
WIKIPROJECT = "psychology" #@param {type:"string"}
CLASS = "C-Class" #@param {type:"string"}
MAX_ARTICLES = "0" #@param {type:"string"}
try:
  MAX_ARTICLES = int(MAX_ARTICLES)  # Try conversion
except (ValueError, TypeError):
  MAX_ARTICLES = None
LANG = "en"
REQUEST_DELAY = 0.03
CHECKPOINT_INTERVAL = 100



def get_project_articles(project, klass, MAX_ARTICLES=None):
    """
    Fetch article titles for a given WikiProject and class.
    """
    category = f"Category:{klass}_{project}_articles"
    url = "https://en.wikipedia.org/w/api.php"

    print(f"Fetching from: {category}")

    titles = []
    cmcontinue = None
    page_count = 0

    while True:
        page_count += 1
        params = {
            "action": "query",
            "list": "categorymembers",
            "cmtitle": category,
            "cmlimit": 500,
            "format": "json"
        }
        if cmcontinue:
            params["cmcontinue"] = cmcontinue

        r = requests.get(url, params=params, headers=HEADERS, timeout=30)

        if not r.headers.get("Content-Type", "").startswith("application/json"):
            print("Non-JSON response, retrying...")
            time.sleep(2)
            continue

        data = r.json()
        members = data.get("query", {}).get("categorymembers", [])

        print(f"  Page {page_count}: Found {len(members)} items")

        for p in members:
            title = p["title"]
            if title.startswith("Talk:"):
                article_title = title[5:]
                titles.append(article_title)
            elif not title.startswith("Category:"):
                titles.append(title)

            if MAX_ARTICLES and len(titles) >= MAX_ARTICLES:
                unique = sorted(set(titles))[:MAX_ARTICLES]
                print(f"Reached limit of {MAX_ARTICLES} articles")
                return unique

        cmcontinue = data.get("continue", {}).get("cmcontinue")
        if not cmcontinue:
            break

        time.sleep(REQUEST_DELAY)

    unique_titles = sorted(set(titles))
    print(f"Total unique articles found: {len(unique_titles)}")

    return unique_titles

def get_article_metadata(title):
    """
    Get basic metadata: days since edit, word count, citation needed count
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "redirects": True,
        "prop": "revisions|extracts",
        "rvprop": "content|timestamp",
        "rvslots": "main",
        "explaintext": True,
        "exlimit": 1,
        "format": "json"
    }
    r = requests.get(url, params=params, headers=HEADERS, timeout=30)
    if not r.headers.get("Content-Type", "").startswith("application/json"):
        return 0, 0, 0

    data = r.json()
    page = next(iter(data["query"]["pages"].values()))

    # Calculate days since last edit
    last_edit_str = page.get("revisions", [{}])[0].get("timestamp", "")
    days_since_edit = 0
    if last_edit_str:
        last_edit = datetime.strptime(last_edit_str, "%Y-%m-%dT%H:%M:%SZ")
        days_since_edit = round((datetime.now(timezone.utc).replace(tzinfo=None) - last_edit).days)

    # Word count
    extract = page.get("extract", "")
    word_count = len(extract.split()) if extract else 0

    # Count "citation needed"
    wikitext = page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")
    citation_needed_count = wikitext.lower().count("citation needed")

    return days_since_edit, word_count, citation_needed_count

def get_sections(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "parse",
        "page": title,
        "prop": "sections",
        "redirects": 1,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return ""
        parse_data = r.json().get("parse", {})
        if not parse_data:
            return ""
        sections = parse_data.get("sections", [])
        section_names = ", ".join(s["line"] for s in sections)
        return section_names
    except:
        return ""

def get_citation_count(title):
    """
    Universal citation counter that handles ALL citation formats:
    - Standard <ref> tags
    - {{sfn}}, {{sfnp}}, {{sfnm}} (short footnotes with variants)
    - {{harv}}, {{harvnb}}, {{harvp}}, etc. (Harvard citations)
    - {{r}}, {{rp}} (reference shortcuts)
    - {{efn}} (explanatory footnotes)
    - {{citation needed}} tags
    - List-defined references
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "titles": title,
        "redirects": 1,
        "format": "json"
    }
    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return 0

        data = r.json()
        pages = data.get("query", {}).get("pages", {})
        page = next(iter(pages.values()), {})

        if "revisions" not in page:
            return 0

        content = page.get("revisions", [{}])[0].get("slots", {}).get("main", {}).get("*", "")

        # Method 1: Count standard <ref> tags (unique named refs + unnamed refs)
        named_refs = set()
        unnamed_count = 0

        ref_pattern = r'<ref(?:\s+[^>]*)?>'
        all_refs = re.findall(ref_pattern, content, re.IGNORECASE | re.DOTALL)

        for ref in all_refs:
            if ref.strip().endswith('/>'):
                continue

            name_match = re.search(r'name\s*=\s*["\']([^"\']+)["\']', ref, re.IGNORECASE)
            if name_match:
                named_refs.add(name_match.group(1))
            else:
                unnamed_count += 1

        ref_count = len(named_refs) + unnamed_count

        # Method 2: Count ALL sfn variants (sfn, sfnp, sfnm, sfnmp, etc.)
        sfn_pattern = r'\{\{sfn[a-z]*\|'
        sfn_count = len(re.findall(sfn_pattern, content, re.IGNORECASE))

        # Method 3: Count ALL harv variants (harv, harvnb, harvp, harvtxt, etc.)
        harv_pattern = r'\{\{harv[a-z]*\|'
        harv_count = len(re.findall(harv_pattern, content, re.IGNORECASE))

        # Method 4: Count {{r}} and {{rp}} (reference shortcuts)
        r_pattern = r'\{\{rp?\|'
        r_count = len(re.findall(r_pattern, content, re.IGNORECASE))

        # Method 5: Count {{efn}} (explanatory footnotes)
        efn_pattern = r'\{\{efn[a-z]*\|'
        efn_count = len(re.findall(efn_pattern, content, re.IGNORECASE))

        # Method 6: Count list-defined references
        ldr_count = 0
        ldr_match = re.search(r'\{\{reflist\|refs=(.*?)\n\}\}', content, re.IGNORECASE | re.DOTALL)
        if ldr_match:
            ldr_content = ldr_match.group(1)
            ldr_count = len(re.findall(r'<ref name=', ldr_content, re.IGNORECASE))

        # Combine footnote-style citations (sfn + efn count together, as they're often used together)
        footnote_count = sfn_count + efn_count

        # Use the highest count from all methods
        # (articles typically use ONE main citation style)
        total_citations = max(ref_count, footnote_count, harv_count, r_count, ldr_count)

        return total_citations

    except Exception as e:
        print(f"   ⚠ Citation error: {e}")
        return 0

def get_images_and_categories(title):
    """
    Get image count and category count
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "titles": title,
        "redirects": True,
        "prop": "images|categories",
        "imlimit": 500,
        "cllimit": 500,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return 0, 0

        data = r.json()
        page = next(iter(data["query"]["pages"].values()))

        images = len(page.get("images", []))
        categories = len(page.get("categories", []))

        return images, categories

    except Exception as e:
        print(f"Images/categories error: {e}")
        return 0, 0

def get_edit_statistics(title):
    """
    Get edit statistics: total editors and edits in last year
    """
    url = "https://en.wikipedia.org/w/api.php"

    params = {
        "action": "query",
        "titles": title,
        "redirects": True,
        "prop": "revisions",
        "rvprop": "timestamp|user",
        "rvlimit": 500,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return 0, 0

        data = r.json()
        pages = data.get("query", {}).get("pages", {})
        page = next(iter(pages.values()), {})

        revisions = page.get("revisions", [])

        unique_editors = set()
        recent_edits = 0
        one_year_ago = datetime.now(timezone.utc) - timedelta(days=365)

        for rev in revisions:
            user = rev.get("user", "")
            if user:
                unique_editors.add(user)

            # Count recent edits (last year)
            timestamp_str = rev.get("timestamp", "")
            if timestamp_str:
                timestamp = datetime.strptime(timestamp_str, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
                if timestamp >= one_year_ago:
                    recent_edits += 1

        return len(unique_editors), recent_edits

    except Exception as e:
        print(f"Edit statistics error: {e}")
        return 0, 0

def get_talk_page_size(title):
    """
    Get the size of the talk page in bytes
    """
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "titles": f"Talk:{title}",
        "redirects": True,
        "prop": "revisions",
        "rvprop": "size",
        "rvlimit": 1,
        "format": "json"
    }

    try:
        r = requests.get(url, params=params, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return 0

        data = r.json()
        pages = data.get("query", {}).get("pages", {})
        page = next(iter(pages.values()), {})

        # Check if page exists (missing pages have negative IDs)
        if int(page.get("pageid", -1)) < 0:
            return 0

        size = page.get("revisions", [{}])[0].get("size", 0)
        return size

    except Exception as e:
        print(f"Talk page error: {e}")
        return 0

def get_pageviews_3mo(title):
    end = datetime.now(timezone.utc).replace(tzinfo=None)
    start = end - timedelta(days=90)
    encoded_title = quote(title.replace(' ', '_'))
    url = (
        f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/"
        f"en.wikipedia/all-access/user/"
        f"{encoded_title}/daily/"
        f"{start:%Y%m%d}/{end:%Y%m%d}"
    )

    try:
        r = requests.get(url, headers=HEADERS, timeout=30)
        if not r.headers.get("Content-Type", "").startswith("application/json"):
            return 0
        data = r.json()
        return sum(d["views"] for d in data.get("items", []))
    except:
        return 0

# Main execution
print("="*60)
print(f"Fetching {CLASS} {WIKIPROJECT} articles from Wikipedia")
print("="*60)

articles = get_project_articles(WIKIPROJECT, CLASS, MAX_ARTICLES)  # Change to None for all

if not articles:
    print("\nNo articles found.")
    exit()

print(f"\n{'='*60}")
print(f"Processing {len(articles)} articles...")
print(f"Estimated time: {len(articles) * 6 * REQUEST_DELAY / 60:.1f} - {len(articles) * 6 * 0.5 / 60:.1f} minutes")
print(f"{'='*60}\n")

rows = []
start_time = time.time()

for i, title in enumerate(articles, 1):
    print(f"[{i}/{len(articles)}] {title}")
    try:
        days_since_edit, word_count, citation_needed = get_article_metadata(title)
        sections = get_sections(title)
        citations = get_citation_count(title)
        views = get_pageviews_3mo(title)
        images, categories = get_images_and_categories(title)
        num_editors, recent_edits = get_edit_statistics(title)
        talk_page_size = get_talk_page_size(title)

        rows.append({
            "Article": title,
            "Days Since Last Edit": days_since_edit,
            "Word Count": word_count,
            "Section Names": sections,
            "Citations": citations,
            "Citation Needed Count": citation_needed,
            "Images": images,
            "Categories": categories,
            "Total Editors": num_editors,
            "Edits Last Year": recent_edits,
            "Talk Page Size (bytes)": talk_page_size,
            "Last 3 Months Views": views
        })
    except Exception as e:
        print(f"Error: {e}")
        rows.append({
            "Article": title,
            "Days Since Last Edit": 0,
            "Word Count": 0,
            "Section Names": "",
            "Citations": 0,
            "Citation Needed Count": 0,
            "Images": 0,
            "Categories": 0,
            "Total Editors": 0,
            "Edits Last Year": 0,
            "Talk Page Size (bytes)": 0,
            "Last 3 Months Views": 0
        })

    # Checkpoint saves
    if i % CHECKPOINT_INTERVAL == 0:
        df_checkpoint = pd.DataFrame(rows)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        checkpoint_file = f"{WIKIPROJECT}_{CLASS}_{EMAIL}_{MAX_ARTICLES}.csv"
        df_checkpoint.to_csv(checkpoint_file, index=False)
        elapsed = time.time() - start_time
        remaining = (elapsed / i) * (len(articles) - i)
        print(f"Checkpoint saved: {checkpoint_file}")
        print(f"Elapsed: {elapsed/60:.1f}min | Estimated remaining: {remaining/60:.1f}min")

    time.sleep(REQUEST_DELAY)

# Save final CSV and display dataframe
if rows:
    df = pd.DataFrame(rows)
    #timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{WIKIPROJECT}_{CLASS}_{EMAIL}_{MAX_ARTICLES}.csv"
    df.to_csv(filename, index=False)
    filename = f"/content/drive/MyDrive/{WIKIPROJECT}_{CLASS}_{EMAIL}_{MAX_ARTICLES}.csv"
    df.to_csv(filename, index=False)

    elapsed_total = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"Successfully saved {len(df)} articles to: {filename}")
    print(f"Total time: {elapsed_total/60:.1f} minutes")
    print(f"{'='*60}\n")

    # Display editable dataframe
    print("Editable DataFrame:")
    display(df)

    print(f"\n{'='*60}")
    print("Summary statistics:")
    print(f"{'='*60}")
    numeric_cols = ['Days Since Last Edit', 'Word Count', 'Citations', 'Citation Needed Count',
                   'Images', 'Categories', 'Total Editors', 'Edits Last Year',
                   'Talk Page Size (bytes)', 'Last 3 Months Views']
    print(round(df[numeric_cols].describe(), 1))
else:
    print("\nNo data to save")