# Getting Pageviews: Monthly Pageview Data 2017 - 2025

This notebook handles the acquisition of engagement data. It uses the Wikipedia Pageviews API to request and collect the complete monthly pageview history for every economist, spanning from January 2017 to November 2025. The core function involves iterating through the list of economists, querying the API for the time-series data, and then performing the final merge with the cleaned, labeled biographical dataset generated in Notebook 3. This results in the complete, ready-for-analysis dataset.

***Note on Development**: In some sections, due to the high volume of requests necessary to collect monthly data over this time period for all economists, Generative AI was extensively used to develop the API request function, focusing on error handling, efficient batching, and adherence to API rate limits to prevent connection issues.

**Table of Content**

1. [Library Imports](#sec1)
2. [Load Cleaned and Labeled Data](#sec2)
3. [Wikipedia Pageviews API Function](#sec3)
4. [Execute Pageview Requests](#sec4)
5. [Final Dataset Merge](#sec5)
6. [Data Export](#sec6)

<a id="sec1"></a>
### Library Imports

In [2]:
import requests
import pandas as pd
import time
import tqdm 

<a id="sec2"></a>
### Load Cleaned and Labeled Data

In [4]:
df = pd.read_csv("../Data/economists_cleaned.csv")
df.head()

Unnamed: 0,name,article_url,qid,summary,birth_year,gender,citizenship,occupation,fields
0,Edith Abbott,https://en.wikipedia.org/wiki/Edith_Abbott,Q272731,"Edith Abbott (September 26, 1876 – July 28, 19...",1876.0,female,['United States'],"['economist', 'statistician', 'social worker',...","['economics', 'social work', 'statistics']"
1,Daron Acemoglu,https://en.wikipedia.org/wiki/Daron_Acemoglu,Q718581,"Kamer Daron Acemoğlu (born September 3, 1967) ...",1967.0,male,"['Turkey', 'United States']","['economist', 'university teacher', 'author']",['economics']
2,Nicola Acocella,https://en.wikipedia.org/wiki/Nicola_Acocella,Q7001311,Nicola Acocella (born 3 July 1939) is an Itali...,1939.0,male,"['Kingdom of Italy', 'Italy']",['economist'],[]
3,Zoltan Acs,https://en.wikipedia.org/wiki/Zoltan_Acs,Q8073604,Zoltan J. Acs (born 1947) is an American econo...,1947.0,male,"['United States', 'Hungary']",['economist'],['economics']
4,Henry Carter Adams,https://en.wikipedia.org/wiki/Henry_Carter_Adams,Q518021,"Henry Carter Adams (December 31, 1851 – August...",1851.0,male,['United States'],"['economist', 'university teacher', 'writer', ...",['economics']


<a id="sec3"></a>
### Wikipedia Pageviews API Function

In [None]:
# Generative AI was used below to assist with the setup of the function 

def fetch_monthly_pageviews(title, start="20170101", end="20251130"):
    """
    Fetch monthly pageviews for an article title between start and end dates.
    Returns a list of dicts: [{"title":..., "date":..., "views":...}, ...]
    """

    # Prepare title for API (spaces → underscores)
    title_encoded = title.replace(" ", "_")

    url = f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/" \
          f"en.wikipedia.org/all-access/user/{title_encoded}/monthly/{start}/{end}"

    headers = {"User-Agent": "WellesleyCS234-StudentProject/1.0 (sk131@wellesley.edu)"}

    r = requests.get(url, headers=headers)

    # If article missing → return an empty list
    if r.status_code != 200:
        return []

    data = r.json()

    # Extract
    results = []
    for item in data.get("items", []):
        results.append({
            "title": title,
            "qid": None,  
            "date": item["timestamp"][:6],  # YYYYMM
            "views": item["views"]
        })

    return results


<a id="sec4"></a>
### Execute Pageview Requests

In [7]:
# Generative AI was used below to assist with the setup of the function 

# Add page_title column
df["page_title"] = df["article_url"].str.split("/wiki/").str[-1]

all_views = []

for title, qid in tqdm.tqdm(zip(df.page_title, df.qid), total=len(df)):
    monthly = fetch_monthly_pageviews(title)
    
    # assign qid to each row
    for item in monthly:
        item["qid"] = qid

    all_views.extend(monthly)
    
    time.sleep(0.15)   # Necessary to avoid API rate-limit


  0%|          | 0/1102 [00:00<?, ?it/s]

100%|██████████| 1102/1102 [11:45<00:00,  1.56it/s] 


In [8]:
views_df = pd.DataFrame(all_views)
views_df.head()


Unnamed: 0,title,qid,date,views
0,Edith_Abbott,Q272731,201701,909
1,Edith_Abbott,Q272731,201702,1005
2,Edith_Abbott,Q272731,201703,1461
3,Edith_Abbott,Q272731,201704,901
4,Edith_Abbott,Q272731,201705,801


In [10]:
views_df.to_csv("../Data/economists_monthly_pageviews_2017_2025.csv", index=False)

In [12]:
meta = pd.read_csv("../Data/economists_cleaned.csv")
views = pd.read_csv("../Data/economists_monthly_pageviews_2017_2025.csv")

<a id="sec5"></a>
### Final Dataset Merge

In [14]:
merged = views.merge(
    meta,
    on="qid",
    how="left"
)

In [15]:
merged.shape

(114905, 12)

<a id="sec6"></a>
### Data Export

In [17]:
merged.to_csv("../Data/economists_final_dataset.csv", index=False)