# ORES Ranking Notebook

This notebook is used to post to Wikimedias ORES ranking system for each wikipedia article.

In [2]:
import os, json, time, urllib.parse
from dotenv import load_dotenv
import requests
import pandas as pd
from tqdm import tqdm

In [None]:
# Loading api keys from .env
load_dotenv()

Before we can query ORES, we need to extract all the necessary information for each page. In particular we need the revision id. To get this we can query the wikipedia api. Below I have written a function that calls the api given an article title and returns the corresponding revision id

In [49]:
def get_article_revid(article_title):
    """
    Extracts the most recent revision id for the wikipedia article
    that is specified by the `article_title`.

    Parameters
    ----------
    article_title : str
        The title of the wikipedia article
    
    Returns
    -------

    """
    session = requests.Session()

    URL = "https://en.wikipedia.org/w/api.php"

    PARAMS = {
        "action": "query",
        "format": "json",
        "titles": article_title,
        "prop": "info",
        "inprop": "url|talkid"
    }

    response = session.get(url=URL, params=PARAMS)
    response_dict = response.json()

    pages = response_dict["query"]["pages"]

    # Extracting the most recent revision id
    for _ , v in pages.items():
        return v['lastrevid']

The next step is to parse through each line of the `us_cities_by_state_SEPT.2023.csv` file and extract the revision ids for each article. We will store the extracted revision ids into a new csv file under the `data_clean` folder.

In [53]:
df = pd.read_csv("./data_raw/us_cities_by_state_SEPT.2023.csv")

with open("./data_clean/us_cities_revid.csv", "w+") as f:
    f.write("state,page_title,revision_id,url\n")
    for city in tqdm(df.itertuples()):
        revision_id = get_article_revid(city.page_title)
        f.write(f'"{city.state}","{city.page_title}","{revision_id}","{city.url}"\n')

22157it [1:36:37,  3.82it/s]


Now that we have all of the revision ids extracted and stored away, we can now start classifying each article via ORES. Using the starter code that was provided by Dr. McDonald, we have an easy way to call the ORES API. First we define some key parameters.

In [107]:
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

Then we define the `request_ores_score_per_article` function that Dr. McDonald wrote. We can call this function on each article given its revision id and our own credentials.

In [108]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None

    # Extracting the score from the json response
    score = json_response["enwiki"]["scores"][str(article_revid)]["articlequality"]["score"]
    return score


To start classifying each of the articles we need to load in the csv file that contains the stored revision ids.

In [90]:
us_cities_df = pd.read_csv("./data_intermediate/us_cities_revid.csv")

In [104]:
wiki_access_token = os.getenv("wiki_access_token")
with open("./data_clean/us_cities_score.csv", "w+") as f1, open("./data_clean/us_cities_score_failures.csv", "w+") as f2:
    f1.write("state,page_title,revision_id,score,url\n")
    f2.write("state,page_title,revision_id,url\n")
    for row in tqdm(us_cities_df.itertuples()):
        try:
            score = request_ores_score_per_article(
                article_revid=row.revision_id,
                email_address="evan@yipsite.net",
                access_token=wiki_access_token)
            prediction = score['prediction']
            f1.write(f'"{row.state}","{row.page_title}","{row.revision_id}","{prediction}","{row.url}"\n')
        except Exception as e:
            f2.write(f'"{row.state}","{row.page_title}","{row.revision_id}","{row.url}"\n')

19935it [3:36:36,  1.28it/s]

Expecting value: line 1 column 1 (char 0)


20997it [3:42:39,  1.09s/it]

Expecting value: line 1 column 1 (char 0)


22157it [4:04:01,  1.51it/s]


Above it seemed like a bunch of the articles failed. Since there were too many for to display in the output, we can analyze this simply by reading in the saved output and comparing the articles in each file to the starting csv.

In [None]:
score_df = pd.read_csv("../data_intermediate/us_cities_score.csv")

(18961, 5)

After running the following cell three times and updating the throttle wait time, we have successfully extracted scores for all of the articles. 

In [113]:
# Loading in the failures
failures_df = pd.read_csv("../data_clean/us_cities_score_failures.csv")

# Running ORES again, this time appending to the scores file, but overwriting the fails
wiki_access_token = os.getenv("wiki_access_token")
with open("./data_clean/us_cities_score.csv", "a") as f1, open("./data_clean/us_cities_score_failures.csv", "w+") as f2:
    f2.write("state,page_title,revision_id,url\n")
    for row in tqdm(failures_df.itertuples()):
        try:
            score = request_ores_score_per_article(
                article_revid=row.revision_id,
                email_address="evan@yipsite.net",
                access_token=wiki_access_token)
            prediction = score['prediction']
            f1.write(f'"{row.state}","{row.page_title}","{row.revision_id}","{prediction}","{row.url}"\n')
        except Exception as e:
            f2.write(f'"{row.state}","{row.page_title}","{row.revision_id}","{row.url}"\n')

0it [00:00, ?it/s]


For formatting purposes we will reload in the stored scores and sort them by State and city in alphabetical format.

In [6]:
final_scores = pd.read_csv("../data_intermediate/us_cities_score.csv")
final_scores = final_scores.sort_values(["state", "page_title"])
final_scores.tail()

Unnamed: 0,state,page_title,revision_id,score,url
18956,Wyoming,"Wamsutter, Wyoming",1169591845,GA,"https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
18957,Wyoming,"Wheatland, Wyoming",1176370621,GA,"https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
18958,Wyoming,"Worland, Wyoming",1166347917,GA,"https://en.wikipedia.org/wiki/Worland,_Wyoming"
18959,Wyoming,"Wright, Wyoming",1166334449,GA,"https://en.wikipedia.org/wiki/Wright,_Wyoming"
18960,Wyoming,"Yoder, Wyoming",1171182284,C,"https://en.wikipedia.org/wiki/Yoder,_Wyoming"


The resulting dataframe appears to be sorted correctly since Wyoming is at the very bottom and the page titles are also in alphabetical order. The last step is to save the data file back into the csv.

In [None]:
final_scores.to_csv("../data_clean/us_cities_score_sorted.csv", index=False)