# ORES API Example
This example illustrates how to generate quality scores for article revisions using [ORES](https://www.mediawiki.org/wiki/ORES). This example shows how to request a score of a specific revision, where the score provides probabilities for all of the possible article quality levels. The API documentation can be access from the main [ORES](https://ores.wikimedia.org) page. However, this documentation is a little skimpy and if you want more information you may have to dig around.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022



In [10]:
# 
# These are standard python modules
import json, time, urllib.parse, os
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
from tqdm import tqdm
import pandas as pd

The example relies on some constants that help make the code a bit more readable.

In [3]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<amb7896@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
with open(os.path.join("..","data","latest_revision_id.json"), "r") as infile:
    ARTICLE_REVISIONS = json.load(infile)

# ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article revisions. Therefore, the main parameter is article_revid.

In [4]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [6]:
all_articles = list(ARTICLE_REVISIONS.keys())
ores_retrieval_failed_articles = []
article_ores_predictions = []

for j in tqdm(range(len(all_articles))):
    try:
        ARTICLE = all_articles[j]
        score = request_ores_score_per_article(ARTICLE_REVISIONS[ARTICLE])
        for revision_id in score["enwiki"]["scores"].items():
            score_intermediate = revision_id[0]
        article_prediction = score["enwiki"]["scores"][score_intermediate] \
        ["articlequality"]["score"]["prediction"]
        article_ores_predictions.append(article_prediction)
    except Exception as e:
        print("Error occured for '%s' with revid: %d",ARTICLE,ARTICLE_REVISIONS[ARTICLE])
        ores_retrieval_failed_articles.append(ARTICLE)

100%|##########| 7526/7526 [33:22<00:00,  3.76it/s]  


Getting ORES scores for 'Denis Walker' with revid: 1111257734
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1111257734": {
                "articlequality": {
                    "score": {
                        "prediction": "C",
                        "probability": {
                            "B": 0.09057406471211663,
                            "C": 0.6165301794361601,
                            "FA": 0.004627285684019312,
                            "GA": 0.009809443335344838,
                            "Start": 0.2737761784549897,
                            "Stub": 0.004682848377369667
                        }
                    }
                }
            }
        }
    }
}


### There were no politicians for whom I couldn't get a score

## Including ORES data for every politician article

In [11]:
# Joining ORES scores with article titles
politicians_ores_df = pd.DataFrame({
    "article_title":all_articles,
    "revision_id":list(ARTICLE_REVISIONS.values()),
    "article_quality":article_ores_predictions
})

In [12]:
politicians_ores_df.head()

Unnamed: 0,article_title,revision_id,article_quality
0,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,1104998382,Start


In [13]:
politicians_wiki_df = pd.read_csv(os.path.join("..","data","politicians_by_country_SEPT.2022.csv"))
politicians_wiki_df.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [14]:
# Merging politician scores and politician wiki data to add country column
politicians_ores_df = pd.merge(left= politicians_wiki_df,
                               right= politicians_ores_df,
                               left_on= "name",
                               right_on= "article_title",
                               how = "inner")

In [17]:
politicians_ores_df.head()

Unnamed: 0,name,url,country,article_title,revision_id,article_quality
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382,Start


In [18]:
# Drop columns not required in final schema - name and url
politicians_ores_df.drop(labels = ['name','url'],inplace = True,axis = 1)

In [19]:
politicians_ores_df.head()

Unnamed: 0,country,article_title,revision_id,article_quality
0,Afghanistan,Shahjahan Noori,1099689043,GA
1,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start
2,Afghanistan,Majah Ha Adrif,852404094,Start
3,Afghanistan,Haroon al-Afghani,1095102390,B
4,Afghanistan,Tayyab Agha,1104998382,Start


## Merge previously created wikipedia-scores and population data

In [20]:
# Creating region column in population data
population_df = pd.read_csv(os.path.join("..","data","population_by_country_2022.csv"))
population_df.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [25]:
# Create new region column by copying those values from Geography column which has string 
# with all uppercase letters. Fill NaN values with previous region value which is all caps 
# 
population_df["region"] = population_df.Geography[population_df.Geography.str.isupper()]
population_df["region"] = population_df.region.ffill(axis=0)

In [26]:
#Validate if region column popluated as expected
population_df.head(10)

Unnamed: 0,Geography,Population (millions),region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA
8,Tunisia,11.8,NORTHERN AFRICA
9,Western Sahara,0.6,NORTHERN AFRICA


In [35]:
# Merging politician scores and population data to add region and population column
politicians_final_df = pd.merge(left= politicians_ores_df,
                               right= population_df,
                               left_on= "country",
                               right_on= "Geography",
                               how = "inner")
politicians_final_df.rename(columns={"Population (millions)":"population"},inplace=True)
politicians_final_df.head(20)

Unnamed: 0,country,article_title,revision_id,article_quality,Geography,population,region
0,Afghanistan,Shahjahan Noori,1099689043,GA,Afghanistan,41.1,SOUTH ASIA
1,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start,Afghanistan,41.1,SOUTH ASIA
2,Afghanistan,Majah Ha Adrif,852404094,Start,Afghanistan,41.1,SOUTH ASIA
3,Afghanistan,Haroon al-Afghani,1095102390,B,Afghanistan,41.1,SOUTH ASIA
4,Afghanistan,Tayyab Agha,1104998382,Start,Afghanistan,41.1,SOUTH ASIA
5,Afghanistan,Ahmadullah Wasiq,1109361754,Start,Afghanistan,41.1,SOUTH ASIA
6,Afghanistan,Aziza Ahmadyar,1087211008,Start,Afghanistan,41.1,SOUTH ASIA
7,Afghanistan,Muqadasa Ahmadzai,1082489593,Start,Afghanistan,41.1,SOUTH ASIA
8,Afghanistan,Mohammad Sarwar Ahmedzai,1038918070,Start,Afghanistan,41.1,SOUTH ASIA
9,Afghanistan,Amir Muhammad Akhundzada,1069322182,Start,Afghanistan,41.1,SOUTH ASIA


In [36]:
# Drop columns not required in final schema - Geography
politicians_final_df.drop(labels = ['Geography'],inplace = True,axis = 1)
politicians_final_df.columns

Index(['country', 'article_title', 'revision_id', 'article_quality',
       'population', 'region'],
      dtype='object')

In [41]:
# Write to consolidate the remaining data into a single CSV
# file 
politicians_final_df.to_csv(os.path.join("..","data","wp_politicians_by_country.csv"),
                            index=False)

In [39]:
# Identify all countries for which there are no matches and output a list of those countries,
# with each country on a separate line 
no_match_countries = politicians_ores_df.country[~politicians_ores_df.country\
                           .isin(politicians_final_df.country.unique())].unique()
# open file in write mode
with open(os.path.join("..","data","wp_countries-no_match.txt"), 'w') as fp:
    for item in no_match_countries:
        # write each item on a new line
        fp.write("%s\n" % item)

### Switch to article_analysis.ipynb for Step4: Analysis and Step5: Results