## Purpose

The purpose of this code is to acquire data on article quality predictions from ORES. 

The article source data comes from English Wikipedia, the text of which is licensed under "Creative Commons Attribution Share-Alike license" (https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)

We will be using the MediaWiki REST API for English Wikipedia. To get more information on the API please use the following link: https://www.mediawiki.org/wiki/API:Main_page. The following link may also be helpful when looking for more documentation: https://www.mediawiki.org/wiki/API:Info

We will leverage code developed by Dr. David W. McDonald for use in Data 512  which is provided under Creative Commons CC-BY license. (https://creativecommons.org/ and https://creativecommons.org/licenses/by/4.0/). The file can be found at this link: https://drive.google.com/drive/folders/1FtvWV31DHE8HIMdEsPGuCXPz0PMvShfl

We will also be using the ORES API. Information on the API itself can be found at https://www.mediawiki.org/wiki/ORES, with original API documentation from https://ores.wikimedia.org/docs and new LiftWing documentation from https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage.

We will begin by reading in standard Python libraries.

In [2]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

Next we will read in the us_cities_by_state_SEPT.2023.csv file from raw_data to create a list of articles which we want to feed in to ORES

In [51]:
#Read in info using pandas
cities_by_st_df = pd.read_csv('../raw_data/us_cities_by_state_SEPT.2023.csv')
cities_by_st_df.head()

#Get a list of page_titles
page_titles = list(cities_by_st_df['page_title'])

#Check if all the page titles were captures
if len(page_titles) == len(cities_by_st_df):
    print("All {0} page titles captured.".format(len(page_titles)))
else:
    print("Not all page titles captured")
    
#Check if page titles are unique
if len(page_titles) == len(cities_by_st_df['page_title'].unique()):
    print("All {0} page titles are unique.".format(len(page_titles)))
else:
    print("WARNING: Page titles are not unique. "+
          "There are {0} page titles and {1} unique titles - ".format(
              len(page_titles), len(cities_by_st_df['page_title'].unique()))
          + "a delta of {0}.".format(len(page_titles)-
                                len(cities_by_st_df['page_title'].unique())))

All 22157 page titles captured.


The user may notice a warning that all page titles are not unique. We will remove duplicated rows in the scraped file us_cities_by_state_SEPT.2023.csv to reduce later processing.

In [52]:
#Removing dupes
cities_dedupe = cities_by_st_df.drop_duplicates()

#Rebuilding the page titles list
page_titles_dedupe = list(cities_dedupe['page_title'])

#Rechecking if page titles are unique
if len(page_titles_dedupe) == len(cities_dedupe['page_title'].unique()):
    print("All {0} page titles are unique.".format(len(page_titles_dedupe)))
else:
    print("WARNING: Page titles are not unique. "+
          "There are {0} page titles and {1} unique titles - ".format(
              len(page_titles_dedupe),
              len(cities_dedupe['page_title'].unique()))
          + "a delta of {0}.".format(len(page_titles_dedupe)-
                                len(cities_dedupe['page_title'].unique())))



It is possible that page titles are still not unique. We will visually inspect these page titles to see if there is any way to reduce later processing.

In [53]:
#Getting list of non-unique city names
#The following code was taken from Stack Overflow (https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them)
import collections
dupe_titles = [item for item, count in collections.Counter(page_titles_dedupe).items() if count > 1]
print(dupe_titles)

['2020 United States census', '2010 United States census', 'County (United States)', 'Population']


Using the files originally obtained for this analysis we will find that there are "cities" listed in article titles which are not actually cities at all (e.g., "Population" or "2020 United States census"). We will remove those non-cities.

In [54]:
#Removing non-cities from the deduped city file
cities_final_df = cities_dedupe.loc[~cities_dedupe['page_title'].isin(dupe_titles)]

We will save the cleaned state, page titles, and urls to a new file in the raw_data folder.

In [55]:
#Saving cities_final_df
cities_final_df.to_csv('../raw_data/clean_us_cities_by_state.csv')

Now we will access the page info from the MediaWiki REST API for English Wikipedia articles. This code was taken from the file provided by Professor McDonald. We will begin by copying the code which sets up the constants for the API call. Some changes were made to keep comments to 1 line and to alter the name of the example list of articles.

In [7]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - 
#we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique 
#to the person making the request
REQUEST_HEADERS = {
    'User-Agent': '<ekrolen@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# Example article list
ex_article_titles = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see 
#the Info documentation for what can be included. If you don't want any 
#this can simply be the empty string
#PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # simplify - single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

Next we copy the API request procedure developed by Professor McDonald. Per his comments, "The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title."

In [8]:
#Page info requester
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#The following code can be run to verify the API is pulling correctly
'''print(f"Getting page info data for: {ex_article_titles[3]}")
info = request_pageinfo_per_article(ex_article_titles[3])
print(json.dumps(info,indent=4))'''

'print(f"Getting page info data for: {ex_article_titles[3]}")\ninfo = request_pageinfo_per_article(ex_article_titles[3])\nprint(json.dumps(info,indent=4))'

Next we will create a dictionary which contains each of the article titles as keys and their revision ids (lastrevid from request_pageinfo_per_article) for input into the ORES API

In [9]:
#Creating the final list of titles to pull
article_titles = list(cities_final_df['page_title'])

#Pulling lastrevid for all of the articles & saving in a dictionary
page_info_dict = {}
for i in range(len(article_titles)):
    info = request_pageinfo_per_article(article_titles[i])
    page_info_dict[article_titles[i]] = info['query']['pages'][list(info['query']['pages'].keys())[0]]['lastrevid']
    if i%10 == 0:
        print(str(i)+" of "+str(len(article_titles))+" have been written")

0 of 21515 have been written
10 of 21515 have been written
20 of 21515 have been written
30 of 21515 have been written
40 of 21515 have been written
50 of 21515 have been written
60 of 21515 have been written
70 of 21515 have been written
80 of 21515 have been written
90 of 21515 have been written
100 of 21515 have been written
110 of 21515 have been written
120 of 21515 have been written
130 of 21515 have been written
140 of 21515 have been written
150 of 21515 have been written
160 of 21515 have been written
170 of 21515 have been written
180 of 21515 have been written
190 of 21515 have been written
200 of 21515 have been written
210 of 21515 have been written
220 of 21515 have been written
230 of 21515 have been written
240 of 21515 have been written
250 of 21515 have been written
260 of 21515 have been written
270 of 21515 have been written
280 of 21515 have been written
290 of 21515 have been written
300 of 21515 have been written
310 of 21515 have been written
320 of 21515 have b

2600 of 21515 have been written
2610 of 21515 have been written
2620 of 21515 have been written
2630 of 21515 have been written
2640 of 21515 have been written
2650 of 21515 have been written
2660 of 21515 have been written
2670 of 21515 have been written
2680 of 21515 have been written
2690 of 21515 have been written
2700 of 21515 have been written
2710 of 21515 have been written
2720 of 21515 have been written
2730 of 21515 have been written
2740 of 21515 have been written
2750 of 21515 have been written
2760 of 21515 have been written
2770 of 21515 have been written
2780 of 21515 have been written
2790 of 21515 have been written
2800 of 21515 have been written
2810 of 21515 have been written
2820 of 21515 have been written
2830 of 21515 have been written
2840 of 21515 have been written
2850 of 21515 have been written
2860 of 21515 have been written
2870 of 21515 have been written
2880 of 21515 have been written
2890 of 21515 have been written
2900 of 21515 have been written
2910 of 

5170 of 21515 have been written
5180 of 21515 have been written
5190 of 21515 have been written
5200 of 21515 have been written
5210 of 21515 have been written
5220 of 21515 have been written
5230 of 21515 have been written
5240 of 21515 have been written
5250 of 21515 have been written
5260 of 21515 have been written
5270 of 21515 have been written
5280 of 21515 have been written
5290 of 21515 have been written
5300 of 21515 have been written
5310 of 21515 have been written
5320 of 21515 have been written
5330 of 21515 have been written
5340 of 21515 have been written
5350 of 21515 have been written
5360 of 21515 have been written
5370 of 21515 have been written
5380 of 21515 have been written
5390 of 21515 have been written
5400 of 21515 have been written
5410 of 21515 have been written
5420 of 21515 have been written
5430 of 21515 have been written
5440 of 21515 have been written
5450 of 21515 have been written
5460 of 21515 have been written
5470 of 21515 have been written
5480 of 

7740 of 21515 have been written
7750 of 21515 have been written
7760 of 21515 have been written
7770 of 21515 have been written
7780 of 21515 have been written
7790 of 21515 have been written
7800 of 21515 have been written
7810 of 21515 have been written
7820 of 21515 have been written
7830 of 21515 have been written
7840 of 21515 have been written
7850 of 21515 have been written
7860 of 21515 have been written
7870 of 21515 have been written
7880 of 21515 have been written
7890 of 21515 have been written
7900 of 21515 have been written
7910 of 21515 have been written
7920 of 21515 have been written
7930 of 21515 have been written
7940 of 21515 have been written
7950 of 21515 have been written
7960 of 21515 have been written
7970 of 21515 have been written
7980 of 21515 have been written
7990 of 21515 have been written
8000 of 21515 have been written
8010 of 21515 have been written
8020 of 21515 have been written
8030 of 21515 have been written
8040 of 21515 have been written
8050 of 

10300 of 21515 have been written
10310 of 21515 have been written
10320 of 21515 have been written
10330 of 21515 have been written
10340 of 21515 have been written
10350 of 21515 have been written
10360 of 21515 have been written
10370 of 21515 have been written
10380 of 21515 have been written
10390 of 21515 have been written
10400 of 21515 have been written
10410 of 21515 have been written
10420 of 21515 have been written
10430 of 21515 have been written
10440 of 21515 have been written
10450 of 21515 have been written
10460 of 21515 have been written
10470 of 21515 have been written
10480 of 21515 have been written
10490 of 21515 have been written
10500 of 21515 have been written
10510 of 21515 have been written
10520 of 21515 have been written
10530 of 21515 have been written
10540 of 21515 have been written
10550 of 21515 have been written
10560 of 21515 have been written
10570 of 21515 have been written
10580 of 21515 have been written
10590 of 21515 have been written
10600 of 2

12790 of 21515 have been written
12800 of 21515 have been written
12810 of 21515 have been written
12820 of 21515 have been written
12830 of 21515 have been written
12840 of 21515 have been written
12850 of 21515 have been written
12860 of 21515 have been written
12870 of 21515 have been written
12880 of 21515 have been written
12890 of 21515 have been written
12900 of 21515 have been written
12910 of 21515 have been written
12920 of 21515 have been written
12930 of 21515 have been written
12940 of 21515 have been written
12950 of 21515 have been written
12960 of 21515 have been written
12970 of 21515 have been written
12980 of 21515 have been written
12990 of 21515 have been written
13000 of 21515 have been written
13010 of 21515 have been written
13020 of 21515 have been written
13030 of 21515 have been written
13040 of 21515 have been written
13050 of 21515 have been written
13060 of 21515 have been written
13070 of 21515 have been written
13080 of 21515 have been written
13090 of 2

15280 of 21515 have been written
15290 of 21515 have been written
15300 of 21515 have been written
15310 of 21515 have been written
15320 of 21515 have been written
15330 of 21515 have been written
15340 of 21515 have been written
15350 of 21515 have been written
15360 of 21515 have been written
15370 of 21515 have been written
15380 of 21515 have been written
15390 of 21515 have been written
15400 of 21515 have been written
15410 of 21515 have been written
15420 of 21515 have been written
15430 of 21515 have been written
15440 of 21515 have been written
15450 of 21515 have been written
15460 of 21515 have been written
15470 of 21515 have been written
15480 of 21515 have been written
15490 of 21515 have been written
15500 of 21515 have been written
15510 of 21515 have been written
15520 of 21515 have been written
15530 of 21515 have been written
15540 of 21515 have been written
15550 of 21515 have been written
15560 of 21515 have been written
15570 of 21515 have been written
15580 of 2

17770 of 21515 have been written
17780 of 21515 have been written
17790 of 21515 have been written
17800 of 21515 have been written
17810 of 21515 have been written
17820 of 21515 have been written
17830 of 21515 have been written
17840 of 21515 have been written
17850 of 21515 have been written
17860 of 21515 have been written
17870 of 21515 have been written
17880 of 21515 have been written
17890 of 21515 have been written
17900 of 21515 have been written
17910 of 21515 have been written
17920 of 21515 have been written
17930 of 21515 have been written
17940 of 21515 have been written
17950 of 21515 have been written
17960 of 21515 have been written
17970 of 21515 have been written
17980 of 21515 have been written
17990 of 21515 have been written
18000 of 21515 have been written
18010 of 21515 have been written
18020 of 21515 have been written
18030 of 21515 have been written
18040 of 21515 have been written
18050 of 21515 have been written
18060 of 21515 have been written
18070 of 2

20260 of 21515 have been written
20270 of 21515 have been written
20280 of 21515 have been written
20290 of 21515 have been written
20300 of 21515 have been written
20310 of 21515 have been written
20320 of 21515 have been written
20330 of 21515 have been written
20340 of 21515 have been written
20350 of 21515 have been written
20360 of 21515 have been written
20370 of 21515 have been written
20380 of 21515 have been written
20390 of 21515 have been written
20400 of 21515 have been written
20410 of 21515 have been written
20420 of 21515 have been written
20430 of 21515 have been written
20440 of 21515 have been written
20450 of 21515 have been written
20460 of 21515 have been written
20470 of 21515 have been written
20480 of 21515 have been written
20490 of 21515 have been written
20500 of 21515 have been written
20510 of 21515 have been written
20520 of 21515 have been written
20530 of 21515 have been written
20540 of 21515 have been written
20550 of 21515 have been written
20560 of 2

Due to the long run time pulling page information, we will save the results as a JSON file in the raw_data folder.

In [3]:
#Making the page info dict a JSON object
page_info_json_object = json.dumps(page_info_dict, indent = 4) 

#Writing to the file
with open('../raw_data/page_info.json', 'w') as outfile:
    outfile.write(page_info_json_object)

NameError: name 'page_info_dict' is not defined

While it's possible that this script is run consecutively, we will load the page info in the below section of code in the event the programmer wants to take a break between getting page info and ORES scores.

In [4]:
#Loading page info dict as var
page_info_file = open('../raw_data/page_info.json')
 
#Makes page info a dictionary
page_info_dict = json.load(page_info_file)

Next we will need to make ORES requests for the pages using the article titles and last revision ids stored in page_info_dict. We will use the ORES scripts developed by Professor McDonald - for more information on ORES itself see the Purpose section of this document or the README in the home directory of this project. As with the other API pull, we will start by defining constants. The following block of code will allow you to write your email, username, and access token once for use later. Change any emails, usernames, and access tokens as necessary to reflect your own account.

In [9]:
email = ""
username = ""
access_token = ""

In [10]:
#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<"+email+">, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer "+access_token
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : email,    
    'access_token'  : access_token
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ex_article_revisions = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = username
ACCESS_TOKEN = access_token
#

Now we will use the function written by Professor McDonald to make the ORES API request. He states, "The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels."

In [11]:
#ORES API puller fcn
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

The following code can be used to verify the ORES call runs correctly. As always, change the email and access token to reflect your own.

In [22]:
'''#   Which article - the key for the article dictionary defined above
article_title = "Bison"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ex_article_revisions[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid=ex_article_revisions[article_title],
                                       email_address=email,
                                       access_token=ACCESS_TOKEN)
#
#    Output the result
print(json.dumps(score,indent=4))
#'''

Getting LiftWing ORES scores for 'Bison' with revid: 1085687913
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1085687913": {
                "articlequality": {
                    "score": {
                        "prediction": "FA",
                        "probability": {
                            "B": 0.07895665991827401,
                            "C": 0.03728215742560417,
                            "FA": 0.5629436065906797,
                            "GA": 0.30547854835374505,
                            "Start": 0.011061807252218824,
                            "Stub": 0.00427722045947826
                        }
                    }
                }
            }
        }
    }
}


Now we will call ORES for each article. We will save after each set of 2000 articles are scored to avoid losing work should the connection time out or go down.

In [130]:
#Get list of article titles
article_list = list(page_info_dict.keys())

#Setting up final dict, no predictions list, and article count
ores_scores = {}
no_prediction = []
key_error_list = []
article_count = 0

#Go through each chunk and each article calling ORES. Capture the final output, blend it with the existing final dict
# and save it to the file, printing out how far you got in case it stops
for article in article_list:
    score = request_ores_score_per_article(article_revid=page_info_dict[article],
                                       email_address=email,
                                       access_token=ACCESS_TOKEN)
    if score is None:
        no_prediction.append(article)   
    else:
        try: 
            ores_scores[article] = score['enwiki']['scores'][str(page_info_dict[article])]['articlequality']['score']['prediction']
        except KeyError:    
            key_error_list.append(article)   
    
    #Printing tracker
    if article_count%10 == 0:
        print(str(article_count)+" of "+str(len(article_list))+" have been pulled")
    
    #Pringing every time we save 2000 articles
    if article_count%2000 == 0:
        ores_scores_json_object = json.dumps(ores_scores, indent = 4) 
        with open('../raw_data/ores_scores.json', 'w') as outfile:
            outfile.write(ores_scores_json_object)
        print(str(article_count)+" articles processed and written to file")
    
    #Incrementing tracker 
    article_count = article_count + 1

0 of 21515 have been pulled
0 articles processed and written to file
10 of 21515 have been pulled
20 of 21515 have been pulled
30 of 21515 have been pulled
40 of 21515 have been pulled
50 of 21515 have been pulled
60 of 21515 have been pulled
70 of 21515 have been pulled
80 of 21515 have been pulled
90 of 21515 have been pulled
100 of 21515 have been pulled
110 of 21515 have been pulled
120 of 21515 have been pulled
130 of 21515 have been pulled
140 of 21515 have been pulled
150 of 21515 have been pulled
160 of 21515 have been pulled
170 of 21515 have been pulled
180 of 21515 have been pulled
190 of 21515 have been pulled
200 of 21515 have been pulled
210 of 21515 have been pulled
220 of 21515 have been pulled
230 of 21515 have been pulled
240 of 21515 have been pulled
250 of 21515 have been pulled
260 of 21515 have been pulled
270 of 21515 have been pulled
280 of 21515 have been pulled
290 of 21515 have been pulled
300 of 21515 have been pulled
310 of 21515 have been pulled
320 of 215

2660 of 21515 have been pulled
2670 of 21515 have been pulled
2680 of 21515 have been pulled
2690 of 21515 have been pulled
2700 of 21515 have been pulled
2710 of 21515 have been pulled
2720 of 21515 have been pulled
2730 of 21515 have been pulled
2740 of 21515 have been pulled
2750 of 21515 have been pulled
2760 of 21515 have been pulled
2770 of 21515 have been pulled
2780 of 21515 have been pulled
2790 of 21515 have been pulled
2800 of 21515 have been pulled
2810 of 21515 have been pulled
2820 of 21515 have been pulled
2830 of 21515 have been pulled
2840 of 21515 have been pulled
2850 of 21515 have been pulled
2860 of 21515 have been pulled
2870 of 21515 have been pulled
2880 of 21515 have been pulled
2890 of 21515 have been pulled
2900 of 21515 have been pulled
2910 of 21515 have been pulled
2920 of 21515 have been pulled
2930 of 21515 have been pulled
2940 of 21515 have been pulled
2950 of 21515 have been pulled
2960 of 21515 have been pulled
2970 of 21515 have been pulled
2980 of 

5290 of 21515 have been pulled
5300 of 21515 have been pulled
5310 of 21515 have been pulled
5320 of 21515 have been pulled
5330 of 21515 have been pulled
5340 of 21515 have been pulled
5350 of 21515 have been pulled
5360 of 21515 have been pulled
5370 of 21515 have been pulled
5380 of 21515 have been pulled
5390 of 21515 have been pulled
5400 of 21515 have been pulled
5410 of 21515 have been pulled
5420 of 21515 have been pulled
5430 of 21515 have been pulled
5440 of 21515 have been pulled
5450 of 21515 have been pulled
5460 of 21515 have been pulled
5470 of 21515 have been pulled
5480 of 21515 have been pulled
5490 of 21515 have been pulled
5500 of 21515 have been pulled
5510 of 21515 have been pulled
5520 of 21515 have been pulled
5530 of 21515 have been pulled
5540 of 21515 have been pulled
5550 of 21515 have been pulled
5560 of 21515 have been pulled
5570 of 21515 have been pulled
5580 of 21515 have been pulled
5590 of 21515 have been pulled
5600 of 21515 have been pulled
5610 of 

7920 of 21515 have been pulled
7930 of 21515 have been pulled
7940 of 21515 have been pulled
7950 of 21515 have been pulled
7960 of 21515 have been pulled
7970 of 21515 have been pulled
7980 of 21515 have been pulled
7990 of 21515 have been pulled
8000 of 21515 have been pulled
8000 articles processed and written to file
8010 of 21515 have been pulled
8020 of 21515 have been pulled
8030 of 21515 have been pulled
8040 of 21515 have been pulled
8050 of 21515 have been pulled
8060 of 21515 have been pulled
8070 of 21515 have been pulled
8080 of 21515 have been pulled
8090 of 21515 have been pulled
8100 of 21515 have been pulled
8110 of 21515 have been pulled
8120 of 21515 have been pulled
8130 of 21515 have been pulled
8140 of 21515 have been pulled
8150 of 21515 have been pulled
8160 of 21515 have been pulled
8170 of 21515 have been pulled
8180 of 21515 have been pulled
8190 of 21515 have been pulled
8200 of 21515 have been pulled
8210 of 21515 have been pulled
8220 of 21515 have been pu

10520 of 21515 have been pulled
10530 of 21515 have been pulled
10540 of 21515 have been pulled
10550 of 21515 have been pulled
10560 of 21515 have been pulled
10570 of 21515 have been pulled
10580 of 21515 have been pulled
10590 of 21515 have been pulled
10600 of 21515 have been pulled
10610 of 21515 have been pulled
10620 of 21515 have been pulled
10630 of 21515 have been pulled
10640 of 21515 have been pulled
10650 of 21515 have been pulled
10660 of 21515 have been pulled
10670 of 21515 have been pulled
10680 of 21515 have been pulled
10690 of 21515 have been pulled
10700 of 21515 have been pulled
10710 of 21515 have been pulled
10720 of 21515 have been pulled
10730 of 21515 have been pulled
10740 of 21515 have been pulled
10750 of 21515 have been pulled
10760 of 21515 have been pulled
10770 of 21515 have been pulled
10780 of 21515 have been pulled
10790 of 21515 have been pulled
10800 of 21515 have been pulled
10810 of 21515 have been pulled
10820 of 21515 have been pulled
10830 of

13070 of 21515 have been pulled
13080 of 21515 have been pulled
13090 of 21515 have been pulled
13100 of 21515 have been pulled
13110 of 21515 have been pulled
13120 of 21515 have been pulled
13130 of 21515 have been pulled
13140 of 21515 have been pulled
13150 of 21515 have been pulled
13160 of 21515 have been pulled
13170 of 21515 have been pulled
13180 of 21515 have been pulled
13190 of 21515 have been pulled
13200 of 21515 have been pulled
13210 of 21515 have been pulled
13220 of 21515 have been pulled
13230 of 21515 have been pulled
13240 of 21515 have been pulled
13250 of 21515 have been pulled
13260 of 21515 have been pulled
13270 of 21515 have been pulled
13280 of 21515 have been pulled
13290 of 21515 have been pulled
13300 of 21515 have been pulled
13310 of 21515 have been pulled
13320 of 21515 have been pulled
13330 of 21515 have been pulled
13340 of 21515 have been pulled
13350 of 21515 have been pulled
13360 of 21515 have been pulled
13370 of 21515 have been pulled
13380 of

15610 of 21515 have been pulled
15620 of 21515 have been pulled
15630 of 21515 have been pulled
15640 of 21515 have been pulled
15650 of 21515 have been pulled
15660 of 21515 have been pulled
15670 of 21515 have been pulled
15680 of 21515 have been pulled
15690 of 21515 have been pulled
15700 of 21515 have been pulled
15710 of 21515 have been pulled
15720 of 21515 have been pulled
15730 of 21515 have been pulled
15740 of 21515 have been pulled
15750 of 21515 have been pulled
15760 of 21515 have been pulled
15770 of 21515 have been pulled
15780 of 21515 have been pulled
15790 of 21515 have been pulled
15800 of 21515 have been pulled
15810 of 21515 have been pulled
15820 of 21515 have been pulled
15830 of 21515 have been pulled
15840 of 21515 have been pulled
15850 of 21515 have been pulled
15860 of 21515 have been pulled
15870 of 21515 have been pulled
15880 of 21515 have been pulled
15890 of 21515 have been pulled
15900 of 21515 have been pulled
15910 of 21515 have been pulled
15920 of

18150 of 21515 have been pulled
18160 of 21515 have been pulled
18170 of 21515 have been pulled
18180 of 21515 have been pulled
18190 of 21515 have been pulled
18200 of 21515 have been pulled
18210 of 21515 have been pulled
18220 of 21515 have been pulled
18230 of 21515 have been pulled
18240 of 21515 have been pulled
18250 of 21515 have been pulled
18260 of 21515 have been pulled
18270 of 21515 have been pulled
18280 of 21515 have been pulled
18290 of 21515 have been pulled
18300 of 21515 have been pulled
18310 of 21515 have been pulled
18320 of 21515 have been pulled
18330 of 21515 have been pulled
18340 of 21515 have been pulled
18350 of 21515 have been pulled
18360 of 21515 have been pulled
18370 of 21515 have been pulled
18380 of 21515 have been pulled
18390 of 21515 have been pulled
18400 of 21515 have been pulled
18410 of 21515 have been pulled
18420 of 21515 have been pulled
18430 of 21515 have been pulled
18440 of 21515 have been pulled
18450 of 21515 have been pulled
18460 of

20700 of 21515 have been pulled
20710 of 21515 have been pulled
20720 of 21515 have been pulled
20730 of 21515 have been pulled
20740 of 21515 have been pulled
20750 of 21515 have been pulled
20760 of 21515 have been pulled
20770 of 21515 have been pulled
20780 of 21515 have been pulled
20790 of 21515 have been pulled
20800 of 21515 have been pulled
20810 of 21515 have been pulled
20820 of 21515 have been pulled
20830 of 21515 have been pulled
20840 of 21515 have been pulled
20850 of 21515 have been pulled
20860 of 21515 have been pulled
20870 of 21515 have been pulled
20880 of 21515 have been pulled
20890 of 21515 have been pulled
20900 of 21515 have been pulled
20910 of 21515 have been pulled
20920 of 21515 have been pulled
20930 of 21515 have been pulled
20940 of 21515 have been pulled
20950 of 21515 have been pulled
20960 of 21515 have been pulled
20970 of 21515 have been pulled
20980 of 21515 have been pulled
20990 of 21515 have been pulled
21000 of 21515 have been pulled
21010 of

We will save the final ORES scores file to the raw_data directory

In [132]:
#Making the final run of ores scores a json object and saving
ores_scores_json_object = json.dumps(ores_scores, indent = 4) 
with open('../raw_data/ores_scores.json', 'w') as outfile:
        outfile.write(ores_scores_json_object)
print("All ORES scores have been processed and saved!")

All ORES scores have been processed and saved!


In building this script we found that some articles would throw KeyErrors, then if run again, the key errors would resolve. We've collected a list of articles which threw KeyErrors in their first run through ORES. While we keep getting key errors (or we try to run them 4 times) we will re-run them through ORES. We will append the original ores_scores information and the rerun articles' information into a final ores_scores document. If articles continue to throw key errors after 4 attempts to re-process them, we will add them to the list of no_prediction.

In [39]:
#Creating a merge method to merge and return a new dictionary
def Merge(dict1, dict2):
    res = {**dict1, **dict2}
    return res
     
#Building the final ores scores dictionary from the original ores_scores dictionary
#final_ores_scores = {}
#final_ores_scores = Merge(final_ores_scores, ores_scores)

#Rerunning the ores script 4 times or until all the KeyErrors have been resolved
reruns = 0
while (len(key_error_list) != 0 and reruns < 4):
    #Setting new_ores_scores and capturing the new key errors
    new_ores_scores = {}
    new_key_error_list = []
    article_count = 0
    #Running ORES scoring again
    for article in key_error_list:
        score = request_ores_score_per_article(article_revid=page_info_dict[article],
                                           email_address=email,
                                           access_token=ACCESS_TOKEN)
        if score is None:
            no_prediction.append(article)   
        else:
            try: 
                new_ores_scores[article] = score['enwiki']['scores'][str(page_info_dict[article])]['articlequality']['score']['prediction']
            except KeyError:    
                new_key_error_list.append(article)   
        #Printing tracker
        if article_count%10 == 0:
            print(str(article_count)+" of "+str(len(key_error_list))+" have been pulled")
        #Pringing every time we save 2000 articles
        if article_count%2000 == 0:
            new_ores_scores_json_object = json.dumps(new_ores_scores, indent = 4) 
            with open('../raw_data/new_ores_scores.json', 'w') as outfile:
                outfile.write(new_ores_scores_json_object)
            print(str(article_count)+" articles processed and written to file")
        #Incrementing tracker 
        article_count = article_count + 1
    
    #Merging new_ores_scores to final_ores_scores
    final_ores_scores = Merge(final_ores_scores, new_ores_scores)
    
    #Switching new_key_error_list to key_error_list
    key_error_list = new_key_error_list
    print("There are {0} articles with KeyErrors remaining".format(len(key_error_list)))
    
    #Incrementing reruns
    reruns = reruns + 1

#Adding the arcticles which have KeyErrors after 4 reruns to the no_prediction list
no_prediction = no_prediction + key_error_list

0 of 555 have been pulled
0 articles processed and written to file
10 of 555 have been pulled
20 of 555 have been pulled
30 of 555 have been pulled
40 of 555 have been pulled
50 of 555 have been pulled
60 of 555 have been pulled
70 of 555 have been pulled
80 of 555 have been pulled
90 of 555 have been pulled
100 of 555 have been pulled
110 of 555 have been pulled
120 of 555 have been pulled
130 of 555 have been pulled
140 of 555 have been pulled
150 of 555 have been pulled
160 of 555 have been pulled
170 of 555 have been pulled
180 of 555 have been pulled
190 of 555 have been pulled
200 of 555 have been pulled
210 of 555 have been pulled
220 of 555 have been pulled
230 of 555 have been pulled
240 of 555 have been pulled
250 of 555 have been pulled
260 of 555 have been pulled
270 of 555 have been pulled
280 of 555 have been pulled
290 of 555 have been pulled
300 of 555 have been pulled
310 of 555 have been pulled
320 of 555 have been pulled
330 of 555 have been pulled
340 of 555 have be

We will now save the final output to our raw_data file

In [49]:
#Making the final run of ores scores a json object and saving
final_ores_scores_json_object = json.dumps(final_ores_scores, indent = 4) 
with open('../raw_data/final_ores_scores.json', 'w') as outfile:
        outfile.write(final_ores_scores_json_object)
print("All ORES scores have been processed and saved!")

All ORES scores have been processed and saved!


Now we will examine the articles ORES could not produce a score for and validate our final counts of articles with and without scores.

In [45]:
#Validate the total number of articles
if len(final_ores_scores.keys()) == len(page_info_dict) - len(no_prediction):
    print("All articles have been scored, or are unable to be scored.")
else:
    print("Article scores are missing. Expected {0} total articles and got {1} with scores and {2} without scores".format(
    len(page_info_dict), len(final_ores_scores.keys()), len(no_prediction)))

#Print number of articles and articles with no prediction
print("{0} article(s) had no prediction".format(len(no_prediction)))
print(no_prediction)

All articles have been scored, or are unable to be scored.
1 article(s) had no prediction
['Corsicana, Texas']
