## Purpose

The purpose of this code is to acquire data on article quality predictions from ORES. 

The article source data comes from English Wikipedia, the text of which is licensed under "Creative Commons Attribution Share-Alike license" (https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)

We will be using the MediaWiki REST API for English Wikipedia. To get more information on the API please use the following link: https://www.mediawiki.org/wiki/API:Main_page. The following link may also be helpful when looking for more documentation: https://www.mediawiki.org/wiki/API:Info

We will leverage code developed by Dr. David W. McDonald for use in Data 512  which is provided under Creative Commons CC-BY license. (https://creativecommons.org/ and https://creativecommons.org/licenses/by/4.0/). The file can be found at this link: https://drive.google.com/drive/folders/1FtvWV31DHE8HIMdEsPGuCXPz0PMvShfl

We will also be using the ORES API. Information on the API itself can be found at https://www.mediawiki.org/wiki/ORES, with original API documentation from https://ores.wikimedia.org/docs and new LiftWing documentation from https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage.

We will begin by reading in standard Python libraries.

In [1]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

Next we will read in the us_cities_by_state_SEPT.2023.csv file from raw_data to create a list of articles which we want to feed in to ORES

In [2]:
#Read in info using pandas
cities_by_st_df = pd.read_csv('../raw_data/us_cities_by_state_SEPT.2023.csv')
cities_by_st_df.head()

#Get a list of page_titles
page_titles = list(cities_by_st_df['page_title'])

#Check if all the page titles were captures
if len(page_titles) == len(cities_by_st_df):
    print("All {0} page titles captured.".format(len(page_titles)))
else:
    print("Not all page titles captured")
    
#Check if page titles are unique
if len(page_titles) == len(cities_by_st_df['page_title'].unique()):
    print("All {0} page titles are unique.".format(len(page_titles)))
else:
    print("WARNING: Page titles are not unique. "+
          "There are {0} page titles and {1} unique titles - ".format(
              len(page_titles), len(cities_by_st_df['page_title'].unique()))
          + "a delta of {0}.".format(len(page_titles)-
                                len(cities_by_st_df['page_title'].unique())))

All 22157 page titles captured.


The user may notice a warning that all page titles are not unique. We will remove duplicated rows in the scraped file us_cities_by_state_SEPT.2023.csv to reduce later processing.

In [3]:
#Removing dupes
cities_dedupe = cities_by_st_df.drop_duplicates()

#Rebuilding the page titles list
page_titles_dedupe = list(cities_dedupe['page_title'])

#Rechecking if page titles are unique
if len(page_titles_dedupe) == len(cities_dedupe['page_title'].unique()):
    print("All {0} page titles are unique.".format(len(page_titles_dedupe)))
else:
    print("WARNING: Page titles are not unique. "+
          "There are {0} page titles and {1} unique titles - ".format(
              len(page_titles_dedupe),
              len(cities_dedupe['page_title'].unique()))
          + "a delta of {0}.".format(len(page_titles_dedupe)-
                                len(cities_dedupe['page_title'].unique())))



It is possible that page titles are still not unique. We will visually inspect these page titles to see if there is any way to reduce later processing.

In [4]:
#Getting list of non-unique city names
#The following code was taken from Stack Overflow (https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them)
import collections
dupe_titles = [item for item, count in collections.Counter(page_titles_dedupe).items() if count > 1]
print(dupe_titles)

['2020 United States census', '2010 United States census', 'County (United States)', 'Population']


Using the files originally obtained for this analysis we will find that there are "cities" listed in article titles which are not actually cities at all (e.g., "Population" or "2020 United States census"). We will remove those non-cities.

In [5]:
#Removing non-cities from the deduped city file
cities_final_df = cities_dedupe.loc[~cities_dedupe['page_title'].isin(dupe_titles)]

In [6]:
'''
#This section has a more robust look at the titles, but there appears
to be multiple naming conventions so we'll settle for the easy removals above

#making new col w/ last n char from cities_dedupe
#cities_dedupe['title_state'] = cities_dedupe['page_title'][:-len(cities_dedupe['state'])]

#cities_dedupe.insert(len(cities_dedupe.columns), 'len', len(cities_dedupe['state']))
cities_dedupe['title_state'] = None
for index, row in cities_dedupe.iterrows():
    row['title_state'] = row['page_title'][-len(row['state']):]
cities_dedupe
cities_dedupe.loc[cities_dedupe['state'] != cities_dedupe['title_state']]'''

"\n#This section has a more robust look at the titles, but there appears\nto be multiple naming conventions so we'll settle for the easy removals above\n\n#making new col w/ last n char from cities_dedupe\n#cities_dedupe['title_state'] = cities_dedupe['page_title'][:-len(cities_dedupe['state'])]\n\n#cities_dedupe.insert(len(cities_dedupe.columns), 'len', len(cities_dedupe['state']))\ncities_dedupe['title_state'] = None\nfor index, row in cities_dedupe.iterrows():\n    row['title_state'] = row['page_title'][-len(row['state']):]\ncities_dedupe\ncities_dedupe.loc[cities_dedupe['state'] != cities_dedupe['title_state']]"

Now we will access the page info from the MediaWiki REST API for English Wikipedia articles. This code was taken from the file provided by Professor McDonald. We will begin by copying the code which sets up the constants for the API call. Some changes were made to keep comments to 1 line and to alter the name of the example list of articles.

In [7]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - 
#we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique 
#to the person making the request
REQUEST_HEADERS = {
    'User-Agent': '<ekrolen@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# Example article list
ex_article_titles = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see 
#the Info documentation for what can be included. If you don't want any 
#this can simply be the empty string
#PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # simplify - single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

Next we copy the API request procedure developed by Professor McDonald. Per his comments, "The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title."

In [8]:
#Page info requester
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#The following code can be run to verify the API is pulling correctly
'''print(f"Getting page info data for: {ex_article_titles[3]}")
info = request_pageinfo_per_article(ex_article_titles[3])
print(json.dumps(info,indent=4))'''

'print(f"Getting page info data for: {ex_article_titles[3]}")\ninfo = request_pageinfo_per_article(ex_article_titles[3])\nprint(json.dumps(info,indent=4))'

Next we will create a dictionary which contains each of the article titles as keys and their revision ids (lastrevid from request_pageinfo_per_article) for input into the ORES API

In [None]:
#Creating the final list of titles to pull
article_titles = list(cities_final_df['page_title'])

#Pulling lastrevid for all of the articles & saving in a dictionary
page_info_dict = {}
for i in range(len(article_titles)):
    info = request_pageinfo_per_article(article_titles[i])
    page_info_dict[article_titles[i]] = info['query']['pages'][list(info['query']['pages'].keys())[0]]['lastrevid']
    if i%10 == 0:
        print(str(i)+" of "+str(len(article_titles))+" have been written")

0 of 21515 have been written
10 of 21515 have been written
20 of 21515 have been written
30 of 21515 have been written
40 of 21515 have been written
50 of 21515 have been written
60 of 21515 have been written
70 of 21515 have been written
80 of 21515 have been written
90 of 21515 have been written
100 of 21515 have been written
110 of 21515 have been written
120 of 21515 have been written
130 of 21515 have been written
140 of 21515 have been written
150 of 21515 have been written
160 of 21515 have been written
170 of 21515 have been written
180 of 21515 have been written
190 of 21515 have been written
200 of 21515 have been written
210 of 21515 have been written
220 of 21515 have been written
230 of 21515 have been written
240 of 21515 have been written
250 of 21515 have been written
260 of 21515 have been written
270 of 21515 have been written
280 of 21515 have been written
290 of 21515 have been written
300 of 21515 have been written
310 of 21515 have been written
320 of 21515 have b

2600 of 21515 have been written
2610 of 21515 have been written
2620 of 21515 have been written
2630 of 21515 have been written
2640 of 21515 have been written
2650 of 21515 have been written
2660 of 21515 have been written
2670 of 21515 have been written
2680 of 21515 have been written
2690 of 21515 have been written
2700 of 21515 have been written
2710 of 21515 have been written
2720 of 21515 have been written
2730 of 21515 have been written
2740 of 21515 have been written
2750 of 21515 have been written
2760 of 21515 have been written
2770 of 21515 have been written
2780 of 21515 have been written
2790 of 21515 have been written
2800 of 21515 have been written
2810 of 21515 have been written
2820 of 21515 have been written
2830 of 21515 have been written
2840 of 21515 have been written
2850 of 21515 have been written
2860 of 21515 have been written
2870 of 21515 have been written
2880 of 21515 have been written
2890 of 21515 have been written
2900 of 21515 have been written
2910 of 

5170 of 21515 have been written
5180 of 21515 have been written
5190 of 21515 have been written
5200 of 21515 have been written
5210 of 21515 have been written
5220 of 21515 have been written
5230 of 21515 have been written
5240 of 21515 have been written
5250 of 21515 have been written
5260 of 21515 have been written
5270 of 21515 have been written
5280 of 21515 have been written
5290 of 21515 have been written
5300 of 21515 have been written
5310 of 21515 have been written
5320 of 21515 have been written
5330 of 21515 have been written
5340 of 21515 have been written
5350 of 21515 have been written
5360 of 21515 have been written
5370 of 21515 have been written
5380 of 21515 have been written
5390 of 21515 have been written
5400 of 21515 have been written
5410 of 21515 have been written
5420 of 21515 have been written
5430 of 21515 have been written
5440 of 21515 have been written
5450 of 21515 have been written
5460 of 21515 have been written
5470 of 21515 have been written
5480 of 

7740 of 21515 have been written
7750 of 21515 have been written
7760 of 21515 have been written
7770 of 21515 have been written
7780 of 21515 have been written
7790 of 21515 have been written
7800 of 21515 have been written
7810 of 21515 have been written
7820 of 21515 have been written
7830 of 21515 have been written
7840 of 21515 have been written
7850 of 21515 have been written
7860 of 21515 have been written
7870 of 21515 have been written
7880 of 21515 have been written
7890 of 21515 have been written
7900 of 21515 have been written
7910 of 21515 have been written
7920 of 21515 have been written
7930 of 21515 have been written
7940 of 21515 have been written
7950 of 21515 have been written
7960 of 21515 have been written
7970 of 21515 have been written
7980 of 21515 have been written
7990 of 21515 have been written
8000 of 21515 have been written
8010 of 21515 have been written
8020 of 21515 have been written
8030 of 21515 have been written
8040 of 21515 have been written
8050 of 

10300 of 21515 have been written
10310 of 21515 have been written
10320 of 21515 have been written
10330 of 21515 have been written
10340 of 21515 have been written
10350 of 21515 have been written
10360 of 21515 have been written
10370 of 21515 have been written
10380 of 21515 have been written
10390 of 21515 have been written
10400 of 21515 have been written
10410 of 21515 have been written
10420 of 21515 have been written
10430 of 21515 have been written
10440 of 21515 have been written
10450 of 21515 have been written
10460 of 21515 have been written
10470 of 21515 have been written
10480 of 21515 have been written
10490 of 21515 have been written
10500 of 21515 have been written
10510 of 21515 have been written
10520 of 21515 have been written
10530 of 21515 have been written
10540 of 21515 have been written
10550 of 21515 have been written
10560 of 21515 have been written
10570 of 21515 have been written
10580 of 21515 have been written
10590 of 21515 have been written
10600 of 2

12790 of 21515 have been written
12800 of 21515 have been written
12810 of 21515 have been written
12820 of 21515 have been written
12830 of 21515 have been written
12840 of 21515 have been written
12850 of 21515 have been written
12860 of 21515 have been written
12870 of 21515 have been written
12880 of 21515 have been written
12890 of 21515 have been written
12900 of 21515 have been written
12910 of 21515 have been written
12920 of 21515 have been written
12930 of 21515 have been written
12940 of 21515 have been written
12950 of 21515 have been written
12960 of 21515 have been written
12970 of 21515 have been written
12980 of 21515 have been written
12990 of 21515 have been written
13000 of 21515 have been written
13010 of 21515 have been written
13020 of 21515 have been written
13030 of 21515 have been written
13040 of 21515 have been written
13050 of 21515 have been written
13060 of 21515 have been written
13070 of 21515 have been written
13080 of 21515 have been written
13090 of 2

In [None]:
little_list = article_titles[4370:4390]
print(little_list)
'''little_page_info_dict = {}
for i in range(len(little_list)):
    info = request_pageinfo_per_article(little_list[i])
    little_page_info_dict[article_titles[i]] = info['query']['pages'][list(info['query']['pages'].keys())[0]]['lastrevid']
    print(info)'''

Due to the long run time pulling page information, we will save the results as a JSON file in the raw_data folder.

In [None]:
#Making the page info dict a JSON object
page_info_json_object = json.dumps(page_info_dict, indent = 4) 

#Writing to the file
with open('../raw_data/page_info.json', 'w') as outfile:
    outfile.write(page_info_json_object)