## License

This code was developed by Ameya Bhamare from the MS Data Science program. Snippets have been taken from the code example authored by Dr. David W. McDonald for use in DATA 512. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 14, 2023

Importing libraries

In [3]:
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

import json, time, urllib.parse
import pandas as pd

Retrieving article titles from us_cities_by_state_SEPT.2023.csv and saving the article titles in article_titles.csv

In [2]:
cities_by_state = pd.read_csv('us_cities_by_state_SEPT.2023.csv')
ARTICLE_TITLES = cities_by_state['page_title'].to_list()

In [71]:
article_titles = {'article_title' : ARTICLE_TITLES}
df = pd.DataFrame(article_titles)
df.to_csv('article_titles.csv', index=False)

Defining constants to make the pageview API request coherent

In [3]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

In [4]:
# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "url"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


### Define a function to make the Wikimedia API request

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Defining constants to make the pageview API request coherent

In [7]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""

Setting access tokens here as generated by Wikimedia's API

In [8]:
USERNAME = "Ameyabhamare"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJmMGViYzU5YjJlYWY1OWI2OTk4Zjg2OGI0ZTI3YzI4YyIsImp0aSI6ImJhMTk2Zjc5MTE2OGNiY2JkMjYyZDA0NWY5MzdiZWRlMmRmMmJiMGUzYTZjNzE2YmNkMzllZDA1MzkxNzQxZDU1MzY1NDFhZDRlNjZkYmUwIiwiaWF0IjoxNjk3MTQxNjYzLjcxNTEzNiwibmJmIjoxNjk3MTQxNjYzLjcxNTEzOSwiZXhwIjozMzI1NDA1MDQ2My43MTM5MTcsInN1YiI6IjczOTkxMTY1IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyIsImNyZWF0ZWVkaXRtb3ZlcGFnZSIsImVkaXRwcm90ZWN0ZWQiXX0.QF7kx0LSHINl6GZW5mdOu3QI5ZWGYBiurV-u_qJZbmHAClgjD_huC_qesO-jzcHXdOmCIe-6dEDzdlRIc57XHwqUmLpgkm5JVAyhw0ew8lS3z-XZvaxpMa4EDSyG-6lrZrE2yz_6OyCwoz3cI3D2x4tSFJ7z43GmZEui00-zuCf-rUcG9cpfolA9DzxF4bta8k47mtZE4ozxMvJwziScxcxyJDRz0HwV2GdCSBRM0FYE1zYMrmp1tdeHIRlMkPGlgw1jzdKcn3bvZjt1LTOEB4EeJYuafuDzKBwp8hZQRlEecEZ5aZAe4QMB0D76AQQWko7vC6TLks34cymmrMWsn1ExgJPpcpNVnRjTuoZGM7Nf3bw1aUXBUahMps2WmCtZgGCSgcgFpbSaP6e6QgRE-4yOl7_wqMFzbllBQ-P2TZEnttecz6IGE_H_lQOmJFfh5In92FYoTy6YsrkiZxAx0z754JTfzvKl-iXE_oSHM011KA8HPNRCbVKNXo3No9gk2Hea-W1eBvVnBhRbwLBIr4NbHJvg-qlHRNuYytRfJSH9OYXVkz4srEBEMi_s2aLjsY2Ku-BLHxokAe0oCeF0aGcL-uocFnPdXI7NT7Dv85-DbVarvvgPDPTKZ0ZSuwYLSI-GY6I4wBBdjDKHXfoXVQvl9_tB3PkaLp66SrKsAXQ"

### Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [9]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


This code section merges US States by Region - US Census Bureau.xlsx and NST-EST2022-POP.xlsx since we want division wise population for each state. It begins by cleaning up both excel sheets since they are in a raw format to appeal to the human aesthetic, but cannot be processed by pandas as-is

In [4]:
df_states = pd.read_excel('US States by Region - US Census Bureau.xlsx')

In [5]:
df_states.head()

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,,New England,
2,,,Connecticut
3,,,Maine
4,,,Massachusetts


In [7]:
df_states.rename(columns={'REGION': 'region', 'DIVISION' : 'division', 'STATE' : 'state'}, inplace=True) 

In [8]:
# Using pandas ffill() to populate the empty rows in 'division' and 'region'
df_states['division'] = df_states['division'].ffill()
df_states['region'] = df_states['region'].ffill()

In [9]:
# Removing all those rows where 'state' is NULL
df_states = df_states.loc[~df_states['state'].isnull(), :]

In [10]:
df_states.head()

Unnamed: 0,region,division,state
2,Northeast,New England,Connecticut
3,Northeast,New England,Maine
4,Northeast,New England,Massachusetts
5,Northeast,New England,New Hampshire
6,Northeast,New England,Rhode Island


In [11]:
df_states.to_csv('states_by_region.csv', index=False)

In [12]:
df_population = pd.read_excel('NST-EST2022-POP.xlsx')

In [13]:
df_population.head()

Unnamed: 0,table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Annual Estimates of the Resident Population fo...,,,,
1,Geographic Area,"April 1, 2020 Estimates Base",Population Estimate (as of July 1),,
2,,,2020,2021.0,2022.0
3,United States,331449520,331511512,332031554.0,333287557.0
4,Northeast,57609156,57448898,57259257.0,57040406.0


In [14]:
# Selecting rows #8 onwards since the states start from then
df_population = df_population.loc[8:,:]

In [15]:
# Dropping columns that have 2020 and 2021 populations
df_population = df_population.drop(df_population.columns[[1, 2, 3]], axis = 1)

In [16]:
df_population = df_population.rename(columns={df_population.columns[0]: 'state', df_population.columns[1]: 'population'}) 

In [17]:
df_population['state'] = df_population['state'].str[1:]

In [18]:
df_population.head()

Unnamed: 0,state,population
8,Alabama,5074296.0
9,Alaska,733583.0
10,Arizona,7359197.0
11,Arkansas,3045637.0
12,California,39029342.0


In [19]:
df_population.to_csv('population_by_state.csv', index=False)

Read and merge the two cleaned csv's - population_by_state and states_by_region

In [20]:
population_by_state = pd.read_csv('population_by_state.csv')

In [21]:
states_by_region = pd.read_csv('states_by_region.csv')

In [22]:
df = pd.merge(population_by_state, states_by_region, on="state")

In [23]:
df['regional division'] = df[['region', 'division']].apply(lambda x : '{} - {}'.format(x[0], x[1]), axis=1)

In [24]:
df = df[['state', 'regional division', 'population']]

In [27]:
df.head()

Unnamed: 0,state,regional division,population
0,Alabama,South - East South Central,5074296.0
1,Alaska,West - Pacific,733583.0
2,Arizona,West - Mountain,7359197.0
3,Arkansas,South - West South Central,3045637.0
4,California,West - Pacific,39029342.0


In [28]:
df.to_csv('combined_region_population.csv', index=False)

Calling Wikimedia API to retrieve revision ID of each article and save to revision_ids.csv

In [35]:
revision_ids = []
for title in ARTICLE_TITLES:
    article_response = request_pageinfo_per_article(title)
    article_info = list(article_response['query']['pages'].values())[0]
    revision_id = article_info['lastrevid']
    revision_ids.append(revision_id)

In [43]:
revids = {'article_title' : ARTICLE_TITLES, 'revision_id' : revision_ids}
df = pd.DataFrame(revids)
df.to_csv('revision_ids.csv', index=False)

Calculating article quality by passing each revision ID to the ORES request and saving it in article_quality.csv

In [3]:
import pandas as pd
import json

In [4]:
article_titles = pd.read_csv('revision_ids.csv')['article_title'].to_list()
revision_ids = pd.read_csv('revision_ids.csv')['revision_id'].to_list()

In [5]:
score_dicts = dict()

In [None]:
for article_title, revision_id in zip(article_titles, revision_ids):
    ores_response = request_ores_score_per_article(article_revid=revision_id,
                                       email_address="ameyarb@uw.edu",
                                       access_token=ACCESS_TOKEN)
    for key, value in ores_response.get('enwiki', {}).get('scores', {}).items():
        if 'score' in value.get('articlequality', {}):
            score_dict = value['articlequality']['score']
            score_dicts.update({article_title : score_dict})

In [60]:
json_object = json.dumps(score_dicts, indent=4) 
with open("consolidated_ores.json", "w") as fp:
    json.dump(json_object , fp) 

Create and save wp_scored_city_articles_by_state.csv by joining rows from combined_region_population.csv and article_quality.csv based on match in 'state'


In [40]:
state_reg_pop_df = pd.read_csv('combined_region_population.csv')

In [26]:
with open('consolidated-ores.json', 'r') as json_file:
    ores_scores = json.load(json_file)

In [63]:
f = []

In [None]:
for article_title, revision_id in zip(article_titles, revision_ids):
    try:
        state = article_title.split(",")[-1].strip()
        ores = ores_scores[article_title]['prediction']
        match = state_reg_pop_df[state_reg_pop_df['state'] == state].values.tolist()[0]
        match.extend([article_title, revision_id, ores])
        print(match)
        f.append(match)
    except Exception as error:
        print(error)

In [68]:
df = pd.DataFrame(f, columns=['state', 'regional_division', 'population', 'article_title', 'revision_id', 'article_quality']) 

In [69]:
df.head()

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,South - East South Central,5074296.0,"Abbeville, Alabama",1171163550,C
1,Alabama,South - East South Central,5074296.0,"Adamsville, Alabama",1177621427,C
2,Alabama,South - East South Central,5074296.0,"Addison, Alabama",1168359898,C
3,Alabama,South - East South Central,5074296.0,"Akron, Alabama",1165909508,GA
4,Alabama,South - East South Central,5074296.0,"Alabaster, Alabama",1179139816,C


In [70]:
df.to_csv('wp_scored_city_articles_by_state.csv', index=False)