**Considering Bias in Data**    
**DATA512 Homework #2**

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles about cities in different US states. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities.


There are two steps here:

1. We will delve into the process of accessing page information data from the English Wikipedia through the MediaWiki REST API. Our aim is to obtain **concise page info data summaries for a variety of article pages.**

2. The next phase is where we leverage the **LiftWing ML Service API** to obtain **ORES scores** . By utilizing the LiftWing version of ORES, we can generate estimations of article quality for any changes made to articles.

**License**

Some parts of the code are either used as is or modified based on the example code that was developed by Dr. David W. McDonald for use in the UW MS Data Science DATA 512 course. This code is provided under the [Creative Commons](https://creativecommons.org/) [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.

Firstly, we start off by importing the required libraries

In [3]:
import json, time, requests, urllib.parse
import pandas as pd
from tqdm.autonotebook import tqdm
import warnings
warnings.filterwarnings("ignore")
from pandas import json_normalize
import matplotlib.pyplot as plt
from itertools import islice
import logging
logging.basicConfig(filename="errors.log", level=logging.ERROR)

  from tqdm.autonotebook import tqdm


In [4]:
pd.set_option('display.max_colwidth', None)

# Step 1: Getting the Article, Population and Region Data

In this assignment, we will be importing documents from a Google Drive folder which can be found [here](https://drive.google.com/drive/folders/1qzJcMILGuf_GjvfjwXizN5B8T9VUGhLv?usp=sharing)

Further, we also will be using our own `email_address`

In this step, we are mounting Google Drive so we can access the data.

In [7]:
from google.colab import drive
import sys

drive.mount('/content/drive')

FOLDERNAME = "DATA512_HW2"
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))
%cd /content/drive/My\ Drive/$FOLDERNAME/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/DATA512_HW2


These are the details of the three data files we will be importing from Google Drive in the subsquent steps.

**List of Cities**.  
The Wikipedia Category:Lists of cities in the United States by state was crawled to generate a list of Wikipedia article pages about US cities from each state. This data is in the folder as `us_cities_by_state_SEPT.2023.csv`

**Population Data**.   
The US Census Bureau provides updated population estimates for every US state. You can find State Population Totals and Components of Change: 2020-2022 from their website. An Excel file linked to that page contains estimated populations of all US states for 2022. It is called `NST-EST2022-POP.xlsx`

**Regional Division Data**.  
The 'region' demarcation within the US is not one standardized and fixed thing. In fact, different US government agencies agglomerate states to define regions as a function of differing goals (e.g., see List of regions of the United States for some examples). For this analysis, you will use the regional and divisional agglomerations as defined by the US Census Bureau. The homework folder contains a spreadsheet listing the states in each regional division. It is called `US States by Region - US Census Bureau.xlsx`

## List of Cities

In [16]:
WIKI_URL = 'us_cities_by_state_SEPT.2023.csv'
WIKI_SHEET = pd.read_csv(WIKI_URL)

WIKI_SHEET.drop_duplicates(inplace=True, ignore_index=True)

WIKI_SHEET

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
21520,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
21521,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
21522,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
21523,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


# Population Data

After we import the data, the first four rows in the `POP_SHEET` dataframe are removed because they contain irrelevant information. The index of the dataframe is reset, and the column names in the dataframe are updated to `State, 2020_est, 2020, 2021, and 2022`.

Rows in the `State` column that contain ('.') are selected and the period is removed from the `State` values, `POP_SHEET` is modified to retain only the `State` and `2022` columns and the index is reset again.

The column names are updated to `State` and `2022` for our convenience.

In [24]:
POP_URL = 'NST-EST2022-POP.xlsx'
POP_SHEET = pd.read_excel(POP_URL)

POP_SHEET = POP_SHEET[4:]
POP_SHEET.reset_index(drop=True, inplace=True)
POP_SHEET.columns = ['State', '2020_est', '2020', '2021', '2022']
POP_SHEET = POP_SHEET[POP_SHEET['State'].str.contains('^\.', na=False)]
POP_SHEET['State'] = POP_SHEET['State'].str.slice(1)
POP_SHEET = POP_SHEET[['State', '2022']].reset_index(drop=True)
POP_SHEET.columns = ['State', '2022']

POP_SHEET.head()

Unnamed: 0,State,2022
0,Alabama,5074296.0
1,Alaska,733583.0
2,Arizona,7359197.0
3,Arkansas,3045637.0
4,California,39029342.0


## Regional Division Data

Missing values in the `REGION` column in `REGION_SHEET` are replaced with the previous non-null value. This ensures that each row in the `REGION` column has a valid value. Similarly, the `DIVISION` column is filled with the last valid non-null value forward.

Further, rows in the dataframe with missing values in the `STATE` column are removed and all column names in the dataframe are converted to lowercase.

In [17]:
REGION_URL = 'US States by Region - US Census Bureau.xlsx'
REGION_SHEET = pd.read_excel(REGION_URL)

REGION_SHEET['REGION'].fillna(method='ffill', inplace=True)
REGION_SHEET['DIVISION'].fillna(method='ffill', inplace=True)
REGION_SHEET = REGION_SHEET.dropna(subset=['STATE'])
REGION_SHEET.columns = REGION_SHEET.columns.str.lower()

REGION_SHEET.head()

Unnamed: 0,region,division,state
2,Northeast,New England,Connecticut
3,Northeast,New England,Maine
4,Northeast,New England,Massachusetts
5,Northeast,New England,New Hampshire
6,Northeast,New England,Rhode Island


# Article Page Info MediaWiki API

This code accesses page info data using the [MediaWiki REST API](https://www.mediawiki.org/wiki/API:Main_page) for the EN Wikipedia. This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

**License**.   
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org/) [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. Revision 1.1 - August 14, 2023

In [18]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<avivam@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


# Function to Request Page Information per Article

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the `article_title`.

In [21]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Making requests for page information and storing the responses

In this code, we're collecting information for a list of Wikipedia articles. We go through each article, attempting to retrieve its data. If the data is successfully obtained, it's stored in the "article_results" list. If there's a retrieval issue, we record the article as "failed" and display an error message. After gathering the responses, we organize the data into a DataFrame called `result_df`

In [22]:
article_results = []
failed_articles = []

In [23]:
for article_title in WIKI_SHEET.page_title:
    try:
        print('Pulling data for:', article_title)
        response = request_pageinfo_per_article(article_title)

        if response is not None:
            article_results.append(response)
        else:
            print(f"Failed to retrieve data for {article_title}")
            failed_articles.append(article_title)
    except Exception as e:
        print(f"Error for article {article_title}: {e}")

article_results_df = pd.DataFrame(article_results)
result_df = pd.DataFrame(article_results_df['query'].apply(lambda x: x['pages'][list(x['pages'].keys())[0]]))
result_df.reset_index(drop=True, inplace=True)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Pulling data for: Oliver Township, Perry County, Pennsylvania
Pulling data for: Lewis Township, Northumberland County, Pennsylvania
Pulling data for: South Centre Township, Columbia County, Pennsylvania
Pulling data for: Great Bend Township, Susquehanna County, Pennsylvania
Pulling data for: Jackson Township, Tioga County, Pennsylvania
Pulling data for: Leesport, Pennsylvania
Pulling data for: Farmington Township, Clarion County, Pennsylvania
Pulling data for: Auburn Township, Pennsylvania
Pulling data for: Upper Tulpehocken Township, Berks County, Pennsylvania
Pulling data for: Aleppo Township, Allegheny County, Pennsylvania
Pulling data for: Tyrone Township, Blair County, Pennsylvania
Pulling data for: North York, Pennsylvania
Pulling data for: Lenox Township, Susquehanna County, Pennsylvania
Pulling data for: Benton Township, Lackawanna County, Pennsylvania
Pulling data for: Conyngham, Pennsylvania
Pulling data for: Wa

In [24]:
failed_articles

[]

Saving the result_df DataFrame to the Google Drive directory that we've mounted in Google Colab as `article_1.csv`

In [25]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
result_df.to_csv(save_path + 'article_1.csv', index=False)

This code involves a pandas DataFrame manipulation technique that 'query' column from result_df and it applies the pd.Series constructor to transform the structured data into separate DataFrame columns. The column names are extracted from the contents of the result_df itself. The formatted data is then stored in `data_1`.

In [40]:
data_1 = result_df['query'].apply(pd.Series)
data_1 = data_1[['pageid', 'ns', 'title', 'contentmodel', 'pagelanguage', 'pagelanguagehtmlcode', 'pagelanguagedir', 'touched', 'lastrevid', 'length', 'talkid', 'fullurl', 'editurl', 'canonicalurl', 'watchers', 'redirect', 'new']]

In [41]:
data_1

Unnamed: 0,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,redirect,new
0,104730,0,"Abbeville, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1171163550,24706,281244.0,"https://en.wikipedia.org/wiki/Abbeville,_Alabama","https://en.wikipedia.org/w/index.php?title=Abbeville,_Alabama&action=edit","https://en.wikipedia.org/wiki/Abbeville,_Alabama",,,
1,104761,0,"Adamsville, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1177621427,18040,281272.0,"https://en.wikipedia.org/wiki/Adamsville,_Alabama","https://en.wikipedia.org/w/index.php?title=Adamsville,_Alabama&action=edit","https://en.wikipedia.org/wiki/Adamsville,_Alabama",,,
2,105188,0,"Addison, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1168359898,13309,281517.0,"https://en.wikipedia.org/wiki/Addison,_Alabama","https://en.wikipedia.org/w/index.php?title=Addison,_Alabama&action=edit","https://en.wikipedia.org/wiki/Addison,_Alabama",,,
3,104726,0,"Akron, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1165909508,11710,281240.0,"https://en.wikipedia.org/wiki/Akron,_Alabama","https://en.wikipedia.org/w/index.php?title=Akron,_Alabama&action=edit","https://en.wikipedia.org/wiki/Akron,_Alabama",,,
4,105109,0,"Alabaster, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1179139816,20343,281444.0,"https://en.wikipedia.org/wiki/Alabaster,_Alabama","https://en.wikipedia.org/w/index.php?title=Alabaster,_Alabama&action=edit","https://en.wikipedia.org/wiki/Alabaster,_Alabama",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21520,140221,0,"Wamsutter, Wyoming",wikitext,en,en,ltr,2023-10-10T22:36:04Z,1169591845,15315,7761801.0,"https://en.wikipedia.org/wiki/Wamsutter,_Wyoming","https://en.wikipedia.org/w/index.php?title=Wamsutter,_Wyoming&action=edit","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming",,,
21521,140185,0,"Wheatland, Wyoming",wikitext,en,en,ltr,2023-10-10T22:36:04Z,1176370621,20494,10817367.0,"https://en.wikipedia.org/wiki/Wheatland,_Wyoming","https://en.wikipedia.org/w/index.php?title=Wheatland,_Wyoming&action=edit","https://en.wikipedia.org/wiki/Wheatland,_Wyoming",,,
21522,140245,0,"Worland, Wyoming",wikitext,en,en,ltr,2023-10-10T22:36:04Z,1166347917,19443,10265719.0,"https://en.wikipedia.org/wiki/Worland,_Wyoming","https://en.wikipedia.org/w/index.php?title=Worland,_Wyoming&action=edit","https://en.wikipedia.org/wiki/Worland,_Wyoming",,,
21523,140070,0,"Wright, Wyoming",wikitext,en,en,ltr,2023-10-10T22:36:04Z,1166334449,12129,10717664.0,"https://en.wikipedia.org/wiki/Wright,_Wyoming","https://en.wikipedia.org/w/index.php?title=Wright,_Wyoming&action=edit","https://en.wikipedia.org/wiki/Wright,_Wyoming",,,


# Step 2: Getting Article Quality Predictions


# Requesting ORES scores through LiftWing ML Service API
Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This code generates article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).


## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023



# Get your access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


In [42]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<avivam@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "avivam@uw.edua",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJkOTdhODZmM2ZhM2ZkMDFlZGViZTE4Njg2MDYxODZhNiIsImp0aSI6IjY0ZTg2NTVmYWM2NjU2YTBjNWE3ZWIxN2JlZmY4YjY4NzEyOGU4YmFmNzdiNzgzMjBkNTg1MjBjZDEyMTBmYjUzYzA5YWY5YjhlOTJiOTA0IiwiaWF0IjoxNjk3NDk0MDM5LjA2MzQ2OCwibmJmIjoxNjk3NDk0MDM5LjA2MzQ3MSwiZXhwIjozMzI1NDQwMjgzOS4wNjEzMiwic3ViIjoiNzQwMjEyODEiLCJpc3MiOiJodHRwczovL21ldGEud2lraW1lZGlhLm9yZyIsInJhdGVsaW1pdCI6eyJyZXF1ZXN0c19wZXJfdW5pdCI6NTAwMCwidW5pdCI6IkhPVVIifSwic2NvcGVzIjpbImJhc2ljIiwiY3JlYXRlZWRpdG1vdmVwYWdlIiwiZWRpdHByb3RlY3RlZCJdfQ.b-m1C9Na4UUIS3f_bgBis71VXIqdxV1495puOZckzlSc_rGUfPhDhERQAwxN5TtX0-3pj9TYP5O8-HYyFdE1DyRl0FO1il59jYtZkS1UR4SPKH5_ZGrrvc_xbYXx-yYxROiswG0pV-K6gsSn6JE1vvXqRqnQNPVcqAf7SfMyM4zpYkHfaZB_W6g0otSRR_W-bFCU-pMWNtuVwAW4ydtm6XlTyXCCkr4gWZCcAoLqgK7Dc3KftJY_OFI1Oaelcl7Z7yc6AAbMuQ62CW6hMWSMdR5jzUFziOUqAW9dMbWB8EcJm5I2VzVk4n8ZK_NhFkITdt0x7ErRxwv-DROP5IxBYpjmBvpqChpgc3IQDv8E6uwKD02YaGY7H41vc6K5uir5SHMcPD8PzotwctNbRPhsCdlbDuX0m4ag-169sMOWllRiSmyWqixSh6hfukV-MJa231pwkJdDlq7JxuxFrOQ8yROU75U1GDmLdddIsVKP6hgOIKpfcbrTove50GpYdnq8P8JacDDlW6iStBMwmT4prlel-MOMXf-b1BzS0spXuFEngczhO4KRikqiYuo6vTJT65NLmCHDLW-Gzqx4O2GOEaBOEAnSiJPI6olI6fKFB9zSdvca4YAYvu_ObII6fluMrKXad1oEwS1W4rO9Oh0QZPcVwgcv_2Cp6Bwz6TU8cg8"          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Avivamunshi"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJkOTdhODZmM2ZhM2ZkMDFlZGViZTE4Njg2MDYxODZhNiIsImp0aSI6IjY0ZTg2NTVmYWM2NjU2YTBjNWE3ZWIxN2JlZmY4YjY4NzEyOGU4YmFmNzdiNzgzMjBkNTg1MjBjZDEyMTBmYjUzYzA5YWY5YjhlOTJiOTA0IiwiaWF0IjoxNjk3NDk0MDM5LjA2MzQ2OCwibmJmIjoxNjk3NDk0MDM5LjA2MzQ3MSwiZXhwIjozMzI1NDQwMjgzOS4wNjEzMiwic3ViIjoiNzQwMjEyODEiLCJpc3MiOiJodHRwczovL21ldGEud2lraW1lZGlhLm9yZyIsInJhdGVsaW1pdCI6eyJyZXF1ZXN0c19wZXJfdW5pdCI6NTAwMCwidW5pdCI6IkhPVVIifSwic2NvcGVzIjpbImJhc2ljIiwiY3JlYXRlZWRpdG1vdmVwYWdlIiwiZWRpdHByb3RlY3RlZCJdfQ.b-m1C9Na4UUIS3f_bgBis71VXIqdxV1495puOZckzlSc_rGUfPhDhERQAwxN5TtX0-3pj9TYP5O8-HYyFdE1DyRl0FO1il59jYtZkS1UR4SPKH5_ZGrrvc_xbYXx-yYxROiswG0pV-K6gsSn6JE1vvXqRqnQNPVcqAf7SfMyM4zpYkHfaZB_W6g0otSRR_W-bFCU-pMWNtuVwAW4ydtm6XlTyXCCkr4gWZCcAoLqgK7Dc3KftJY_OFI1Oaelcl7Z7yc6AAbMuQ62CW6hMWSMdR5jzUFziOUqAW9dMbWB8EcJm5I2VzVk4n8ZK_NhFkITdt0x7ErRxwv-DROP5IxBYpjmBvpqChpgc3IQDv8E6uwKD02YaGY7H41vc6K5uir5SHMcPD8PzotwctNbRPhsCdlbDuX0m4ag-169sMOWllRiSmyWqixSh6hfukV-MJa231pwkJdDlq7JxuxFrOQ8yROU75U1GDmLdddIsVKP6hgOIKpfcbrTove50GpYdnq8P8JacDDlW6iStBMwmT4prlel-MOMXf-b1BzS0spXuFEngczhO4KRikqiYuo6vTJT65NLmCHDLW-Gzqx4O2GOEaBOEAnSiJPI6olI6fKFB9zSdvca4YAYvu_ObII6fluMrKXad1oEwS1W4rO9Oh0QZPcVwgcv_2Cp6Bwz6TU8cg8"
#

The following function makes a request to obtain the ORES score and returns the result

In [43]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


We now process a list of Wikipedia articles to retrieve quality scores using the ORES API

We start by preparing lists to keep track of the collected data and a set to remember which articles make it. Next, in `data_1`, for each article we reach out to the ORES API to secure its quality score.

As we receive these scores, we extract the predictions and probabilities for various quality categories, such as `B, C, FA, GA, Start, and Stub`. Our findings are recorded in separate lists which are recorded in `quality_1`

In [44]:
# Initialize lists to store page titles and quality scores as a dataframe
page_titles = []
revision_ids = []
predictions = []
probabilities_B = []
probabilities_C = []
probabilities_FA = []
probabilities_GA = []
probabilities_Start = []
probabilities_Stub = []

In [46]:
failed_articles = set()

In [47]:
for index, row in data_1.iterrows():
    article_title = row['title']
    article_revid = row['lastrevid']
    print('Requesting quality score for: ', article_title, ', with rev ID: ', article_revid)

    quality_calc = request_ores_score_per_article(article_revid=article_revid, email_address="avivam@uw.edu", access_token=ACCESS_TOKEN)

    if 'httpCode' in quality_calc:
        print(f"Failed to retrieve quality score for {article_title}")
        failed_articles.add(article_title)
    else:
        if quality_calc is not None:
            quality_scores = quality_calc.get('enwiki', {}).get('scores', {})
            for rev_id, rev_data in quality_scores.items():
                prediction = rev_data['articlequality']['score']['prediction']
                probability = rev_data['articlequality']['score']['probability']
                page_titles.append(article_title)
                revision_ids.append(rev_id)
                predictions.append(prediction)
                probabilities_B.append(probability.get('B', None))
                probabilities_C.append(probability.get('C', None))
                probabilities_FA.append(probability.get('FA', None))
                probabilities_GA.append(probability.get('GA', None))
                probabilities_Start.append(probability.get('Start', None))
                probabilities_Stub.append(probability.get('Stub', None))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Requesting quality score for:  Porter Township, Jefferson County, Pennsylvania , with rev ID:  1173811134
Failed to retrieve quality score for Porter Township, Jefferson County, Pennsylvania
Requesting quality score for:  Pillow, Pennsylvania , with rev ID:  1157451597
Failed to retrieve quality score for Pillow, Pennsylvania
Requesting quality score for:  Enon Valley, Pennsylvania , with rev ID:  1169508153
Failed to retrieve quality score for Enon Valley, Pennsylvania
Requesting quality score for:  Spartansburg, Pennsylvania , with rev ID:  1172104787
Failed to retrieve quality score for Spartansburg, Pennsylvania
Requesting quality score for:  Manns Choice, Pennsylvania , with rev ID:  1157439819
Failed to retrieve quality score for Manns Choice, Pennsylvania
Requesting quality score for:  Pine Township, Lycoming County, Pennsylvania , with rev ID:  1098462035
Failed to retrieve quality score for Pine Township, Lycomin

In [50]:
quality_1 = pd.DataFrame({
    'page_title': page_titles,
    'revision_id': revision_ids,
    'prediction': predictions,
    'probability_B': probabilities_B,
    'probability_C': probabilities_C,
    'probability_FA': probabilities_FA,
    'probability_GA': probabilities_GA,
    'probability_Start': probabilities_Start,
    'probability_Stub': probabilities_Stub
})

In [52]:
quality_1

Unnamed: 0,page_title,revision_id,prediction,probability_B,probability_C,probability_FA,probability_GA,probability_Start,probability_Stub
0,"Abbeville, Alabama",1171163550,C,0.310423,0.597920,0.025186,0.049521,0.013574,0.003376
1,"Adamsville, Alabama",1177621427,C,0.198274,0.377070,0.019070,0.351488,0.050261,0.003837
2,"Addison, Alabama",1168359898,C,0.271041,0.324460,0.011266,0.294871,0.093188,0.005175
3,"Akron, Alabama",1165909508,GA,0.175388,0.265587,0.011557,0.448584,0.093508,0.005375
4,"Alabaster, Alabama",1179139816,C,0.270972,0.646384,0.009591,0.033642,0.036341,0.003071
...,...,...,...,...,...,...,...,...,...
12181,"Wamsutter, Wyoming",1169591845,GA,0.140348,0.187580,0.011517,0.624482,0.032711,0.003363
12182,"Wheatland, Wyoming",1176370621,GA,0.245020,0.285966,0.034640,0.395906,0.033736,0.004732
12183,"Worland, Wyoming",1166347917,GA,0.160382,0.238469,0.026243,0.546966,0.023868,0.004072
12184,"Wright, Wyoming",1166334449,GA,0.165136,0.323542,0.009909,0.467899,0.029659,0.003855


In [54]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
quality_1.to_csv(save_path + 'quality_2.csv', index=False)

In accordance with the reproducibility rules, we save a list of articles which we are not moving forward with

In [49]:
print("List of articles in which we were unable to gather ORES scores:")
for article in failed_articles:
    print(article)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Royal Oak, Michigan
Essex, Missouri
Norman, Oklahoma
Elk Township, Warren County, Pennsylvania
Fernley, Nevada
Athens, Ohio
Morristown, Minnesota
South Greenfield, Missouri
Madison Township, Lackawanna County, Pennsylvania
Salem, South Carolina
Mendenhall, Mississippi
Oakland, Oregon
Ryegate, Montana
Falcon, Mississippi
Bakersfield, Missouri
La Pine, Oregon
Landisburg, Pennsylvania
White Township, New Jersey
North Branch, Minnesota
Atlantic Beach, New York
Burlington, New Jersey
Hamilton Township, Monroe County, Pennsylvania
Tatum, New Mexico
St. Helens, Oregon
Plains Township, Luzerne County, Pennsylvania
Milford Center, Ohio
Piqua, Ohio
Mendota Heights, Minnesota
Asheville, North Carolina
Cayuga Heights, New York
Munsey Park, New York
Washington Township, Dauphin County, Pennsylvania
Chester Township, New Jersey
Dora, New Mexico
Middleport, New York
Rome, New York
Cortland, New York
Sallisaw, Oklahoma
Jeddo, Pennsylvani

# Step 3: Combining the Datasets

Now, we create `combined_df` by merging dfs on state information, quality scores, regional division, and population data. Further:

- The 'state' column is cleaned and standardized. For instance, 'Georgia_(U.S._state)' becomes 'Georgia,' and underscores in state names are replaced with spaces.

- The 'lastrevid' column is converted to strings to ensure compatibility for subsequent merges.

- Quality scores are integrated based on revision IDs.

- Regional division and population data are merged.

- Column names are standardized.

- Duplicate rows are removed.



In [55]:
combined_df = data_1.copy()
combined_df = combined_df.merge(WIKI_SHEET, how='left', left_on='title', right_on='page_title')

combined_df['state'] = combined_df['state'].apply(lambda x: 'Georgia' if x == 'Georgia_(U.S._state)' else x)
combined_df['state'] = combined_df['state'].str.replace('_', ' ')

combined_df['lastrevid'] = combined_df['lastrevid'].astype(str)
combined_df = combined_df.merge(quality_1[['revision_id', 'prediction']], left_on='lastrevid', right_on='revision_id', how='left')

combined_df = combined_df.merge(REGION_SHEET, how='left', on='state')
combined_df = combined_df.merge(POP_SHEET[['State', '2022']], how='left', left_on='state', right_on='State')

combined_df.drop('revision_id', inplace=True, axis=1)
combined_df.rename(columns={'title': 'article_title', 'lastrevid': 'revision_id', 'prediction': 'article_quality', 'division': 'regional_division', '2022': 'population'}, inplace=True)

combined_df = combined_df[['state', 'regional_division', 'population', 'article_title', 'revision_id', 'article_quality']]
combined_df.drop_duplicates(inplace=True, ignore_index=True)

The resultant merged data is saved as `wp_scored_city_articles_by_state.csv` in the google drive.

In [56]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
combined_df.to_csv(save_path + 'wp_scored_city_articles_by_state.csv', index=False)

In [8]:
URL = 'wp_scored_city_articles_by_state.csv'
combined_df = pd.read_csv(URL)

combined_df

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5074296.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5074296.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",1179139816,C
...,...,...,...,...,...,...
21520,Wyoming,Mountain,581381.0,"Wamsutter, Wyoming",1169591845,GA
21521,Wyoming,Mountain,581381.0,"Wheatland, Wyoming",1176370621,GA
21522,Wyoming,Mountain,581381.0,"Worland, Wyoming",1166347917,GA
21523,Wyoming,Mountain,581381.0,"Wright, Wyoming",1166334449,GA


# Step 4: Analysis
In this segment we calculate `total-articles-per-population` (a ratio representing the number of articles per person)  and `high-quality-articles-per-population` (a ratio representing the number of high quality articles per person) on a state-by-state and divisional basis. All of these values are “per capita” ratios.

For this analysis we consider 'high quality' articles to be articles that ORES predicted would be in either the `FA` (featured article) or `GA` (good article) classes.


In [9]:
high_quality_articles = combined_df[combined_df['article_quality'].isin(['FA', 'GA'])]

Here we start by grouping the data in `combined_df` based on 'state' and 'regional_division' columns.     
We then calculate the `total_articles`, which represents the count of articles within each group.

In [10]:
state_division_group = combined_df.groupby(['state', 'regional_division'])
total_articles = state_division_group['article_title'].count()

Next, we count the number of high-quality articles within each group defined by state and regional division.

In [11]:
high_quality_articles_count = high_quality_articles.groupby(['state', 'regional_division'])['article_title'].count()

Now, we extract the `state` and `regional_division` columns from the grouped data and append two new columns, `total_articles` and `high_quality_articles` into `analysis_df`. These represent the total article count and high-quality article count for each state and division combination.

In [12]:
analysis_df = state_division_group[['state', 'regional_division']].first()
analysis_df['total_articles'] = total_articles
analysis_df['high_quality_articles'] = high_quality_articles_count

In [22]:
analysis_df.reset_index(drop=True, inplace=True)

This step carries out the merge of population data into the `analysis_df`

In [25]:
analysis_df = analysis_df.merge(POP_SHEET[['State', '2022']], left_on='state', right_on='State', how='left')

We finally compute `total_articles_per_capita` and `high_quality_articles_per_capita.`

In [26]:
analysis_df['total_articles_per_capita'] = analysis_df['total_articles'] / analysis_df['2022']
analysis_df['high_quality_articles_per_capita'] = analysis_df['high_quality_articles'] / analysis_df['2022']

`State` and `2022` columns are removed

In [27]:
analysis_df.drop(['State', '2022'], inplace=True, axis=1)
analysis_df.rename(columns={'population': 'population'}, inplace=True)

analysis_df

Unnamed: 0,state,regional_division,total_articles,high_quality_articles,total_articles_per_capita,high_quality_articles_per_capita
0,Alabama,East South Central,461,53.0,9.1e-05,1.04448e-05
1,Alaska,Pacific,149,31.0,0.000203,4.225834e-05
2,Arizona,Mountain,91,24.0,1.2e-05,3.261225e-06
3,Arkansas,West South Central,500,72.0,0.000164,2.364037e-05
4,California,Pacific,482,173.0,1.2e-05,4.432563e-06
5,Colorado,Mountain,290,76.0,5e-05,1.301386e-05
6,Delaware,South Atlantic,57,25.0,5.6e-05,2.454841e-05
7,Florida,South Atlantic,412,119.0,1.9e-05,5.349559e-06
8,Georgia,South Atlantic,538,93.0,4.9e-05,8.522043e-06
9,Hawaii,Pacific,151,30.0,0.000105,2.08305e-05


In [37]:
analysis_df.shape

(48, 6)

# Step 5: Results

1. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .


In [28]:
analysis_1 = analysis_df.sort_values(by='total_articles_per_capita', ascending=False).head(10)
analysis_1.drop(['high_quality_articles','high_quality_articles_per_capita'],inplace=True,axis=1)
analysis_1.reset_index(drop=True)

Unnamed: 0,state,regional_division,total_articles,total_articles_per_capita
0,Vermont,New England,329,0.000508
1,North Dakota,West North Central,356,0.000457
2,Maine,New England,483,0.000349
3,South Dakota,West North Central,311,0.000342
4,Iowa,West North Central,1043,0.000326
5,Alaska,Pacific,149,0.000203
6,Pennsylvania,Middle Atlantic,2556,0.000197
7,Michigan,East North Central,1773,0.000177
8,Wyoming,Mountain,99,0.00017
9,New Hampshire,New England,234,0.000168


In [44]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_1.to_csv(save_path + 'analysis_1.csv', index=False)

2. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .


In [43]:
analysis_2 = analysis_df.sort_values(by='total_articles_per_capita').head(10)
analysis_2.drop(['high_quality_articles','high_quality_articles_per_capita'],inplace=True,axis=1)
analysis_2=analysis_2.reset_index(drop=True)
analysis_2

Unnamed: 0,state,regional_division,total_articles,total_articles_per_capita
0,North Carolina,South Atlantic,50,5e-06
1,Nevada,Mountain,19,6e-06
2,California,Pacific,482,1.2e-05
3,Arizona,Mountain,91,1.2e-05
4,Virginia,South Atlantic,133,1.5e-05
5,Florida,South Atlantic,412,1.9e-05
6,Oklahoma,West South Central,75,1.9e-05
7,Kansas,West North Central,63,2.1e-05
8,Maryland,South Atlantic,157,2.5e-05
9,Wisconsin,East North Central,192,3.3e-05


In [45]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_2.to_csv(save_path + 'analysis_2.csv', index=False)

3. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .


In [42]:
analysis_3 = analysis_df.sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)
analysis_3.drop(['total_articles','total_articles_per_capita'],inplace=True,axis=1)
analysis_3=analysis_3.reset_index(drop=True)
analysis_3

Unnamed: 0,state,regional_division,high_quality_articles,high_quality_articles_per_capita
0,Vermont,New England,45.0,7e-05
1,Wyoming,Mountain,39.0,6.7e-05
2,West Virginia,South Atlantic,106.0,6e-05
3,Alaska,Pacific,31.0,4.2e-05
4,Iowa,West North Central,104.0,3.2e-05
5,Maine,New England,43.0,3.1e-05
6,Delaware,South Atlantic,25.0,2.5e-05
7,Arkansas,West South Central,72.0,2.4e-05
8,Idaho,Mountain,41.0,2.1e-05
9,Hawaii,Pacific,30.0,2.1e-05


In [46]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_3.to_csv(save_path + 'analysis_3.csv', index=False)

4. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).


In [41]:
analysis_4 = analysis_df.sort_values(by='high_quality_articles_per_capita').head(10)
analysis_4.drop(['total_articles','total_articles_per_capita'],inplace=True,axis=1)
analysis_4 = analysis_4.reset_index(drop=True)
analysis_4

Unnamed: 0,state,regional_division,high_quality_articles,high_quality_articles_per_capita
0,Pennsylvania,Middle Atlantic,12.0,9.250688e-07
1,Virginia,South Atlantic,18.0,2.072868e-06
2,Arizona,Mountain,24.0,3.261225e-06
3,California,Pacific,173.0,4.432563e-06
4,Florida,South Atlantic,119.0,5.349559e-06
5,Tennessee,East South Central,39.0,5.530864e-06
6,Maryland,South Atlantic,42.0,6.813028e-06
7,Kansas,West North Central,22.0,7.490254e-06
8,Georgia,South Atlantic,93.0,8.522043e-06
9,Massachusetts,New England,62.0,8.88001e-06


In [47]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_4.to_csv(save_path + 'analysis_4.csv', index=False)

5. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.


In [40]:
analysis_5 = analysis_df.groupby('regional_division').agg({'total_articles_per_capita': 'mean'}).reset_index()
analysis_5 = analysis_5.sort_values(by='total_articles_per_capita', ascending=False)
analysis_5 = analysis_5.reset_index(drop=True)
analysis_5

Unnamed: 0,regional_division,total_articles_per_capita
0,West North Central,0.000242
1,New England,0.000222
2,Middle Atlantic,9.7e-05
3,East North Central,9.5e-05
4,East South Central,8.4e-05
5,Pacific,8.3e-05
6,Mountain,7.3e-05
7,West South Central,7.2e-05
8,South Atlantic,4.4e-05


In [48]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_5.to_csv(save_path + 'analysis_5.csv', index=False)

6. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


In [39]:
analysis_6 = analysis_df.groupby('regional_division').agg({'high_quality_articles_per_capita': 'mean'}).reset_index()
analysis_6 = analysis_6.sort_values(by='high_quality_articles_per_capita', ascending=False)
analysis_6 = analysis_6.reset_index(drop=True)
analysis_6

Unnamed: 0,regional_division,high_quality_articles_per_capita
0,New England,3.648807e-05
1,Mountain,2.450887e-05
2,Pacific,2.057298e-05
3,West North Central,1.99925e-05
4,South Atlantic,1.783649e-05
5,West South Central,1.646999e-05
6,East North Central,1.331883e-05
7,East South Central,1.116111e-05
8,Middle Atlantic,9.250688e-07


In [49]:
save_path = '/content/drive/My Drive/{}/'.format(FOLDERNAME)
analysis_6.to_csv(save_path + 'analysis_6.csv', index=False)