# This notebook serves as the executable software that reads in datasets, combines datasets, and runs data analysis

The end goal is to perform analysis on how the coverage of US cities on wikipedia and how the quality of articles about cities varies among states. 

Steps: 
1. Confirm datasets are ready to go.
1. Combine dataset of wikipedia articles with a dataset of state populations
2. Use ORES to estiamte quality of articles about the cities
3. Data analyis
   3a. The states with the greatest and least coverage of cities on Wikipedia compared  
       to their population.
   3b. The states with the highest and lowest proportion of high quality articles about cities.
   3c. A ranking of US geographic regions by articles-per-person and proportion of high 
       quality articles.



# Import packages

In [1]:
import numpy as np
import pandas as pd
import csv
import json
import time
import urllib.parse
import requests
import base64

# Part 1. Confirm datasets are ready to go. 

Confirm that the files below are there:  
./data/PopulationEstimates.csv   
./data/States_by_region.csv  
./data/us_cities_by_state_SEPT2023.csv  

#### Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
    ##### Step 1A1: Store only state and population data from csv into list for just states. 
    ####  Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1B: Read in ./data/States_by_region.csv as csv object.
    ##### Step 1B1: Store all data from csv into list. 
    ##### Step 1B2: Remove header row from list to prevent it showing up later.    
    ##### Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
    ##### Step 1C1: Store all data from csv into list. 
    ##### Step 1C2: Remove header row from list to prevent it showing up later.    
    ##### Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory. 

### Data sources: 
##### ./data/PopulationEstimates.csv  
https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/state/totals/NST-EST2022-ALLDATA.csv    
https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html  
##### ./data/States_by_region.csv    
The shared google drive for Homework 2 by UW Data 512 provides this file.   
##### ./data/us_cities_by_state_SEPT2023.csv  
The shared google drive for Homework 2 by UW data 512 provies this file.   
However the Wikipedia Category:Lists of cities in the United States by   
state was crawled to generate a list of Wikipedia article pages about US  
cities from each state.   


In [2]:
# Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
StatePopEstimates = []
with open('../data/PopulationEstimates.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1A1: Store only state and population data from csv into list for just states. 
    #  This means excluding the first 15 and the last rows, which correspond to 
    #  the header row, larger regions of the United States, and Puerto Rico, which are not states. 
    counter = 0
    for row in reader: 
        if(counter > 14 and counter < 66):
            StatePopEstimates.append([row[4], row[8]]) 
        counter += 1
        
# 1A1: Remove district of columbia
StatePopEstimates.pop(8)
        
# Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory.  
print("Number of states, should be 50:", len(StatePopEstimates))
print(StatePopEstimates[0])
print(StatePopEstimates[len(StatePopEstimates)//2])
print(StatePopEstimates[len(StatePopEstimates)-1])

Number of states, should be 50: 50
['Alabama', '5074296']
['Montana', '1122867']
['Wyoming', '581381']


In [3]:
# Step 1B: Read in ./data/States_by_region.csv as csv object.
StateRegions = []
with open('../data/States_by_region.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1B1: Store all data from csv into list. 
    for row in reader: 
        StateRegions.append([row[0], row[1], row[2]]) 

# Step 1B2: Remove header row
StateRegions.pop(0)

# Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory.
print("Number of states, should be 50:", len(StateRegions))
print(StateRegions[0])
print(StateRegions[len(StateRegions)//2])
print(StateRegions[len(StateRegions)-1])

Number of states, should be 50: 50
['Northeast', 'New England', 'Connecticut']
['South', 'South Atlantic', 'North Carolina']
['West', 'Pacific', 'Washington']


In [4]:
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
CityArticles = []
with open('../data/us_cities_by_state_SEPT2023.csv', newline='', encoding="utf8") as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1C1: Store all data from csv into list. 
    for row in reader: 
        CityArticles.append([row[0], row[1], row[2]]) 

# Step 1C2: Remove header row
CityArticles.pop(0)

# Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory.
print(CityArticles[0])
print(CityArticles[len(CityArticles)//2])
print(CityArticles[len(CityArticles)-1])

['Alabama', 'Abbeville, Alabama', 'https://en.wikipedia.org/wiki/Abbeville,_Alabama']
['Minnesota', 'Sargeant, Minnesota', 'https://en.wikipedia.org/wiki/Sargeant,_Minnesota']
['Wyoming', 'Yoder, Wyoming', 'https://en.wikipedia.org/wiki/Yoder,_Wyoming']


# Part 2. Get Article Quality Predictions
We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:
FA - Featured article
GA - Good article (sometimes called A-class)
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article
These labelings were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a subset of quality assessment categories developed by Wikipedia editors.

#### Step 2A: Create wikimedia user account to generate API token. 
    ##### https://api.wikimedia.org/w/index.php?title=Special:UserLogin 
    ##### Then go here to create token: https://api.wikimedia.org/wiki/Special:AppManagement
    ##### Click Create key
    ##### Personal API token, all perms
#### Step 2A: Define constants to make data requests.
    ##### Step 2A1: 
    ####  Step 2A2: 
#### Step 2B: ?
#### Step 2C: ?
#### Step 2D: ?

In [5]:
# This function below was generated by chatGPT. https://chat.openai.com/ 

# Prompt used:
# "Given a text file with 6 lines, where the 2nd, 4th, and 6th lines 
# are a client id, client secret access token, and access token,
# give me a python function that that reads in a text file and loads 
# those lines into memory as variables. Make sure to close the file."

# Why: Wanted to save time, knew how to explain the problem, simple code to generate.

def load_credentials_from_file(filename):
    try:
        with open(filename, 'r') as file:
            lines = file.readlines()
        if len(lines) >= 6:
            client_id = lines[1].strip()
            client_secret = lines[3].strip()
            access_token = lines[5].strip()
            return client_id, client_secret, access_token
        else:
            print("Error: The file does not contain enough lines.")
            return None, None, None
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None, None, None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {e}")
        return None, None, None

# Usage
filename = '../auth.txt'
client_id, client_secret, access_token = load_credentials_from_file(filename)

if client_id is not None and client_secret is not None and access_token is not None:
    print("Credentials loaded successfully:")
else:
    print("Failed to load credentials.")

Credentials loaded successfully:


In [6]:
# The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, 
# a course in the UW MS Data Science degree program. 
# This code is provided under the Creative Commons CC-BY license. Revision 1.1 - August 14, 2022
# Any modifications made to the original source also fall under the CC-BY license. 

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [14]:
# The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, 
# a course in the UW MS Data Science degree program. 
# This code is provided under the Creative Commons CC-BY license. Revision 1.1 - August 15, 2023
# Any modifications made to the original source also fall under the CC-BY license. 

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<zbowyer@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer )" + str(access_token)
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "zbowyer@uw.edu",         # your email address should go here
    'access_token'  : access_token          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Zbowyer"
ACCESS_TOKEN = access_token
#
#print(ACCESS_TOKEN)

#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    #print("inner revid", article_revid, type(article_revid))
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        #print(request_url, headers, request_data)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print("Exception", e)
        json_response = None
    return json_response

In [34]:
# a) read each line of us_cities_by_state_SEPT.2023.csv
for i in range(len(CityArticles)):
    article_title = CityArticles[i][1]
    article_url = CityArticles[i][2]
    
    # b) make a page info request to get the current article page revision
    PageData = request_pageinfo_per_article(article_title)
    pageid = list(PageData["query"]["pages"].keys())[0]
    revid = int(PageData["query"]["pages"][pageid]["lastrevid"])
    article_dict = {article_title: revid}
    
    # c) then  make an ORES request using the page title and current revision id.
    score = request_ores_score_per_article(article_revid=revid,
                                       email_address="zbowyer@uw.edu",
                                       access_token=ACCESS_TOKEN)

    ORES_prediction = score["enwiki"]["scores"][str(revid)]["articlequality"]["score"]["prediction"]
    
    print(article_title, ":", ORES_prediction)

# d) maintain a log of articles for which you were not able to retrieve an ORES score. 

Abbeville, Alabama : C
Adamsville, Alabama : C
Addison, Alabama : C
Akron, Alabama : GA
Alabaster, Alabama : C
Albertville, Alabama : C
Alexander City, Alabama : GA
Aliceville, Alabama : GA
Allgood, Alabama : C
Altoona, Alabama : C
Andalusia, Alabama : C
Anderson, Lauderdale County, Alabama : Stub
Anniston, Alabama : C
Arab, Alabama : C
Ardmore, Alabama : GA
Argo, Alabama : C
Ariton, Alabama : C



KeyboardInterrupt



# Combine datasets