# Considering Bias in Data

The objective of this assignment is to delve into the concept of bias in data through an exploration of Wikipedia articles. The analysis will primarily focus on articles about cities across various states in the United States. By combining a dataset of Wikipedia articles with state populations, we will leverage the ORES machine learning service to estimate the quality of these city-related articles. 

The following notebook outlines the sequential data processing steps for this analysis.



## Step 1: Getting the Article, Population and Region Data

In [1]:
# import 

import json, time, urllib.parse
import requests
import csv
import pandas as pd

In [2]:
#########
#
#    CONSTANTS -setting up API parameter template
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

REQUEST_HEADERS = {
    'User-Agent': '<april.gg@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# Creating a list with all Oscar-winning movies title from local csv file
ARTICLE_TITLES = []

TITLE_FILE = 'C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/us_cities_by_state_SEPT.2023.csv'

with open(TITLE_FILE, newline='', encoding='utf-8') as file:
    reader = csv.reader(file)
    for row in reader:
        if row:  
            ARTICLE_TITLES.append(row[1])


PAGEINFO_EXTENDED_PROPERTIES = ""

PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The subsequent functions serve the purpose of requesting revision IDs from the Wikipedia API, utilizing the titles extracted from a local CSV file. The retrieved output will be integrated back into the original file.

In [None]:
def request_pageinfo_per_article(article_title=None,
                                 endpoint_url=API_ENWIKIPEDIA_ENDPOINT,
                                 request_template=PAGEINFO_PARAMS_TEMPLATE,
                                 headers=REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:

        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()

        pages = json_response.get('query', {}).get('pages', {})
        for page_id in pages:
            page_info = pages[page_id]
            article_name = page_info.get('title', '')
            revid = page_info.get('lastrevid', '')
            article_info.append([article_name, revid])
            update_csv_with_revid(article_name, revid)
    except Exception as e:
        print(e)
        json_response = None
    return json_response

# Creating function to add revid infomation to the existing file
def update_csv_with_revid(article_name, revid):
    df = pd.read_csv(TITLE_FILE)
    if 'revid' not in df.columns:
        df['revid'] = ''
    df.loc[df['page_title'] == article_name, 'revid'] = int(revid)
    df.to_csv(TITLE_FILE, index=False, mode='w')


if __name__ == '__main__':
    article_info = []
    for title in ARTICLE_TITLES: 
        response = request_pageinfo_per_article(title)

In [4]:
#testing on artile title loaded correctly
print(f"Getting page info data for: {ARTICLE_TITLES[3]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[3])
print(json.dumps(info,indent=4))

Getting page info data for: Addison, Alabama
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "105188": {
                "pageid": 105188,
                "ns": 0,
                "title": "Addison, Alabama",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2023-10-10T22:35:37Z",
                "lastrevid": 1168359898,
                "length": 13309
            }
        }
    }
}


## Step 2: Getting Article Quality Predictions

In [9]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED


REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {}"
}

#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : '',         
    'access_token'  :''}

#creating article revisions dict
article_revid_file = 'C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/us_cities_by_state_SEPT.2023.csv'

ARTICLE_REVISIONS = {}

with open(article_revid_file, mode='r', encoding='utf-8') as file:  # specify the appropriate encoding here
    csv_reader = csv.DictReader(file)
    for row in csv_reader:
        if row['revid']:  # Check if the 'revid' is not an empty string
            ARTICLE_REVISIONS[row['page_title']] = int(float(row['revid']))
        else:
            pass

#    This is a template of the data required as a payload when making a scoring request of the ORES model

ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}



In [10]:
#username and token are hidden
USERNAME = ""
ACCESS_TOKEN = ""

In [11]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        if key == 'User-Agent':
            headers[key] = header_format[key].format(header_params['email_address'])


    # make the request
    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response




In [None]:
import csv
import json

# Define the function to call request_ores_score_per_article and update the CSV file
def update_csv_with_ores_scores(csv_file_path, ARTICLE_REVISIONS):
    with open(csv_file_path, mode='r', newline='',encoding='utf-8') as file:
        csv_reader = csv.DictReader(file)
        rows = [row for row in csv_reader]

    for row in rows:
        page_title = row['page_title']
        if page_title in ARTICLE_REVISIONS:
            revid = ARTICLE_REVISIONS[page_title]
            score = request_ores_score_per_article(article_revid=revid,
                                                   email_address="april.gg@uw.edu",
                                                   access_token=ACCESS_TOKEN)
            if score:
                try:
                    article_quality_score = score['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction']
                    row['article_quality_score'] = article_quality_score
                except KeyError:
                    print(f"No score found for article with revid {revid}")

    with open(csv_file_path, mode='w', newline='',encoding='utf-8') as file:
        fieldnames = rows[0].keys()
        csv_writer = csv.DictWriter(file, fieldnames=fieldnames)
        csv_writer.writeheader()
        csv_writer.writerows(rows)

# Call the function to update the CSV file
update_csv_with_ores_scores("C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/us_cities_by_state_SEPT.2023.csv", ARTICLE_REVISIONS)


## Step 3: Combining the Datasets


### merging the region data

In [9]:
region_df = pd.read_excel("C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/US States by Region - US Census Bureau.xlsx")
region_df.head()

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,,New England,
2,,,Connecticut
3,,,Maine
4,,,Massachusetts


In [10]:
#Backfill the blanks in region and division
region_df['REGION'] = region_df['REGION'].fillna(method='ffill')
region_df['DIVISION'] = region_df['DIVISION'].fillna(method='ffill')
region_df.head()

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,Northeast,New England,
2,Northeast,New England,Connecticut
3,Northeast,New England,Maine
4,Northeast,New England,Massachusetts


In [11]:
#Deleting NA records and change the 'state' column name to get ready to merge
region_df=region_df.dropna().reset_index(drop=True)
region_df.rename(columns={'STATE': 'state'}, inplace=True)
region_df.head()

Unnamed: 0,REGION,DIVISION,state
0,Northeast,New England,Connecticut
1,Northeast,New England,Maine
2,Northeast,New England,Massachusetts
3,Northeast,New England,New Hampshire
4,Northeast,New England,Rhode Island


In [12]:
score_df = pd.read_csv("C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/us_cities_by_state_SEPT.2023.csv")
score_df.head()                          

Unnamed: 0,state,page_title,url,revid,article_quality_score
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171164000.0,C
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621000.0,C
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168360000.0,C
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165910000.0,GA
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179140000.0,C


In [13]:
score_region = pd.merge(region_df,score_df, on=['state'])
score_region.head()

Unnamed: 0,REGION,DIVISION,state,page_title,url,revid,article_quality_score
0,Northeast,New England,Maine,"Abbot, Maine","https://en.wikipedia.org/wiki/Abbot,_Maine",1171169000.0,
1,Northeast,New England,Maine,"Acton, Maine","https://en.wikipedia.org/wiki/Acton,_Maine",1175249000.0,
2,Northeast,New England,Maine,"Addison, Maine","https://en.wikipedia.org/wiki/Addison,_Maine",1168360000.0,
3,Northeast,New England,Maine,"Albion, Maine","https://en.wikipedia.org/wiki/Albion,_Maine",1165910000.0,
4,Northeast,New England,Maine,"Alexander, Maine","https://en.wikipedia.org/wiki/Alexander,_Maine",1170294000.0,


### merging population data

In [2]:
population_df = pd.read_csv("C:/Users/april/Documents/Documents/MSDS/DATA512/HW2/NST-EST2022-ALLDATA.csv")
population_df.head()


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,NAME,ESTIMATESBASE2020,POPESTIMATE2020,POPESTIMATE2021,POPESTIMATE2022,NPOPCHG_2020,...,RDEATH2021,RDEATH2022,RNATURALCHG2021,RNATURALCHG2022,RINTERNATIONALMIG2021,RINTERNATIONALMIG2022,RDOMESTICMIG2021,RDOMESTICMIG2022,RNETMIG2021,RNETMIG2022
0,10,0,0,0,United States,331449520,331511512,332031554,333287557,61992,...,10.363828,10.350218,0.434073,0.736729,1.133397,3.038912,0.0,0.0,1.133397,3.038912
1,20,1,0,0,Northeast Region,57609156,57448898,57259257,57040406,-160258,...,9.780142,9.868918,0.206629,0.5112,1.402708,3.752662,-4.855348,-8.061896,-3.45264,-4.309234
2,30,1,1,0,New England,15116206,15074473,15121745,15129548,-41733,...,9.530598,9.887115,-0.310502,-0.206669,1.770752,4.65514,1.546021,-3.767839,3.316773,0.887301
3,30,1,2,0,Middle Atlantic,42492950,42374425,42137512,41910858,-118525,...,9.869304,9.862369,0.3914,0.769581,1.271205,3.427836,-7.142565,-9.607444,-5.87136,-6.179608
4,20,2,0,0,Midwest Region,68985537,68961043,68836505,68787595,-24494,...,11.059195,11.169148,-0.207043,-0.12553,0.802714,2.111084,-2.645374,-2.529339,-1.84266,-0.418255


In [3]:
#drop irrelevant columns and rename to get ready to merge
population_df = population_df[['NAME', 'POPESTIMATE2022']]
population_df = population_df.rename(columns={"NAME": "state", "POPESTIMATE2022": "population"})
population_df.head()


Unnamed: 0,state,population
0,United States,333287557
1,Northeast Region,57040406
2,New England,15129548
3,Middle Atlantic,41910858
4,Midwest Region,68787595


In [16]:
final_df = pd.merge(population_df,score_region, on=['state'])
final_df.head()

Unnamed: 0,state,population,REGION,DIVISION,page_title,url,revid,article_quality_score
0,Alabama,5031362,South,East South Central,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171164000.0,C
1,Alabama,5031362,South,East South Central,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621000.0,C
2,Alabama,5031362,South,East South Central,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168360000.0,C
3,Alabama,5031362,South,East South Central,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165910000.0,GA
4,Alabama,5031362,South,East South Central,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179140000.0,C


In [None]:
final_df.drop(['url','REGION'], axis=1, inplace=True)
df_final = df_final.rename(columns={'DIVISION': 'regional_division','page_title': 'article_title','revid':'revision_id','article_quality_score':'article_quality' })
final_df.to_csv('wp_scored_city_articles_by_state.csv', index=False)