# DATA512 HW2 Considering Bias in Data

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles about cities in different US states. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities.

There are 2 aspects of this homework. The first part is about how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). We request summary 'page info' for multiple article pages. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this part.

The next part is to request ORES scores through LiftWing ML Service API. We generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org/docs) can be accessed from the main ORES page. 

## License
Some parts of the code are either used as is or modified based on the example code that was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/).



## Import Libraries
Load all required libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

import os
import json, time, urllib.parse
import requests
import pandas as pd
import matplotlib.pyplot as plt

from itertools import islice
from pandas import json_normalize

pd.set_option('display.max_colwidth', None)

## Hard coded variables to change before running this code:
1. Working Directory has to be changed under `Set Directory`
2. In `Step 2: Getting Article Quality Predictions`, use your own `email_address`, `access_token` & `user_name`. 

To Get your access token: You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin&centralAuthAutologinTried=1&centralAuthError=Not+centrally+logged+in). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is a 'guide' that describes how to get authentication tokens - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a Personal API token that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.


Without changing these values, the code would not run.

## Set Directory

In [2]:
# Change this
cwd = 'C:\\Users\\adith\\Documents\\HW2'

## Step 1: Getting the Article, Population and Region Data

## Import files

We use 3 flatfiles as a part of our analysis.
#### List of Cities
The Wikipedia Category:Lists of cities in the United States by state was crawled to generate a list of Wikipedia article pages about US cities from each state. This data is available in the input folder called `us_cities_by_state_SEPT.2023.csv`.

#### Population Data
We use the data provided by US Census Bureau which has population estimates for every US state. We can find Excel file linked to that page contains estimated populations of all US states for 2022 in the input folder named `NST-EST2022-POP.xlsx`.

#### Regional Division Data
We also use the regional and divisional agglomerations as defined by the US Census Bureau. The input folder contains a spreadsheet listing the states in each regional division, named `US States by Region - US Census Bureau.xlsx`.


In [3]:
# Reading the wikipedia list of cities data
us_citiesxstates = pd.read_csv(os.path.join(cwd, 'input/us_cities_by_state_SEPT.2023.csv'))

# Drop duplicate rows
us_citiesxstates.drop_duplicates(inplace=True, ignore_index=True)

us_citiesxstates

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
21520,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
21521,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
21522,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
21523,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


In [4]:
# Reading the population data

us_pop = pd.read_excel(os.path.join(cwd, 'input/NST-EST2022-POP.xlsx'), skiprows=4)

# Remove unnecessary rows
us_pop = us_pop[4:]  

us_pop.reset_index(drop=True, inplace=True)

# Rename the columns
us_pop.columns = ['State','2020_est','2020', '2021', '2022']

# The State names start with a . so we remove such special characters
us_pop = us_pop[us_pop['State'].str.contains('^\.', na=False)]
us_pop['State'] = us_pop['State'].str.slice(1)
us_pop = us_pop[['State', '2022']].reset_index(drop=True)

us_pop.columns = ['State', '2022']

us_pop

Unnamed: 0,State,2022
0,Alabama,5074296.0
1,Alaska,733583.0
2,Arizona,7359197.0
3,Arkansas,3045637.0
4,California,39029342.0
5,Colorado,5839926.0
6,Connecticut,3626205.0
7,Delaware,1018396.0
8,District of Columbia,671803.0
9,Florida,22244823.0


In [5]:
# Load the regional divison data from the Excel file
us_regions = pd.read_excel(os.path.join(cwd, 'input/US States by Region - US Census Bureau.xlsx'))

# Forward-fill missing region and division values
us_regions['REGION'].fillna(method='ffill', inplace=True)
us_regions['DIVISION'].fillna(method='ffill', inplace=True)

# Filter out rows where STATE is not null (i.e., where STATE is a state name)
us_regions = us_regions.dropna(subset=['STATE'])

# Rename columns to lowercase
us_regions.columns = us_regions.columns.str.lower()

us_regions

Unnamed: 0,region,division,state
2,Northeast,New England,Connecticut
3,Northeast,New England,Maine
4,Northeast,New England,Massachusetts
5,Northeast,New England,New Hampshire
6,Northeast,New England,Rhode Island
7,Northeast,New England,Vermont
9,Northeast,Middle Atlantic,New Jersey
10,Northeast,Middle Atlantic,New York
11,Northeast,Middle Atlantic,Pennsylvania
14,Midwest,East North Central,Illinois


## Set Global Variables for the English Wikipedia endpoint
In this section, we define various constants and configuration settings that will be used throughout the script to interact with the English Wikipedia API and manage the data retrieval process. These constants include the API endpoint, assumed latency, request headers, example article titles, and parameters for making page information requests. 

In [6]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<adi279@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests. We use another list in this code.
ARTICLE_TITLES = ['Abbeville, Alabama', 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

## Function to Request Page Information per Article

### Function to Request Page Information per Article

This function is used to make requests to the Wikipedia API to retrieve page information for a given article. It takes the following parameters:

- `article_title`: The title of the article for which page information is requested.
- `endpoint_url`: The URL of the Wikipedia API endpoint (default is `API_ENWIKIPEDIA_ENDPOINT`).
- `request_template`: A template containing request parameters (default is `PAGEINFO_PARAMS_TEMPLATE`).
- `headers`: Request headers (default is `REQUEST_HEADERS`).


In [7]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

## Data Collection:

This code block is responsible for collecting and processing data for a set of articles. It operates in the following steps:

#### Iteration Over DataFrame Rows:

- It iterates over each row of a DataFrame named `us_citiesxstates`.
- For each row, it extracts the value in the 'page_title' column and assigns it to the variable `article_title`.

#### API Requests for Page Information:

- For each `article_title`, the code makes an API request to gather page information.

#### Data Storage:

- The retrieved data is stored in dictionaries for further processing.

#### Handling Failed Requests:

- If a request fails, the code logs the failed article titles and continues processing others.
- This is in `failed_articles`.


#### Output:

The resulting DataFrame contains information about articles, including page ID, revision ID, quality score, etc.


In [8]:
# Initialize an empty list to store the responses for each article
article_responses = []

# To store failed articles
failed_articles = []  

# Iterate through values from us_citiesxstates.page_title

for article_title in us_citiesxstates.page_title:
    print('Pulling data for: ',article_title)
    # Make a request for each article title
    response = request_pageinfo_per_article(article_title)
    
    if response is not None:
        # Append the response to the list as a dictionary
        article_responses.append(response)
    else:
        print(f"Failed to retrieve data for {article_title}")
        failed_articles.append(article_title)


# Convert the responses into a dataframe.
        
article_responses_df = pd.DataFrame(article_responses)

result_df = pd.DataFrame()

for index, row in article_responses_df.iterrows():
    page_info = row['query']
    page_id = list(page_info['pages'].keys())[0]
    page_data = page_info['pages'][page_id]

    # Convert the page_data dictionary into a DataFrame with one row
    page_data_df = pd.DataFrame.from_dict(page_data, orient='index').T
    result_df = pd.concat([result_df, page_data_df])

# Reset the index
result_df.reset_index(drop=True, inplace=True)

Pulling data for:  Abbeville, Alabama
Pulling data for:  Adamsville, Alabama
Pulling data for:  Addison, Alabama
Pulling data for:  Akron, Alabama
Pulling data for:  Alabaster, Alabama
Pulling data for:  Albertville, Alabama
Pulling data for:  Alexander City, Alabama
Pulling data for:  Aliceville, Alabama
Pulling data for:  Allgood, Alabama
Pulling data for:  Altoona, Alabama
Pulling data for:  Andalusia, Alabama
Pulling data for:  Anderson, Lauderdale County, Alabama
Pulling data for:  Anniston, Alabama
Pulling data for:  Arab, Alabama
Pulling data for:  Ardmore, Alabama
Pulling data for:  Argo, Alabama
Pulling data for:  Ariton, Alabama
Pulling data for:  Arley, Alabama
Pulling data for:  Ashford, Alabama
Pulling data for:  Ashland, Alabama
Pulling data for:  Ashville, Alabama
Pulling data for:  Athens, Alabama
Pulling data for:  Atmore, Alabama
Pulling data for:  Attalla, Alabama
Pulling data for:  Auburn, Alabama
Pulling data for:  Autaugaville, Alabama
Pulling data for:  Avon, Ala

Pulling data for:  Irondale, Alabama
Pulling data for:  Jackson, Alabama
Pulling data for:  Jackson's Gap, Alabama
Pulling data for:  Jacksonville, Alabama
Pulling data for:  Jasper, Alabama
Pulling data for:  Jemison, Alabama
Pulling data for:  Kansas, Alabama
Pulling data for:  Kellyton, Alabama
Pulling data for:  Kennedy, Alabama
Pulling data for:  Killen, Alabama
Pulling data for:  Kimberly, Alabama
Pulling data for:  Kinsey, Alabama
Pulling data for:  Kinston, Alabama
Pulling data for:  LaFayette, Alabama
Pulling data for:  Lake View, Alabama
Pulling data for:  Lakeview, Alabama
Pulling data for:  Lanett, Alabama
Pulling data for:  Langston, Alabama
Pulling data for:  Leeds, Alabama
Pulling data for:  Leesburg, Alabama
Pulling data for:  Leighton, Alabama
Pulling data for:  Lester, Alabama
Pulling data for:  Level Plains, Alabama
Pulling data for:  Lexington, Alabama
Pulling data for:  Libertyville, Alabama
Pulling data for:  Lincoln, Alabama
Pulling data for:  Linden, Alabama
Pul

Pulling data for:  Wadley, Alabama
Pulling data for:  Waldo, Alabama
Pulling data for:  Walnut Grove, Alabama
Pulling data for:  Warrior, Alabama
Pulling data for:  Waterloo, Alabama
Pulling data for:  Waverly, Alabama
Pulling data for:  Weaver, Alabama
Pulling data for:  Webb, Alabama
Pulling data for:  Wedowee, Alabama
Pulling data for:  West Blocton, Alabama
Pulling data for:  West Jefferson, Alabama
Pulling data for:  West Point, Alabama
Pulling data for:  Westover, Alabama
Pulling data for:  Wetumpka, Alabama
Pulling data for:  White Hall, Alabama
Pulling data for:  Wilsonville, Alabama
Pulling data for:  Wilton, Alabama
Pulling data for:  Winfield, Alabama
Pulling data for:  Woodland, Alabama
Pulling data for:  Woodstock, Alabama
Pulling data for:  Woodville, Alabama
Pulling data for:  Yellow Bluff, Alabama
Pulling data for:  York, Alabama
Pulling data for:  Adak, Alaska
Pulling data for:  Akhiok, Alaska
Pulling data for:  Akiak, Alaska
Pulling data for:  Akutan, Alaska
Pulling d

Pulling data for:  Payson, Arizona
Pulling data for:  Peoria, Arizona
Pulling data for:  Phoenix, Arizona
Pulling data for:  Pima, Arizona
Pulling data for:  Pinetop-Lakeside, Arizona
Pulling data for:  Prescott, Arizona
Pulling data for:  Prescott Valley, Arizona
Pulling data for:  Quartzsite, Arizona
Pulling data for:  Queen Creek, Arizona
Pulling data for:  Safford, Arizona
Pulling data for:  Sahuarita, Arizona
Pulling data for:  San Luis, Arizona
Pulling data for:  Scottsdale, Arizona
Pulling data for:  Sedona, Arizona
Pulling data for:  Show Low, Arizona
Pulling data for:  Sierra Vista, Arizona
Pulling data for:  Snowflake, Arizona
Pulling data for:  Somerton, Arizona
Pulling data for:  South Tucson, Arizona
Pulling data for:  Springerville, Arizona
Pulling data for:  St. Johns, Arizona
Pulling data for:  Star Valley, Arizona
Pulling data for:  Superior, Arizona
Pulling data for:  Surprise, Arizona
Pulling data for:  Taylor, Arizona
Pulling data for:  Tempe, Arizona
Pulling data f

Pulling data for:  Gilbert, Arkansas
Pulling data for:  Gillett, Arkansas
Pulling data for:  Gillham, Arkansas
Pulling data for:  Gilmore, Arkansas
Pulling data for:  Glenwood, Arkansas
Pulling data for:  Goshen, Arkansas
Pulling data for:  Gosnell, Arkansas
Pulling data for:  Gould, Arkansas
Pulling data for:  Grady, Arkansas
Pulling data for:  Grannis, Arkansas
Pulling data for:  Gravette, Arkansas
Pulling data for:  Green Forest, Arkansas
Pulling data for:  Greenbrier, Arkansas
Pulling data for:  Greenland, Arkansas
Pulling data for:  Greenway, Arkansas
Pulling data for:  Greenwood, Arkansas
Pulling data for:  Greers Ferry, Arkansas
Pulling data for:  Griffithville, Arkansas
Pulling data for:  Grubbs, Arkansas
Pulling data for:  Guion, Arkansas
Pulling data for:  Gum Springs, Arkansas
Pulling data for:  Gurdon, Arkansas
Pulling data for:  Guy, Arkansas
Pulling data for:  Hackett, Arkansas
Pulling data for:  Hamburg, Arkansas
Pulling data for:  Hampton, Arkansas
Pulling data for:  Ha

Pulling data for:  Ravenden Springs, Arkansas
Pulling data for:  Rector, Arkansas
Pulling data for:  Redfield, Arkansas
Pulling data for:  Reed, Arkansas
Pulling data for:  Reyno, Arkansas
Pulling data for:  Rison, Arkansas
Pulling data for:  Rockport, Arkansas
Pulling data for:  Roe, Arkansas
Pulling data for:  Rogers, Arkansas
Pulling data for:  Rondo, Arkansas
Pulling data for:  Rose Bud, Arkansas
Pulling data for:  Rosston, Arkansas
Pulling data for:  Rudy, Arkansas
Pulling data for:  Russell, Arkansas
Pulling data for:  Russellville, Arkansas
Pulling data for:  Salem, Fulton County, Arkansas
Pulling data for:  Salesville, Arkansas
Pulling data for:  Scranton, Arkansas
Pulling data for:  Searcy, Arkansas
Pulling data for:  Sedgwick, Arkansas
Pulling data for:  Shannon Hills, Arkansas
Pulling data for:  Sheridan, Arkansas
Pulling data for:  Sherrill, Arkansas
Pulling data for:  Sherwood, Arkansas
Pulling data for:  Shirley, Arkansas
Pulling data for:  Sidney, Arkansas
Pulling data f

KeyboardInterrupt: 

In [None]:
# List of failed articles for which the API failed
failed_articles

In [None]:
# Export page info request as csv (intermediate file) 
result_df.to_csv(os.path.join(cwd, 'intermediate/article_info.csv'), index=False)
result_df

## Step 2: Getting Article Quality Predictions

#### LiftWing ORES API Configuration:

This section defines the configuration for making API requests to the LiftWing ORES endpoint and prediction model.

- `API_ORES_LIFTWING_ENDPOINT`: The API endpoint for the LiftWing ORES model.

- `API_ORES_EN_QUALITY_MODEL`: The specific quality model for English Wikipedia articles.

#### Throttling and Rate Limits:

The code outlines the calculation of throttling and rate limits for making automated requests to the API.

- `API_LATENCY_ASSUMED`: The assumed API latency in seconds.

- `API_THROTTLE_WAIT`: The calculated time interval to wait between API requests to comply with rate limits.

#### Request Headers:

This part provides details about the headers to be included in API requests, including:

- `User-Agent`: A user-agent string indicating the requester's identity.

- `Content-Type`: The type of content in the request.

- `Authorization`: The authorization header with a bearer token.

#### Header Parameters Template:

The template for header parameters that need to be supplied in API requests, including email address and access token.

#### Sample Article Revisions:

A dictionary of English Wikipedia article titles and sample revision IDs for use in ORES scoring examples.

#### ORES Request Data Template:

The template for data payload required when making a scoring request to the ORES model.

#### User Information:

Your username and access token for making authenticated API requests.


In [11]:
#    The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<adi279@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "adithyaa279@gmail.com",         # your email address should go here
    'access_token'  : "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiI0NmQ4ZTg5NzY5M2YxYTE3OGU4ZDViZTQ4ZWY5Njg4OCIsImp0aSI6IjM4YTUwMDZjNTYxOGYxZWEyMDM0OWM4ZTkzNjQyNmUwZjhjNzBiMThjNzBmMzNlZDA2N2FiNzM0ZDFmYTVhMGY4MDc5YjAwMmNkYmI4N2QzIiwiaWF0IjoxNjk3MzM1NjE2LjAwNDY5LCJuYmYiOjE2OTczMzU2MTYuMDA0NjkzLCJleHAiOjMzMjU0MjQ0NDE2LjAwMzM1LCJzdWIiOiI3NDAwNjUzOCIsImlzcyI6Imh0dHBzOi8vbWV0YS53aWtpbWVkaWEub3JnIiwicmF0ZWxpbWl0Ijp7InJlcXVlc3RzX3Blcl91bml0Ijo1MDAwLCJ1bml0IjoiSE9VUiJ9LCJzY29wZXMiOlsiYmFzaWMiLCJjcmVhdGVlZGl0bW92ZXBhZ2UiLCJlZGl0cHJvdGVjdGVkIl19.U1V2FY9Zh11AaHTjy1OG6ql1gkKcsUhtwid9j9-1_L7Fwm2BaHl_1aOBNMhxC2TZMjAIs9hagRkKUug3TI9EMP-Up89cPwsiCdnb7rZFSXxU5hUIrrpVVN49y8LonBZeAy2VXCAnbeHCGQkfYbVBf92JOhY7rpUEGiMGJddXnLDjHmKBX2fOT31Kf-9pb4UUGt1QcJxucnM094KVJfjL5mdvWSscqdKRwc982Qa951uifNkAFse4uxVeq10gwUIgO_sfGVvU3FnAvGl7r3BmnmUsrVCVb_NwnWpDye8JYQ6aNOLxsSCS9iw13PVct1369ZdO__wiNCFX9BOiG9C7juosMu9X2XoBs35eXUOIhZOcbHRhPatfP1kvzpEqAuvBiaXTshVI7TA_FWgg2htpIieVFuKtD0X6koQ-5LcInL_re2PuFrovMJOtrV_k2m8Qx4JYntlaxgZJc-mNW-wbgT4LUIC03MR9s8ITvbiYY2q96nwqPsJQIH0JTa_49eDAh4L8sjXD-ktXtnyTpQtulJzQn-QduWFO8o_1ssjhbAgIecp7HzTIxDOdlIBG5Ib24IA4VRk8zC20OKWVjK5EQB0jbyLH4rv2IizXkzUrhadpZPRk8VHECkM3sbMi97U-m_5uQ3Mv1-Cpv0wtdl31kNR0h3C-hS6GZ0zr_apsn50"          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
# ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Adithyaavaasen"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiI0NmQ4ZTg5NzY5M2YxYTE3OGU4ZDViZTQ4ZWY5Njg4OCIsImp0aSI6IjM4YTUwMDZjNTYxOGYxZWEyMDM0OWM4ZTkzNjQyNmUwZjhjNzBiMThjNzBmMzNlZDA2N2FiNzM0ZDFmYTVhMGY4MDc5YjAwMmNkYmI4N2QzIiwiaWF0IjoxNjk3MzM1NjE2LjAwNDY5LCJuYmYiOjE2OTczMzU2MTYuMDA0NjkzLCJleHAiOjMzMjU0MjQ0NDE2LjAwMzM1LCJzdWIiOiI3NDAwNjUzOCIsImlzcyI6Imh0dHBzOi8vbWV0YS53aWtpbWVkaWEub3JnIiwicmF0ZWxpbWl0Ijp7InJlcXVlc3RzX3Blcl91bml0Ijo1MDAwLCJ1bml0IjoiSE9VUiJ9LCJzY29wZXMiOlsiYmFzaWMiLCJjcmVhdGVlZGl0bW92ZXBhZ2UiLCJlZGl0cHJvdGVjdGVkIl19.U1V2FY9Zh11AaHTjy1OG6ql1gkKcsUhtwid9j9-1_L7Fwm2BaHl_1aOBNMhxC2TZMjAIs9hagRkKUug3TI9EMP-Up89cPwsiCdnb7rZFSXxU5hUIrrpVVN49y8LonBZeAy2VXCAnbeHCGQkfYbVBf92JOhY7rpUEGiMGJddXnLDjHmKBX2fOT31Kf-9pb4UUGt1QcJxucnM094KVJfjL5mdvWSscqdKRwc982Qa951uifNkAFse4uxVeq10gwUIgO_sfGVvU3FnAvGl7r3BmnmUsrVCVb_NwnWpDye8JYQ6aNOLxsSCS9iw13PVct1369ZdO__wiNCFX9BOiG9C7juosMu9X2XoBs35eXUOIhZOcbHRhPatfP1kvzpEqAuvBiaXTshVI7TA_FWgg2htpIieVFuKtD0X6koQ-5LcInL_re2PuFrovMJOtrV_k2m8Qx4JYntlaxgZJc-mNW-wbgT4LUIC03MR9s8ITvbiYY2q96nwqPsJQIH0JTa_49eDAh4L8sjXD-ktXtnyTpQtulJzQn-QduWFO8o_1ssjhbAgIecp7HzTIxDOdlIBG5Ib24IA4VRk8zC20OKWVjK5EQB0jbyLH4rv2IizXkzUrhadpZPRk8VHECkM3sbMi97U-m_5uQ3Mv1-Cpv0wtdl31kNR0h3C-hS6GZ0zr_apsn50"
#

In [12]:
##
##   Decode the Wikimedia JWT Access token
##
##   NOTE: This is not required to use LiftWing to request ORES scores. This is just being done to satisfy my curiosity.
##
#import base64
#
#print("Decoding the ACCESS_TOKEN:")
#try:
#    token_components = ACCESS_TOKEN.split(".")
#    if len(token_components) == 3:
#        header = json.loads(base64.b64decode(token_components[0]).decode())
#        payload = json.loads(base64.b64decode(token_components[1]).decode())
#        print("Token Header:",json.dumps(header,indent=4))
#        print("Token Payload:",json.dumps(payload,indent=4))
#        #print("Token Signature:",token_components[2])
#        print("Token Signature: <value_suppressed>")
#        #
#        #  One should be able to use public/private keys to actually validate that signature - left as an exercise for later
#        #
#    else:
#        print(f"The ACCESS_TOKEN appears to be improperly structured. It should have 3 components and it has {len(token_components)}")
#except Exception as ex:
#    print(f"Looks like the ACCESS_TOKEN is undefined or an empty value")
#    raise(ex)
#

### Function for getting article quality score:

This section contains the code for a function named `request_ores_score_per_article`. The function is responsible for making API requests to gather article quality scores from the ORES service.

#### Function Parameters:

- `article_revid`: The revision ID of the article for which you want to retrieve the quality score.
- `email_address`: Email address required for the API request.
- `access_token`: Access token required for the API request.
- `endpoint_url`: The API endpoint URL for making the request (default is `API_ORES_LIFTWING_ENDPOINT`).
- `model_name`: The name of the ORES model for scoring (default is `API_ORES_EN_QUALITY_MODEL`).
- `request_data`: Request data template for the API request (default is `ORES_REQUEST_DATA_TEMPLATE`).
- `header_format`: Request header template for the API request (default is `REQUEST_HEADER_TEMPLATE`).
- `header_params`: Request header parameters for the API request (default is `REQUEST_HEADER_PARAMS_TEMPLATE`).


In [13]:
def request_ores_score_per_article(article_revid = None, 
                                   email_address=None,
                                   access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### Data Collection:
This code block fetches quality scores for a set of articles using API requests to the ORES service. The retrieved scores are organized into a structured DataFrame for analysis and further processing.



#### Initializing Data Structures:

- An empty dictionary `quality_scores` is created to store the quality scores.
- An empty list `failed_articles` is created to store articles for which scores couldn't be retrieved.

#### Iteration Over DataFrame Rows:

- It iterates over each row of a DataFrame named `result_df` that has the article tile and the revision ID.
- For each row, it extracts the 'title' and 'lastrevid' values, which represent the article title and revision ID, respectively.

#### API Requests for Quality Scores:

- For each article, the code makes an API request to fetch the quality score.
- The quality score is requested from the ORES API, and the response is stored in the `quality_score` variable.

#### Data Storage:

- If the quality score is successfully retrieved, it is stored in the `quality_scores` dictionary with the article title as the key.

#### Handling Failed Requests:

- If a request fails (i.e. returns {'httpCode': 429, 'httpReason': ''} or {}), the code logs the failed article titles in `failed_articles` and continues processing others.
- The list of artciles without scores are displayed towards the end of the loop.

#### Data Transformation:

After processing all articles, the code aggregates the retrieved quality scores into a structured DataFrame.

#### Output:

The resulting DataFrame `quality_scores_df` contains information about quality scores for each article.

In [16]:
# Initialize an empty dictionary to store quality scores
quality_scores = {}
# Initialize an empty list to store articles for which scores couldn't be retrieved
failed_articles = []

# Iterate through the DataFrame with article titles and revision IDs
for index, row in result_df.iterrows():
    article_title = row['title']
    article_revid = row['lastrevid']
    print('Requesting quality score for: ', article_title,', with rev ID: ',article_revid)

    # Make a request to the ORES API to get the quality score
    quality_score = request_ores_score_per_article(article_revid=article_revid, email_address="adithyaa279@gmail.com", access_token=ACCESS_TOKEN)
    
    # Check to collect articles with no scores
    if 'httpCode' in quality_score:
        print(f"Failed to retrieve quality score for {article_title}")
        failed_articles.append(article_title)
    else:
        if quality_score is not None:
            quality_scores[article_title] = quality_score
        else:
            print(f"Failed to retrieve quality score for {article_title}")
            failed_articles.append(article_title)


# Initialize lists to store page titles and quality scores as a dataframe
page_titles = []
revision_ids = []
predictions = []
probabilities_B = []
probabilities_C = []
probabilities_FA = []
probabilities_GA = []
probabilities_Start = []
probabilities_Stub = []

for page_title, page_data in quality_scores.items():
    enwiki = page_data.get('enwiki', {})
    if 'scores' in enwiki:
        scores = enwiki['scores']
        for rev_id, rev_data in scores.items():
            prediction = rev_data['articlequality']['score']['prediction']
            probability = rev_data['articlequality']['score']['probability']
            page_titles.append(page_title)
            revision_ids.append(rev_id)
            predictions.append(prediction)
            probabilities_B.append(probability['B'])
            probabilities_C.append(probability['C'])
            probabilities_FA.append(probability['FA'])
            probabilities_GA.append(probability['GA'])
            probabilities_Start.append(probability['Start'])
            probabilities_Stub.append(probability['Stub'])

# Create a DataFrame from the extracted data
quality_scores_df = pd.DataFrame({
    'page_title': page_titles,
    'revision_id': revision_ids,
    'prediction': predictions,
    'probability_B': probabilities_B,
    'probability_C': probabilities_C,
    'probability_FA': probabilities_FA,
    'probability_GA': probabilities_GA,
    'probability_Start': probabilities_Start,
    'probability_Stub': probabilities_Stub
})

Requesting quality score for:  Abbeville, Alabama , with rev ID:  1171163550
{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1171163550': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.31042252456158204, 'C': 0.5979200965294227, 'FA': 0.025186220917133947, 'GA': 0.04952133645299354, 'Start': 0.013573873336789355, 'Stub': 0.0033759482020785892}}}}}}}
Requesting quality score for:  Adamsville, Alabama , with rev ID:  1177621427
{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1177621427': {'articlequality': {'score': {'prediction': 'C', 'probability': {'B': 0.198274200391586, 'C': 0.3770695177348356, 'FA': 0.019070364455845708, 'GA': 0.3514876684327692, 'Start': 0.05026148902798659, 'Stub': 0.003836759956977147}}}}}}}
Requesting quality score for:  Addison, Alabama , with rev ID:  1168359898
{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1168359898': {'articlequality': {'score': {

KeyboardInterrupt: 

In [None]:
# Print and save the list of failed articles
print("Failed to retrieve ORES scores for the following articles:")
for article in failed_articles:
    print(article)

In [19]:
# Export page info request as csv (intermediate file) 
quality_scores_df.to_csv(os.path.join(cwd, 'intermediate/scores.csv'), index=False)

Unnamed: 0,page_title,revision_id,prediction,probability_B,probability_C,probability_FA,probability_GA,probability_Start,probability_Stub
0,"Abbeville, Alabama",1171163550,C,0.310423,0.597920,0.025186,0.049521,0.013574,0.003376
1,"Adamsville, Alabama",1177621427,C,0.198274,0.377070,0.019070,0.351488,0.050261,0.003837
2,"Addison, Alabama",1168359898,C,0.271041,0.324460,0.011266,0.294871,0.093188,0.005175
3,"Akron, Alabama",1165909508,GA,0.175388,0.265587,0.011557,0.448584,0.093508,0.005375
4,"Alabaster, Alabama",1179139816,C,0.270972,0.646384,0.009591,0.033642,0.036341,0.003071
...,...,...,...,...,...,...,...,...,...
16673,"Wamsutter, Wyoming",1169591845,GA,0.140348,0.187580,0.011517,0.624482,0.032711,0.003363
16674,"Wheatland, Wyoming",1176370621,GA,0.245020,0.285966,0.034640,0.395906,0.033736,0.004732
16675,"Worland, Wyoming",1166347917,GA,0.160382,0.238469,0.026243,0.546966,0.023868,0.004072
16676,"Wright, Wyoming",1166334449,GA,0.165136,0.323542,0.009909,0.467899,0.029659,0.003855


## Step 3: Combining the Datasets
This code block combines multiple datasets into a single DataFrame.

In [65]:
# Create a copy of the DataFrame to start the combining process
combined_df = result_df.copy()

# Merge with the DataFrame containing state information
combined_df = combined_df.merge(us_citiesxstates, how='left',
                                left_on='title', right_on='page_title')

# Clean state names, handling any specific cases
combined_df['state'] = combined_df['state'].apply(lambda x: 'Georgia' if x == 'Georgia_(U.S._state)' else x)

# The cityxstate data had values like New_york, fixing that
combined_df['state'] = combined_df.state.str.replace('_', ' ')

# Convert lastrevid to string for proper merge
combined_df['lastrevid'] = combined_df['lastrevid'].astype(str)

# Merge with the DataFrame containing quality scores with revision ID
combined_df = combined_df.merge(quality_scores_df[['revision_id', 'prediction']],
                                left_on='lastrevid',
                                right_on='revision_id',
                                how='left')


# Merge with the DataFrame containing regional division data
combined_df = combined_df.merge(us_regions, how='left', on='state')

# Merge with the DataFrame containing population data
combined_df = combined_df.merge(us_pop[['State', '2022']], how='left', left_on='state', right_on='State')

# Removing the duplicate revID column
combined_df.drop('revision_id',inplace= True, axis =1)

# Rename columns for consistency
combined_df.rename(columns={'title': 'article_title',
                            'lastrevid': 'revision_id',
                            'prediction': 'article_quality',
                            'division': 'regional_division',
                            '2022': 'population'},
                   inplace=True)

# Select and reorder the relevant columns
combined_df = combined_df[['state', 'regional_division', 'population',
                           'article_title', 'revision_id', 'article_quality']]

# Drop duplicate rows
combined_df.drop_duplicates(inplace=True, ignore_index=True)

# Export the combined dataset to a CSV file
combined_df.to_csv(os.path.join(cwd, 'output/wp_scored_city_articles_by_state.csv'),
                   index=False)
combined_df

Unnamed: 0,state,regional_division,population,article_title,revision_id,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5074296.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5074296.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",1179139816,C
...,...,...,...,...,...,...
21520,Wyoming,Mountain,581381.0,"Wamsutter, Wyoming",1169591845,GA
21521,Wyoming,Mountain,581381.0,"Wheatland, Wyoming",1176370621,GA
21522,Wyoming,Mountain,581381.0,"Worland, Wyoming",1166347917,GA
21523,Wyoming,Mountain,581381.0,"Wright, Wyoming",1166334449,GA


## Step 4: Analysis

This code block processes and analyzes data to calculate per capita ratios of articles and high-quality articles per state and regional division. It follows these key steps:

#### Filtering High-Quality Articles:

- High-quality articles (FA and GA classes) are selected from the dataset.

#### Grouping Data:

- The data is grouped by state and regional division.

#### Calculating Article Statistics:

- Total articles and high-quality articles are calculated for each state and division.

#### Merging Population Data:

- Population data is merged into the analysis data.

#### Calculating Per Capita Ratios:

- Per capita ratios for articles and high-quality articles are computed.


In [97]:
# Filter for high-quality articles (FA and GA)
high_quality_articles = combined_df[combined_df['article_quality'].isin(['FA', 'GA'])]

# Group the data by state and division
grouped_data = combined_df.groupby(['state', 'regional_division'])

# Calculate the total articles and high-quality articles
total_articles = grouped_data['article_title'].count()
high_quality_articles_count = high_quality_articles.groupby(['state', 'regional_division'])['article_title'].count()

# Merge the population data
analysis_data = grouped_data[['state', 'regional_division']].first()
analysis_data['total_articles'] = total_articles
analysis_data['high_quality_articles'] = high_quality_articles_count

analysis_data.reset_index(drop=True, inplace=True)

# Merge population data using 'state' column
analysis_data = analysis_data.merge(us_pop, left_on='state', right_on='State', how='left')

# Calculate per capita ratios
analysis_data['total_articles_per_capita'] = analysis_data['total_articles'] / analysis_data['2022']
analysis_data['high_quality_articles_per_capita'] = analysis_data['high_quality_articles'] / analysis_data['2022']

# Drop repeated columns
analysis_data.drop('State',inplace=True,axis=1)

# Rename columns for consistency
analysis_data.rename(columns={'2022': 'population'},
                   inplace=True)

analysis_data

Unnamed: 0,state,regional_division,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
0,Alabama,East South Central,461,53.0,5074296.0,9.1e-05,1e-05
1,Alaska,Pacific,149,31.0,733583.0,0.000203,4.2e-05
2,Arizona,Mountain,91,24.0,7359197.0,1.2e-05,3e-06
3,Arkansas,West South Central,500,72.0,3045637.0,0.000164,2.4e-05
4,California,Pacific,482,173.0,39029342.0,1.2e-05,4e-06
5,Colorado,Mountain,290,77.0,5839926.0,5e-05,1.3e-05
6,Delaware,South Atlantic,57,25.0,1018396.0,5.6e-05,2.5e-05
7,Florida,South Atlantic,412,119.0,22244823.0,1.9e-05,5e-06
8,Georgia,South Atlantic,538,93.0,10912876.0,4.9e-05,9e-06
9,Hawaii,Pacific,151,30.0,1440196.0,0.000105,2.1e-05


## Step 5: Results

1.	Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .
2.	Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .
3.	Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) .
4.	Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).
5.	Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.
6.	Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.


In [86]:
# 1
top_10_states_by_coverage = analysis_data.sort_values(by='total_articles_per_capita', ascending=False).head(10)
top_10_states_by_coverage.drop(['high_quality_articles','high_quality_articles_per_capita'],inplace=True,axis=1)
top_10_states_by_coverage.reset_index(drop=True)

Unnamed: 0,state,regional_division,total_articles,population,total_articles_per_capita
0,Vermont,New England,329,647064.0,0.000508
1,North Dakota,West North Central,356,779261.0,0.000457
2,Maine,New England,483,1385340.0,0.000349
3,South Dakota,West North Central,311,909824.0,0.000342
4,Iowa,West North Central,1043,3200517.0,0.000326
5,Alaska,Pacific,149,733583.0,0.000203
6,Pennsylvania,Middle Atlantic,2556,12972008.0,0.000197
7,Michigan,East North Central,1773,10034113.0,0.000177
8,Wyoming,Mountain,99,581381.0,0.00017
9,New Hampshire,New England,234,1395231.0,0.000168


In [88]:
# 2
bottom_10_states_by_coverage = analysis_data.sort_values(by='total_articles_per_capita').head(10)
bottom_10_states_by_coverage.drop(['high_quality_articles','high_quality_articles_per_capita'],inplace=True,axis=1)
bottom_10_states_by_coverage.reset_index(drop=True)

Unnamed: 0,state,regional_division,total_articles,population,total_articles_per_capita
0,North Carolina,South Atlantic,50,10698973.0,5e-06
1,Nevada,Mountain,19,3177772.0,6e-06
2,California,Pacific,482,39029342.0,1.2e-05
3,Arizona,Mountain,91,7359197.0,1.2e-05
4,Virginia,South Atlantic,133,8683619.0,1.5e-05
5,Florida,South Atlantic,412,22244823.0,1.9e-05
6,Oklahoma,West South Central,75,4019800.0,1.9e-05
7,Kansas,West North Central,63,2937150.0,2.1e-05
8,Maryland,South Atlantic,157,6164660.0,2.5e-05
9,Wisconsin,East North Central,192,5892539.0,3.3e-05


In [94]:
# 3
top_10_states_by_high_quality = analysis_data.sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)
top_10_states_by_high_quality.drop(['total_articles','total_articles_per_capita'],inplace=True,axis=1)
top_10_states_by_high_quality.reset_index(drop=True)

Unnamed: 0,state,regional_division,high_quality_articles,population,high_quality_articles_per_capita
0,Vermont,New England,45.0,647064.0,7e-05
1,Wyoming,Mountain,39.0,581381.0,6.7e-05
2,South Dakota,West North Central,56.0,909824.0,6.2e-05
3,West Virginia,South Atlantic,106.0,1775156.0,6e-05
4,Montana,Mountain,55.0,1122867.0,4.9e-05
5,New Hampshire,New England,63.0,1395231.0,4.5e-05
6,Alaska,Pacific,31.0,733583.0,4.2e-05
7,New Jersey,Middle Atlantic,379.0,9261699.0,4.1e-05
8,North Dakota,West North Central,26.0,779261.0,3.3e-05
9,Oregon,Pacific,141.0,4240137.0,3.3e-05


In [95]:
# 4
bottom_10_states_by_high_quality = analysis_data.sort_values(by='high_quality_articles_per_capita').head(10)
bottom_10_states_by_high_quality.drop(['total_articles','total_articles_per_capita'],inplace=True,axis=1)
bottom_10_states_by_high_quality.reset_index(drop=True)

Unnamed: 0,state,regional_division,high_quality_articles,population,high_quality_articles_per_capita
0,Michigan,East North Central,12.0,10034113.0,1e-06
1,North Carolina,South Atlantic,21.0,10698973.0,2e-06
2,Virginia,South Atlantic,18.0,8683619.0,2e-06
3,Nevada,Mountain,8.0,3177772.0,3e-06
4,Arizona,Mountain,24.0,7359197.0,3e-06
5,California,Pacific,173.0,39029342.0,4e-06
6,Florida,South Atlantic,119.0,22244823.0,5e-06
7,New York,Middle Atlantic,111.0,19677151.0,6e-06
8,Maryland,South Atlantic,42.0,6164660.0,7e-06
9,Kansas,West North Central,22.0,2937150.0,7e-06


In [92]:
# 5
census_divisions_by_total_coverage = analysis_data.groupby('regional_division').agg({'total_articles_per_capita': 'mean'}).reset_index()
census_divisions_by_total_coverage = census_divisions_by_total_coverage.sort_values(by='total_articles_per_capita', ascending=False)
census_divisions_by_total_coverage.reset_index(drop=True)

Unnamed: 0,regional_division,total_articles_per_capita
0,West North Central,0.000242
1,New England,0.000222
2,Middle Atlantic,9.7e-05
3,East North Central,9.5e-05
4,East South Central,8.4e-05
5,Pacific,8.3e-05
6,Mountain,7.3e-05
7,West South Central,7.2e-05
8,South Atlantic,4.4e-05


In [93]:
# 6
census_divisions_by_high_quality_coverage = analysis_data.groupby('regional_division').agg({'high_quality_articles_per_capita': 'mean'}).reset_index()
census_divisions_by_high_quality_coverage = census_divisions_by_high_quality_coverage.sort_values(by='high_quality_articles_per_capita', ascending=False)
census_divisions_by_high_quality_coverage.reset_index(drop=True)

Unnamed: 0,regional_division,high_quality_articles_per_capita
0,New England,3.9e-05
1,West North Central,3.2e-05
2,Mountain,2.4e-05
3,Pacific,2.3e-05
4,Middle Atlantic,2.1e-05
5,East South Central,1.6e-05
6,South Atlantic,1.5e-05
7,West South Central,1.4e-05
8,East North Central,1.2e-05
