# This notebook serves as the executable software that reads in datasets, combines datasets, and runs data analysis

The end goal is to perform analysis on how the coverage of US cities on wikipedia and how the quality of articles about cities varies among states. 

Steps: 
1. Confirm datasets are ready to go.
2. Get article quality predictions
3. Combine dataset of wikipedia articles with a dataset of state populations
4. Data analyis  
   3a. The states with the greatest and least coverage of cities on Wikipedia compared to their population.  
   3b. The states with the highest and lowest proportion of high quality articles about cities.  
   3c. A ranking of US geographic regions by articles-per-person and proportion of high quality articles.  
5. Print results - 6 tables of various state/region per capita article/article quality counts

# Import packages

In [1]:
import numpy as np
import pandas as pd
import csv
import json
import time
import urllib.parse
import requests
import base64
from tqdm import tqdm

# Part 1. Confirm datasets are ready to go. 

Confirm that the files below are there:  
./data/PopulationEstimates.csv   
./data/States_by_region.csv  
./data/us_cities_by_state_SEPT2023.csv  

#### Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
    ##### Step 1A1: Store only state and population data from csv into list for just states. 
    ####  Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1B: Read in ./data/States_by_region.csv as csv object.
    ##### Step 1B1: Store all data from csv into list. 
    ##### Step 1B2: Remove header row from list to prevent it showing up later.    
    ##### Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory. 
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
    ##### Step 1C1: Store all data from csv into list. 
    ##### Step 1C2: Remove header row from list to prevent it showing up later.    
    ##### Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory. 

### Data sources: 
##### ./data/PopulationEstimates.csv  
https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/state/totals/NST-EST2022-ALLDATA.csv    
https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html  
##### ./data/States_by_region.csv    
The shared google drive for Homework 2 by UW Data 512 provides this file.   
##### ./data/us_cities_by_state_SEPT2023.csv  
The shared google drive for Homework 2 by UW data 512 provies this file.   
However the Wikipedia Category:Lists of cities in the United States by   
state was crawled to generate a list of Wikipedia article pages about US  
cities from each state.   

# Data issues:
I did not find any inconsistencies like the HW2 assignment suggested I might here.  

In [2]:
# Step 1A: Read in ./data/PopulationEstimates.csv as csv object.
StatePopEstimates = []
RegionalPopEstimates = []
with open('../data/PopulationEstimates.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1A1: Store only state and population data from csv into list for just states. 
    #  This means excluding the first 15 and the last rows, which correspond to 
    #  the header row, larger regions of the United States, and Puerto Rico, which are not states. 
    counter = 0
    for row in reader: 
        if(counter > 14 and counter < 66):
            StatePopEstimates.append([row[4], row[8]])
        elif(counter < 15):
            RegionalPopEstimates.append([row[4], row[8]])
        counter += 1
        
#Remove header, US population, Washington DC
RegionalPopEstimates.pop(0)
RegionalPopEstimates.pop(0)
StatePopEstimates.pop(8)
        
# Step 1A2: Output first, middle, and last row of csv object to confirm data is in memory.  
print("Number of states, should be 50:", len(StatePopEstimates))
print(StatePopEstimates[0])
print(StatePopEstimates[len(StatePopEstimates)//2])
print(StatePopEstimates[len(StatePopEstimates)-1])

Number of states, should be 50: 50
['Alabama', '5074296']
['Montana', '1122867']
['Wyoming', '581381']


In [3]:
# Step 1B: Read in ./data/States_by_region.csv as csv object.
StateRegions = []
with open('../data/States_by_region.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1B1: Store all data from csv into list. 
    for row in reader: 
        StateRegions.append([row[0], row[1], row[2]]) 

# Step 1B2: Remove header row
StateRegions.pop(0)

# Step 1B3: Output first, middle, and last row of csv object to confirm data is in memory.
print("Number of states, should be 50:", len(StateRegions))
print(StateRegions[0])
print(StateRegions[len(StateRegions)//2])
print(StateRegions[len(StateRegions)-1])

Number of states, should be 50: 50
['Northeast', 'New England', 'Connecticut']
['South', 'South Atlantic', 'North Carolina']
['West', 'Pacific', 'Washington']


In [4]:
#### Step 1C: Read in ./data/us_cities_by_state_SEPT2023.csv as csv object.
CityArticles = []
with open('../data/us_cities_by_state_SEPT2023.csv', newline='', encoding="utf8") as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    
    # Step 1C1: Store all data from csv into list. 
    for row in reader: 
        CityArticles.append([row[0], row[1], row[2]]) 

# Step 1C2: Remove header row
CityArticles.pop(0)

# Step 1C3: Output first, middle, and last row of csv object to confirm data is in memory.
print(CityArticles[0])
print(CityArticles[len(CityArticles)//2])
print(CityArticles[len(CityArticles)-1])

['Alabama', 'Abbeville, Alabama', 'https://en.wikipedia.org/wiki/Abbeville,_Alabama']
['Minnesota', 'Sargeant, Minnesota', 'https://en.wikipedia.org/wiki/Sargeant,_Minnesota']
['Wyoming', 'Yoder, Wyoming', 'https://en.wikipedia.org/wiki/Yoder,_Wyoming']


# Part 2. Get Article Quality Predictions
We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:
FA - Featured article
GA - Good article (sometimes called A-class)
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article
These labelings were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a subset of quality assessment categories developed by Wikipedia editors.

#### Step 2A: Create wikimedia user account to generate API token. 
    ##### Step 2A1: Create account here: https://api.wikimedia.org/w/index.php?title=Special:UserLogin 
    ##### Step 2A2: Then go here to create token: https://api.wikimedia.org/wiki/Special:AppManagement
    ##### Step 2A3: Click Create key, choosen personal API token, checkmark all permsissions
#### Step 2B: Define constants/functions to make data requests.
    ##### Step 2B1: Create function to load in credentials from text file which is not allowed to be pushed to the repo.
    ##### Step 2B2: Load in credentials. 
    ##### Step 2B3: Define functions and constants for pageinfo API. 
    ##### Step 2B4: Define functions and constants for ORES API. 
#### Step 2C: Get ORES score for each city article. 
    ##### Step 2D1: Read each line of us_cities_by_state_SEPT.2023.csv
    ##### Step 2D2: Make a page info request to get the current article page revision
    ##### Step 2D3: Make an ORES request using the page title and current revision id.
    ##### Step 2D4: Store score predictions if possible, else print out failures. 

In [5]:
# Step 2B1: Create function to load in credentials from text file which is not allowed to be pushed to the repository.
def load_credentials_from_file(filename):
    '''
    Description:
        Given a text file with six lines, where the second, fourth, and sixth lines are
        respectively client id, client secret acess token, and acess token, reads the text file
        and loads those lines into memory as variables. 
    Inputs:
        filename - String - Path of file
    Outputs:
        client_id - String
        client_secret - String
        access_token - String
    Notes:
        This function below was generated by chatGPT. https://chat.openai.com/ 
        Prompt used:
            "Given a text file with 6 lines, where the 2nd, 4th, and 6th lines 
             are a client id, client secret access token, and access token,
             give me a python function that that reads in a text file and loads 
             those lines into memory as variables. Make sure to close the file."
        Why: 
            Wanted to save time, knew how to explain the problem, can easily make this myself, simple code to generate.
    '''
    try:
        with open(filename, 'r') as file:
            lines = file.readlines()
        if len(lines) >= 6:
            client_id = lines[1].strip()
            client_secret = lines[3].strip()
            access_token = lines[5].strip()
            return client_id, client_secret, access_token
        else:
            print("Error: The file does not contain enough lines.")
            return None, None, None
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None, None, None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {e}")
        return None, None, None

# Step 2B2: Load in credentials
filename = '../auth.txt'
client_id, client_secret, access_token = load_credentials_from_file(filename)
if client_id is not None and client_secret is not None and access_token is not None:
    print("Credentials loaded successfully:")
else:
    print("Failed to load credentials.")

Credentials loaded successfully:


In [6]:
# Step 2B3: Define functions and constants for pageinfo API. 

# The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, 
# a course in the UW MS Data Science degree program. 
# This code is provided under the Creative Commons CC-BY license. Revision 1.1 - August 14, 2022
# Any modifications made to the original source also fall under the CC-BY license. 

#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. 
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# When making automated requests we should include something that is unique to the person making the request
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is a string of additional page properties that can be returned see the Info documentation for what can be included.
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = { "action": "query", "format": "json", "titles": "", "prop": "info",
                            "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    '''
    Description:
        Make request to endpoint with supplied arguments.
    Inputs:
        article_title - String
        endpoint_url - String
        request_template - Dictionary
        headers - Dictionary
    Output:
        Dictionary
    '''
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [7]:
# Step 2B4: Define functions and constants for ORES API. 

# The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, 
# a course in the UW MS Data Science degree program. 
# This code is provided under the Creative Commons CC-BY license. Revision 1.1 - August 15, 2023
# Any modifications made to the original source also fall under the CC-BY license. 

# The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there   
# Because all LiftWing API requests require some form of authentication, you need to provide your access token
# as part of the header too
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<zbowyer@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer )" + str(access_token)
}

#    This is a template for the parameters that we need to supply in the headers of an API request
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "zbowyer@uw.edu",         # your email address should go here
    'access_token'  : access_token          # the access token you create will need to go here
}

#    This is a template of the data required as a payload when making a scoring request of the ORES model
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    '''
    Description:
        Attempts to get ORES score from API call with supplied arguments.
    Inputs:
        article_revid - Integer
        email_address - String
        access_token - String
        endpoint_url - String
        model_name - String
        request_data - Dictionary
        header_format - Dictionary
        header_params - Dictionary
    Output:
        Dictionary
    '''
    
    # Make sure we have an article revision id, email and token
    # This approach prioritizes the parameters passed in when making the call
    if article_revid: request_data['rev_id'] = article_revid
    if email_address: header_params['email_address'] = email_address
    if access_token: header_params['access_token'] = access_token
    
    # Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']: raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']: raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']: raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print("Exception", e)
        json_response = None
    return json_response

In [None]:
# Approach to handling failures:
#  If an ORES score for an article cannot be found,
#  store the article name in a list and print out
#  everything from the list at the end. Also store 
#  the score as N/A in the article list. We chose to print
#  them all at the end to prevent print statements forcing the
#  TQDM progress bar to be recreated. The output of the failures
#  is simply a string for each article title.  

# Step 2D: Get ORES score for each city article. 
# Step 2D1: Read each line of us_cities_by_state_SEPT.2023.csv
unscored_articles = []
for i in tqdm(range(len(CityArticles)), position=0, leave=True):
    article_title = CityArticles[i][1]
    article_url = CityArticles[i][2]
    
    # Step 2D2: Make a page info request to get the current article page revision
    PageData = request_pageinfo_per_article(article_title)
    pageid = list(PageData["query"]["pages"].keys())[0]
    revid = int(PageData["query"]["pages"][pageid]["lastrevid"])
    article_dict = {article_title: revid}
    
    # Step 2D3: Make an ORES request using the page title and current revision id.
    score = request_ores_score_per_article(article_revid=revid, email_address="zbowyer@uw.edu", access_token=access_token)
    
    # Step 2D4: Store score predictions if possible, else print out failures. 
    try:
        ORES_prediction = score["enwiki"]["scores"][str(revid)]["articlequality"]["score"]["prediction"]
        CityArticles[i].append(ORES_prediction)
        CityArticles[i].append(revid)
    except:
        CityArticles[i].append("N/A")
        unscored_articles.append(article_title)
        CityArticles[i].append(revid)

print("Articles that could not be scored: ")
for x in unscored_articles: print(x)

  0%|                                                                                             | 23/22157 [00:19<5:01:12,  1.22it/s]

# Step 3: Combine datasets

The goal here is to merge the wikipedia data and population data together.   
This can be done because both datasets have state name fields.  
Additionally regional-devisions must be added, so a third dataset will be combined.  

All data with no matches should be logged.  

#### Step 3A: Save article scores to a file separately for redundancy (it took a long time to generate). 
#### Step 3B: Convert all lists to dataframes
#### Step 3C: Merge all dataframes on the 'state' column
#### Step 3D: Save final merged dataframe to file
#### Step 3E: Print out final merged dataframe to see if previous steps worked.

In [None]:
# Step 3A: Save article scores to a file separately for redundancy (it took a long time to generate). 
#Uncomment on first passthrough
#CityArticles_dataframe = pd.DataFrame(CityArticles, columns = ['state', 'article_title', 'url', 'article_quality'])
#CityArticles_dataframe.to_csv("../data/CityArticles_withscores.csv")
CityArticles_dataframe = pd.read_csv("../data/CityArticles_withscores.csv") #For reloading

# Step 3B: Convert all lists to dataframes
StateRegions_dataframe = pd.DataFrame(StateRegions, columns = ['region', 'regional_division', 'state'])
StatePopEstimates_dataframe = pd.DataFrame(StatePopEstimates, columns = ['state', 'state_population'])
RegionalPopEstimates_dataframe = pd.DataFrame(RegionalPopEstimates, columns = ['regional_division', 'region_population'])

# Step 3C: Merge all dataframes on the 'state' column
df3 = CityArticles_dataframe.merge(StatePopEstimates_dataframe, how='inner',left_on='state', right_on='state')
df4 = df3.merge(StateRegions_dataframe, how='inner',left_on='state', right_on='state')
df5 = df4.merge(RegionalPopEstimates_dataframe, how='inner', left_on='regional_division', right_on='regional_division')

# Step 3D: Save final merged dataframe to file
df5.to_csv("../data/wp_scored_city_articles_by_state.csv")

# Step 3E: Print out final merged dataframe to see if previous steps worked.
print(df5.head(2))

# Step 3F: Print out all non matched rows:
# This indicates every spot where a join could not be made due to lack of data. 
print("-----------------------------------------------------------------")
print("Missing rows in combined dataframe!")
test_df = pd.merge(CityArticles_dataframe, StatePopEstimates_dataframe, how = 'left', left_on='state', right_on='state')
missingrows1 = (test_df.loc[test_df['url'].isna()])
missingrows2 = (test_df.loc[test_df['state_population'].isna()])

test_df = pd.merge(df3, StateRegions_dataframe, how = 'left', left_on='state', right_on='state')
missingrows3 = (test_df.loc[test_df['url'].isna()])
missingrows4 = (test_df.loc[test_df['region'].isna()])

test_df = pd.merge(df4, RegionalPopEstimates_dataframe, how = 'left', left_on='regional_division', right_on='regional_division')
missingrows5 = (test_df.loc[test_df['url'].isna()])
missingrows6 = (test_df.loc[test_df['region_population'].isna()])

for x in missingrows1.iterrows(): print(x)
for x in missingrows2.iterrows(): print(x[1]["article_title"])
for x in missingrows3.iterrows(): print(x)
for x in missingrows4.iterrows(): print(x)
for x in missingrows5.iterrows(): print(x)
for x in missingrows6.iterrows(): print(x)

# Step 4: Analysis

The goal here is to calculate metrics for our dataset. 

### For each state, and each division:
    Total-articles-per-population (number of articles per person)  
    High quality articles per population (number of high quality articles per person) (FA OR GA)    
    
#### Step 4A: Use groupby, merging, renaming, etc, to make calculated field of article number per capita by state
#### Step 4B: Use groupby, merging, renaming, etc, to make calculated field of article number per capita by division
#### Step 4C: Use groupby, merging, renaming, etc, to make calculated field of high quality article number per capita by state
#### Step 4D: Use groupby, merging, renaming, etc, to make calculated field of high quality article number per capita by division

In [None]:
# Step 4A: Use groupby, merging, renaming, etc, to make calculated field of article number per capita by state
#Total articles per capita (STATE)
States_num_articles = pd.DataFrame(df5.groupby('state').count()["article_title"])
States_num_articles = States_num_articles.rename(columns={"article_title": "article_number"})
States_num_articles = States_num_articles.merge(StatePopEstimates_dataframe, how='inner',left_on='state', right_on='state')
States_num_articles = States_num_articles.astype({'state_population': 'int32'})
States_num_articles["article_number_per_capita"] = States_num_articles["article_number"] / States_num_articles["state_population"]
print("State-based number of articles per capita:")
print(States_num_articles.head(2))
print("---------------------------------------------------------------------------------------------")

#### Step 4B: Use groupby, merging, renaming, etc, to make calculated field of article number per capita by division
#Total articles per capita (Division)
Divisions_num_articles = pd.DataFrame(df5.groupby('regional_division').count()["article_title"])
Divisions_num_articles = Divisions_num_articles.rename(columns={"article_title": "article_number"})
Divisions_num_articles = Divisions_num_articles.merge(RegionalPopEstimates_dataframe, how='inner',left_on='regional_division', right_on='regional_division')
Divisions_num_articles = Divisions_num_articles.astype({'region_population': 'int32'})
Divisions_num_articles["article_number_per_capita"] = Divisions_num_articles["article_number"] / Divisions_num_articles["region_population"]
print("Division-based number of articles per capita:")
print(Divisions_num_articles.head(2))
print("---------------------------------------------------------------------------------------------")

#Used for both 4C and 4D
#Filter on only FA or GAs
GA_or_FA = df5[df5['article_quality'].isin(['FA', 'GA'])]

# Step 4C: Use groupby, merging, renaming, etc, to make calculated field of high quality article number per capita by state
#Total high quality articles per capita (STATE)
State_numquality_articles = GA_or_FA.groupby('state')['article_quality'].count().reset_index()
State_numquality_articles = State_numquality_articles.merge(StatePopEstimates_dataframe, how='inner',left_on='state', right_on='state')
State_numquality_articles = State_numquality_articles.rename(columns={"article_quality": "article_number"})
State_numquality_articles = State_numquality_articles.astype({'state_population': 'int32'})
State_numquality_articles["quality_article_number_per_capita"] = State_numquality_articles["article_number"] / State_numquality_articles["state_population"]
print("State-based number of quality articles per capita:")
print(State_numquality_articles.head(2))
print("---------------------------------------------------------------------------------------------")

# Step 4D: Use groupby, merging, renaming, etc, to make calculated field of high quality article number per capita by division
#Total high quality articles per capita (Division)
Divisions_numquality_articles = GA_or_FA.groupby('regional_division')['article_quality'].count().reset_index()
Divisions_numquality_articles = Divisions_numquality_articles.merge(RegionalPopEstimates_dataframe, how='inner',left_on='regional_division', right_on='regional_division')
Divisions_numquality_articles = Divisions_numquality_articles.rename(columns={"article_quality": "article_number"})
Divisions_numquality_articles = Divisions_numquality_articles.astype({'region_population': 'int32'})
Divisions_numquality_articles["quality_article_number_per_capita"] = Divisions_numquality_articles["article_number"] / Divisions_numquality_articles["region_population"]
print("Division-based number of quality articles per capita:")
print(Divisions_numquality_articles.head(2))

# Step 5: Results

#### The goal here is to produce six tables that show:
5A. Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order).  
5B. Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order).  
5C. Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order).  
5D. Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).  
5E. Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.  
5F. Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.  


In [None]:
# Step 5A: Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order).  
States_num_articles.sort_values(by='article_number_per_capita', ascending=False).head(10)

In [None]:
# Step 5B: Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order).  
States_num_articles.sort_values(by='article_number_per_capita', ascending=True).head(10)

In [None]:
# Step 5C: Top 10 US states by high quality: The 10 US states with the highest number of high quality articles per capita (in descending order).
State_numquality_articles.sort_values(by='quality_article_number_per_capita', ascending=False).head(10)

In [None]:
# Step 5D: Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order).
State_numquality_articles.sort_values(by='quality_article_number_per_capita', ascending=True).head(10)

In [None]:
# Step 5E: Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita.
Divisions_num_articles.sort_values(by='article_number_per_capita', ascending=False).head(10)

In [None]:
# Step 5F Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita.
Divisions_numquality_articles.sort_values(by='quality_article_number_per_capita', ascending=False).head(10)