# DATA512 - Homework 2 - Considering Bias in Data

### Baisakhi Sarkar - MSDS @ Univeristy of Washington Seattle, 2023-25

## Analysis of coverage of politicians and the quality of articles about politicians in Wikipedia

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. Our analysis will consist of a series of tables that show:

1.	The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
	
2.	The countries with the highest and lowest proportion of high quality articles about politicians.
	
3.	A ranking of geographic regions by articles-per-person and proportion of high quality articles.
	


## Step 1. Data Acquisition

The first step is getting the data, which lives in several different places. We will need data that lists Wikipedia articles of politicians and data for country populations. 

The [Wikipedia Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the Data folder as politicians_by_country.AUG.2024.csv.  

The population data is available in CSV format as population_by_country_AUG.2024.csv in the Data folder. This dataset was downloaded from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

The population_by_country_AUG.2024.csv contains rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows should not match the country values in politicians_by_country.AUG.2024.csv, but we will retain them so that we can report coverage and quality by region as specified in the analysis section below.

In [2]:
# These are standard python modules
import json, time, urllib.parse
import requests
import warnings


# The modules below are not standard Python modules
import pandas as pd

In [3]:
# Read the politicians data having their wikipedia page url and country
politicians_df = pd.read_csv("politicians_by_country_AUG.2024.csv")
politicians_df.head()

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan


In [4]:
# Read the population data having their country name and population in millions
population_df = pd.read_csv("population_by_country_AUG.2024.csv")
population_df.head()

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2


In [5]:
politicians_df.info

<bound method DataFrame.info of                       name                                                url  \
0           Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1        Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2              Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3     Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4           Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   
...                    ...                                                ...   
7150      Josiah Tongogara     https://en.wikipedia.org/wiki/Josiah_Tongogara   
7151     Langton Towungana    https://en.wikipedia.org/wiki/Langton_Towungana   
7152     Sengezo Tshabangu    https://en.wikipedia.org/wiki/Sengezo_Tshabangu   
7153   Herbert Ushewokunze  https://en.wikipedia.org/wiki/Herbert_Ushewokunze   
7154          Denis Walker         https://en.wikipedia.org/wiki/Denis_Walker

In [6]:
population_df.info

<bound method DataFrame.info of            Geography  Population
0              WORLD      8009.0
1             AFRICA      1453.0
2    NORTHERN AFRICA       256.0
3            Algeria        46.8
4              Egypt       105.2
..               ...         ...
228            Samoa         0.2
229  Solomon Islands         0.8
230            Tonga         0.1
231           Tuvalu         0.0
232          Vanuatu         0.3

[233 rows x 2 columns]>

In [7]:
#checking if any column has null value in politicians_df
politicians_df.columns[politicians_df.isnull().any()].tolist()

[]

In [8]:
#checking if any column has null value in population_df
population_df.columns[population_df.isnull().any()].tolist()

[]

In [9]:
#Removing duplicates, but since shape remains same we can say there were no duplicates present
politicians_df.drop_duplicates(inplace=True)
politicians_df.shape

(7155, 3)

In [10]:
#Removing duplicates, but since shape remains same we can say there were no duplicates present
population_df.drop_duplicates(inplace=True)
population_df.shape

(233, 2)

In [None]:
population_df.rename(columns={'Geography': 'country', 'Population': 'population'}, inplace=True)

In [11]:
# Step 1: Decode the Unicode sequences in the 'name' column
#politicians_df['name'] = politicians_df['name'].apply(lambda x: x.encode('utf-8').decode('unicode-escape'))

In [12]:
# Storing the final list of politician names/titles in a list 
politician_names = politicians_df['name'].to_list()
len(politician_names)

7155

## Step 2: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1.	FA - Featured article
   
2.	GA - Good article (also known as A-Class)
	
3.	B - B-Class article
	
4.	C - C-Class article
	
5.	Start - Start-class article
	
6.	Stub - Stub-class article

These class labels were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a subset of quality assessment categories developed by Wikipedia editors.

ORES requires a specific revision ID of an article to be able to make a label prediction. You can use the API:Info request to get a range of metadata on an article, including the most current revision ID of the article page
.
Putting this together, to get a Wikipedia page quality prediction from ORES for each politician’s article page you will need to: 
a) read each line of politicians_by_country.AUG.2024.csv, 
b) make a page info request to get the current page revision, and 
c) make an ORES request using the page title and current revision id.  


### (a) Page Info Request to get Current Page Revision of the Wikipedia Article

The code below illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). 

#### LICENSE

This [code example](https://drive.google.com/file/d/1iGH_pOMlspeCwDzKCPRlQdq73iS16R6k/view?usp=drive_link) was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024. This example shows how to request summary 'page info' for a single article page. The API documentation, API:Info, covers additional details that may be helpful when trying to use or understand this example.

The example relies on some constants that help make the code a bit more readable.

In [13]:
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<bsarka@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
#ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]
ARTICLE_TITLES = politician_names

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [14]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'bsarka@uw' not in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [15]:
#Example Usage 1
print(f"Getting page info data for: {ARTICLE_TITLES[95]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[95])
print(json.dumps(info,indent=4))

Getting page info data for: Refo Çapari
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "31301973": {
                "pageid": 31301973,
                "ns": 0,
                "title": "Refo \u00c7apari",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2024-10-13T16:01:57Z",
                "lastrevid": 1222381064,
                "length": 3211,
                "talkid": 31302066,
                "fullurl": "https://en.wikipedia.org/wiki/Refo_%C3%87apari",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Refo_%C3%87apari&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Refo_%C3%87apari"
            }
        }
    }
}


In [16]:
#Example Usage 2
print(f"Getting page info data for: {ARTICLE_TITLES[1]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[1])
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Haroon al-Afghani
{
    "11966231": {
        "pageid": 11966231,
        "ns": 0,
        "title": "Haroon al-Afghani",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-10-05T14:27:29Z",
        "lastrevid": 1230459615,
        "length": 17027,
        "talkid": 15250816,
        "fullurl": "https://en.wikipedia.org/wiki/Haroon_al-Afghani",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Haroon_al-Afghani&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Haroon_al-Afghani"
    }
}


In [17]:
#Example Usage 3
page_titles = f"{ARTICLE_TITLES[0]}|{ARTICLE_TITLES[2]}|{ARTICLE_TITLES[4]}"
print(f"Getting page info data for: {page_titles}")
request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
request_info['titles'] = page_titles
info = request_pageinfo_per_article(request_template=request_info)
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Majah Ha Adrif|Tayyab Agha|Aziza Ahmadyar
{
    "47805901": {
        "pageid": 47805901,
        "ns": 0,
        "title": "Aziza Ahmadyar",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-10-13T14:46:20Z",
        "lastrevid": 1195651393,
        "length": 3790,
        "talkid": 47806200,
        "fullurl": "https://en.wikipedia.org/wiki/Aziza_Ahmadyar",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Aziza_Ahmadyar&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Aziza_Ahmadyar"
    },
    "10483286": {
        "pageid": 10483286,
        "ns": 0,
        "title": "Majah Ha Adrif",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-09-30T14:32:18Z",
        "lastrevid": 1233202991,
        "length

In [18]:
politician_article_revid = {}

# Iterate through every article and get the json response
for article in ARTICLE_TITLES:
    pageinfo = request_pageinfo_per_article(article)
    info = pageinfo['query']['pages']
    for key, value in info.items():
        # Extract the revision ID and title from the json response and store it in a dicitionary
        if 'lastrevid' in value:
            lastrevid_value = value['lastrevid']
        if 'title' in value:
            article_title = value['title']
        politician_article_revid[article_title] = lastrevid_value

In [19]:
len(politician_article_revid)

7111

### Note:

**In the initial politicians_df we had 7155 rows corresponding to 7155 politician names, but after calling the API to get the revision IDs of each article we get only 7111 rows indicating that around 44 article URLs were either invalid or not on wikipedia anymore due to which their latest revision IDs were not available.**

In [20]:
# Save the dictionary in a json file
with open('politician_article_revid.json', 'w') as json_file:
    json.dump(politician_article_revid, json_file, indent=4)

Reading the data we just saved as json for our next step to call [ORES](https://www.mediawiki.org/wiki/ORES) (Objective Revision Evaluation Service)

In [21]:
with open('politician_article_revid.json', 'r') as json_file:
    pol_data_revid = json.load(json_file)

### (b) Requesting ORES scores through LiftWing ML Service API

Access to the ORES API will require us to request an API access key. The [sample code](https://drive.google.com/file/d/1GN1ULxKombHRzVsNKzj7tBhnBrSWUWXc/view?usp=drive_link) for making ORES requests includes links to information about how to request a key. A "best practice" for any code that requires an API key is to make sure that the key does not appear in the plain text of the code or notebook. One approach is to embed the key as an environment variable and retrieve the key from that variable. Another approach is to use a code based key manager that stores keys on your local machine. The apikeys user module is used in the ORES example code.
As we run the code, it is possible that we will be unable to get a score for a particular article. If that happens, we have to make sure to maintain a log of articles for which we were not able to retrieve an ORES score. This log will be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. 
 Our codeld compuste and prisnt the score error rate. The error rate is the ratio of the number of articles for whicweou were not able to get a score divided by the total number of articles. Iourur request error rate is higher than 1%, theweou should review your code, determine what is going wrong, fix it, and reru your score collection. 


Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org/docs) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

#### License
This [code example](https://drive.google.com/file/d/1GN1ULxKombHRzVsNKzj7tBhnBrSWUWXc/view?usp=drive_link) was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org/) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023

In [22]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "bsarka@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "bsarka@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = pol_data_revid

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

### Getting access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


In [23]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 
#
#   In the Homework 2 folder you should be able to find a zip file containing the apikeys user module. Install this module
#   into the folder where you keep all of your user modules. This is also the folder that your PYTHONPATH variable points to.
#
from apikeys.KeyManager import KeyManager
keyman = KeyManager()

#
#   This is my Wikipedia/Wikimedia username. They suggest you request your keys using your Wikipedia username, so I
#   also stored the API key using my Wikipedia username.
#
#   You should probably use your own username here.
# USERNAME = "Sarkarb"
# key_info = keyman.findRecord(USERNAME,API_ORES_LIFTWING_ENDPOINT)
# ACCESS_TOKEN = key_info[0]['key']
# print(key_info[0]['description'])
# print(ACCESS_TOKEN)
#
#   Note: if you don't want to use the key manager to help manage your API keys, you can specify the values as constants
#   below. Just don't distribute the notebook without removing the constants or you'll be distributing your key too.
#
USERNAME = "Sarkarb"
ACCESS_TOKEN = "eyJ0e******"
#

The Wikimedia Foundation appears to be issuing access tokens that are adhering to the [JWT (JSON Web Token) standard](https://jwt.io/introduction/). There was also some documentation by IBM about the [use of JWT tokens](https://www.ibm.com/docs/en/cics-ts/6.1?topic=cics-json-web-token-jwt) that I found useful. Keep in mind, documentation from IBM is specific to their implementation of the JWT standard. Access tokens are composed of different parts that specify the domain being accessed and rate limits. The little snippet of code below is not required to make ORES requests. It just allows us to see what is in the Wikimedia provided access token that you were issued.

In [24]:
#
#   Decode the Wikimedia JWT Access token
#
#   NOTE: This is not required to use LiftWing to request ORES scores. This is just being done to satisfy my curiosity.
#   You might be curious too!
#
import base64

print("Decoding the ACCESS_TOKEN:")
try:
    token_components = ACCESS_TOKEN.split(".")
    if len(token_components) == 3:
        header = json.loads(base64.b64decode(token_components[0]).decode())
        payload = json.loads(base64.b64decode(token_components[1]).decode())
        print("Token Header:",json.dumps(header,indent=4))
        print("Token Payload:",json.dumps(payload,indent=4))
        #print("Token Signature:",token_components[2])
        print("Token Signature: <value_suppressed>")
        #
        #  One should be able to use public/private keys to actually validate that signature - left as an exercise for later
        #
    else:
        print(f"The ACCESS_TOKEN appears to be improperly structured. It should have 3 components and it has {len(token_components)}")
except Exception as ex:
    print(f"Looks like the ACCESS_TOKEN is undefined or an empty value")
    raise(ex)


Decoding the ACCESS_TOKEN:
Token Header: {
    "typ": "JWT",
    "alg": "RS256"
}
Token Payload: {
    "aud": "94e5ae0217f6f6d5d7da196a9a7e4e26",
    "jti": "ab9df51a88bc362abfce9ee7b8f10fe3ed86a82910519142cf134344abb89a0f0a7b7a9ca7ab2fbe",
    "iat": 1728792950.502199,
    "nbf": 1728792950.502202,
    "exp": 33285701750.50039,
    "sub": "76703373",
    "iss": "https://meta.wikimedia.org",
    "ratelimit": {
        "requests_per_unit": 5000,
        "unit": "HOUR"
    },
    "scopes": [
        "basic"
    ]
}
Token Signature: <value_suppressed>


### Defining a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [25]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [26]:
#   Example Usage 1
#
#   Which article - the key for the article dictionary defined above
article_title = "Majah Ha Adrif"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                       email_address="bsarka@uw.edu",
                                       access_token=ACCESS_TOKEN)
#
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Majah Ha Adrif' with revid: 1233202991
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1233202991": {
                "articlequality": {
                    "score": {
                        "prediction": "Start",
                        "probability": {
                            "B": 0.11458586462110412,
                            "C": 0.25181268514718375,
                            "FA": 0.0063723793696558026,
                            "GA": 0.017663717882219612,
                            "Start": 0.5480559478600824,
                            "Stub": 0.06150940511975442
                        }
                    }
                }
            }
        }
    }
}


In [27]:
#   Example Usage 2
#
#   Which article - the key for the article dictionary defined above
article_title = "Themistokli G\u00ebrmenji"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                       email_address="bsarka@uw.edu",
                                       access_token=ACCESS_TOKEN)
#
revid = str(ARTICLE_REVISIONS[article_title])
print(revid)
quality = score["enwiki"]["scores"][revid]["articlequality"]["score"]["prediction"]
print(quality)
#print(score["enwiki"]["scores"])
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Themistokli Gërmenji' with revid: 1241037317
1241037317
C
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1241037317": {
                "articlequality": {
                    "score": {
                        "prediction": "C",
                        "probability": {
                            "B": 0.10918559037874914,
                            "C": 0.6677024395732389,
                            "FA": 0.004337508334838239,
                            "GA": 0.02474220878416045,
                            "Start": 0.1905372959101843,
                            "Stub": 0.0034949570188287575
                        }
                    }
                }
            }
        }
    }
}


In [28]:
# Creating two lists,  one to store the articles with score, and one without the scores
scores_articles = []
no_scores_articles = []

for article_title in ARTICLE_REVISIONS:
    # Call the API to get the score for every article based on the revision id
    score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                           email_address="bsarka@uw.edu",
                                           access_token=ACCESS_TOKEN)
    try:
        if score:
            # Create a new dict to store the article title and score
            scores_dict = {}
            string_revid = str(ARTICLE_REVISIONS[article_title])
            if score["enwiki"]["scores"]:
                quality = score["enwiki"]["scores"][string_revid]["articlequality"]["score"]["prediction"]
                scores_dict['article_title'] = article_title
                scores_dict['article_quality'] = quality
                # Append the dict to the list
                scores_articles.append(scores_dict)
        else:
            no_scores_articles.append(article_title)
    except Exception as e:
        print(e)
        #raise(e)


'enwiki'
'enwiki'


In [63]:
no_scores_articles

[]

No articles were found that was missing a quality score.

In [64]:
scores_articles_df = pd.DataFrame(scores_articles)
# Export the articles and their scores to a csv file
scores_articles_df.to_csv('articles_scores.csv', index=False)

In [65]:
scores_articles_df

Unnamed: 0,article_title,article_quality
0,Majah Ha Adrif,Start
1,Haroon al-Afghani,B
2,Tayyab Agha,Start
3,Khadija Zahra Ahmadi,Stub
4,Aziza Ahmadyar,Start
...,...,...
7104,Josiah Tongogara,C
7105,Langton Towungana,Stub
7106,Sengezo Tshabangu,Start
7107,Herbert Ushewokunze,Stub


**Out of 7111 articles, 7109 had valid article quality, only 2 didnt have. So the error rate is (2/7111) = 0.00028 which is negligible.**

## Step 3: Combining the Datasets

Some processing of the data will be necessary. In particular, after retrieving and including the ORES data for each article, we'll need to merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, we invariably ran into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

We identified all countries for which there are no matches and output a list of those countries, with each country on a separate line called:
wp_countries-no_match.txt

Consolidated the remaining data into a single CSV file called:
wp_politicians_by_country.csv

The CSV file includes the following columns of data:

country,
region,
population,
article_title,
revision_id,
article_quality


In [66]:
pol_data_revid

{'Majah Ha Adrif': 1233202991,
 'Haroon al-Afghani': 1230459615,
 'Tayyab Agha': 1225661708,
 'Khadija Zahra Ahmadi': 1234741562,
 'Aziza Ahmadyar': 1195651393,
 'Muqadasa Ahmadzai': 1235521766,
 'Mohammad Sarwar Ahmedzai': 1176429234,
 'Amir Muhammad Akhundzada': 1247931713,
 'Nasrullah Baryalai Arsalai': 1225385278,
 'Abdul Rahim Ayoubi': 1226326055,
 'Ismael Balkhi': 1244521219,
 'Abdul Baqi Turkistani': 1231655023,
 'Mohammad Ghous Bashiri': 1237694188,
 'Jan Baz': 1227635806,
 'Bashir Ahmad Bezan': 1248505877,
 'Rafiullah Bidar': 1197443408,
 'Mohammad Siddiq Chakari': 1134129082,
 'Cheragh Ali Cheragh': 1193992206,
 'Nasir Ahmad Durrani': 988838315,
 'Muhammad Hashim Esmatullahi': 949986748,
 'Ezatullah (Nangarhar)': 1158302291,
 'Aimal Faizi': 1185105938,
 'Gajinder Singh Safri': 1212323536,
 'Sharif Ghalib': 1245967190,
 'Hashmat Ghani Ahmadzai': 1207743719,
 'Abdul Ghani Ghani': 1227026187,
 'Ghulam Ghaus': 1158659195,
 'Ghulam Muhammad Ghobar': 1240993642,
 'Mohammad Gul (Hel

In [67]:
# Convert the pol_data_revid JSON object to a dataframe
pol_data_revid_df = pd.DataFrame(list(pol_data_revid.items()), columns=['article_title', 'revision_id'])
pol_data_revid_df

Unnamed: 0,article_title,revision_id
0,Majah Ha Adrif,1233202991
1,Haroon al-Afghani,1230459615
2,Tayyab Agha,1225661708
3,Khadija Zahra Ahmadi,1234741562
4,Aziza Ahmadyar,1195651393
...,...,...
7106,Josiah Tongogara,1203429435
7107,Langton Towungana,1246280093
7108,Sengezo Tshabangu,1228478288
7109,Herbert Ushewokunze,959111842


In [68]:
politicians_df.rename(columns={'name': 'article_title'}, inplace=True)
politicians_df

Unnamed: 0,article_title,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan
...,...,...,...
7150,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe
7151,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe
7152,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe
7153,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe


In [69]:
# Merge the pol_data_revid_df and the politicians_df dataframes
politicians_merged_df = pd.merge(pol_data_revid_df, politicians_df, on='article_title', how='left')
politicians_merged_df

Unnamed: 0,article_title,revision_id,url,country
0,Majah Ha Adrif,1233202991,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,1230459615,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,1225661708,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,1234741562,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,1195651393,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan
...,...,...,...,...
7150,Josiah Tongogara,1203429435,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe
7151,Langton Towungana,1246280093,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe
7152,Sengezo Tshabangu,1228478288,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe
7153,Herbert Ushewokunze,959111842,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe


In [71]:
politicians_filtered_df = politicians_merged_df[politicians_merged_df['revision_id'].notna()]
politicians_merged_df = politicians_merged_df.drop_duplicates()
politicians_filtered_df

Unnamed: 0,article_title,revision_id,url,country
0,Majah Ha Adrif,1233202991,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,1230459615,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,1225661708,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,1234741562,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,1195651393,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan
...,...,...,...,...
7150,Josiah Tongogara,1203429435,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe
7151,Langton Towungana,1246280093,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe
7152,Sengezo Tshabangu,1228478288,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe
7153,Herbert Ushewokunze,959111842,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe


In [72]:
# Merge the pol_data_revid_df and the scores_articles_df dataframes
wikipedia_merged_df = pd.merge(politicians_merged_df, scores_articles_df, on='article_title', how='left')
wikipedia_merged_df

Unnamed: 0,article_title,revision_id,url,country,article_quality
0,Majah Ha Adrif,1233202991,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Start
1,Haroon al-Afghani,1230459615,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,B
2,Tayyab Agha,1225661708,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Start
3,Khadija Zahra Ahmadi,1234741562,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,Stub
4,Aziza Ahmadyar,1195651393,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,Start
...,...,...,...,...,...
7150,Josiah Tongogara,1203429435,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,C
7151,Langton Towungana,1246280093,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,Stub
7152,Sengezo Tshabangu,1228478288,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,Start
7153,Herbert Ushewokunze,959111842,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,Stub


In [73]:
wikipedia_merged_df.to_csv('wikipedia_merged.csv', index=False)

In [80]:
# Initialize the region column with None
population_df['region'] = None

current_region = None

for i, row in population_df.iterrows():
    # Check if the Geography entry is in uppercase (indicating a region)
    if row['country'].isupper():
        current_region = row['country']
    else:
        # Assign the current region to the country rows
        population_df.at[i, 'region'] = current_region


In [79]:
population_df

Unnamed: 0,country,population,region
0,WORLD,8009.0,
1,AFRICA,1453.0,
2,NORTHERN AFRICA,256.0,
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.8,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


In [81]:
population_df

Unnamed: 0,country,population,region
0,WORLD,8009.0,
1,AFRICA,1453.0,
2,NORTHERN AFRICA,256.0,
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.8,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


In [82]:
# Merge the pol_data_revid_df and the scores_articles_df dataframes
final_merged_df = pd.merge(wikipedia_merged_df, population_df, on='country', how='outer')
final_merged_df.drop('url', axis=1, inplace= True)
final_merged_df

Unnamed: 0,article_title,revision_id,country,article_quality,population,region
0,Majah Ha Adrif,1.233203e+09,Afghanistan,Start,42.4,SOUTH ASIA
1,Haroon al-Afghani,1.230460e+09,Afghanistan,B,42.4,SOUTH ASIA
2,Tayyab Agha,1.225662e+09,Afghanistan,Start,42.4,SOUTH ASIA
3,Khadija Zahra Ahmadi,1.234742e+09,Afghanistan,Stub,42.4,SOUTH ASIA
4,Aziza Ahmadyar,1.195651e+09,Afghanistan,Start,42.4,SOUTH ASIA
...,...,...,...,...,...,...
7217,,,Kiribati,,0.1,OCEANIA
7218,,,Nauru,,0.0,OCEANIA
7219,,,New Caledonia,,0.3,OCEANIA
7220,,,New Zealand,,5.2,OCEANIA


In [96]:
final_merged_df.to_csv('final_merged_df.csv', index=False)

In [99]:
# Step 1: Identify countries with null population values
no_population_match = final_merged_df[final_merged_df['population'].isnull()]['country'].unique()
print("No Match Countries in politicians dataset, for which exact match wasn't found in Population dataset, but all three are available in slightly different format as explained below:")
print(no_population_match)

# Step 2: Identify countries with blank article titles
no_article_match = final_merged_df[final_merged_df['article_title'].isnull()]['country'].unique()
#print(no_article_match)

no_article_match_new = []
for country in no_article_match:
    if not country.isupper():
        no_article_match_new.append(country)
print("\nNo Match Countries where there are no politicians are:")
print(no_article_match_new)
        

# Step 3: Output the list of non-matching countries to a text file
with open('wp_countries-no_match.txt', 'w') as file:
    for country in no_article_match_new:
        file.write(f"{country}\n")

final_merged_df_2 = final_merged_df[~final_merged_df['country'].isin(no_article_match)]

final_merged_df_2.to_csv('wp_politicians_by_country.csv', index=False)
final_merged_df_2

No Match Countries in politicians dataset, for which exact match wasn't found in Population dataset, but all three are available in slightly different format as explained below:
['Guinea-Bissau' 'Korean' 'Korea, South']

No Match Countries where there are no politicians are:
['Western Sahara', 'GuineaBissau', 'Mauritius', 'Mayotte', 'Reunion', 'Sao Tome and Principe', 'eSwatini', 'Canada', 'United States', 'Mexico', 'Curacao', 'Dominica', 'Guadeloupe', 'Jamaica', 'Martinique', 'Puerto Rico', 'French Guiana', 'Suriname', 'Georgia', 'Brunei', 'Philippines', 'China (Hong Kong SAR)', 'China (Macao SAR)', 'Korea (North)', 'Korea (South)', 'Denmark', 'Iceland', 'Ireland', 'United Kingdom', 'Liechtenstein', 'Netherlands', 'Romania', 'Andorra', 'San Marino', 'Australia', 'Fiji', 'French Polynesia', 'Guam', 'Kiribati', 'Nauru', 'New Caledonia', 'New Zealand', 'Palau']


Unnamed: 0,article_title,revision_id,country,article_quality,population,region
0,Majah Ha Adrif,1.233203e+09,Afghanistan,Start,42.4,SOUTH ASIA
1,Haroon al-Afghani,1.230460e+09,Afghanistan,B,42.4,SOUTH ASIA
2,Tayyab Agha,1.225662e+09,Afghanistan,Start,42.4,SOUTH ASIA
3,Khadija Zahra Ahmadi,1.234742e+09,Afghanistan,Stub,42.4,SOUTH ASIA
4,Aziza Ahmadyar,1.195651e+09,Afghanistan,Start,42.4,SOUTH ASIA
...,...,...,...,...,...,...
7150,Josiah Tongogara,1.203429e+09,Zimbabwe,C,16.7,EASTERN AFRICA
7151,Langton Towungana,1.246280e+09,Zimbabwe,Stub,16.7,EASTERN AFRICA
7152,Sengezo Tshabangu,1.228478e+09,Zimbabwe,Start,16.7,EASTERN AFRICA
7153,Herbert Ushewokunze,9.591118e+08,Zimbabwe,Stub,16.7,EASTERN AFRICA


The above file and output shows all the country names from articles file where there is no direct match in the population file's country/Geography column. But this is biased since the way the country name has been written varies. To machine they are different but if we look at them as human we know they are same countries like "GuineaBissau" in one file and "Guinea-Bissau" in other, "Korea, South" in one and "Korea (South)" in other. There are other similar examples like this. This could be resolved manually but we wont delve into that as part of this assignment.

## Step 4 & 5: Analysis & Results
Oour analysis will consist of calculating total-articles-per-capita (a ratio representing the number of articles per person)  and high-quality-articles-per-capita (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis
.
In this analysis a country can only exist in one region. The population_by_country_AUG.2024.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) regio
n.
For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) class
es.
Alwe need to so, keep in mind that the population_by_country_AUG.2024.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbelts.


Our results from this analysis will be produced in the form of data tables. We will have produce six total tables, that show:

1.	Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).
   
2.	Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order).
	
3.	Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order).
	
4.	Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
	
5.	Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
	
6.	Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


### (a) Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).

For each country, the total number of articles per capita is calculated as the number of articles divided by the population.

Note: there are 6 countries for which population is mentioned as 0(zero). We will leave out those countries from our analysis.

Liechtenstein	0

Monaco	0

San Marino	0

Nauru	0

Palau	0

Tuvalu	0

In [109]:
# Group by country and count the number of articles per country
country_article_count = final_merged_df_2.groupby('country').size().reset_index(name='total_articles')
print(country_article_count)

# Merge with the original DataFrame to get population
country_data = pd.merge(country_article_count, final_merged_df_2[['country', 'population']].drop_duplicates(), on='country')
print(country_data)

exclude_Countries = ['Liechtenstein','Monaco','San Marino','Nauru','Palau','Tuvalu']

country_data = country_data[~country_data['country'].isin(exclude_Countries)]

# Calculate total articles per capita (per million people)
country_data['articles_per_capita'] = country_data['total_articles'] / (country_data['population']*1000000)

top_10_coverage = country_data.sort_values(by='articles_per_capita', ascending=False).head(10)
print("Top 10 countries by coverage")
print(top_10_coverage[['country', 'articles_per_capita']])


                 country  total_articles
0            Afghanistan              85
1                Albania              70
2                Algeria              71
3                 Angola              58
4    Antigua and Barbuda              33
..                   ...             ...
164            Venezuela              56
165              Vietnam              36
166                Yemen              32
167               Zambia               3
168             Zimbabwe              69

[169 rows x 2 columns]
                 country  total_articles  population
0            Afghanistan              85        42.4
1                Albania              70         2.7
2                Algeria              71        46.8
3                 Angola              58        36.7
4    Antigua and Barbuda              33         0.1
..                   ...             ...         ...
164            Venezuela              56        28.8
165              Vietnam              36        98.9
166    

### (b) Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order)

In [111]:
bottom_10_coverage = country_data.sort_values(by='articles_per_capita', ascending=True).head(10)
print("Bottom 10 countries by coverage")
print(bottom_10_coverage[['country', 'articles_per_capita']])

Bottom 10 countries by coverage
           country  articles_per_capita
31           China         1.133707e-08
67           India         1.056979e-07
57           Ghana         1.173021e-07
125   Saudi Arabia         1.355014e-07
167         Zambia         1.485149e-07
111         Norway         1.818182e-07
71          Israel         2.040816e-07
45           Egypt         3.041825e-07
37   Cote d'Ivoire         3.236246e-07
50        Ethiopia         3.478261e-07


### (c) Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

First, filter for high-quality articles (FA or GA), then count them per country, and calculate the ratio based on the population.

In [118]:
# Filter for high-quality articles (FA or GA)
high_quality_articles = final_merged_df_2[final_merged_df_2['article_quality'].isin(['FA', 'GA'])]

# Group by country and count high-quality articles
high_quality_article_count = high_quality_articles.groupby('country').size().reset_index(name='high_quality_articles')

# Merge with the population data
country_data = pd.merge(country_data, high_quality_article_count, on='country', how='left')

exclude_Countries = ['Liechtenstein','Monaco','San Marino','Nauru','Palau','Tuvalu']
country_data = country_data[~country_data['country'].isin(exclude_Countries)]

# Replace NaN (countries with no high-quality articles) with 0
country_data['high_quality_articles'] = country_data['high_quality_articles'].fillna(0)

# Calculate high-quality articles per capita (per person)
country_data['high_quality_articles_per_capita'] = country_data['high_quality_articles'] / (country_data['population']*1000000)

top_10_high_quality = country_data.sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)
print("Top 10 Countries with high quality articles")
print(top_10_high_quality[['country','high_quality_articles_per_capita']])

Top 10 Countries with high quality articles
                   country  high_quality_articles_per_capita
100             Montenegro                      5.000000e-06
89              Luxembourg                      2.857143e-06
1                  Albania                      2.592593e-06
79                  Kosovo                      2.352941e-06
93                Maldives                      1.666667e-06
88               Lithuania                      1.379310e-06
38                 Croatia                      1.315789e-06
63                  Guyana                      1.250000e-06
113  Palestinian Territory                      1.090909e-06
131               Slovenia                      9.523810e-07


### 4.	Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [119]:
bottom_10_high_quality = country_data.sort_values(by='high_quality_articles_per_capita', ascending=True).head(10)
print("Bottom 10 Countries with high quality articles")
print(bottom_10_high_quality[['country','high_quality_articles_per_capita']])

Bottom 10 Countries with high quality articles
                 country  high_quality_articles_per_capita
166             Zimbabwe                               0.0
34                 Congo                               0.0
80                Kuwait                               0.0
139            St. Lucia                               0.0
37         Cote d'Ivoire                               0.0
138  St. Kitts and Nevis                               0.0
132      Solomon Islands                               0.0
40                Cyprus                               0.0
129            Singapore                               0.0
42              Djibouti                               0.0


### 5.	Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

Group the data by region and aggregate the total articles and population.

In [121]:
# Group by region and aggregate total articles and population
region_data = final_merged_df_2.groupby('region').agg(
    total_articles=('article_title', 'size'),
    population=('population', 'sum')
).reset_index()

# Calculate articles per capita (per million people)
region_data['articles_per_capita'] = region_data['total_articles'] / (region_data['population']*1000000)

regions_by_total_coverage = region_data.sort_values(by='articles_per_capita', ascending=False)
print("Geographic regions by total coverage (descending)")
print(regions_by_total_coverage)


Geographic regions by total coverage (descending)
             region  total_articles  population  articles_per_capita
9           OCEANIA              72       111.1         6.480648e-07
8   NORTHERN EUROPE             191      1162.1         1.643576e-07
0         CARIBBEAN             219      1414.9         1.547813e-07
1   CENTRAL AMERICA             188      1418.8         1.325063e-07
2      CENTRAL ASIA             106      1983.6         5.343819e-08
16     WESTERN ASIA             610     13369.5         4.562624e-08
14  SOUTHERN EUROPE             797     17956.6         4.438479e-08
4    EASTERN AFRICA             665     23941.2         2.777639e-08
17   WESTERN EUROPE             498     18969.9         2.625212e-08
7   NORTHERN AFRICA             302     12173.9         2.480717e-08
5    EASTERN EUROPE             709     29044.9         2.441048e-08
6     MIDDLE AFRICA             231      9524.9         2.425222e-08
13  SOUTHERN AFRICA             123      5954.3      

### 6.	Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

Performing similar grouping as above and aggregation for high-quality articles.

In [123]:
# Filter for high-quality articles (FA or GA) for regional level
high_quality_region_data = high_quality_articles.groupby('region').agg(
    high_quality_articles=('article_title', 'size'),
    population=('population', 'sum')
).reset_index()

# Calculate high-quality articles per capita (per million people)
high_quality_region_data['high_quality_articles_per_capita'] = high_quality_region_data['high_quality_articles'] / (high_quality_region_data['population']*1000000)
regions_by_high_quality_coverage = high_quality_region_data.sort_values(by='high_quality_articles_per_capita', ascending=False)

print("Geographic regions by high quality coverage (descending)")
print(regions_by_high_quality_coverage)

Geographic regions by high quality coverage (descending)
             region  high_quality_articles  population  \
8   NORTHERN EUROPE                      9        45.7   
1   CENTRAL AMERICA                     10        92.3   
9           OCEANIA                      1         9.5   
0         CARIBBEAN                      9        90.0   
2      CENTRAL ASIA                      5        76.7   
16     WESTERN ASIA                     27       554.1   
14  SOUTHERN EUROPE                     53      1117.5   
7   NORTHERN AFRICA                     17       571.8   
4    EASTERN AFRICA                     17       634.6   
17   WESTERN EUROPE                     21       838.0   
6     MIDDLE AFRICA                      8       342.8   
5    EASTERN EUROPE                     38      1994.6   
13  SOUTHERN AFRICA                      8       485.6   
10    SOUTH AMERICA                     19      1569.1   
3         EAST ASIA                      3       252.5   
11       SOUTH 