## Considering Bias in Data
Class: DATA 512 <br>
Author: Fang Yu Lim(Fiona) <br>
Date: 10/11/024 <br>


This project is separated into two modules. The first module "DATA512_HW2_DataRetrieval.ipynb" retrieves the article and population data, gets their quality prediction and combines the wikipedia data and population data. 
We obtained 2 sets of datasets from making API calls:
1. Information about politicians articles (List of politicians is specified by) [politicians_by_country_AUG.2024.csv](../input_data/politicians_by_country_AUG.2024.csv)
2. The quality of the articles using the reviews ids obtained from the previous API call. 

In [614]:

import json, time, urllib.parse
import pandas as pd
#
import requests

### Data Retrieval: Getting the Article and Population Data. 
We need the information about the politcians from Wikipedia articles, the country population, and the quality of the articles. 
The [politicians_by_country.AUG.2024.csv](../input_data/politicians_by_country_AUG.2024.csv) was created by crawling the Wikipedia Category: Politicians by nationality to generate a list of Wikipedia article pages about politicians from a wide range of countries. 
The [population_by_country_AUG.2024.csv](../input_data/population_by_country_AUG.2024.csv) wass downloaded from the world population data sheet published by the Population Reference Bureau. 


The following code for making API calls are based off of the code example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program.
- [wp_ores_liftwing_example.ipynb](../reference_notebooks/wp_ores_liftwing_example.ipynb) This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023
- [wp_page_info_example.ipynb](../reference_notebooks/wp_page_info_example.ipynb) 
This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

In the next two blocks, we declare constants and the function to make an API call to obtain the Wikipedia page information. 

In [615]:
# CONSTANTS 
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<flim89@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
# ARTICLE_TITLES = ""
# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [616]:
# FUNCTION
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#### Wikipedia articles: Politicians by nationality

In [617]:
politicians_list = pd.read_csv("../input_data/politicians_by_country_AUG.2024.csv")
politicians_by_country = politicians_list["name"]

Here, we make the API call for all of the politcians in our list politicians_by_country and save it as a dictionary where the key is the article title(politcian name) and the value is the page information. In addition to the information obtained from the API call, I added a "Country" key as I will need this piece of information later, to join all my data. 
I added a step where it appends the article titles where I was unable to get information about and made it easier to retry those attempts if the reason was due to internet problems. (Instead of iterating through the full list which takes a very long time. )

In [618]:
politicians_articles = {}
failed_list =[]
for articles in politicians_by_country:
    while attempt < retries:
        retries = 3  
        attempt = 0  
        info = None
        try:
            info = request_pageinfo_per_article(articles)
            break
        except Exception as e:
            attempt += 1
    if info is not None:
        original_format = info["query"]["pages"]
        for key, value in original_format.items():
            politicians_articles[articles] = value
            politicians_articles[articles]["pages"] = key
        if articles in politicians_list["name"].values:
            country = politicians_list.loc[politicians_list["name"] == articles, "country"].values[0]
            politicians_articles[articles]["Country"] = country
    else:
        failed_list.append(value)

I didn't want to risk loosing my data that took a very long time to retrieve, so I am saving the results of my API calls into a intermediary_files folder. 

In [None]:
with open("../intermediary_files/polticians_articles.json", "w") as json_file:
    json.dump(politicians_articles, json_file, indent=4)


After running len(politicians_by_country) and len(polticians_articles.keys()), I realized that I was missing entries, but my failed_list was empty. The following code checks if a politician represented different countries. 


In [619]:
politicians_by_country_list = list(politicians_by_country)

In [794]:
temp1Counter = []
duplicate_list = [] 
for name in politicians_by_country_list:
    if name in temp1Counter:
        duplicate_list.append(name)
    else:
        temp1Counter.append(name)

We can see here, that there are 44 politcians who represent more than one country. We save it in duplicate_list for now and we will come back to it later.

In [795]:
len(duplicate_list)

44

In the next block, I was doing a check to find if there were articles that did not have a revision id, indicating the number of entries I would expect to be missing after my API call for article quality. 

In [623]:
# Articles that do not have a review id. 
missing_keys = set(politicians_articles.keys()) - set(ARTICLE_REVISIONS.keys())
missing_keys

{'André Ngongang Ouandji',
 'Barbara Eibinger-Miedl',
 'Bashir Bililiqo',
 'Kyaw Myint',
 'Mehrali Gasimov',
 'Richard Sumah',
 "Segun ''Aeroland'' Adewale",
 'Tomás Pimentel'}

There are 7155 polticians in the original csv fille. However, there are duplicated 44 politcians with who worked in 2 countries. 
Therefore, they refer to the same article therefore, we will just make 7111 calls. Within these 7111, there are articles of 8 politcians that do not have a score, so we will only have 7103 articles with scores (if all other articles work fine). 
(missing_keys)

#### Populations by country

Here, I'm only reading in the population_by_country csv file, we will come back to it later when we need to join our data.

In [624]:
population_by_country = pd.read_csv("../input_data/population_by_country_AUG.2024.csv")

### Data Retrieval: Getting Article Quality Predictions

In the following blocks, I am declaring the constants and functions needed to make the ORES API call. 
The parameter which the function takes is the review id, therefore I iterate through my politicians_articles variable and save the article title and the key for ARTICLE_REVISIONS, and the value is the revew id.


In [625]:
#    CONSTANTS

#    The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "flim89@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
# lastrevid
# https://docs.python.org/3/tutorial/datastructures.html#dict-comprehensions
list_no_revision_id = []
for key, value in politicians_articles.items():
    if "lastrevid" in value:
        ARTICLE_REVISIONS[key] = value["lastrevid"]
    else:
        list_no_revision_id.append(key)

Here I am using the code and module provided in the [wp_ores_liftwing_example.ipynb](../reference_notebooks/wp_ores_liftwing_example.ipynb) to store and access my token.
Please reference it for more information.

In [628]:
from apikeys.KeyManager import KeyManager
keyman = KeyManager()

#
#   This is my Wikipedia/Wikimedia username. They suggest you request your keys using your Wikipedia username, so I
#   also stored the API key using my Wikipedia username.
#
#   You should probably use your own username here.
USERNAME = "LastMinuteStudent"
# you will need to create your record before running this
# keyman.createRecord("name", "domain","token","description")

key_info = keyman.findRecord(USERNAME,API_ORES_LIFTWING_ENDPOINT)
ACCESS_TOKEN = key_info[0]['key']

In [629]:
# FUNCTIONS 
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Since there are a total of 7103 API calls I need to make, to prevent needing to run through the whole list again, I split my dictionary into 4 parts and run it separately and combine. Thus, if there was failed attempt in part of the dictionary where my Exception did not properly catch, I do not need to wait for as long of a time to rerun. 
I maintain a list of articles which did not have a review, and articles which somehow failed and I will rerun those articles that somehow failed and make the call and see what was the error and add it to my dictionary. 

In [630]:
# My function kept crashing because it took so long, so I'm going to separate my dictionary into 4 parts and call it 4 times. 
def split_dict_into_parts(input_dict, num_parts):
    items = list(input_dict.items())
    
    chunk_size = len(items) // num_parts
    remainder = len(items) % num_parts
    
    split_dicts = []
    start = 0
    for i in range(num_parts):
        end = start + chunk_size + (1 if i < remainder else 0)
        split_dicts.append(dict(items[start:end]))
        start = end
    
    return split_dicts

split_dicts = split_dict_into_parts(ARTICLE_REVISIONS, 4)



In [632]:
article_revision_dict0 = {}
original0 = {}
revid_no_info0 = []
error_list0 = []
for key, values in split_dicts[0].items():
    retries = 3  
    delay = 5    
    attempt = 0  
    score = None

    while attempt < retries:
        try:
            score = request_ores_score_per_article(
                article_revid=values,
                email_address="flim89@uw.edu",
                access_token=ACCESS_TOKEN
            )
            break 
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            attempt += 1   

    if score is None or "error" in score:
        revid_no_info0.append(key)
    else:
        try:
            original0 = score['enwiki']['scores']
            for article_id, details in original0.items():
                article_revision_dict0[article_id] = details['articlequality']['score']["prediction"]
        except Exception as e: 
            print(f"Exception is: {e} in revid {value}")
            error_list0.append(value)


I check the lengths of my variables after each call to observe if there were exceptions, and dumped them into json files since I wanted to make sure I would have a copy. 

In [791]:
print(len(revid_no_info0))
print(len(error_list0))
print(len(article_revision_dict0))

0
0
1776


In [639]:
with open("../intermediary_files/revision_p1.json", "w") as json_file:
    json.dump(article_revision_dict0, json_file, indent=4)

In [640]:
article_revision_dict1 = {}
original1 = {}
revid_no_info1 = []
error_list1 = []
for key, values in split_dicts[1].items():
    retries = 3  
    delay = 5    
    attempt = 0  
    score = None
    print(key)
    while attempt < retries:
        try:
            score = request_ores_score_per_article(
                article_revid=values,
                email_address="flim89@uw.edu",
                access_token=ACCESS_TOKEN
            )
            break 
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            attempt += 1   

    if score is None or "error" in score:
        revid_no_info1.append(key)
    else:
        try:
            original1 = score['enwiki']['scores']
            for article_id, details in original1.items():
                article_revision_dict1[article_id] = details['articlequality']['score']["prediction"]
        except Exception as e: 
            print(f"Exception is: {e} in revid {value}")
            error_list1.append(value)


Savo Zlatić
Ante Županović
Prime Ministers of Cuba
President of Cuba
Eduardo Agramonte Piña
Ignacio Agramonte
Francisco Agüero Velasco
Alberto López Díaz
Sebastian Arcos Bergnes
Gustavo Arcos
Juan de Ayala y Escobar
Miguel Brugueras
Leopoldo Cancio
Demetrio Castillo Duany
Vicente Manuel de Céspedes
Manuel Cidre
Rosendo Collazo
Concepción Campa Huergo
José María Coppinger
Miguel Díaz-Canel
Manuel Dorta-Duque
Carlos Fernández Gondín
Orestes Ferrara
Porfirio Franca
Francisco de Arango y Parreño
Edith García Buchaca
Eumelín González Sánchez
Aurelio Hevia
Fernando Heydrich
Alfredo Hornedo
José Irisarri
Perfecto Lacoste
José Francisco Lemus
Manuel Marrero Cruz
Rubén Martínez Puente
Augusto Martínez Sánchez
Bartolomé Masó
Ramón Meza y Suárez Inclán
Rafael Montoro
Bartolomé Morales
José Núñez de Cáceres
Juan Vitalio Acuña Núñez
Frank País
Pedro Esteban González-Larrinaga
Guillermo Portela
Juan Nepomuceno de Quesada
Luis Alberto Rodríguez López-Calleja
Francisco Sánchez Betancourt
Salvador José

In [661]:
with open("../intermediary_files/revision_p2.json", "w") as json_file:
    json.dump(article_revision_dict1, json_file, indent=4)

In [792]:
print(len(revid_no_info1))
print(len(error_list1))
print(len(article_revision_dict1))

1
1
1774


In [647]:
article_revision_dict2 = {}
original2 = {}
revid_no_info2 = []
error_list2 = []
for key, values in split_dicts[2].items():
    retries = 3  
    delay = 5    
    attempt = 0  
    score = None

    while attempt < retries:
        try:
            score = request_ores_score_per_article(
                article_revid=values,
                email_address="flim89@uw.edu",
                access_token=ACCESS_TOKEN
            )
            break 
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            attempt += 1   

    if score is None or "error" in score:
        revid_no_info2.append(key)
    else:
        try:
            original2 = score['enwiki']['scores']
            for article_id, details in original2.items():
                article_revision_dict2[article_id] = details['articlequality']['score']["prediction"]
        except Exception as e: 
            print(f"Exception is: {e} in revid {value}")
            error_list2.append(value)

Exception is: 'enwiki' in revid {'pageid': 3255571, 'ns': 0, 'title': 'Denis Walker', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'touched': '2024-10-05T14:24:34Z', 'lastrevid': 1247902630, 'length': 10247, 'talkid': 3338681, 'fullurl': 'https://en.wikipedia.org/wiki/Denis_Walker', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Denis_Walker&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Denis_Walker', 'pages': '3255571', 'Country': 'Zimbabwe'}


In [793]:
print(len(revid_no_info2))
print(len(error_list2))
print(len(article_revision_dict2))

0
1
1775


In [662]:
with open("../intermediary_files/revision_p3.json", "w") as json_file:
    json.dump(article_revision_dict2, json_file, indent=4)

In [657]:
original3 = {}
article_revision_dict3 = {}
revid_no_info3 = []
error_list3 = []
for key, values in split_dicts[3].items():
    retries = 3  
    delay = 5    
    attempt = 0  
    score = None

    while attempt < retries:
        try:
            score = request_ores_score_per_article(
                article_revid=values,
                email_address="flim89@uw.edu",
                access_token=ACCESS_TOKEN
            )
            break 
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            attempt += 1   

    if score is None or "error" in score:
        revid_no_info3.append(key)
    else:
        try:
            original3 = score['enwiki']['scores']
            for article_id, details in original3.items():
                article_revision_dict3[article_id] = details['articlequality']['score']["prediction"]
        except Exception as e3: 
            print(f"Exception is: {e3} in revid {value}")
            error_list3.append(value)

HTTPSConnectionPool(host='api.wikimedia.org', port=443): Max retries exceeded with url: /service/lw/inference/v1/models/enwiki-articlequality:predict (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000002034C316150>: Failed to resolve 'api.wikimedia.org' ([Errno 11001] getaddrinfo failed)"))
HTTPSConnectionPool(host='api.wikimedia.org', port=443): Max retries exceeded with url: /service/lw/inference/v1/models/enwiki-articlequality:predict (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000002034C319310>: Failed to resolve 'api.wikimedia.org' ([Errno 11001] getaddrinfo failed)"))


In [664]:
print(len(revid_no_info3))
print(len(error_list3))
print(len(article_revision_dict3))

2
0
1773


In [663]:
with open("../intermediary_files/revision_p4.json", "w") as json_file:
    json.dump(article_revision_dict3, json_file, indent=4)

In the next block, I combined my list of articles where I was not able to obtain a score for article quality and output to [wp_countries-no_match.txt](../output/wp_countries-no_match.txt)

In [823]:
combined_no_info = revid_no_info0 + revid_no_info1 + revid_no_info2 + revid_no_info3
combined_error_list = error_list0 + error_list1 + error_list2 + error_list3
error_list = []
for item in combined_error_list:
    all_articles = item["title"]
    if all_articles not in error_list:
        error_list.append(all_articles) 
all_articles_no_result = combined_no_info + error_list
with open('../output/wp_countries-no_match.txt', 'w') as file:
    # Write each item on a separate line
    for item in all_articles_no_result:
        file.write(f"{item}\n")

After obtaining all the dictionaries, I combine them into combined_article_revision and saved the results in a json file

In [669]:
combined_article_revision = article_revision_dict0 | article_revision_dict1 | article_revision_dict2 |article_revision_dict3

In [683]:
with open("../intermediary_files/article_revisions.json", "w") as json_file:
    json.dump(combined_article_revision, json_file, indent=4)

In [672]:
len(combined_article_revision)

7098

In the following sections, I convert my data into pandas dataframe with the headers I want in my final csv file

We start with the revision quality dataframe:

In [677]:
revision_quality_df = pd.DataFrame(combined_article_revision.items(), columns=["revision_id", "article_quality"])
revision_quality_df

Unnamed: 0,revision_id,article_quality
0,1136614520,Stub
1,1244421894,GA
2,1221605713,Start
3,1233202991,Start
4,1230459615,B
...,...,...
7093,1203429435,C
7094,1246280093,Stub
7095,1228478288,Start
7096,959111842,Stub


In the population_by_country_AUG.2024.csv, the regions are specified with all caps, with all countries below that entry (until the next entry with all caps) belonging to that region. 
Therefore, I created the country_population_df with the columns [country, population, region] and distributed the countries to their corresponding region. 

In [678]:
country_population_df = pd.DataFrame(columns=["country", "population", "region"])

region = None
for index, row in population_by_country.iterrows():
    geography = row["Geography"]

    if geography.isupper():
        region = geography
        country_population_df = pd.concat([country_population_df, pd.DataFrame({"country":"", "population": [row["Population"]], "region": [region]})], ignore_index=True)
    else:
        country_population_df = pd.concat([country_population_df, pd.DataFrame({"country": [geography], "population": [row["Population"]], "region": [region]})], ignore_index=True)

  country_population_df = pd.concat([country_population_df, pd.DataFrame({"country":"", "population": [row["Population"]], "region": [region]})], ignore_index=True)


In [679]:
country_population_df

Unnamed: 0,country,population,region
0,,8009.0,WORLD
1,,1453.0,AFRICA
2,,256.0,NORTHERN AFRICA
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.8,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


For the politcians_df, we only keep the entries where we were a revision_id was available. (We will not have been able to retrieve the score of the particular article without the revision_id.)

In [680]:
politicians_df = pd.DataFrame(columns=["article_title", "country", "revision_id"])
articles_no_rev_id = []
for key, values in politicians_articles.items():
    if "lastrevid" in values:
        data = {"article_title": values["title"], "country": values["Country"],"revision_id" : values["lastrevid"]}
        politicians_df = pd.concat([politicians_df, pd.DataFrame(data, index=[0])], ignore_index=True)
    else:
        articles_no_rev_id.append(key)

In [681]:
politicians_df

Unnamed: 0,article_title,country,revision_id
0,Majah Ha Adrif,Afghanistan,1233202991
1,Haroon al-Afghani,Afghanistan,1230459615
2,Tayyab Agha,Afghanistan,1225661708
3,Khadija Zahra Ahmadi,Afghanistan,1234741562
4,Aziza Ahmadyar,Afghanistan,1195651393
...,...,...,...
7098,Josiah Tongogara,Zimbabwe,1203429435
7099,Langton Towungana,Zimbabwe,1246280093
7100,Sengezo Tshabangu,Zimbabwe,1228478288
7101,Herbert Ushewokunze,Zimbabwe,959111842


Going back to the duplicate_list earlier, where politcians represented more than one country:  <br>
We extract their names as the key, and the value is the list of countries they represented.

In [759]:
politician_multi_dict = {}
for name in duplicate_list:
    countries = politicians_list.loc[politicians_list['name'] == name, 'country'].tolist()

    politician_multi_dict[name] = countries
    
print(politician_multi_dict)

{'Count Václav Antonín Chotek of Chotkov and Vojnín': ['Austria', 'Czechia'], 'Eduard Hedvicek': ['Austria', 'Czechia'], 'Leopold, Count von Thun und Hohenstein': ['Austria', 'Czechia'], 'Ibrahim Harun': ['Eritrea', 'Ethiopia'], 'José Francisco Barrundia': ['Guatemala', 'Honduras'], 'Manuel Carrascalão': ['Timor Leste', 'Indonesia'], 'Bak Jungyang': ['Japan', 'Korean'], 'Visar Ymeri': ['Albania', 'Kosovo'], 'Torokul Dzhanuzakov': ['Kazakhstan', 'Kyrgyzstan', 'Tajikistan', 'Uzbekistan'], 'Tadeusz Kościuszko': ['Belarus', 'Lithuania', 'Poland'], 'Venko Markovski': ['Bulgaria', 'North Macedonia'], 'Ashab Uddin Ahmad': ['Bangladesh', 'Pakistan'], 'Moinuddin Ahmed Chowdhury': ['Bangladesh', 'Pakistan'], 'Mohammad Toaha': ['Bangladesh', 'Pakistan'], 'Ali al-Qaradaghi': ['Iraq', 'Qatar'], 'Aleksandr Nikitin (politician, born 1987)': ['Moldova', 'Russia'], 'José Alejandro de Aycinena': ['Guatemala', 'El Salvador'], 'Shqiprim Arifi': ['Germany', 'Serbia'], 'Melko Čingrija': ['Croatia', 'Serbia'

Then I will need to combine combined_article_revision with country_population_df and renamed_politcians_list.
revision_quality_df ("revision_id", "article_quality")
country_population_df ("article_title, "country", "revision_id")
politicians_df ("article_title", "country")
I had to change the data type as they were not the same across the different dataframes

In [803]:
revision_quality_df_str = revision_quality_df.astype('string', copy=None, errors='raise')

In [724]:
politicians_df = politicians_df.astype('string', copy=None, errors='raise')

In [772]:
final_df = politicians_df_str.merge(revision_quality_df_str, how='left', on ='revision_id')

In [773]:
final_df = final_df.merge(country_population_df, how='left', on ='country')

In [774]:
final_df.shape[0]

7103

However, as mentioned previously, there are 44 politicians which ruled in different countries. 
We extract the entries in renamed_politcians_list where "article_title" is the same as the duplicate_list. 
We duplicate these rows and change the country to the value in duplicated_dict (for the countries not in the current dataframe)

In [824]:
for key, values in politician_multi_dict.items():
    matching_row = final_df.loc[final_df['article_title'] == key]

    if not matching_row.empty:
        no_countries = len(values)
        # Could remove this if since all politcians in this dictionary has more than 1 country associated.
        if no_countries > 1:
            existing_countries = matching_row['country'].tolist()
            missing_countries = [country for country in values if country not in existing_countries]
            
            for country in missing_countries:
                new_row = matching_row.copy()
                new_row['country'] = country
                
                final_df = pd.concat([final_df, new_row], ignore_index=True)
final_df.shape[0]

7147

Finally, we save it to our csv file

In [825]:
final_df.to_csv('../output/wp_politicians_by_country.csv', index=False)

Our error rate is the ratio of the number of articles I could not get a score divided by the total number of articles. <br>
I am using the length of the final_df as we added the duplicated entries for our 44 politcians

In [820]:
# Error rate
((len(politicians_by_country)-final_df.shape[0])/len(politicians_by_country)) * 100

0.11180992313067784