Anushna Prakash  
DATA 512 - Human-Centered Data Science  
October 14, 2021  
# <center> A2 - Bias in Data <center>
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries using a dataset of Wikipedia articles, country populations, and a machine learning service called ORES to estimate the quality of each article to show the quality of coverage of politicians on Wikipedia by country. This notebook will output:  
- the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
- the countries with the highest and lowest proportion of high quality articles about politicians.  
- a ranking of geographic regions by articles-per-person and proportion of high quality articles.  

## Step 0: Set Up Notebook

In [1]:
# Optional: Make notebook width wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:55% !important; }</style>"))

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import json
import requests
import math

## Step 1: Import data

See the `README.md` for original data sources that were downloaded into the `data_raw` folder. Here I download Wikipedia articles about politicians and a dataset of the population of countries in the world and their regions.

In [3]:
# See README.md for where data was downloaded from originally.
# Import from data_raw folder assuming we are running from the src folder.
page_data = pd.read_csv('../data_raw/country/data/page_data.csv')
population = pd.read_csv('../data_raw/WPDS_2020_data.csv')

The Wikipedia page dataset has a page name, the country that the page is for, and a `rev_id` which uniquely identifies each page.

In [4]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [5]:
page_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     47197 non-null  object
 1   country  47197 non-null  object
 2   rev_id   47197 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [6]:
page_data.isna().sum()

page       0
country    0
rev_id     0
dtype: int64

The population data set has countries in the `Name` column in normal case, and the regions they belong in in uppercase. Although one value in the `FIPS` column is missing, we will not be requiring this column.

In [7]:
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [8]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        233 non-null    object 
 1   Name        234 non-null    object 
 2   Type        234 non-null    object 
 3   TimeFrame   234 non-null    int64  
 4   Data (M)    234 non-null    float64
 5   Population  234 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.1+ KB


In [9]:
population.isna().sum()

FIPS          1
Name          0
Type          0
TimeFrame     0
Data (M)      0
Population    0
dtype: int64

In [10]:
population.loc[population['FIPS'].isna(),]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
62,,Namibia,Country,2019,2.541,2541000


## Step 2: Cleaning the Data  
Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that I will need to filter out and/or ignore when you combine the datasets in the next step.  

In the case of `page_data.csv`, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and will not be included in the analysis.  
Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts which have ALL CAPS values in the `Name` field (e.g. AFRICA, OCEANIA). I will first focus on keeping just the base country names to match up to the page data set, and later will use the regional hierarchy to analyze the page data by world region. There is still one sub-region that is technically not a country that will end up being included in the output, the Channel Islands, but since this sub-region has no countries underneath it, it is retained.

In [11]:
# Remove page names that begin with 'Template:' since these are not wikipedia articles
page_data = page_data.loc[~page_data['page'].str.startswith('Template:'), ]

In [12]:
page_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [13]:
original_population = population.copy()
population = population.loc[~population['Name'].str.isupper(), ]
population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 3 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        209 non-null    object 
 1   Name        210 non-null    object 
 2   Type        210 non-null    object 
 3   TimeFrame   210 non-null    int64  
 4   Data (M)    210 non-null    float64
 5   Population  210 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.5+ KB


In [14]:
population.loc[population['Type'] != 'Country',]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
168,Channel Islands,Channel Islands,Sub-Region,2019,0.172,172000


In [15]:
# Save just the list of countries assuming we are in the src folder
population.to_csv('../data_clean/WPDS_countries_only.csv', index = False)

## Step 3: Getting Article Quality Predictions

Now I retrieve the predicted quality scores for each article in the Wikipedia dataset using the ORES API. See the `README.md` for more information about what this API is and how it is used. The API will output for each `rev_id` the article quality estimates, which are:  
- FA - Featured article  
- GA - Good article  
- B - B-class article  
- C - C-class article  
- Start - Start-class article  
- Stub - Stub-class article  

Note that there is also an `ores` package that can be pip installed, but I ran into significant issues trying to do this and so the API is being used in this case. Also, be aware that the API call can handle multiple `rev_ids` in one call, but only up to 50. For this reason, the dataset is divided into batches with each batch having a maximum of 50 articles in it. Each batch is sent to the API and the `json` calls are saved in the `data_raw/api_dump` folder so that if you are re-running this analysis, you don't have to call the API each time. The API call does take a few minutes to run so it can be time-consuming.

In [16]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

# Fill with your own information if reproducing
headers = {
    'User-Agent': 'https://github.com/anushnap',
    'From': 'anushnap@uw.edu'
}

# Use the scores endpoint which allows multiple revids to be sent up at once. This endpoint is set up
# so that only the articlequality model is returned on English wikipedia.
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={revid}'

# Make a copy of the original data frame so it is not mutated in place.
df = page_data.copy()

# Use 49 as a ceiling so that if there are remainders, the maximum it will go up to is 50 before increasing the number of batches
# to prevent the API call from failing.
n_batches = math.ceil(len(df) / 49)
df['row_num'] = np.arange(len(df))
df['batch_num'] = df['row_num'] % n_batches

Uncomment and run these cells if you are re-running and re-downloading the predictions from the API. Otherwise, skip this block and download the already-saved data from .json files in the `data_raw/api_dump` folder.

In [17]:
# for n in range(n_batches):
#     id_str = '|'.join(df.loc[df['batch_num'] == n, 'rev_id'].astype(str))
#     params = {
#         "revid" : id_str
#     }
    
#     call = api_call(endpoint, params)
#     filename = 'ores_scores_enwiki_articlequality_batchnum-' + str(n) + '.json'
    
#     with open('../data_raw/api_dump/' + filename, 'w', encoding='utf-8') as f:
#         json.dump(call, f, ensure_ascii = False, indent=4)

We go back through the `json` returned files and if a prediction exists, it will be returned and saved into the copied dataframe in a column called `prediction`. If it does not exist, we will return `np.nan`. Since there will be several hundred saved `json` files in the folder, this block will take ~2 minutes to run.

In [18]:
# Read the data back in and get the prediction if it exists
# If the prediction does not exist, there will be nothing in the json file after ['articlequality']['score'].
for n in range(n_batches):
    filename = '../data_raw/api_dump/ores_scores_enwiki_articlequality_batchnum-' + str(n) + '.json'
    temp = json.load(open(filename))['enwiki']['scores']
    
    for i in temp:
        int_id = int(i)
#         print(int_id)
        try:
            prediction = temp[i]['articlequality']['score']['prediction']
        except KeyError:
            prediction = np.nan
        finally:
            df.loc[(df['rev_id'] == int_id), 'prediction'] = prediction

In [19]:
df.isna().sum()

page            0
country         0
rev_id          0
row_num         0
batch_num       0
prediction    276
dtype: int64

Get a list of the articles for which we were unable to get a prediction and save this in `data_clean`. These are the articles that did not return a prediction.

In [20]:
# Save articles for which no prediction was returned assuming we are running from the src folder
df.loc[df['prediction'].isna()].to_csv('../data_clean/articles_missing_prediction.csv', index = False)

## Part 4: Combining the Datasets

Here we will merge the Wikipedia page data with the article prediction scores into the population data set. Both have fields containing country names which will be merged on.
Countries that have a population in the population dataset but no articles are removed, as are articles from countries that have no population in the population data set and saved separately into a CSV file called: `wp_wpds_countries-no_match.csv`.  
The remaining data is consolidated into a single CSV file called: `wp_wpds_politicians_by_country.csv`.  

The schema for that file is below:  

| Column |
|--------|
| country      |
| article_name      |
| revision_id      |
| article_quality_est.      |
| population      |


Note: `revision_id` here is the same thing as `rev_id`.

In [22]:
# Join all data together excluding data with missing predictions
full_results = df.merge(population, how = 'outer', left_on = 'country', right_on = 'Name')

# Find data that is missing in either table and save assuming we are running from src folder
missing_results = full_results.loc[(full_results['country'].isnull() | full_results['Name'].isnull())]
missing_results.to_csv('../data_clean/wp_wpds_countries-no_match.csv', index = False)

In [23]:
print(missing_results['country'].unique())
print(missing_results['Name'].unique())

['Czech Republic' 'Salvadoran' 'Rhodesian' 'Congo, Dem. Rep. of'
 'East Timorese' 'Faroese' 'Cape Colony' 'South Korean' 'Samoan'
 'Montserratian' 'Pitcairn Islands' 'Saint Kitts and Nevis' 'Macedonia'
 'Abkhazia' 'Niuean' 'Ivorian' 'Carniolan' 'Saint Lucian'
 'South African Republic' 'Hondura' 'Incan' 'Chechen' 'Jersey' 'Guernsey'
 'Saint Vincent and the Grenadines' 'South Ossetian' 'Cook Island' 'Omani'
 'Tokelauan' 'Swaziland' 'Dagestani' 'Greenlandic' 'Ossetian' 'Palauan'
 'Somaliland' 'Rojava' nan]
[nan 'Western Sahara' "Cote d'Ivoire" 'Mayotte' 'Reunion'
 'Congo, Dem. Rep.' 'eSwatini' 'El Salvador' 'Honduras' 'Curacao'
 'Puerto Rico' 'St. Kitts-Nevis' 'Saint Lucia'
 'St. Vincent and the Grenadines' 'Georgia' 'Oman' 'Brunei' 'Timor-Leste'
 'China, Hong Kong SAR' 'China, Macao SAR' 'Channel Islands' 'Czechia'
 'North Macedonia' 'French Polynesia' 'Guam' 'New Caledonia' 'Palau'
 'Samoa']


In [24]:
# Join data together, but only non-missing data and only data that have a prediction
# Rename the column names for clarity
results = df.loc[~df['prediction'].isna()].merge(population, how = 'inner', left_on = 'country', right_on = 'Name')\
    [['country', 'page', 'rev_id', 'prediction', 'Population']]\
    .rename(
        {'page': 'article_name', 'rev_id': 'revision_id', 'prediction': 'article_quality_est.', 'Population': 'population'}, 
        axis = 1)

# Write results to a table assuming we are in the src folder
results.to_csv('../data_clean/wp_wpds_politicians_by_country.csv', index = False)

## Step 5: Analysis

We calculate the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. "High quality" articles are ones that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) class.  
Examples:  
- if a country has a population of 10,000 people, and you found 10 FA or GA class articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
- if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [25]:
# High quality articles are ones that are classified as FA or GA
results['high_quality'] = results['article_quality_est.'].isin(['FA', 'GA'])

# Create a new data frame that has the aggregated data
country_stats = results.groupby(['country', 'population', 'high_quality'], dropna = False, as_index = False)\
    .agg({'revision_id': 'count'})

In [26]:
# Create the new columns required for the analysis
# The total number of articles as total_ids
country_stats['total_ids'] = country_stats.groupby(['country', 'population'])['revision_id'].transform('sum')
# The percentage of articles that are high quality as percentage_of_articles
country_stats['percentage_of_articles'] = country_stats['revision_id'] / country_stats['total_ids']
# The number of high quality articles per population as percentage_of_pop
country_stats['percentage_of_pop'] = country_stats['revision_id'] / country_stats['population']

# New data frame that contains the analysis and calculations. Rename the columns for clarity
articles_by_country = country_stats.loc[country_stats['high_quality'] == True].drop(['high_quality'], axis = 1)\
    .rename({'revision_id': 'num_high_quality_articles', 'total_ids': 'total_articles'}, axis = 1)
articles_by_country

Unnamed: 0,country,population,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop
1,Afghanistan,38928000,13,319,0.040752,3.339499e-07
3,Albania,2838000,3,456,0.006579,1.057082e-06
5,Algeria,44357000,2,116,0.017241,4.508871e-08
10,Argentina,45377000,16,491,0.032587,3.526015e-07
12,Armenia,2956000,5,193,0.025907,1.691475e-06
...,...,...,...,...,...,...
319,Vanuatu,321000,3,58,0.051724,9.345794e-06
321,Venezuela,28645000,3,130,0.023077,1.047303e-07
323,Vietnam,96209000,13,187,0.069519,1.351225e-07
325,Yemen,29826000,3,116,0.025862,1.005834e-07


Get a mapping of each country to its sub-region and major region. For example, Northern Africa is a sub-region to which Algeria belongs, but Northern Africa is also in Africa which is its own major region/continent. We want to retain both of these regional classifications for the final output. This only works because the data in the `population` data set is stored such that each subregion is in ALL CAPS and the countries listed directly below it belong to that subregion. **IF THE FILE CHANGES AND THIS NO LONGER HOLDS THAN THE BELOW CODE WILL BE USELESS**.

In [27]:
# Find the sub-regions indices which are indicated by being in ALL CAPS
original_population['is_sub'] = original_population['Name'].str.isupper() * original_population.index
sub_i = original_population['is_sub'].expanding(1).apply(lambda x: np.max(x))

# Create a new column that retains this lowest-level subregional category.
original_population['Sub-Region_0'] = original_population.loc[sub_i, 'Name'].values
original_population['Sub-Region_0'].unique()

# I have manually listed what the highest-level regions are below. This may need to be changed if there are geopolitical
# changes to the way that countries are categorized by these major regions and the population data set changes.
regions_1 = ['AFRICA', 'NORTHERN AMERICA', 'LATIN AMERICA AND THE CARIBBEAN', 'ASIA', 'EUROPE', 'OCEANIA']
original_population['is_region'] = original_population['Name'].isin(regions_1) * original_population.index
region_i = original_population['is_region'].expanding(1).apply(lambda x: np.max(x))

# Createe a new column that retains this highest-level subregional category.
original_population['Sub-Region_1'] = original_population.loc[region_i, 'Name'].values
region_map = original_population.loc[(original_population['is_sub'] == 0) & (original_population['Name'] != 'WORLD'), ['Name', 'Sub-Region_0', 'Sub-Region_1']]
region_map

Unnamed: 0,Name,Sub-Region_0,Sub-Region_1
3,Algeria,NORTHERN AFRICA,AFRICA
4,Egypt,NORTHERN AFRICA,AFRICA
5,Libya,NORTHERN AFRICA,AFRICA
6,Morocco,NORTHERN AFRICA,AFRICA
7,Sudan,NORTHERN AFRICA,AFRICA
...,...,...,...
229,Samoa,OCEANIA,OCEANIA
230,Solomon Islands,OCEANIA,OCEANIA
231,Tonga,OCEANIA,OCEANIA
232,Tuvalu,OCEANIA,OCEANIA


The regional map is joined to the previous analysis that had aggregated the number of articles by country and population to do further analysis and raise the analysis up two levels into the subregions.

In [28]:
# Join the region map to the articles_by_country data frame to get the 2 levels of subregions each country belongs to.
# Rename columns for clarity.
articles_by_country = articles_by_country.merge(region_map, how = 'inner', left_on = 'country', right_on = 'Name').drop(['Name'], axis = 1)\
    .merge(original_population[['Name', 'Population']], how = 'inner', left_on = 'Sub-Region_0', right_on = 'Name')\
    .merge(original_population[['Name', 'Population']], how = 'inner', left_on = 'Sub-Region_1', right_on = 'Name')\
    .drop(['Name_x', 'Name_y'], axis = 1)\
    .rename({'Population_x': 'pop_Sub-Region_0', 'Population_y': 'pop_Sub-Region_1'}, axis = 1)
articles_by_country

Unnamed: 0,country,population,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop,Sub-Region_0,Sub-Region_1,pop_Sub-Region_0,pop_Sub-Region_1
0,Afghanistan,38928000,13,319,0.040752,3.339499e-07,SOUTH ASIA,ASIA,1967131000,4625927000
1,Bangladesh,169809000,3,317,0.009464,1.766691e-08,SOUTH ASIA,ASIA,1967131000,4625927000
2,Bhutan,730000,2,33,0.060606,2.739726e-06,SOUTH ASIA,ASIA,1967131000,4625927000
3,India,1400100000,13,968,0.013430,9.285051e-09,SOUTH ASIA,ASIA,1967131000,4625927000
4,Iran,84150000,13,810,0.016049,1.544860e-07,SOUTH ASIA,ASIA,1967131000,4625927000
...,...,...,...,...,...,...,...,...,...,...
141,Papua New Guinea,8950000,4,160,0.025000,4.469274e-07,OCEANIA,OCEANIA,43155000,43155000
142,Tuvalu,10000,4,54,0.074074,4.000000e-04,OCEANIA,OCEANIA,43155000,43155000
143,Vanuatu,321000,3,58,0.051724,9.345794e-06,OCEANIA,OCEANIA,43155000,43155000
144,Canada,38190000,24,839,0.028605,6.284368e-07,NORTHERN AMERICA,NORTHERN AMERICA,368193000,368193000


In [29]:
# Uplevel analysis once into the lowest-level subregion
articles_by_subregion = articles_by_country.groupby(['Sub-Region_0', 'pop_Sub-Region_0'], as_index = False)\
    .sum()[['Sub-Region_0', 'pop_Sub-Region_0', 'num_high_quality_articles', 'total_articles']]
# Recalculate the percentage of articles that are high quality
articles_by_subregion['percentage_of_articles'] = articles_by_subregion['num_high_quality_articles'] / articles_by_subregion['total_articles']
# Recalculate the number of high quality articles as percentage of the population
articles_by_subregion['percentage_of_pop'] = articles_by_subregion['num_high_quality_articles'] / articles_by_subregion['pop_Sub-Region_0']
articles_by_subregion

Unnamed: 0,Sub-Region_0,pop_Sub-Region_0,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop
0,CARIBBEAN,43233000,13,552,0.023551,3.006962e-07
1,CENTRAL AMERICA,178611000,23,1380,0.016667,1.287715e-07
2,CENTRAL ASIA,74961000,7,135,0.051852,9.338189e-08
3,EAST ASIA,1641063000,76,2473,0.030732,4.631145e-08
4,EASTERN AFRICA,444970000,35,2294,0.015257,7.865699e-08
5,EASTERN EUROPE,291902000,118,3311,0.035639,4.042453e-07
6,MIDDLE AFRICA,179757000,16,538,0.02974,8.900905e-08
7,NORTHERN AFRICA,244344000,19,761,0.024967,7.775922e-08
8,NORTHERN AMERICA,368193000,104,1901,0.054708,2.824606e-07
9,NORTHERN EUROPE,105990000,102,3046,0.033487,9.623549e-07


In [30]:
# Uplevel analysis in to the highest-level subregion and repeat analysis from above code block.
articles_by_region = articles_by_country.groupby(['Sub-Region_1', 'pop_Sub-Region_1'], as_index = False)\
    .sum()[['Sub-Region_1', 'pop_Sub-Region_1', 'num_high_quality_articles', 'total_articles']]
articles_by_region['percentage_of_articles'] = articles_by_region['num_high_quality_articles'] / articles_by_region['total_articles']
articles_by_region['percentage_of_pop'] = articles_by_region['num_high_quality_articles'] / articles_by_region['pop_Sub-Region_1']
articles_by_region

Unnamed: 0,Sub-Region_1,pop_Sub-Region_1,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop
0,AFRICA,1337918000,119,6139,0.019384,8.894417e-08
1,ASIA,4625927000,316,11515,0.027442,6.831063e-08
2,EUROPE,746622000,350,14444,0.024232,4.68778e-07
3,LATIN AMERICA AND THE CARIBBEAN,651036000,76,4917,0.015457,1.16737e-07
4,NORTHERN AMERICA,368193000,104,1901,0.054708,2.824606e-07
5,OCEANIA,43155000,63,2811,0.022412,1.459854e-06


Northern America and Oceania are currently the only sub-regions that do not have a higher level region, so they are duplicated in both data frames. They are unioned and removed below.

In [31]:
# Union the datasets together and remove duplicates
articles_by_all_regions = pd.concat(
    [articles_by_region.rename({'Sub-Region_1': 'Region', 'pop_Sub-Region_1': 'population'}, axis = 1),
     articles_by_subregion.rename({'Sub-Region_0': 'Region', 'pop_Sub-Region_0': 'population'}, axis = 1)])\
    .drop_duplicates()
articles_by_all_regions

Unnamed: 0,Region,population,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop
0,AFRICA,1337918000,119,6139,0.019384,8.894417e-08
1,ASIA,4625927000,316,11515,0.027442,6.831063e-08
2,EUROPE,746622000,350,14444,0.024232,4.68778e-07
3,LATIN AMERICA AND THE CARIBBEAN,651036000,76,4917,0.015457,1.16737e-07
4,NORTHERN AMERICA,368193000,104,1901,0.054708,2.824606e-07
5,OCEANIA,43155000,63,2811,0.022412,1.459854e-06
0,CARIBBEAN,43233000,13,552,0.023551,3.006962e-07
1,CENTRAL AMERICA,178611000,23,1380,0.016667,1.287715e-07
2,CENTRAL ASIA,74961000,7,135,0.051852,9.338189e-08
3,EAST ASIA,1641063000,76,2473,0.030732,4.631145e-08


## Step 6: Results

The below tables are shown as a part of the output of the analysis and saved into the `results` folder, and displayed in this notebook.
- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of high-quality politician articles as a proportion of country population `top_10_countries_by_coverage.csv`  
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of high-quality politician articles as a proportion of country population `bottom_10_countries_by_coverage.csv`   
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality `top_10_countries_by_quality.csv`  
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality `bottom_10_countries_by_quality.csv`  

The top/bottom 10 results are ranked in order of either decreasing (for the top 10) or increasing (for the bottom 10) metric. For example, the country with the highest-quality articles as a percentage of total articles will come first in the top 10 results table. The country with the lowest-quality articles as a percentage of total articles will come first in the bottom 10 results table. The schema for the tables output is below:  

| Column |
|--------|
| country      |
| population      |
| percentage_of_articles      |
| percentage_of_population      |


- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population `regions_by_coverage.csv`  
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality `regions_by_quality.csv`  

The results for each region are ranked in decreasing order of the metric in question. For example, the region with the overall highest proportio of high-quality articles will be listed first in the `regions_by_quality` table and corresponding .csv. The schema for these tables is below:  

| Column |
|--------|
| Region      |
| population      |
| percentage_of_articles      |
| percentage_of_population      |



In [32]:
# Top 10 countries by coverage
top10_country_coverage = articles_by_country[['country', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_pop'], ascending = False)\
    .head(10)
# Bottom 10 countries by coverage
bottom10_country_coverage = articles_by_country[['country', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_pop'], ascending = True)\
    .head(10)
# Top 10 countries by relative quality
top10_country_quality = articles_by_country[['country', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_articles'], ascending = False)\
    .head(10)
# Bottom 10 countries by relative quality
bottom10_country_quality = articles_by_country[['country', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_articles'], ascending = True)\
    .head(10)

# Save csvs assuming we are in the src folder
top10_country_coverage.to_csv('../results/top_10_countries_by_coverage.csv', index = False)
bottom10_country_coverage.to_csv('../results/bottom_10_countries_by_coverage.csv', index = False)
top10_country_quality.to_csv('../results/top_10_countries_by_quality.csv', index = False)
bottom10_country_quality.to_csv('../results/bottom_10_countries_by_quality.csv', index = False)

display(top10_country_coverage)
display(bottom10_country_coverage)
display(top10_country_quality)
display(bottom10_country_quality)

Unnamed: 0,country,population,percentage_of_articles,percentage_of_pop
142,Tuvalu,10000,0.074074,0.0004
128,Dominica,72000,0.083333,1.4e-05
143,Vanuatu,321000,0.051724,9e-06
70,Iceland,368000,0.00995,5e-06
71,Ireland,5003000,0.067024,5e-06
49,Montenegro,622000,0.027778,3e-06
132,Martinique,356000,0.029412,3e-06
2,Bhutan,730000,0.060606,3e-06
140,New Zealand,4987000,0.016603,3e-06
65,Romania,19241000,0.122449,2e-06


Unnamed: 0,country,population,percentage_of_articles,percentage_of_pop
3,India,1400100000,0.01343,9.285051e-09
92,Nigeria,206140000,0.002959,9.702144e-09
107,Tanzania,59734000,0.002475,1.674088e-08
99,Ethiopia,114916000,0.019802,1.740402e-08
1,Bangladesh,169809000,0.009464,1.766691e-08
120,Colombia,49444000,0.003509,2.02249e-08
108,Uganda,45741000,0.005405,2.186222e-08
80,Morocco,35952000,0.004854,2.781486e-08
118,Brazil,211812000,0.011009,2.832701e-08
33,China,1402385000,0.03543,2.852284e-08


Unnamed: 0,country,population,percentage_of_articles,percentage_of_pop
35,"Korea, North",25779000,0.222222,3.103301e-07
19,Saudi Arabia,35041000,0.128205,4.2807e-07
65,Romania,19241000,0.122449,2.182839e-06
111,Central African Republic,4830000,0.121212,1.656315e-06
41,Uzbekistan,34174000,0.107143,8.778604e-08
90,Mauritania,4650000,0.104167,1.075269e-06
134,Guatemala,18066000,0.084337,3.874682e-07
128,Dominica,72000,0.083333,1.388889e-05
20,Syria,19398000,0.078125,5.155171e-07
82,Benin,12209000,0.076923,5.733475e-07


Unnamed: 0,country,population,percentage_of_articles,percentage_of_pop
55,Belgium,11515000,0.001927,8.684325e-08
107,Tanzania,59734000,0.002475,1.674088e-08
60,Switzerland,8634000,0.002488,1.158212e-07
6,Nepal,29996000,0.002809,3.333778e-08
123,Peru,32824000,0.002857,3.046551e-08
92,Nigeria,206140000,0.002959,9.702144e-09
50,Portugal,10255000,0.003145,9.751341e-08
120,Colombia,49444000,0.003509,2.02249e-08
73,Lithuania,2794000,0.004098,3.579098e-07
80,Morocco,35952000,0.004854,2.781486e-08


In [33]:
# Geographic regions ranked by coverage by pop
regions_by_coverage = articles_by_all_regions[['Region', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_pop'], ascending = False)

# Geographic regions ranked by coverage by article quality
regions_by_quality = articles_by_all_regions[['Region', 'population', 'percentage_of_articles', 'percentage_of_pop']]\
    .sort_values(by = ['percentage_of_articles'], ascending = False)

# Save csvs assuming we are in the src folder
regions_by_coverage.to_csv('../results/regions_by_coverage.csv', index = False)
regions_by_quality.to_csv('../results/regions_by_quality.csv', index = False)

display(regions_by_coverage)
display(regions_by_quality)

Unnamed: 0,Region,population,percentage_of_articles,percentage_of_pop
5,OCEANIA,43155000,0.022412,1.459854e-06
9,NORTHERN EUROPE,105990000,0.033487,9.623549e-07
15,SOUTHERN EUROPE,153251000,0.020584,4.82868e-07
2,EUROPE,746622000,0.024232,4.68778e-07
5,EASTERN EUROPE,291902000,0.035639,4.042453e-07
17,WESTERN ASIA,280927000,0.035303,3.168083e-07
0,CARIBBEAN,43233000,0.023551,3.006962e-07
18,WESTERN EUROPE,195479000,0.012467,2.864758e-07
4,NORTHERN AMERICA,368193000,0.054708,2.824606e-07
14,SOUTHERN AFRICA,67732000,0.020316,1.328766e-07


Unnamed: 0,Region,population,percentage_of_articles,percentage_of_pop
4,NORTHERN AMERICA,368193000,0.054708,2.824606e-07
2,CENTRAL ASIA,74961000,0.051852,9.338189e-08
13,SOUTHEAST ASIA,661845000,0.036139,1.102977e-07
5,EASTERN EUROPE,291902000,0.035639,4.042453e-07
17,WESTERN ASIA,280927000,0.035303,3.168083e-07
9,NORTHERN EUROPE,105990000,0.033487,9.623549e-07
3,EAST ASIA,1641063000,0.030732,4.631145e-08
6,MIDDLE AFRICA,179757000,0.02974,8.900905e-08
1,ASIA,4625927000,0.027442,6.831063e-08
7,NORTHERN AFRICA,244344000,0.024967,7.775922e-08
