# A2 - Bias in Data Assignment
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:
1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.

You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.


In [1]:
import pandas as pd
import numpy as np

## Data Acquisition

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called page_data.csv.

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

#### Politicians by Country from the English-language Wikipedia

The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

1. "country", containing the sanitised country name, extracted from the category name;
2. "page", containing the unsanitised page title.
3. "last_edit", containing the edit ID of the last edit to the page.

Data Source: https://figshare.com/articles/dataset/Untitled_Item/5513449

Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5513449.v6 

In [2]:
#importing wikipedia politicians pages and their countries
wiki_country_politician = pd.read_csv("data/page_data.csv")
wiki_country_politician.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


#### World Population Data Sheet

This dataset was extracted from the Population Reference Bureau. It contains the world population counts by region for 2019.

Columns are: 
1. "FIPS", contains the Federal Information Processing Standards codes for place
2. "Name", contains the name of the place
3. "Type" , contains the type of place: World, Sub-Region, World
4. "TimeFrame", contains the year (2019)
5. "Data (M)", contains the population count in millions
6. "Population", contains the population count

About the data: https://www.prb.org/international/indicator/population/table/ 

Data Source:
https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit#gid=283125346


In [3]:
#importing the world population (2020) data sheet
world_population_2019 = pd.read_csv("data/WPDS_2020_data.csv")
world_population_2019.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Data Processing
Both page_data.csv and WPDS_2020_data.csv contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.

Similarly, WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.


In [4]:
wiki_country_politician[wiki_country_politician['page'].str.contains('Template:')].index

Int64Index([    0,     2,     3,     4,     5,     6,     7,     8,     9,
               11,
            ...
            44296, 44580, 44581, 44603, 44657, 44916, 44966, 45587, 45823,
            46907],
           dtype='int64', length=496)

In [5]:
#select row indices with "Template:"
templates_rows = wiki_country_politician[wiki_country_politician['page'].str.contains('Template:')].index

#removing pages that start with the string "Template:"
wiki_country_politician = wiki_country_politician.drop(index = templates_rows)

wiki_country_politician.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [6]:
#removing non-country types from population dataset

#list of all non-country (=True) and country (=False) by indice
isupper = world_population_2019['Name'].str.isupper()

#get list of all regional population (not including countries)
regional_population = world_population_2019[isupper]

#get list of only country populations
country_population = world_population_2019[isupper == False]

country_population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


## Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any rev_id you send it.

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the rev_id column to make an API query.

ORES REST API - 
Documentation: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model

Whether you query the API or use the client, you will notice that ORES returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for prediction. 


In [10]:
#install the ores client
#!pip install -U pip
#!pip install ORES
#from ores import api

In [28]:
import json
import requests

In [29]:
#api endpoint for getting ores scores
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/aaliyahfiala42',
    'From': 'fialaa@uw.edu'
}

In [30]:
#function to access api in batches
def api_call_batch(endpoint, batch):
    batch_ids = rev_ids[batch_no==batch]
    rev_id = '|'.join(str(x) for x in batch_ids)
    call = requests.get(endpoint.format(rev_id=rev_id), headers=headers)
    response = call.json()   
    return response


In [55]:
# Parameters for getting ORES model wiki page scores 
# see: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid
# single batch test
#wiki_scores = api_call(endpoint, '|'.join(str(x) for x in wiki_country_politician.rev_id.iloc[0:50]))
#wiki_scores

In [57]:
len(wiki_country_politician)

46700

In [52]:
bad_rev_ids = wiki_country_politician[wiki_country_politician['rev_id'] == 807484325].index
bad_rev_ids

wiki_country_politician = wiki_country_politician.drop(index = bad_rev_ids)


In [None]:
len(wiki_country_politician)

In [232]:
wiki_scores = []
temp_predictions = [] #store nested predictions temporarily 

rev_id = [] #variable to store rev_id
pred = [] #variable to store final predictions, including errors

batchsize = 50 #set the batch size
i = 1000
batch = wiki_country_politician.rev_id.iloc[i:i+batchsize] # the result might be shorter than batchsize at the end
json_results = api_call(endpoint, '|'.join(str(x) for x in batch))
scores = json_results['enwiki']['scores']

for p_id, p_info in scores.items():    
    temp_predictions.append(p_info['articlequality'])
    rev_id.append(p_id) #store rev_id
    
for p in temp_predictions:
    for p_id, p_info in p.items():
        if p_id == 'score':
            #store predicted quality
            pred.append(p_info['prediction'])
        else:
            #error: could not get the predicted quailty
            pred.append('error')

In [233]:
#temp_predictions

[{'score': {'prediction': 'Start',
   'probability': {'B': 0.055956824214990715,
    'C': 0.21163698758644048,
    'FA': 0.0050682775646706335,
    'GA': 0.016572605638882732,
    'Start': 0.657600471748186,
    'Stub': 0.05316483324682938}}},
 {'score': {'prediction': 'Stub',
   'probability': {'B': 0.008494539912752579,
    'C': 0.01134474383241322,
    'FA': 0.0015308478859911026,
    'GA': 0.002966212835055669,
    'Start': 0.02738291942099843,
    'Stub': 0.948280736112789}}},
 {'score': {'prediction': 'Stub',
   'probability': {'B': 0.01442721078826357,
    'C': 0.013425452505774703,
    'FA': 0.0024571018372031065,
    'GA': 0.0035025737375618143,
    'Start': 0.30723520715681724,
    'Stub': 0.6589524539743796}}},
 {'score': {'prediction': 'Stub',
   'probability': {'B': 0.01678015980383155,
    'C': 0.02207838913852806,
    'FA': 0.002922127002898921,
    'GA': 0.004683167544585403,
    'Start': 0.3494739017283775,
    'Stub': 0.6040622547817787}}},
 {'score': {'prediction': '

In [251]:
temp_predictions = [] #store nested predictions temporarily 
rev_id = [] #variable to store rev_id
pred = [] #variable to store final predictions, including errors

batchsize = 50 #set the batch size


for i in range(0, len(wiki_country_politician), batchsize):
    #get the batch of revision id's
    batch = wiki_country_politician.rev_id.iloc[i:i+batchsize] # the result might be shorter than batchsize at the end
    json_results = api_call(endpoint, '|'.join(str(x) for x in batch))
    scores = json_results['enwiki']['scores']

    #parse json from latest batch
    for p_id, p_info in scores.items():    
        temp_predictions.append(p_info['articlequality'])
        rev_id.append(p_id) #store rev_id
    
    #get all predictions from latest batch
    for p in temp_predictions:
        for p_id, p_info in p.items():
            if p_id == 'score':
                #store predicted quality
                pred.append(p_info['prediction'])
            else:
                #error: could not get the predicted quailty
                pred.append('error')
    #reset temp variable
    temp_predictions = []

In [252]:
len(pred) #validate expected number of predictions

46700

In [254]:
len(rev_id) #validate expected number of rev id's

46700

In [303]:
#convert lists to pd dataframes
pred = pd.DataFrame(pred, columns = ['pred'])
rev_id = pd.DataFrame(rev_id, columns = ['rev_id'])

#merge rev_id's with associated predictions into a single dataframe
predictions = pd.concat([rev_id, pred], axis = 1)
predictions.head()

Unnamed: 0,rev_id,pred
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [320]:
#format rev id to ints
predictions['rev_id'] = [int(item) for item in predictions['rev_id']]


In [299]:
#printing out all of the revision id's of wiki pages that the quality could not be predicted by the model
rev_no_prediction = predictions[predictions['pred'] == 'error']['rev_id'].tolist()

#printing 275 pages without predictions
pd.option_context("display.max_rows", 300, "display.max_columns", 300)
wiki_no_pred = wiki_country_politician[wiki_country_politician['rev_id'].isin(rev_no_prediction)]
display(wiki_no_pred)

Unnamed: 0,page,country,rev_id
126,List of politicians in Poland,Poland,516633096
222,Tingtingru,Vanuatu,550682925
330,Daud Arsala,Afghanistan,627547024
359,Book:Two Political Biographies,India,636911471
514,Dilaver Bey,Turkey,669987106
539,Bharat Saud,Nepal,671484594
625,Book:Australian Prime Ministers,Australia,680981536
643,Robert Sych,Poland,684023803
644,Marek Krzysztof Jeleniewski,Poland,684023859
780,Khanzahi,Iran,696608092


In [301]:
#save all no prediction values to a csv
wiki_no_pred.to_csv("wikipedia_politcian_pages_no_ORES_pred.csv")

In [308]:
predictions[predictions['pred'] == 'error']

Unnamed: 0,rev_id,pred
14,516633096,error
21,550682925,error
51,627547024,error
75,636911471,error
180,669987106,error
204,671484594,error
287,680981536,error
301,684023803,error
302,684023859,error
434,696608092,error


In [312]:
#get indice of error predictions
err_pred = predictions[predictions['pred'] == 'error'].index

#drop rev_ids with no prediction
predictions = predictions.drop(index = err_pred)

## Combining Datasets


In [321]:
predictions.describe()

Unnamed: 0,rev_id
count,46425.0
mean,774627500.0
std,31729880.0
min,355319500.0
25%,757214000.0
50%,788674900.0
75%,798638200.0
max,807483300.0


In [318]:
wiki_country_politician.describe()

Unnamed: 0,rev_id
count,46700.0
mean,774547700.0
std,31831090.0
min,355319500.0
25%,757213800.0
50%,788659200.0
75%,798643700.0
max,807483300.0


In [325]:
country_population.describe()

Unnamed: 0,TimeFrame,Data (M),Population
count,210.0,210.0,210.0
mean,2019.0,37.009043,37009040.0
std,0.0,141.648454,141648500.0
min,2019.0,0.01,10000.0
25%,2019.0,1.27825,1278250.0
50%,2019.0,7.0995,7099500.0
75%,2019.0,26.076,26076000.0
max,2019.0,1402.385,1402385000.0


In [328]:
#merge datasets by revision id 
merged_revs = pd.merge(predictions, wiki_country_politician, how = 'left', on = 'rev_id')

In [330]:
#merge datasets by country
merged_pop = pd.merge(country_population, merged_revs, how = 'left', left_on = 'Name', right_on = 'country')

In [331]:
merged_pop

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,rev_id,pred,page,country
0,DZ,Algeria,Country,2019,44.357,44357000,686269631.0,Stub,Ali Fawzi Rebaine,Algeria
1,DZ,Algeria,Country,2019,44.357,44357000,705910185.0,Stub,Ahmed Attaf,Algeria
2,DZ,Algeria,Country,2019,44.357,44357000,707427823.0,Stub,Ahmed Djoghlaf,Algeria
3,DZ,Algeria,Country,2019,44.357,44357000,708060571.0,Stub,Hammi Larouissi,Algeria
4,DZ,Algeria,Country,2019,44.357,44357000,708980561.0,Stub,Salah Goudjil,Algeria
...,...,...,...,...,...,...,...,...,...,...
44590,VU,Vanuatu,Country,2019,0.321,321000,799954279.0,Stub,Tallis Obed Moses,Vanuatu
44591,VU,Vanuatu,Country,2019,0.321,321000,799954813.0,Start,Esmon Saimon,Vanuatu
44592,VU,Vanuatu,Country,2019,0.321,321000,799955662.0,C,Baldwin Lonsdale,Vanuatu
44593,VU,Vanuatu,Country,2019,0.321,321000,800106636.0,C,Sela Molisa,Vanuatu


In [352]:
col_names = ['country', 'article_name', 'revision_id', 'article_quality_est', 'population']

In [446]:
politician_by_country = merged_pop[['country', 'page', 'rev_id', 'pred', 'Population']]

In [447]:
politician_by_country.columns = col_names

In [448]:
politician_by_country.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000


In [357]:
#create csv of final merged dataset 
politician_by_country.to_csv('wp_wpds_politicians_by_country.csv')

In [342]:
#get list of rows not merges
no_merged_revs = pd.merge(predictions, wiki_country_politician, how = 'outer', on = 'rev_id')

In [343]:
#get list of rows not merges
no_merged_pop = pd.merge(country_population, merged_revs, how = 'outer', left_on = 'Name', right_on = 'country')

In [347]:
#merge lists of no merges
no_match = pd.concat([no_merged_revs, no_merged_pop], axis = 0)

In [348]:
#list of no matches because pred was null, or population was null, or page was null
no_match

Unnamed: 0,rev_id,pred,page,country,FIPS,Name,Type,TimeFrame,Data (M),Population
0,355319463.0,Stub,Bir I of Kanem,Chad,,,,,,
1,393276188.0,Stub,Information Minister of the Palestinian Nation...,Palestinian Territory,,,,,,
2,393822005.0,Stub,Yos Por,Cambodia,,,,,,
3,395521877.0,Stub,Julius Gregr,Czech Republic,,,,,,
4,395526568.0,Stub,Edvard Gregr,Czech Republic,,,,,,
...,...,...,...,...,...,...,...,...,...,...
46447,798692052.0,Start,Dahir Riyale Kahin,Somaliland,,,,,,
46448,804143605.0,Stub,Adan Ahmed Elmi,Somaliland,,,,,,
46449,805840190.0,C,Muhammad Haji Ibrahim Egal,Somaliland,,,,,,
46450,805873719.0,C,Hediya Yousef,Rojava,,,,,,


In [350]:
#create a csv of data not matched
no_match.to_csv('wp_wpds_countries-no_match.csv')

## Analysis
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [524]:
politician_by_country

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000
...,...,...,...,...,...
44590,Vanuatu,Tallis Obed Moses,799954279.0,Stub,321000
44591,Vanuatu,Esmon Saimon,799954813.0,Start,321000
44592,Vanuatu,Baldwin Lonsdale,799955662.0,C,321000
44593,Vanuatu,Sela Molisa,800106636.0,C,321000


#### Proportion of good/featured articles by total articles

In [558]:
#get the number of pages per country
num_pgs_country = politician_by_country.groupby('country').size()
len(num_pgs_country)

183

In [559]:
#get the number of pages per country with featured (FA) or good (GA) articles
good_articles = politician_by_country[politician_by_country['article_quality_est'].isin(['FA','GA'])]
num_good_pgs_country = good_articles.groupby('country').size()
len(num_good_pgs_country)

146

In [560]:
#merge together values by country
total_pages_type = pd.concat([num_pgs_country,num_good_pgs_country], axis = 1)
total_pages_type.columns = ['total', 'good']

#fill in nans with 0's
total_pages_type = total_pages_type.fillna(0)
#total_pages_type

In [561]:
#get proportion of good to total articles by country
prop_good_total = total_pages_type['good']/total_pages_type['total']

#convert to dataframe
prop_good_art = prop_good_total.to_frame()
prop_good_art.columns = ['prop_good_articles']

In [570]:
good_to_total_results = pd.concat([prop_good_art, num_pgs_country, num_good_pgs_country], axis = 1)
good_to_total_results.columns = ['prop_good_to_total', 'total_articles', 'total_good_articles']

#replace nans with 0's (as they are 0 or approx 0)
good_to_total_results = good_to_total_results.fillna(0)

#final dataframe for proportion by total articles
display(good_to_total_results)

Unnamed: 0,prop_good_to_total,total_articles,total_good_articles
Afghanistan,0.040752,319,13.0
Albania,0.006579,456,3.0
Algeria,0.017241,116,2.0
Andorra,0.0,34,0.0
Angola,0.0,106,0.0
Antigua and Barbuda,0.0,24,0.0
Argentina,0.032587,491,16.0
Armenia,0.025907,193,5.0
Australia,0.024375,1559,38.0
Austria,0.009009,333,3.0


#### Proportion of good/featured articles by total population

In [562]:
#get population by country
country_population = politician_by_country[['country', 'population']].drop_duplicates()
country_population = country_population.dropna()

In [564]:
#format number of good articles by country
num_good_country = num_good_pgs_country.to_frame()
num_good_country['country'] = num_good_country.index
num_good_country.columns = ['good', 'country']
num_good_country.index.name = 'index'

In [573]:
#merge together values by country
total_pages_pop = pd.merge(country_population, num_good_country, how = 'left', on = 'country')

#fill in nans with 0's
total_pages_pop = total_pages_pop.fillna(0)

In [575]:
#get the proportion of good articles to population
total_pages_pop['prop'] = total_pages_pop['good']/total_pages_pop['population']

In [576]:
good_population_results = total_pages_pop
good_population_results.columns = ["country", "population", "total_good_articles", "prop_good_to_population"]

#final dataframe for proportion by population
display(good_population_results)

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
0,Algeria,44357000,2.0,4.508871e-08
1,Egypt,100803000,10.0,9.92034e-08
2,Libya,6891000,4.0,5.804673e-07
3,Morocco,35952000,1.0,2.781486e-08
4,Sudan,43849000,2.0,4.561107e-08
5,Tunisia,11896000,0.0,0.0
6,Benin,12209000,7.0,5.733475e-07
7,Burkina Faso,20903000,2.0,9.568005e-08
8,Cape Verde,556000,0.0,0.0
9,Gambia,2417000,2.0,8.274721e-07


#### Regional Population

In [581]:
regional_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


In [582]:
country_population

Unnamed: 0,country,population
0,Algeria,44357000
116,Egypt,100803000
350,Libya,6891000
460,Morocco,35952000
666,Sudan,43849000
761,Tunisia,11896000
900,Benin,12209000
991,Burkina Faso,20903000
1086,Cape Verde,556000
1123,Gambia,2417000


## Results

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [577]:
display(good_population_results.sort_values('prop_good_to_population', ascending = 0)[:10])

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
181,Tuvalu,10000,4.0,0.0004
63,Dominica,72000,1.0,1.4e-05
182,Vanuatu,321000,3.0,9e-06
132,Iceland,368000,2.0,5e-06
133,Ireland,5003000,25.0,5e-06
165,Montenegro,622000,2.0,3e-06
69,Martinique,356000,1.0,3e-06
107,Bhutan,730000,2.0,3e-06
177,New Zealand,4987000,13.0,3e-06
153,Romania,19241000,42.0,2e-06


2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [578]:
display(good_population_results.sort_values('prop_good_to_population', ascending = 1)[:10])

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
48,Lesotho,2142000,0.0,0.0
53,Belize,419000,0.0,0.0
158,Andorra,82000,0.0,0.0
30,Mozambique,31166000,0.0,0.0
66,Guadeloupe,375000,0.0,0.0
32,Seychelles,98000,0.0,0.0
65,Grenada,113000,0.0,0.0
61,Barbados,287000,0.0,0.0
37,Zambia,18384000,0.0,0.0
23,Djibouti,988000,0.0,0.0


3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [579]:
display(good_to_total_results.sort_values('prop_good_to_total', ascending = 0)[:10])

Unnamed: 0,prop_good_to_total,total_articles,total_good_articles
"Korea, North",0.222222,36,8.0
Saudi Arabia,0.128205,117,15.0
Romania,0.122449,343,42.0
Central African Republic,0.121212,66,8.0
Uzbekistan,0.107143,28,3.0
Mauritania,0.104167,48,5.0
Guatemala,0.084337,83,7.0
Dominica,0.083333,12,1.0
Syria,0.078125,128,10.0
Benin,0.076923,91,7.0


4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [580]:
display(good_to_total_results.sort_values('prop_good_to_total', ascending = 1)[:10])

Unnamed: 0,prop_good_to_total,total_articles,total_good_articles
Solomon Islands,0.0,97,0.0
Tonga,0.0,63,0.0
Nauru,0.0,52,0.0
Namibia,0.0,162,0.0
Djibouti,0.0,37,0.0
Mozambique,0.0,58,0.0
Monaco,0.0,40,0.0
Eritrea,0.0,16,0.0
Estonia,0.0,148,0.0
Moldova,0.0,421,0.0


5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

## Writeup: Reflections and Implications

Write a few paragraphs, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). 

You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

What biases did you expect to find in the data (before you started working with it), and why?

What (potential) sources of bias did you discover in the course of your data processing and analysis?

How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?


Before I found the conclusions in this article, I expected to find a small correlation with politician article ratings and the size of the country. The reason for this, is that in my experience the larger more dominate countries tend to populate international news sources, as they tend to have more of an impact on the world, and consequently, politicians in these larger countries, tend to be more famous. My expectation would be that the larger the following of a politician, the more thorough their wikipedia article will be.  

I expected the top 10 countries by the number of good/featured articles to the population and total number of articles, to contain several larger more popular countries, such as United States, China, and Russia.



## Cited Sources

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.isupper.html

https://datascienceparichay.com/article/pandas-groupby-count-of-rows-in-each-group

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html