# A2 - Bias in Data Assignment

#### Data 512: Human Centered Data Science
#### Aaliyah Hänni
#### 10/7/2021


## Project Overview

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. Wikipedia articles and country populations datasets are combined, and ORES is used to estimate the quality of each article by country.

This notebook contains step-by-step analysis, from data aquisition to results, of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries.The 'Results' section of this notebook contains tables that display:

1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.

In the 'Reflection' section contains a short reflection on the project that focuses on how both findings from this analysis and the process we went through to reach the findings, helped me to understand the causes and consequences of biased data in large, complex data science projects.


In [1]:
import pandas as pd
import numpy as np

## Data Acquisition

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

#### Data Source #1: Politicians by Country from the English-language Wikipedia

The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

1. "country", containing the sanitised country name, extracted from the category name;
2. "page", containing the unsanitised page title.
3. "last_edit", containing the edit ID of the last edit to the page.

Data Source: https://figshare.com/articles/dataset/Untitled_Item/5513449

Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5513449.v6 

In [2]:
#importing wikipedia politicians pages and their countries
wiki_country_politician = pd.read_csv("data/page_data.csv")
wiki_country_politician.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


#### Data Source #2: World Population Data Sheet

This dataset was extracted from the Population Reference Bureau. It contains the world population counts by region for 2019.

Columns are: 
1. "FIPS", contains the Federal Information Processing Standards codes for place
2. "Name", contains the name of the place
3. "Type" , contains the type of place: World, Sub-Region, World
4. "TimeFrame", contains the year (2019)
5. "Data (M)", contains the population count in millions
6. "Population", contains the population count

About the data: https://www.prb.org/international/indicator/population/table/ 

Data Source:
https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit#gid=283125346


In [3]:
#importing the world population (2020) data sheet
world_population_2019 = pd.read_csv("data/WPDS_2020_data.csv")
world_population_2019.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Data Processing
Both page_data.csv and WPDS_2020_data.csv contain some rows need to filter out and/or ignored when combining the datasets below. In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". Note that these pages are not Wikipedia articles, and should not be included in the analysis.

Similarly, WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but will be retained (either in the original file, or a separate file) so that we can report coverage and quality by region in the analysis section.


In [4]:
wiki_country_politician[wiki_country_politician['page'].str.contains('Template:')].index

Int64Index([    0,     2,     3,     4,     5,     6,     7,     8,     9,
               11,
            ...
            44296, 44580, 44581, 44603, 44657, 44916, 44966, 45587, 45823,
            46907],
           dtype='int64', length=496)

In [5]:
#select row indices with "Template:"
templates_rows = wiki_country_politician[wiki_country_politician['page'].str.contains('Template:')].index

#removing pages that start with the string "Template:"
wiki_country_politician = wiki_country_politician.drop(index = templates_rows)

wiki_country_politician.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [6]:
#removing non-country types from population dataset

#list of all non-country (=True) and country (=False) by indice
isupper = world_population_2019['Name'].str.isupper()

#get list of all regional population (not including countries)
regional_population = world_population_2019[isupper]

#get list of only country populations
country_population = world_population_2019[isupper == False]
country_population_full = country_population

country_population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


## Getting Article Quality Predictions

To get the predicted quality scores for each article in the Wikipedia dataset, we're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. 

In order to get article predictions for each article in the Wikipedia dataset, we will first need to read page_data.csv into Python, and then read through the dataset line by line, using the value of the rev_id column to make an API query.

ORES REST API - 
Documentation: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model

Note: The ORES API returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories. For this assignment, we 
only need to capture and use the value for prediction. 


In [8]:
import json
import requests

In [9]:
#api endpoint for getting ores scores
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/aaliyahfiala42',
    'From': 'fialaa@uw.edu'
}

In [10]:
#function to access api, given id parameter
def api_call(endpoint, ids):
    call = requests.get(endpoint.format(rev_id=ids), headers=headers)
    response = call.json()   
    return response

In [13]:
#testing for a single batch of 50
wiki_scores = []
temp_predictions = [] #store nested predictions temporarily 

rev_id = [] #variable to store rev_id
pred = [] #variable to store final predictions, including errors

batchsize = 50 #set the batch size
i = 1000
batch = wiki_country_politician.rev_id.iloc[i:i+batchsize] # the result might be shorter than batchsize at the end
json_results = api_call(endpoint, '|'.join(str(x) for x in batch))
scores = json_results['enwiki']['scores']

for p_id, p_info in scores.items():    
    temp_predictions.append(p_info['articlequality'])
    rev_id.append(p_id) #store rev_id
    
for p in temp_predictions:
    for p_id, p_info in p.items():
        if p_id == 'score':
            #store predicted quality
            pred.append(p_info['prediction'])
        else:
            #error: could not get the predicted quailty
            pred.append('error')

In [14]:
#getting predictions for all revsision_ids, in batches of 50

temp_predictions = [] #store nested predictions temporarily 
rev_id = [] #variable to store rev_id
pred = [] #variable to store final predictions, including errors

batchsize = 50 #set the batch size


for i in range(0, len(wiki_country_politician), batchsize):
    #get the batch of revision id's
    batch = wiki_country_politician.rev_id.iloc[i:i+batchsize] # the result might be shorter than batchsize at the end
    json_results = api_call(endpoint, '|'.join(str(x) for x in batch))
    scores = json_results['enwiki']['scores']

    #parse json from latest batch
    for p_id, p_info in scores.items():    
        temp_predictions.append(p_info['articlequality'])
        rev_id.append(p_id) #store rev_id
    
    #get all predictions from latest batch
    for p in temp_predictions:
        for p_id, p_info in p.items():
            if p_id == 'score':
                #store predicted quality
                pred.append(p_info['prediction'])
            else:
                #error: could not get the predicted quailty
                pred.append('error')
    #reset temp variable
    temp_predictions = []

In [15]:
len(pred) #validate expected number of predictions

46701

In [16]:
len(rev_id) #validate expected number of rev id's

46701

In [17]:
#convert lists to pd dataframes
pred = pd.DataFrame(pred, columns = ['pred'])
rev_id = pd.DataFrame(rev_id, columns = ['rev_id'])

#merge rev_id's with associated predictions into a single dataframe
predictions = pd.concat([rev_id, pred], axis = 1)
predictions.head()

Unnamed: 0,rev_id,pred
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [18]:
#format rev id to ints
predictions['rev_id'] = [int(item) for item in predictions['rev_id']]


In [19]:
#printing out all of the revision id's of wiki pages that the quality could not be predicted by the model
rev_no_prediction = predictions[predictions['pred'] == 'error']['rev_id'].tolist()

#printing 275 pages without predictions
pd.option_context("display.max_rows", 300, "display.max_columns", 300)
wiki_no_pred = wiki_country_politician[wiki_country_politician['rev_id'].isin(rev_no_prediction)]
display(wiki_no_pred)

Unnamed: 0,page,country,rev_id
126,List of politicians in Poland,Poland,516633096
222,Tingtingru,Vanuatu,550682925
330,Daud Arsala,Afghanistan,627547024
359,Book:Two Political Biographies,India,636911471
514,Dilaver Bey,Turkey,669987106
...,...,...,...
46782,John Rose (Trotskyist),United Kingdom,807336308
46862,Jalal Movaghar,Iran,807367030
46863,Mohsen Movaghar,Iran,807367166
47182,King Gutierrez,Philippines,807479587


In [20]:
#save all no prediction values to a csv
wiki_no_pred.to_csv("wikipedia_politcian_pages_no_ORES_pred.csv")

In [21]:
predictions[predictions['pred'] == 'error']

Unnamed: 0,rev_id,pred
14,516633096,error
21,550682925,error
51,627547024,error
75,636911471,error
180,669987106,error
...,...,...
46287,807336308,error
46367,807367030,error
46368,807367166,error
46686,807479587,error


In [22]:
#get indice of error predictions
err_pred = predictions[predictions['pred'] == 'error'].index

#drop rev_ids with no prediction
predictions = predictions.drop(index = err_pred)

## Combining Datasets


In [23]:
predictions.describe()

Unnamed: 0,rev_id
count,46425.0
mean,774627500.0
std,31729880.0
min,355319500.0
25%,757214000.0
50%,788674900.0
75%,798638200.0
max,807483300.0


In [24]:
wiki_country_politician.describe()

Unnamed: 0,rev_id
count,46701.0
mean,774548400.0
std,31831110.0
min,355319500.0
25%,757213800.0
50%,788659300.0
75%,798644000.0
max,807484300.0


In [25]:
country_population.describe()

Unnamed: 0,TimeFrame,Data (M),Population
count,210.0,210.0,210.0
mean,2019.0,37.009043,37009040.0
std,0.0,141.648454,141648500.0
min,2019.0,0.01,10000.0
25%,2019.0,1.27825,1278250.0
50%,2019.0,7.0995,7099500.0
75%,2019.0,26.076,26076000.0
max,2019.0,1402.385,1402385000.0


In [26]:
#merge datasets by revision id 
merged_revs = pd.merge(predictions, wiki_country_politician, how = 'left', on = 'rev_id')

In [27]:
#merge datasets by country
merged_pop = pd.merge(country_population, merged_revs, how = 'left', left_on = 'Name', right_on = 'country')

In [28]:
merged_pop

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,rev_id,pred,page,country
0,DZ,Algeria,Country,2019,44.357,44357000,686269631.0,Stub,Ali Fawzi Rebaine,Algeria
1,DZ,Algeria,Country,2019,44.357,44357000,705910185.0,Stub,Ahmed Attaf,Algeria
2,DZ,Algeria,Country,2019,44.357,44357000,707427823.0,Stub,Ahmed Djoghlaf,Algeria
3,DZ,Algeria,Country,2019,44.357,44357000,708060571.0,Stub,Hammi Larouissi,Algeria
4,DZ,Algeria,Country,2019,44.357,44357000,708980561.0,Stub,Salah Goudjil,Algeria
...,...,...,...,...,...,...,...,...,...,...
44590,VU,Vanuatu,Country,2019,0.321,321000,799954279.0,Stub,Tallis Obed Moses,Vanuatu
44591,VU,Vanuatu,Country,2019,0.321,321000,799954813.0,Start,Esmon Saimon,Vanuatu
44592,VU,Vanuatu,Country,2019,0.321,321000,799955662.0,C,Baldwin Lonsdale,Vanuatu
44593,VU,Vanuatu,Country,2019,0.321,321000,800106636.0,C,Sela Molisa,Vanuatu


In [29]:
col_names = ['country', 'article_name', 'revision_id', 'article_quality_est', 'population']

In [30]:
politician_by_country = merged_pop[['country', 'page', 'rev_id', 'pred', 'Population']]

In [31]:
politician_by_country.columns = col_names

In [32]:
politician_by_country.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000


In [33]:
#create csv of final merged dataset 
politician_by_country.to_csv('wp_wpds_politicians_by_country.csv')

In [34]:
#get list of rows not merges
no_merged_revs = pd.merge(predictions, wiki_country_politician, how = 'outer', on = 'rev_id')

In [35]:
#get list of rows not merges
no_merged_pop = pd.merge(country_population, merged_revs, how = 'outer', left_on = 'Name', right_on = 'country')

In [36]:
#merge lists of no merges
no_match = pd.concat([no_merged_revs, no_merged_pop], axis = 0)

In [37]:
#list of no matches because pred was null, or population was null, or page was null
no_match

Unnamed: 0,rev_id,pred,page,country,FIPS,Name,Type,TimeFrame,Data (M),Population
0,355319463.0,Stub,Bir I of Kanem,Chad,,,,,,
1,393276188.0,Stub,Information Minister of the Palestinian Nation...,Palestinian Territory,,,,,,
2,393822005.0,Stub,Yos Por,Cambodia,,,,,,
3,395521877.0,Stub,Julius Gregr,Czech Republic,,,,,,
4,395526568.0,Stub,Edvard Gregr,Czech Republic,,,,,,
...,...,...,...,...,...,...,...,...,...,...
46447,798692052.0,Start,Dahir Riyale Kahin,Somaliland,,,,,,
46448,804143605.0,Stub,Adan Ahmed Elmi,Somaliland,,,,,,
46449,805840190.0,C,Muhammad Haji Ibrahim Egal,Somaliland,,,,,,
46450,805873719.0,C,Hediya Yousef,Rojava,,,,,,


In [38]:
#create a csv of data not matched
no_match.to_csv('wp_wpds_countries-no_match.csv')

## Analysis
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [39]:
politician_by_country

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000
...,...,...,...,...,...
44590,Vanuatu,Tallis Obed Moses,799954279.0,Stub,321000
44591,Vanuatu,Esmon Saimon,799954813.0,Start,321000
44592,Vanuatu,Baldwin Lonsdale,799955662.0,C,321000
44593,Vanuatu,Sela Molisa,800106636.0,C,321000


#### Proportion of good/featured articles by total articles

In [40]:
#get the number of pages per country
num_pgs_country = politician_by_country.groupby('country').size()
len(num_pgs_country)

183

In [41]:
#get the number of pages per country with featured (FA) or good (GA) articles
good_articles = politician_by_country[politician_by_country['article_quality_est'].isin(['FA','GA'])]
num_good_pgs_country = good_articles.groupby('country').size()
len(num_good_pgs_country)

146

In [42]:
#merge together values by country
total_pages_type = pd.concat([num_pgs_country,num_good_pgs_country], axis = 1)
total_pages_type.columns = ['total', 'good']

#fill in nans with 0's
total_pages_type = total_pages_type.fillna(0)
#total_pages_type

In [43]:
#get proportion of good to total articles by country
prop_good_total = total_pages_type['good']/total_pages_type['total']

#convert to dataframe
prop_good_art = prop_good_total.to_frame()
prop_good_art.columns = ['prop_good_articles']

In [44]:
good_to_total_results = pd.concat([prop_good_art, num_pgs_country, num_good_pgs_country], axis = 1)
good_to_total_results.columns = ['prop_good_to_total', 'total_articles', 'total_good_articles']

#replace nans with 0's (as they are 0 or approx 0)
good_to_total_results = good_to_total_results.fillna(0)

#final dataframe for proportion by total articles
display(good_to_total_results)

Unnamed: 0_level_0,prop_good_to_total,total_articles,total_good_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,0.040752,319,13.0
Albania,0.006579,456,3.0
Algeria,0.017241,116,2.0
Andorra,0.000000,34,0.0
Angola,0.000000,106,0.0
...,...,...,...
Venezuela,0.023077,130,3.0
Vietnam,0.069519,187,13.0
Yemen,0.025862,116,3.0
Zambia,0.000000,25,0.0


#### Proportion of articles by total population

In [45]:
#get population by country
country_population = politician_by_country[['country', 'population']].drop_duplicates()
country_population = country_population.dropna()

In [46]:
#format number of good articles by country
num_good_country = num_good_pgs_country.to_frame()
num_good_country['country'] = num_good_country.index
num_good_country.columns = ['good', 'country']
num_good_country.index.name = 'index'

In [47]:
#merge together values by country
total_pages_pop = pd.merge(country_population, num_good_country, how = 'left', on = 'country')

#fill in nans with 0's
total_pages_pop = total_pages_pop.fillna(0)

In [48]:
num_good_pgs_country

index
Afghanistan    13
Albania         3
Algeria         2
Argentina      16
Armenia         5
               ..
Vanuatu         3
Venezuela       3
Vietnam        13
Yemen           3
Zimbabwe        2
Length: 146, dtype: int64

In [49]:
#get the proportion of good articles to population
total_pages_pop['prop'] = total_pages_pop['good']/total_pages_pop['population']

In [50]:
good_population_results = total_pages_pop
good_population_results.columns = ["country", "population", "total_good_articles", "prop_good_to_population"]

#final dataframe for proportion by population
display(good_population_results)

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
0,Algeria,44357000,2.0,4.508871e-08
1,Egypt,100803000,10.0,9.920340e-08
2,Libya,6891000,4.0,5.804673e-07
3,Morocco,35952000,1.0,2.781486e-08
4,Sudan,43849000,2.0,4.561107e-08
...,...,...,...,...
178,Papua New Guinea,8950000,4.0,4.469274e-07
179,Solomon Islands,715000,0.0,0.000000e+00
180,Tonga,99000,0.0,0.000000e+00
181,Tuvalu,10000,4.0,4.000000e-04


#### Regional Population

In [51]:
regional_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


#### Mapping country to regions
By the layout of the world population dataset, we know that the country belongs to the region listed above it, when ordered by index. Below, we created a new column called 'region', that associated each country by region in a table.

In [52]:
region = []
temp = ''
#iterate through country indices, rows
for country_index, country_row in country_population_full.iterrows():
    stop = 0
    #iterate through region indices, rows (skip "WORLD")
    for reg_index, reg_row in regional_population[1:].iterrows():
        #if the region index is before the country, then the country belongs to that region
        if reg_index > country_index and stop == 0:
            country_row['region'] =  reg_row['Name']      
            region.append(temp) #store region for each country in a variable
            #country_population_by_region.append(reg_row['Name'])            
            stop = 1      
        temp = reg_row['Name']
        
#handle the issue that iterations cancel one iteration too early
for i in range(len(country_population_full) - len(region)):
    region.append('OCEANIA')

In [53]:
#create a dataframe of regions
region = pd.Series(region).to_frame()
region.columns = ['region']

In [54]:
#variable to store all country populations by region
country_population_by_region = pd.concat([region, country_population_full.reset_index()], axis = 1)

In [55]:
country_population_by_region

Unnamed: 0,region,index,FIPS,Name,Type,TimeFrame,Data (M),Population
0,NORTHERN AFRICA,3,DZ,Algeria,Country,2019,44.357,44357000
1,NORTHERN AFRICA,4,EG,Egypt,Country,2019,100.803,100803000
2,NORTHERN AFRICA,5,LY,Libya,Country,2019,6.891,6891000
3,NORTHERN AFRICA,6,MA,Morocco,Country,2019,35.952,35952000
4,NORTHERN AFRICA,7,SD,Sudan,Country,2019,43.849,43849000
...,...,...,...,...,...,...,...,...
205,OCEANIA,229,WS,Samoa,Country,2019,0.200,200000
206,OCEANIA,230,SB,Solomon Islands,Country,2019,0.715,715000
207,OCEANIA,231,TO,Tonga,Country,2019,0.099,99000
208,OCEANIA,232,TV,Tuvalu,Country,2019,0.010,10000


#### Analysis by Region


In [56]:
#population count by region
population_by_region = country_population_by_region.groupby('region').sum('Population')
population_by_region['Population']

region
CARIBBEAN             42747000
CENTRAL AMERICA      178612000
CENTRAL ASIA          74960000
EAST ASIA           1641063000
EASTERN AFRICA       444970000
EASTERN EUROPE       291902000
MIDDLE AFRICA        179757000
NORTHERN AFRICA      244345000
NORTHERN AMERICA     368068000
NORTHERN EUROPE      105852000
OCEANIA               42999000
SOUTH AMERICA        429188000
SOUTH ASIA          1967131000
SOUTHEAST ASIA       661843000
SOUTHERN AFRICA       67732000
SOUTHERN EUROPE      153216000
WESTERN AFRICA       401108000
WESTERN ASIA         280927000
WESTERN EUROPE       195479000
Name: Population, dtype: int64

In [57]:
#total number of articles and good articles by region
summary_articles_country = good_to_total_results
summary_articles_country['country'] = summary_articles_country.index #create country column to join on
summary_articles_country.index.names = ['index']
articles_by_region = pd.merge(country_population_by_region, summary_articles_country, how = 'left', left_on = 'Name', right_on = 'country')

In [58]:
#sum total articles & good articles by region
articles_by_region = articles_by_region.groupby('region').sum()
articles_by_region

#resting the prop_good_to_total value to be correct
articles_by_region['prop_good_to_total'] = articles_by_region['total_good_articles']/articles_by_region['total_articles']

#select columns of interest
articles_by_region = articles_by_region[['Population', 'prop_good_to_total', 'total_good_articles', 'total_articles']]

In [59]:
#adding in articles as a proportion of population column

articles_by_region['prop_articles_to_population'] = articles_by_region['total_articles']/articles_by_region['Population']

In [60]:
display(articles_by_region)

Unnamed: 0_level_0,Population,prop_good_to_total,total_good_articles,total_articles,prop_articles_to_population
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CARIBBEAN,42747000,0.018705,13.0,695.0,1.6e-05
CENTRAL AMERICA,178612000,0.014906,23.0,1543.0,9e-06
CENTRAL ASIA,74960000,0.028571,7.0,245.0,3e-06
EAST ASIA,1641063000,0.030732,76.0,2473.0,2e-06
EASTERN AFRICA,444970000,0.013989,35.0,2502.0,6e-06
EASTERN EUROPE,291902000,0.031618,118.0,3732.0,1.3e-05
MIDDLE AFRICA,179757000,0.02406,16.0,665.0,4e-06
NORTHERN AFRICA,244345000,0.021135,19.0,899.0,4e-06
NORTHERN AMERICA,368068000,0.054708,104.0,1901.0,5e-06
NORTHERN EUROPE,105852000,0.027106,102.0,3763.0,3.6e-05


## Results

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [61]:
display(good_population_results.sort_values('prop_good_to_population', ascending = 0)[:10])

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
181,Tuvalu,10000,4.0,0.0004
63,Dominica,72000,1.0,1.4e-05
182,Vanuatu,321000,3.0,9e-06
132,Iceland,368000,2.0,5e-06
133,Ireland,5003000,25.0,5e-06
165,Montenegro,622000,2.0,3e-06
69,Martinique,356000,1.0,3e-06
107,Bhutan,730000,2.0,3e-06
177,New Zealand,4987000,13.0,3e-06
153,Romania,19241000,42.0,2e-06


2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [62]:
display(good_population_results.sort_values('prop_good_to_population', ascending = 1)[:10])

Unnamed: 0,country,population,total_good_articles,prop_good_to_population
48,Lesotho,2142000,0.0,0.0
53,Belize,419000,0.0,0.0
158,Andorra,82000,0.0,0.0
30,Mozambique,31166000,0.0,0.0
66,Guadeloupe,375000,0.0,0.0
32,Seychelles,98000,0.0,0.0
65,Grenada,113000,0.0,0.0
61,Barbados,287000,0.0,0.0
37,Zambia,18384000,0.0,0.0
23,Djibouti,988000,0.0,0.0


3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [63]:
display(good_to_total_results.sort_values('prop_good_to_total', ascending = 0)[:10])

Unnamed: 0_level_0,prop_good_to_total,total_articles,total_good_articles,country
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Korea, North",0.222222,36,8.0,"Korea, North"
Saudi Arabia,0.128205,117,15.0,Saudi Arabia
Romania,0.122449,343,42.0,Romania
Central African Republic,0.121212,66,8.0,Central African Republic
Uzbekistan,0.107143,28,3.0,Uzbekistan
Mauritania,0.104167,48,5.0,Mauritania
Guatemala,0.084337,83,7.0,Guatemala
Dominica,0.083333,12,1.0,Dominica
Syria,0.078125,128,10.0,Syria
Benin,0.076923,91,7.0,Benin


4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [64]:
display(good_to_total_results.sort_values('prop_good_to_total', ascending = 1)[:10])

Unnamed: 0_level_0,prop_good_to_total,total_articles,total_good_articles,country
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Solomon Islands,0.0,97,0.0,Solomon Islands
Tonga,0.0,63,0.0,Tonga
Nauru,0.0,52,0.0,Nauru
Namibia,0.0,162,0.0,Namibia
Djibouti,0.0,37,0.0,Djibouti
Mozambique,0.0,58,0.0,Mozambique
Monaco,0.0,40,0.0,Monaco
Eritrea,0.0,16,0.0,Eritrea
Estonia,0.0,148,0.0,Estonia
Moldova,0.0,421,0.0,Moldova


5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [65]:
display(articles_by_region.sort_values('prop_articles_to_population', ascending = 0))

Unnamed: 0_level_0,Population,prop_good_to_total,total_good_articles,total_articles,prop_articles_to_population
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OCEANIA,42999000,0.020154,63.0,3126.0,7.3e-05
NORTHERN EUROPE,105852000,0.027106,102.0,3763.0,3.6e-05
SOUTHERN EUROPE,153216000,0.019946,74.0,3710.0,2.4e-05
WESTERN EUROPE,195479000,0.012281,56.0,4560.0,2.3e-05
CARIBBEAN,42747000,0.018705,13.0,695.0,1.6e-05
EASTERN EUROPE,291902000,0.031618,118.0,3732.0,1.3e-05
SOUTHERN AFRICA,67732000,0.014196,9.0,634.0,9e-06
WESTERN ASIA,280927000,0.034725,89.0,2563.0,9e-06
CENTRAL AMERICA,178612000,0.014906,23.0,1543.0,9e-06
SOUTH AMERICA,429188000,0.013193,40.0,3032.0,7e-06


6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [66]:
display(articles_by_region.sort_values('prop_good_to_total', ascending = 1))

Unnamed: 0_level_0,Population,prop_good_to_total,total_good_articles,total_articles,prop_articles_to_population
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
WESTERN EUROPE,195479000,0.012281,56.0,4560.0,2.3e-05
SOUTH AMERICA,429188000,0.013193,40.0,3032.0,7e-06
EASTERN AFRICA,444970000,0.013989,35.0,2502.0,6e-06
SOUTHERN AFRICA,67732000,0.014196,9.0,634.0,9e-06
CENTRAL AMERICA,178612000,0.014906,23.0,1543.0,9e-06
SOUTH ASIA,1967131000,0.016262,71.0,4366.0,2e-06
WESTERN AFRICA,401108000,0.0187,40.0,2139.0,5e-06
CARIBBEAN,42747000,0.018705,13.0,695.0,1.6e-05
SOUTHERN EUROPE,153216000,0.019946,74.0,3710.0,2.4e-05
OCEANIA,42999000,0.020154,63.0,3126.0,7.3e-05


## Writeup: Reflections and Implications

Write a few paragraphs, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). 

You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

What biases did you expect to find in the data (before you started working with it), and why?

What (potential) sources of bias did you discover in the course of your data processing and analysis?

How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?


Before I found the conclusions in this article, I expected to find a small correlation with politician article ratings and the size of the country. The reason for this, is that in my experience the larger more dominate countries tend to populate international news sources, as they tend to have more of an impact on the world, and consequently, politicians in these larger countries, tend to be more famous. My expectation would be that the larger the following of a politician, the more thorough their wikipedia article will be.  

I expected the top 10 countries by the number of good/featured articles to the population and total number of articles, to contain several larger more popular countries, such as United States, China, and Russia.



## Cited Sources

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.isupper.html

https://datascienceparichay.com/article/pandas-groupby-count-of-rows-in-each-group

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html