<a href="https://colab.research.google.com/github/aly-such/data-512-a2/blob/main/hcds_a2_bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A2 - Bias in Data Assignment**
**Alyson Suchodolski**


---


The goal of this assignment is to collect, curate, and analyze data on articles of politicians for many countries, then explore the biases that can be present in this analysis.
We will collect article data from Wikipedia as well as population data from the Population Reference Bureau. 

**Import necessary packages**

In [None]:
# install ores locally
# Will need to comment out and rerun this cell again so output isn't messy
# When booting up notebook again, you'll have to uncomment to reinstall ores locally
# If ores is already installed in your system, you won't need to run this
# !pip install ores

# imports
import requests
import json
import pandas as pd
import numpy as np
from ores import api

**Clone the github repository**

In [None]:
!git clone https://github.com/aly-such/data-512-a2.git

fatal: destination path 'data-512-a2' already exists and is not an empty directory.


### **Step 1: Getting the Article and Population Data**
First we must collect the necessary data from our sources below:


*   [Politicians by Country Dataset](https://figshare.com/articles/Untitled_Item/5513449)
*   [World Population Data Sheet](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing)



In [None]:
# Politicians by Country dataset - from Figshare
wiki = pd.read_csv('https://raw.githubusercontent.com/aly-such/data-512-a2/main/page_data.csv')

# Population Data - from Population Reference Bureau
pop = pd.read_csv('https://raw.githubusercontent.com/aly-such/data-512-a2/main/WPDS_2020_data.csv')

In [None]:
wiki.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### **Step 2: Cleaning the Data**

Both data sets have information we do not need.  Within the wiki dataframe, you will see 'Template:..." at the beginning of some titles in the 'page' column. These are not wikipedia articles and can be dropped from the dataframe.

In [None]:
# Template:... are not wiki articles, drop from dataframe
wiki = wiki[~wiki.page.str.contains("Template:")]

In [None]:
wiki.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


For neatness purposes, I have dropped some fields that we will not be using.

In [None]:
pop = pop.drop(columns= ['TimeFrame', 'FIPS', 'Data (M)'])
pop.head()

Unnamed: 0,Name,Type,Population
0,WORLD,World,7772850000
1,AFRICA,Sub-Region,1337918000
2,NORTHERN AFRICA,Sub-Region,244344000
3,Algeria,Country,44357000
4,Egypt,Country,100803000


Now, we will create two separate dataframes for the population data. Because only countries exist within the wiki dataset and have a rev_id, we will want to house the sub-regional data in a separate frame. This sub-regional data can be identified with all caps (ex. SOUTHEAST ASIA), so those must be popped out from the country dataframe that we want.

In [None]:
# Separate The regional populations from the country populations
# First, create a dataframe of the Names that are not in all caps (country-level counts)
# These rows will match country values in paga_data.csv
country_pop = pop[~pop.Name.str.isupper()]

# Second, create a dataframe of the Names that are in all caps (regional-level counts)
# These rows will not have a match in paga_data.csv
region_pop = pop[pop.Name.str.isupper()]

In [None]:
region_pop

Unnamed: 0,Name,Type,Population
0,WORLD,World,7772850000
1,AFRICA,Sub-Region,1337918000
2,NORTHERN AFRICA,Sub-Region,244344000
10,WESTERN AFRICA,Sub-Region,401115000
27,EASTERN AFRICA,Sub-Region,444970000
48,MIDDLE AFRICA,Sub-Region,179757000
58,SOUTHERN AFRICA,Sub-Region,67732000
64,NORTHERN AMERICA,Sub-Region,368193000
67,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,651036000
68,CENTRAL AMERICA,Sub-Region,178611000


In [None]:
country_pop.head()

Unnamed: 0,Name,Type,Population
3,Algeria,Country,44357000
4,Egypt,Country,100803000
5,Libya,Country,6891000
6,Morocco,Country,35952000
7,Sudan,Country,43849000


In [198]:
region_index = pop[pop['Type'] == 'Sub-Region'].index.to_list()

region_index.remove(168)

rdf = pd.DataFrame(region_index)

rdf.rename(columns = {0 : 'rindex'}, inplace = True)

print(rdf.head())

   rindex
0       1
1       2
2      10
3      27
4      48


In [201]:
reg_copy = region_pop.copy().reset_index(drop = True)

In [202]:
repeat = rdf.diff()

reg_copy['repeat_index'] = repeat

reg_copy['repeat_index'] = reg_copy['repeat_index'].fillna(18)

reg_copy.drop([0,1], inplace = True)

print(reg_copy)

                               Name        Type  Population  repeat_index
2                   NORTHERN AFRICA  Sub-Region   244344000           8.0
3                    WESTERN AFRICA  Sub-Region   401115000          17.0
4                    EASTERN AFRICA  Sub-Region   444970000          21.0
5                     MIDDLE AFRICA  Sub-Region   179757000          10.0
6                   SOUTHERN AFRICA  Sub-Region    67732000           6.0
7                  NORTHERN AMERICA  Sub-Region   368193000           3.0
8   LATIN AMERICA AND THE CARIBBEAN  Sub-Region   651036000           1.0
9                   CENTRAL AMERICA  Sub-Region   178611000           9.0
10                        CARIBBEAN  Sub-Region    43233000          18.0
11                    SOUTH AMERICA  Sub-Region   429191000          14.0
12                             ASIA  Sub-Region  4625927000           1.0
13                     WESTERN ASIA  Sub-Region   280927000          19.0
14                     CENTRAL ASIA  S

In [203]:
reg_reps = reg_copy[['Name', 'repeat_index']]

reg_reps['repeat_index'] -= 1

reg_reps

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Name,repeat_index
2,NORTHERN AFRICA,7.0
3,WESTERN AFRICA,16.0
4,EASTERN AFRICA,20.0
5,MIDDLE AFRICA,9.0
6,SOUTHERN AFRICA,5.0
7,NORTHERN AMERICA,2.0
8,LATIN AMERICA AND THE CARIBBEAN,0.0
9,CENTRAL AMERICA,8.0
10,CARIBBEAN,17.0
11,SOUTH AMERICA,13.0


In [204]:
repeat = reg_reps.loc[reg_reps.index.repeat(reg_reps.repeat_index)]
print(repeat.shape)

rep = repeat['Name'].squeeze().reset_index(drop = True)
print(rep)

(210, 2)
0      NORTHERN AFRICA
1      NORTHERN AFRICA
2      NORTHERN AFRICA
3      NORTHERN AFRICA
4      NORTHERN AFRICA
            ...       
205            OCEANIA
206            OCEANIA
207            OCEANIA
208            OCEANIA
209            OCEANIA
Name: Name, Length: 210, dtype: object


In [205]:
country_pop['Sub-Region'] = rep
print(country_pop.head())

      Name     Type  Population       Sub-Region
3  Algeria  Country    44357000  NORTHERN AFRICA
4    Egypt  Country   100803000  NORTHERN AFRICA
5    Libya  Country     6891000  NORTHERN AFRICA
6  Morocco  Country    35952000  NORTHERN AFRICA
7    Sudan  Country    43849000   WESTERN AFRICA


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### **Step 3: Getting Article Quality Predictions**
We need to get the predicted quality scores for each article in the Wikipedia dataset. ORES can provide estimates on article quality based off this Wiki information. Article Quality is as follows:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

The following will read the rev_id of each article line by line so that it can predict the quality of the article. Each rev_id will be assigned one of the rankings from above.


First we must connect to the wikimedia API

In [206]:
# Start a ores session using api package
ores_session = api.Session('https://ores.wikimedia.org', 'DATA512 A2 ams884@uw.edu')

Then we can pull the results from the API endpoint

In [207]:
# Pull the results of the session
result = ores_session.score('enwiki', ['articlequality'], wiki['rev_id'])

Last, we can create a new column that will contain all the predictions found from using ORES.

In [208]:
# Create new column that will keep track of predicted quality
# Create an empty list to append to
predictions = []

# Loop through the results of the session to append the scores to our empty list
for prediction in result:
  try:
    predictions.append(prediction['articlequality']['score']['prediction'])
  except:
    predictions.append(-1) # appends -1 where there is no prediction

### **Step 4: Combining the Datasets**
Now that we have our predictions for article quality, we will want to combine this with our wiki data frame that has our article information. On top of this, we want to look at these scores relative to the populations of the country of origin, so we will also combine our country population dataset

We will add a new column to the wiki dataframe that will be our article quality. 

In [209]:
# Create a new dataframe that includes wiki data as well as score data
wiki_scores = wiki
wiki_scores['article_quality'] = predictions

wiki_scores.head()

Unnamed: 0,page,country,rev_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,-1
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


Merge the Wiki and Scoring dataset with the country population data. This will happen as an inner join on the fields 'country' and 'Name', respectively.

In [210]:
# Merge wiki/pred data with country population data
merge_df = wiki_scores.merge(country_pop, left_on='country', right_on='Name', how='inner')

In [211]:
merge_df.head()

Unnamed: 0,page,country,rev_id,article_quality,Name,Type,Population,Sub-Region
0,Template:ZambiaProvincialMinisters,Zambia,235107991,-1,Zambia,Country,18384000,MIDDLE AFRICA
1,Gladys Lundwe,Zambia,757566606,Stub,Zambia,Country,18384000,MIDDLE AFRICA
2,Mwamba Luchembe,Zambia,764848643,Stub,Zambia,Country,18384000,MIDDLE AFRICA
3,Thandiwe Banda,Zambia,768166426,Start,Zambia,Country,18384000,MIDDLE AFRICA
4,Sylvester Chisembele,Zambia,776082926,C,Zambia,Country,18384000,MIDDLE AFRICA


Clean up the column names to accurately reflect what we want to show (country, article name, article quality, population, etc.)

In [212]:
# Create single dataframe of wikipedia, prediction, and country population data
# This is to simply rename all the columns to intuitive names

wikipedia_df = pd.DataFrame({
    'country' : merge_df['country'],
    'article_name' : merge_df['page'],
    'revision_id' : merge_df['rev_id'],
    'article_quality_est.' : merge_df['article_quality'],
    'population' : merge_df['Population'],
    'Sub-Region' : merge_df['Sub-Region']
} )

In [213]:
print(wikipedia_df.head())

  country                        article_name  ...  population     Sub-Region
0  Zambia  Template:ZambiaProvincialMinisters  ...    18384000  MIDDLE AFRICA
1  Zambia                       Gladys Lundwe  ...    18384000  MIDDLE AFRICA
2  Zambia                     Mwamba Luchembe  ...    18384000  MIDDLE AFRICA
3  Zambia                      Thandiwe Banda  ...    18384000  MIDDLE AFRICA
4  Zambia                Sylvester Chisembele  ...    18384000  MIDDLE AFRICA

[5 rows x 6 columns]


Some articles might not produce a prediction score, so we will log these in a separate file.

In [214]:
# Drop the rows that did not produce a prediction score
wikipedia_df_final = wikipedia_df.loc[wikipedia_df['article_quality_est.'] != -1]

wikipedia_df_final.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population,Sub-Region
1,Zambia,Gladys Lundwe,757566606,Stub,18384000,MIDDLE AFRICA
2,Zambia,Mwamba Luchembe,764848643,Stub,18384000,MIDDLE AFRICA
3,Zambia,Thandiwe Banda,768166426,Start,18384000,MIDDLE AFRICA
4,Zambia,Sylvester Chisembele,776082926,C,18384000,MIDDLE AFRICA
5,Zambia,Victoria Kalima,776530837,Start,18384000,MIDDLE AFRICA


In [215]:
# Save the rows that did not produce a prediction score to a separate csv
wikipedia_no_score = wikipedia_df.loc[wikipedia_df['article_quality_est.'] == -1]

wikipedia_no_score.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population,Sub-Region
0,Zambia,Template:ZambiaProvincialMinisters,235107991,-1,18384000,MIDDLE AFRICA
62,Chad,Kalthouma Nguembang,762816132,-1,16877000,SOUTHERN AFRICA
156,Zimbabwe,Bybit Lydia Tsomondo,723082034,-1,14863000,MIDDLE AFRICA
290,Zimbabwe,Guy Georgias,807196681,-1,14863000,MIDDLE AFRICA
721,Nigeria,Mahmud Shinkafi,726805368,-1,206140000,EASTERN AFRICA


Export the two dataframes we produced to CSVs.

In [216]:
# Turn dataframes to CSV's
wikipedia_df_final.to_csv('wp_wpds_politicians_by_country.csv', sep = ',')
wikipedia_no_score.to_csv('wp_wpds_countries-no_match.csv', sep = ',')

### **Step 5: Analysis**
After curating the data to our liking, we will now analyze what we have so far. There are a few different metrics we would like to look at, such as the **proportion of articles per population** of each country as well as the **proportion of high quality articles per total number of articles** for each country. We will also want to do this at a regional level (for the 'Names' from the population dataset that are **all caps**).

**High quality** articles can be defined as articles that are FA, or Featured Article, and GA, or Good Article.

Before we can look at results, we must capture these metrics from the dataset we have so far.

Below, we will count the number of articles for each country. This is simply done by the group_by() function, which is very similar to the GROUP BY function in SQL. 

In [217]:
# Count of Articles by Country
art_country = wikipedia_df_final.groupby('country').count()['article_name'].astype(int).reset_index()
art_country.rename(columns = {'article_name' : 'article_count'}, inplace = True)
art_country.head()

Unnamed: 0,country,article_count
0,Afghanistan,324
1,Albania,459
2,Algeria,119
3,Andorra,34
4,Angola,110


Merge the dataset we made of count of articles by country with the country population. This will allow us to find the proportion of articles per selected country population. We will need manually formulate the proportion using:

(Number of Articles/Country Population) * 100

In [218]:
# Merge Count of Articles by Country and Country Population
art_prop = art_country.merge(country_pop, left_on = 'country', right_on = 'Name', how = 'inner')

# Create a Proportion Column of Number of Articles Per the Country's Population
art_prop['percentage'] = (art_prop['article_count'] * 100) / art_prop['Population'] # multiply by 100 to get a percentage

art_prop.head()

Unnamed: 0,country,article_count,Name,Type,Population,Sub-Region,percentage
0,Afghanistan,324,Afghanistan,Country,38928000,SOUTHEAST ASIA,0.000832
1,Albania,459,Albania,Country,2838000,OCEANIA,0.016173
2,Algeria,119,Algeria,Country,44357000,NORTHERN AFRICA,0.000268
3,Andorra,34,Andorra,Country,82000,OCEANIA,0.041463
4,Angola,110,Angola,Country,32522000,MIDDLE AFRICA,0.000338


Create a dataset that will only contain articles with estimated quality of GA or FA. 

Find the number of high quality articles per country by grouping by country. Again use the group_by() function.

In [219]:
# Group high quality articles together (High Quality = 'FA' and 'GA')
high_quality = pd.concat([wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='FA'], 
                           wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='GA']])

# Obtain count of high quality articles
high_quality_group = high_quality.groupby('country').count()['article_name'].reset_index()

# Create a dataframe of country and high quality article count
high_quality_df = pd.DataFrame({'country':high_quality_group['country'], 'high_quality_article_count':high_quality_group['article_name']})
high_quality_df.head()

Unnamed: 0,country,high_quality_article_count
0,Afghanistan,13
1,Albania,3
2,Algeria,2
3,Argentina,16
4,Armenia,5


Merge the dataset containing the number of high quality articles with the dataset containing the total number of articles.

Find the proportion of high quality articles per total number of articles for each country by using the following formula:

(Number of High Quality Articles/Total Number of Articles) * 100

In [220]:
# Merge high quality df (which has the counts) with the number of articles by country dataframe
high_quality_prop = high_quality_df.merge(art_country, left_on = 'country', right_on = 'country', how = 'inner')

# Find the proportion of high quality articles and number of total articles by country
high_quality_prop['Percentage of Quality Articles'] = (high_quality_df['high_quality_article_count'] * 100) / high_quality_prop['article_count']

high_quality_prop.head()

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
0,Afghanistan,13,324,4.012346
1,Albania,3,459,0.653595
2,Algeria,2,119,1.680672
3,Argentina,16,496,3.225806
4,Armenia,5,196,2.55102


In [221]:
# Count of Articles by Sub-Region
art_sub = wikipedia_df_final.groupby('Sub-Region').count()['article_name'].astype(int).reset_index()
art_sub.rename(columns = {'article_name' : 'article_count'}, inplace = True)
art_sub.head()

Unnamed: 0,Sub-Region,article_count
0,CARIBBEAN,1800
1,CENTRAL AMERICA,2555
2,CENTRAL ASIA,882
3,EAST ASIA,2967
4,EASTERN AFRICA,2573


In [222]:
# Merge Count of Articles by Sub-Region and Sub-Region Population
art_reg_prop = art_sub.merge(region_pop, left_on = 'Sub-Region', right_on = 'Name', how = 'inner')

# Create a Proportion Column of Number of Articles Per the Country's Population
art_reg_prop['percentage'] = (art_reg_prop['article_count'] * 100) / art_reg_prop['Population'] # multiply by 100 to get a percentage

art_reg_prop.head()

Unnamed: 0,Sub-Region,article_count,Name,Type,Population,percentage
0,CARIBBEAN,1800,CARIBBEAN,Sub-Region,43233000,0.004163
1,CENTRAL AMERICA,2555,CENTRAL AMERICA,Sub-Region,178611000,0.00143
2,CENTRAL ASIA,882,CENTRAL ASIA,Sub-Region,74961000,0.001177
3,EAST ASIA,2967,EAST ASIA,Sub-Region,1641063000,0.000181
4,EASTERN AFRICA,2573,EASTERN AFRICA,Sub-Region,444970000,0.000578


In [223]:
# Group high quality articles together (High Quality = 'FA' and 'GA')
high_quality_sub = pd.concat([wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='FA'], 
                           wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='GA']])

# Obtain count of high quality articles
high_quality_group_sub = high_quality_sub.groupby('Sub-Region').count()['article_name'].reset_index()

# Create a dataframe of country and high quality article count
high_quality_df_sub = pd.DataFrame({'Sub-Region':high_quality_group_sub['Sub-Region'], 'high_quality_article_count':high_quality_group_sub['article_name']})
high_quality_df_sub.head()

Unnamed: 0,Sub-Region,high_quality_article_count
0,CARIBBEAN,26
1,CENTRAL AMERICA,113
2,CENTRAL ASIA,31
3,EAST ASIA,44
4,EASTERN AFRICA,37


In [224]:
# Merge high quality df (which has the counts) with the number of articles by country dataframe
high_quality_prop_sub = high_quality_df_sub.merge(art_sub, left_on = 'Sub-Region', right_on = 'Sub-Region', how = 'inner')

# Find the proportion of high quality articles and number of total articles by country
high_quality_prop_sub['Percentage of Quality Articles'] = (high_quality_df_sub['high_quality_article_count'] * 100) / high_quality_prop_sub['article_count']

high_quality_prop_sub.head()

Unnamed: 0,Sub-Region,high_quality_article_count,article_count,Percentage of Quality Articles
0,CARIBBEAN,26,1800,1.444444
1,CENTRAL AMERICA,113,2555,4.422701
2,CENTRAL ASIA,31,882,3.514739
3,EAST ASIA,44,2967,1.482979
4,EASTERN AFRICA,37,2573,1.43801


### **Step 6: Results**
We finally have the metrics we are looking for and now we can explore the results of our analysis.

We have a few different questions to look at involving top 10s and bottom 10s. 

They are as follows:


1.   Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2.   Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality



To get the answers to these questions, it will be easiest to sort our tables in descending order

In [225]:
# Sort article proportion data by country in descending order
rank_countries = art_prop.sort_values(['percentage'], ascending=[False])

## Sort High Quality article proportion by country in descending order
rank_countries_high_quality = high_quality_prop.sort_values(['Percentage of Quality Articles'], ascending=[False])

# Sort article proportion data by Sub-Region in descending order
rank_regions = art_reg_prop.sort_values(['percentage'], ascending=[False])

## Sort High Quality article proportion by country in descending order
rank_regions_high_quality = high_quality_prop_sub.sort_values(['Percentage of Quality Articles'], ascending=[False])

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [226]:
# Top 10 countries with largest proportion of articles to country population
rank_countries.head(10)

Unnamed: 0,country,article_count,Name,Type,Population,Sub-Region,percentage
169,Tuvalu,55,Tuvalu,Country,10000,,0.55
117,Nauru,53,Nauru,Country,11000,,0.481818
138,San Marino,82,San Marino,Country,34000,,0.241176
110,Monaco,40,Monaco,Country,38000,SOUTHERN EUROPE,0.105263
95,Liechtenstein,29,Liechtenstein,Country,39000,SOUTHERN EUROPE,0.074359
104,Marshall Islands,37,Marshall Islands,Country,57000,,0.064912
164,Tonga,63,Tonga,Country,99000,,0.063636
70,Iceland,205,Iceland,Country,368000,EASTERN EUROPE,0.055707
3,Andorra,34,Andorra,Country,82000,OCEANIA,0.041463
52,Federated States of Micronesia,38,Federated States of Micronesia,Country,106000,,0.035849


Tuvalu is the country with the largest proportion of politician articles per citizen. Tuvalu is one of the smallest countries population-wise.

2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [227]:
# Bottom 10 countries with largest proportion of articles to country population
rank_countries.tail(10)

Unnamed: 0,country,article_count,Name,Type,Population,Sub-Region,percentage
114,Mozambique,60,Mozambique,Country,31166000,EASTERN AFRICA,0.000193
13,Bangladesh,320,Bangladesh,Country,169809000,SOUTHEAST ASIA,0.000188
162,Thailand,112,Thailand,Country,66534000,NORTHERN EUROPE,0.000168
84,"Korea, North",39,"Korea, North",Country,25779000,WESTERN EUROPE,0.000151
181,Zambia,25,Zambia,Country,18384000,MIDDLE AFRICA,0.000136
51,Ethiopia,105,Ethiopia,Country,114916000,EASTERN AFRICA,9.1e-05
176,Uzbekistan,29,Uzbekistan,Country,34174000,SOUTHEAST ASIA,8.5e-05
34,China,1134,China,Country,1402385000,NORTHERN EUROPE,8.1e-05
72,Indonesia,213,Indonesia,Country,271739000,NORTHERN EUROPE,7.8e-05
71,India,972,India,Country,1400100000,SOUTHEAST ASIA,6.9e-05


India is the country with the smallest proportion of politician articles per citizen. I imagine this is because the population is so large.

3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [231]:
# Top 10 countries with largest proportion of high quality articles to total number of articles
rank_countries_high_quality.head(10)

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
63,"Korea, North",8,39,20.512821
109,Saudi Arabia,15,118,12.711864
106,Romania,42,348,12.068966
23,Central African Republic,8,68,11.764706
140,Uzbekistan,3,29,10.344828
82,Mauritania,5,52,9.615385
46,Guatemala,7,84,8.333333
33,Dominica,1,12,8.333333
125,Syria,10,131,7.633588
138,United States,80,1068,7.490637


Interestingly, North Korea has the largest percentage of quality politican articles compared to any other country. I remember in class it was stated that quality is dictated based on the country/language it was written in, so based on the English wikipedia articles, North Korea has a high percentage. I imagine that the percentage could be even higher if we were to look at the Korean version of these wikipedia articles. 

4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [229]:
# Bottom 10 countries with largest proportion of high quality articles to total number of articles
rank_countries_high_quality.tail(10)

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
87,Morocco,1,208,0.480769
73,Lithuania,1,248,0.403226
27,Colombia,1,288,0.347222
104,Portugal,1,321,0.311526
94,Nigeria,2,681,0.293686
101,Peru,1,354,0.282486
89,Nepal,1,358,0.27933
124,Switzerland,1,406,0.246305
128,Tanzania,1,407,0.2457
10,Belgium,1,522,0.191571


Belgium has the lowest percentage of quality articles of any country. Even if you take into consideration the total number of articles Belgium has on politicians, it still only has 1 high quality article. I wonder why that is?

In [230]:
# Top 10 Sub-Regions with largest proportion of articles to Sub-Region population
rank_regions.head(10)

Unnamed: 0,Sub-Region,article_count,Name,Type,Population,percentage
10,OCEANIA,5657,OCEANIA,Sub-Region,43155000,0.013109
0,CARIBBEAN,1800,CARIBBEAN,Sub-Region,43233000,0.004163
15,SOUTHERN EUROPE,5753,SOUTHERN EUROPE,Sub-Region,153251000,0.003754
9,NORTHERN EUROPE,2959,NORTHERN EUROPE,Sub-Region,105990000,0.002792
1,CENTRAL AMERICA,2555,CENTRAL AMERICA,Sub-Region,178611000,0.00143
2,CENTRAL ASIA,882,CENTRAL ASIA,Sub-Region,74961000,0.001177
17,WESTERN ASIA,3096,WESTERN ASIA,Sub-Region,280927000,0.001102
5,EASTERN EUROPE,2935,EASTERN EUROPE,Sub-Region,291902000,0.001005
18,WESTERN EUROPE,1357,WESTERN EUROPE,Sub-Region,195479000,0.000694
6,MIDDLE AFRICA,1200,MIDDLE AFRICA,Sub-Region,179757000,0.000668


In [232]:
# Top 10 Sub-Regions with largest proportion of high quality articles to total number of articles
rank_regions_high_quality.head(10)

Unnamed: 0,Sub-Region,high_quality_article_count,article_count,Percentage of Quality Articles
1,CENTRAL AMERICA,113,2555,4.422701
11,SOUTH ASIA,50,1199,4.170142
8,NORTHERN EUROPE,109,2959,3.683677
2,CENTRAL ASIA,31,882,3.514739
10,SOUTH AMERICA,26,957,2.716823
17,WESTERN EUROPE,36,1357,2.652911
9,OCEANIA,143,5657,2.527842
7,NORTHERN AFRICA,17,674,2.522255
15,WESTERN AFRICA,35,1455,2.405498
14,SOUTHERN EUROPE,117,5753,2.033722
