In [1]:
import numpy as np
import pandas as pd

# Read the CSV Files

In [2]:
page_data = pd.read_csv("../data/page_data.csv")
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [3]:
population = pd.read_csv("../data/WPDS_2020_data - WPDS_2020_data.csv")
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


 # Data Processing

The page_data contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.

In [4]:
# Filter out the data with page names starting with "Template:"
page_data = page_data[~page_data['page'].str.startswith('Template:')]
# Reset the index
page_data.reset_index(drop=True, inplace=True)
page_data.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv.

In [5]:
# Get the population of countries. (Excluding cumulative regional counts)
population_country = population[population['Type'] == 'Country']
population_country.reset_index(drop=True, inplace=True)
population_country.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,DZ,Algeria,Country,2019,44.357,44357000
1,EG,Egypt,Country,2019,100.803,100803000
2,LY,Libya,Country,2019,6.891,6891000
3,MA,Morocco,Country,2019,35.952,35952000
4,SD,Sudan,Country,2019,43.849,43849000


# Obtain article quality

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  

FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article  

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any rev_id you send it.  

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the rev_id column to make an API query.


In [6]:
# Obtain rev_ids as a list
rev_ids = page_data['rev_id'].tolist()

In [7]:
import requests

headers = {
    'User-Agent': 'https://github.com/azhou5211',
    'From': 'ajz55@uw.edu'
}

def api_call(endpoint, rev_id):
    """
    Function used to call API. Will return response.
    
    :param endpoint: API URL endpoint
    :param parameters: Parameter settings in the API call
    """
    call = requests.get(endpoint.format(rev_id=rev_id), headers=headers)
    response = call.json()
    return response

In [8]:
request_url = 'https://ores.wikimedia.org/v3/scores/enwiki/{rev_id}/articlequality'

In [10]:
# Get the article quality for each wikipedia article
article_quality = []
for rev_id in rev_ids:
    response = api_call(request_url, rev_id)
    try:
        pred = response['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction']
    except:
        pred = np.nan
    article_quality.append(pred)

In [15]:
# Add the article quality into the pages dataframe
page_data['quality'] = article_quality
page_data.head()

Unnamed: 0,page,country,rev_id,quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


List of wikipedia pages that did not have a predicted quality from the ORES

In [16]:
# List of wikipedia pages that did not have a predicted quality from ORES
page_data[page_data['quality'].isna()]

Unnamed: 0,page,country,rev_id,quality
14,List of politicians in Poland,Poland,516633096,
21,Tingtingru,Vanuatu,550682925,
51,Daud Arsala,Afghanistan,627547024,
75,Book:Two Political Biographies,India,636911471,
180,Dilaver Bey,Turkey,669987106,
...,...,...,...,...
46287,John Rose (Trotskyist),United Kingdom,807336308,
46367,Jalal Movaghar,Iran,807367030,
46368,Mohsen Movaghar,Iran,807367166,
46686,King Gutierrez,Philippines,807479587,


In [17]:
page_data[page_data['quality'].isna()].to_csv("Non-Predicted ORES pages.csv", index=False)

Filter the page_data to include only the values that have a predicted quality

In [18]:
page_data_filtered = page_data[~page_data['quality'].isna()]
page_data_filtered

Unnamed: 0,page,country,rev_id,quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...
46695,Hal Bidlack,United States,807481636,C
46696,Yahya Jammeh,Gambia,807482007,GA
46697,Lucius Fairchild,United States,807483006,C
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA


# Combining the datasets

Merge the wikipedia data and population data together.   
Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.


In [31]:
df = page_data_filtered.merge(population_country, left_on='country', right_on='Name', how='left')
df

Unnamed: 0,page,country,rev_id,quality,FIPS,Name,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0
2,Yos Por,Cambodia,393822005,Stub,KH,Cambodia,Country,2019.0,15.497,15497000.0
3,Julius Gregr,Czech Republic,395521877,Stub,,,,,,
4,Edvard Gregr,Czech Republic,395526568,Stub,,,,,,
...,...,...,...,...,...,...,...,...,...,...
46420,Hal Bidlack,United States,807481636,C,US,United States,Country,2019.0,329.878,329878000.0
46421,Yahya Jammeh,Gambia,807482007,GA,GM,Gambia,Country,2019.0,2.417,2417000.0
46422,Lucius Fairchild,United States,807483006,C,US,United States,Country,2019.0,329.878,329878000.0
46423,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA,SA,Saudi Arabia,Country,2019.0,35.041,35041000.0


Filter to only the relevant columns

In [32]:
df = df[['country','page','rev_id','quality','Population']]
df.rename(columns={'page':'article_name','rev_id':'revision_id','quality':'article_quality_est.','Population':'population'}, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Chad,Bir I of Kanem,355319463,Stub,16877000.0
1,Palestinian Territory,Information Minister of the Palestinian Nation...,393276188,Stub,5008000.0
2,Cambodia,Yos Por,393822005,Stub,15497000.0
3,Czech Republic,Julius Gregr,395521877,Stub,
4,Czech Republic,Edvard Gregr,395526568,Stub,
...,...,...,...,...,...
46420,United States,Hal Bidlack,807481636,C,329878000.0
46421,Gambia,Yahya Jammeh,807482007,GA,2417000.0
46422,United States,Lucius Fairchild,807483006,C,329878000.0
46423,Saudi Arabia,Fahd of Saudi Arabia,807483153,GA,35041000.0


Filter the rows that do not have data and output it to 
```wp_wpds_countries-no_match.csv```

In [35]:
df[df['population'].isna()].to_csv("../output_files/wp_wpds_countries-no_match.csv", index=False)
df[df['population'].isna()]

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
3,Czech Republic,Julius Gregr,395521877,Stub,
4,Czech Republic,Edvard Gregr,395526568,Stub,
29,Salvadoran,Timoteo Menéndez,566504165,Stub,
31,Rhodesian,Gervas Clay,574571582,Stub,
34,"Congo, Dem. Rep. of",Mavua Mudima,592289232,Stub,
...,...,...,...,...,...
46306,Hondura,Juan Ángel Arias Boquín,807445333,Start,
46308,Hondura,Francisco Zelaya y Ayes,807445395,Stub,
46322,Omani,Haitham bin Tariq Al Said,807451487,Stub,
46326,Chechen,Ruslan Yamadayev,807454176,B,


Filter the df to contain non null values. Output the dataframe to ```wp_wpds_politicians_by_country.csv```

In [37]:
df = df[~df['population'].isna()]
df.to_csv("../output_files/wp_wpds_politicians_by_country.csv", index=False)
df

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Chad,Bir I of Kanem,355319463,Stub,16877000.0
1,Palestinian Territory,Information Minister of the Palestinian Nation...,393276188,Stub,5008000.0
2,Cambodia,Yos Por,393822005,Stub,15497000.0
5,Canada,Robert Douglas Cook,401577829,Stub,38190000.0
6,Egypt,List of Grand Viziers of Egypt,442937236,Stub,100803000.0
...,...,...,...,...,...
46420,United States,Hal Bidlack,807481636,C,329878000.0
46421,Gambia,Yahya Jammeh,807482007,GA,2417000.0
46422,United States,Lucius Fairchild,807483006,C,329878000.0
46423,Saudi Arabia,Fahd of Saudi Arabia,807483153,GA,35041000.0


# Analysis

Calculate the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region.  
By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Filter the data to include only "high" quality articles

In [97]:
df_analysis = df[(df['article_quality_est.']=='FA') | (df['article_quality_est.']=='GA')]
df_analysis

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
267,Panama,Gumercinda Páez,680071857,GA,4283000.0
2086,Myanmar,Pho Hlaing,712325072,GA,54704000.0
3717,Russia,Konstantin Kostin (politician),718512855,GA,146733000.0
6157,Tanzania,Kimweri ye Nyumbai,728211265,GA,59734000.0
6175,France,Auguste-Nicolas Vaillant,728477569,GA,64940000.0
...,...,...,...,...,...
46408,Philippines,Aga Muhlach,807479087,GA,109581000.0
46409,Fiji,Elizabeth II,807479170,FA,896000.0
46419,Greece,George I of Greece,807481543,FA,10700000.0
46421,Gambia,Yahya Jammeh,807482007,GA,2417000.0


Groupby country to get the count of high quality papers

In [98]:
df_analysis_country = df_analysis.groupby('country').agg({"article_quality_est.":'count', 'population':'first'})
df_analysis_country.rename(columns={"article_quality_est.":"count_of_high_quality"}, inplace=True)
df_analysis_country

Unnamed: 0_level_0,count_of_high_quality,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,13,38928000.0
Albania,3,2838000.0
Algeria,2,44357000.0
Argentina,16,45377000.0
Armenia,5,2956000.0
...,...,...
Vanuatu,3,321000.0
Venezuela,3,28645000.0
Vietnam,13,96209000.0
Yemen,3,29826000.0


Calculate the percentage of articles per population

In [99]:
df_analysis_country['percentage_of_articles-per-population'] = df_analysis_country['count_of_high_quality']/df_analysis_country['population']
df_analysis_country

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,13,38928000.0,3.339499e-07
Albania,3,2838000.0,1.057082e-06
Algeria,2,44357000.0,4.508871e-08
Argentina,16,45377000.0,3.526015e-07
Armenia,5,2956000.0,1.691475e-06
...,...,...,...
Vanuatu,3,321000.0,9.345794e-06
Venezuela,3,28645000.0,1.047303e-07
Vietnam,13,96209000.0,1.351225e-07
Yemen,3,29826000.0,1.005834e-07


Get the total count of articles and add it to the analysis dataframe

In [100]:
count_of_articles = df.groupby('country').agg({"article_quality_est.":'count'})
df_analysis_country = df_analysis_country.merge(count_of_articles, how="left", left_index=True, right_index=True)
df_analysis_country

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,article_quality_est.
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,13,38928000.0,3.339499e-07,319
Albania,3,2838000.0,1.057082e-06,456
Algeria,2,44357000.0,4.508871e-08,116
Argentina,16,45377000.0,3.526015e-07,491
Armenia,5,2956000.0,1.691475e-06,193
...,...,...,...,...
Vanuatu,3,321000.0,9.345794e-06,58
Venezuela,3,28645000.0,1.047303e-07,130
Vietnam,13,96209000.0,1.351225e-07,187
Yemen,3,29826000.0,1.005834e-07,116


Calculate the percentage of high-quality articles

In [101]:
df_analysis_country['percentage_of_high-quality_articles'] = df_analysis_country['count_of_high_quality']/df_analysis_country['article_quality_est.']
df_analysis_country.rename(columns={'article_quality_est.':'total_count'}, inplace=True)
df_analysis_country

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,13,38928000.0,3.339499e-07,319,0.040752
Albania,3,2838000.0,1.057082e-06,456,0.006579
Algeria,2,44357000.0,4.508871e-08,116,0.017241
Argentina,16,45377000.0,3.526015e-07,491,0.032587
Armenia,5,2956000.0,1.691475e-06,193,0.025907
...,...,...,...,...,...
Vanuatu,3,321000.0,9.345794e-06,58,0.051724
Venezuela,3,28645000.0,1.047303e-07,130,0.023077
Vietnam,13,96209000.0,1.351225e-07,187,0.069519
Yemen,3,29826000.0,1.005834e-07,116,0.025862


# Results

### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [102]:
df_analysis_country.sort_values(by=['percentage_of_articles-per-population'], ascending=False).head(10)

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tuvalu,4,10000.0,0.0004,54,0.074074
Dominica,1,72000.0,1.4e-05,12,0.083333
Vanuatu,3,321000.0,9e-06,58,0.051724
Iceland,2,368000.0,5e-06,201,0.00995
Ireland,25,5003000.0,5e-06,373,0.067024
Montenegro,2,622000.0,3e-06,72,0.027778
Martinique,1,356000.0,3e-06,34,0.029412
Bhutan,2,730000.0,3e-06,33,0.060606
New Zealand,13,4987000.0,3e-06,783,0.016603
Romania,42,19241000.0,2e-06,343,0.122449


### Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [103]:
df_analysis_country.sort_values(by=['percentage_of_articles-per-population']).head(10)

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,13,1400100000.0,9.285051e-09,968,0.01343
Nigeria,2,206140000.0,9.702144e-09,676,0.002959
Tanzania,1,59734000.0,1.674088e-08,404,0.002475
Ethiopia,2,114916000.0,1.740402e-08,101,0.019802
Bangladesh,3,169809000.0,1.766691e-08,317,0.009464
Colombia,1,49444000.0,2.02249e-08,285,0.003509
Uganda,1,45741000.0,2.186222e-08,185,0.005405
Morocco,1,35952000.0,2.781486e-08,206,0.004854
Brazil,6,211812000.0,2.832701e-08,545,0.011009
China,40,1402385000.0,2.852284e-08,1129,0.03543


### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [104]:
df_analysis_country.sort_values(by=['percentage_of_high-quality_articles'], ascending=False).head(10)

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Korea, North",8,25779000.0,3.103301e-07,36,0.222222
Saudi Arabia,15,35041000.0,4.2807e-07,117,0.128205
Romania,42,19241000.0,2.182839e-06,343,0.122449
Central African Republic,8,4830000.0,1.656315e-06,66,0.121212
Uzbekistan,3,34174000.0,8.778604e-08,28,0.107143
Mauritania,5,4650000.0,1.075269e-06,48,0.104167
Guatemala,7,18066000.0,3.874682e-07,83,0.084337
Dominica,1,72000.0,1.388889e-05,12,0.083333
Syria,10,19398000.0,5.155171e-07,128,0.078125
Benin,7,12209000.0,5.733475e-07,91,0.076923


### Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality


In [105]:
df_analysis_country.sort_values(by=['percentage_of_high-quality_articles']).head(10)

Unnamed: 0_level_0,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Belgium,1,11515000.0,8.684325e-08,519,0.001927
Tanzania,1,59734000.0,1.674088e-08,404,0.002475
Switzerland,1,8634000.0,1.158212e-07,402,0.002488
Nepal,1,29996000.0,3.333778e-08,356,0.002809
Peru,1,32824000.0,3.046551e-08,350,0.002857
Nigeria,2,206140000.0,9.702144e-09,676,0.002959
Portugal,1,10255000.0,9.751341e-08,318,0.003145
Colombia,1,49444000.0,2.02249e-08,285,0.003509
Lithuania,1,2794000.0,3.579098e-07,244,0.004098
Morocco,1,35952000.0,2.781486e-08,206,0.004854


Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

Obtain the regions from the population dataframe

In [122]:
region = "NORTHERN AFRICA"

regions = ['WORLD', 'AFRICA', 'NORTHERN AFRICA']
for i in range(3, len(population)):
    if population.iloc[i]['Type']=='Sub-Region':
        region = population.iloc[i]['Name']
    regions.append(region)

In [123]:
population['Region'] = regions
population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.850,7772850000,WORLD
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA


In [124]:
df_analysis_country.reset_index(inplace=True)

Merge the region with analysis

In [125]:
df_analysis_country_ = df_analysis_country.merge(population[['Name', 'Region']], left_on='country', right_on='Name', how='left')
df_analysis_country_.drop(columns=['Name'], inplace=True)
df_analysis_country_

Unnamed: 0,index,country,count_of_high_quality,population,percentage_of_articles-per-population,total_count,percentage_of_high-quality_articles,Region
0,0,Afghanistan,13,38928000.0,3.339499e-07,319,0.040752,SOUTH ASIA
1,1,Albania,3,2838000.0,1.057082e-06,456,0.006579,SOUTHERN EUROPE
2,2,Algeria,2,44357000.0,4.508871e-08,116,0.017241,NORTHERN AFRICA
3,3,Argentina,16,45377000.0,3.526015e-07,491,0.032587,SOUTH AMERICA
4,4,Armenia,5,2956000.0,1.691475e-06,193,0.025907,WESTERN ASIA
...,...,...,...,...,...,...,...,...
141,141,Vanuatu,3,321000.0,9.345794e-06,58,0.051724,OCEANIA
142,142,Venezuela,3,28645000.0,1.047303e-07,130,0.023077,SOUTH AMERICA
143,143,Vietnam,13,96209000.0,1.351225e-07,187,0.069519,SOUTHEAST ASIA
144,144,Yemen,3,29826000.0,1.005834e-07,116,0.025862,WESTERN ASIA


Groupby region and sum the paper counts

In [126]:
df_analysis_region = df_analysis_country_.groupby('Region').agg({'count_of_high_quality':'sum','total_count':'sum'})
df_analysis_region

Unnamed: 0_level_0,count_of_high_quality,total_count
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
CARIBBEAN,13,552
CENTRAL AMERICA,23,1380
CENTRAL ASIA,7,135
Channel Islands,102,3046
EAST ASIA,76,2473
EASTERN AFRICA,35,2294
EASTERN EUROPE,118,3311
MIDDLE AFRICA,16,538
NORTHERN AFRICA,19,761
NORTHERN AMERICA,104,1901


Merge the population value

In [127]:
population_region = population[population['Type']=='Sub-Region']

In [128]:
df_analysis_region_ = df_analysis_region.merge(population_region[['Name', 'Population']], left_index=True, right_on='Name')
df_analysis_region_

Unnamed: 0,count_of_high_quality,total_count,Name,Population
77,13,552,CARIBBEAN,43233000
68,23,1380,CENTRAL AMERICA,178611000
129,7,135,CENTRAL ASIA,74961000
168,102,3046,Channel Islands,172000
157,76,2473,EAST ASIA,1641063000
27,35,2294,EASTERN AFRICA,444970000
189,118,3311,EASTERN EUROPE,291902000
48,16,538,MIDDLE AFRICA,179757000
2,19,761,NORTHERN AFRICA,244344000
64,104,1901,NORTHERN AMERICA,368193000


Calculate percentage of articles-per-population and percentage of high-quality articles by region

In [129]:
df_analysis_region_['percentage_of_articles-per-population'] = df_analysis_region_['count_of_high_quality']/df_analysis_region_['Population']
df_analysis_region_['percentage_of_high-quality_articles'] = df_analysis_region_['count_of_high_quality']/df_analysis_region_['total_count']
df_analysis_region_ = df_analysis_region_[['Name', 'percentage_of_articles-per-population', 'percentage_of_high-quality_articles']]
df_analysis_region_.rename(columns={'Name':'Region'}, inplace=True)
df_analysis_region_

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Region,percentage_of_articles-per-population,percentage_of_high-quality_articles
77,CARIBBEAN,3.006962e-07,0.023551
68,CENTRAL AMERICA,1.287715e-07,0.016667
129,CENTRAL ASIA,9.338189e-08,0.051852
168,Channel Islands,0.0005930233,0.033487
157,EAST ASIA,4.631145e-08,0.030732
27,EASTERN AFRICA,7.865699e-08,0.015257
189,EASTERN EUROPE,4.042453e-07,0.035639
48,MIDDLE AFRICA,8.900905e-08,0.02974
2,NORTHERN AFRICA,7.775922e-08,0.024967
64,NORTHERN AMERICA,2.824606e-07,0.054708


### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [130]:
df_analysis_region_.sort_values(by=['percentage_of_articles-per-population'], ascending=False)

Unnamed: 0,Region,percentage_of_articles-per-population,percentage_of_high-quality_articles
168,Channel Islands,0.0005930233,0.033487
216,OCEANIA,1.459854e-06,0.022412
200,SOUTHERN EUROPE,4.82868e-07,0.020584
189,EASTERN EUROPE,4.042453e-07,0.035639
110,WESTERN ASIA,3.168083e-07,0.035303
77,CARIBBEAN,3.006962e-07,0.023551
179,WESTERN EUROPE,2.864758e-07,0.012467
64,NORTHERN AMERICA,2.824606e-07,0.054708
58,SOUTHERN AFRICA,1.328766e-07,0.020316
68,CENTRAL AMERICA,1.287715e-07,0.016667


### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [131]:
df_analysis_region_.sort_values(by=['percentage_of_high-quality_articles'], ascending=False)

Unnamed: 0,Region,percentage_of_articles-per-population,percentage_of_high-quality_articles
64,NORTHERN AMERICA,2.824606e-07,0.054708
129,CENTRAL ASIA,9.338189e-08,0.051852
145,SOUTHEAST ASIA,1.102977e-07,0.036139
189,EASTERN EUROPE,4.042453e-07,0.035639
110,WESTERN ASIA,3.168083e-07,0.035303
168,Channel Islands,0.0005930233,0.033487
157,EAST ASIA,4.631145e-08,0.030732
48,MIDDLE AFRICA,8.900905e-08,0.02974
2,NORTHERN AFRICA,7.775922e-08,0.024967
77,CARIBBEAN,3.006962e-07,0.023551
