## Considering Bias in Data

For this activity, we are provided with two csv files - <br> 
* One with a list of politicians, their Wikipedia pages, and the country the political figures are from. 
* One with the population of each country and the total population of the region (sum of all country populations)

In [1]:
import json, time, urllib.parse
import pandas as pd
import numpy as np
import re
import requests
import matplotlib.pyplot as plt
from datetime import datetime

The Wikipedia Category:Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as politicians_by_country.SEPT.2022.csv. <br>
The population data is available in CSV format as population_by_country_2022.csv from the homework folder. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

### API:Info request to get a range of metadata on an article, including the most current revision ID of the article page.

This API gets the metadata for a specified article. We use this API to get the last revision ID since we will need this to get the quality score of the article.
The documentation is available at - [API:Info Documentation](https://www.mediawiki.org/wiki/API:Info)

The below code invokes the API with some added latency so as to not send in more than 100 requests per second. This code has been taken from - [Page Info API Code](https://colab.research.google.com/drive/1Z8DqXpHmNUJ3RD7e-OOwx2WYJPIYjUPp#scrollTo=VfZg8IQ-49pP), which is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/)

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<anushka8@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    article_title = article_title.replace('”','"')
    article_title = article_title.replace('“','"')
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### ORES Request to get the quality of the article from its last revision ID

From the last revision ID that we receive from the Page Info API, we can get the quality of the article. [ORES](https://www.mediawiki.org/wiki/ORES) is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst: <br>
FA - Featured article <br>
GA - Good article <br>
B - B-class article <br>
C - C-class article <br>
Start - Start-class article <br>
Stub - Stub-class article


The below code invokes the API with some added latency so as to not send in more than 100 requests per second. The API documentation can be access from the main [ORES](https://ores.wikimedia.org) page.

The code for this API call has been taken from [ORES API Call](https://colab.research.google.com/drive/1rZdBrtCe9XO4IkDWqm0A2RA-HfZCsqHh#scrollTo=ptXgvuy649Qz), which is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/).

In [4]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<anushka8@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

We read the politicians and the population CSV files and store them in a dataframe.

In [51]:
df_politicians = pd.read_csv('politicians_by_country_SEPT.2022.csv')
df_population = pd.read_csv('population_by_country_2022.csv')

In [7]:
df_politicians.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


We then check the data for inconsistencies. There are a number of repeated rows in the politicians dataframe which have the exact same name, url, and country information. We see that there are 4 such records, and we remove them from the dataframe.

In [52]:
duplicate_records = df_politicians[df_politicians.duplicated(subset=['name', 'url', 'country'], keep = False)]
print(f'''There are {duplicate_records.shape[0]} duplicate records with same name, url and country names''')

There are 4 duplicate records with same name, url and country names


In [53]:
df_politicians = df_politicians[~df_politicians.duplicated(subset=['name', 'url', 'country'], keep = 'last')]
df_politicians.shape

(7582, 3)

### Obtaining Revsion ID and Article Quality Score
We invoke the above two APIs for page info and quality score. We go over each row in the politicians dataframe and get these metrics and add them to new columns. <br>
We put these API calls in try-except-else blocks because there are certain articles for which the last revision ID is not available. We put all the articles for which the score could not be calculated in a list - **unscored_article_list**. The values in this list are - <br>
'Prince Ofosu Sefah', 'Harjit Kaur Talwandi', 'Abd al-Razzaq al-Hasani', 'Kang Sun-nam', 'Abiodun Abimbola Orekoya', 'Roman Konoplev'

In [11]:
revision_id_list = []
article_quality_list = []
unscored_article_list = []
for index, row in df_politicians.iterrows():
    info = request_pageinfo_per_article(row['name'])
    try : 
        revision_id = info['query']['pages'][list(info['query']['pages'].keys())[0]]['lastrevid']
    except : 
        unscored_article_list.append(row['name'])
        revision_id_list.append("")
        article_quality_list.append("")
    else : 
        score = request_ores_score_per_article(revision_id)
        try : 
            article_quality = score['enwiki']['scores'][str(revision_id)]["articlequality"]["score"]["prediction"]
        except : 
            unscored_article_list.append(row['name'])
            revision_id_list.append("")
            article_quality_list.append("")
        else :
            revision_id_list.append(revision_id)
            article_quality_list.append(article_quality)
    
df_politicians["revision_id"] = revision_id_list
df_politicians["article_quality"] = article_quality_list

In [13]:
df_politicians.to_csv('politicians_article_quality.csv')

In [85]:
df_politicians.shape

(7582, 5)

In [60]:
print(f'''There are {len(unscored_article_list)} records for which the score could not be calculated.''')

There are 6 records for which the score could not be calculated.


In [61]:
print(unscored_article_list)

['Prince Ofosu Sefah', 'Harjit Kaur Talwandi', 'Abd al-Razzaq al-Hasani', 'Kang Sun-nam', 'Abiodun Abimbola Orekoya', 'Roman Konoplev']


As a part of removing data inconsistencies, we now check the countries that have 0 population.

In [17]:
# Listing countries with 0 population
zero_population = df_population[df_population['Population (millions)'] == 0]
zero_population

Unnamed: 0,Geography,Population (millions)
183,Liechtenstein,0.0
185,Monaco,0.0
211,San Marino,0.0
223,Nauru,0.0
226,Palau,0.0
231,Tuvalu,0.0


We now modify the population dataset to classify each country into the region they belong to, and add region as a column in the dataframe. Another way of doing this is to modify the csv file and manually fill the region of each country in the new column.

In [18]:
# Classifying countries into regions
df_population['shifted'] = df_population['Geography'].shift(-1)
df_population = df_population[~((df_population['Geography'].str.isupper() == True) & (df_population['shifted'].str.isupper() == True))].iloc[:,0:2].reset_index().drop('index', axis = 1)
# Obtain the regions which are capital case in the data
regions = pd.DataFrame()
regions['region'] = df_population[df_population['Geography'].str.isupper()]['Geography']
regions['flag'] = np.arange(1, len(regions['region']) + 1)
# Obtain the country and region linkage
df = df_population.merge(regions, left_on = "Geography", right_on = "region", how = 'left').iloc[:,[0,1,3]]
# Fill the region column for linkage region to be populated
df['flag'] = df['flag'].expanding().max()
df_population = df.merge(regions, on = "flag", how = 'inner')
df_population = df_population.iloc[:,[0, 1, 3]]
# Filter for only countries in the geography column
df_population = df_population[df_population['Geography'] != df_population['region']]
df_population.head()

Unnamed: 0,Geography,Population (millions),region
1,Algeria,44.9,NORTHERN AFRICA
2,Egypt,103.5,NORTHERN AFRICA
3,Libya,6.8,NORTHERN AFRICA
4,Morocco,36.7,NORTHERN AFRICA
5,Sudan,46.9,NORTHERN AFRICA


We calculate the cumulative regional population by grouping on the "region" column.

In [19]:
# Calculating population in a region
region_population = df_population[['region', 'Population (millions)']].groupby('region').sum().reset_index()
region_population = region_population.rename(columns = {'Population (millions)': 'region_population'})
region_population['region_population'] = region_population['region_population'].round().astype('int')

We can now merge the population and politician dataframe on the country and geography columns to get a final dataframe with all the relecant fields.

In [20]:
# Merging Politicians and Population dataframes
df_main = df_politicians.merge(df_population, left_on = 'country', right_on = 'Geography', how = 'outer')
print(df_main.shape)
df_main.head()

(7607, 8)


Unnamed: 0,name,url,country,revision_id,article_quality,Geography,Population (millions),region
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,943562276,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,852404094,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1095102390,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1104998382,Start,Afghanistan,41.1,SOUTH ASIA


We then check the list of countries for which either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.
We save this list in a text file - **wp_countries-no_match.txt**

In [21]:
l1_politican = df_main[df_main['country'].isnull()]['Geography'].unique()
l2_population = df_main[df_main['Geography'].isnull()]['country'].unique()
# Combine the lists to obtains the no match countries list
no_match = list(set(np.append(l1_politican, l2_population)))
no_match.sort()
no_match

['Australia',
 'Brunei',
 'Canada',
 'China,  Hong Kong SAR',
 'China,  Macao SAR',
 'Curacao',
 'French Guiana',
 'French Polynesia',
 'Guadeloupe',
 'Guam',
 'Ireland',
 'Kiribati',
 'Korean',
 'Martinique',
 'Mauritius',
 'Mayotte',
 'New Caledonia',
 'New Zealand',
 'Philippines',
 'Puerto Rico',
 'Reunion',
 'Sao Tome and Principe',
 'United Kingdom',
 'United States',
 'Western Sahara',
 'eSwatini']

In [80]:
# Writing the list in a file
with open('wp_countries-no_match.txt', 'w') as f:
    for line in no_match:
        f.write(f"{line}\n")

In [22]:
df_main.shape

(7607, 8)

In [23]:
df_main = df_main[(~df_main['country'].isnull()) & (~df_main['Geography'].isnull())]
df_main = df_main.drop('Geography', axis = 1)
df_main.rename(columns={'name': 'article_title', 'Population (millions)': 'country_population'}, inplace=True)

In [25]:
df_main.shape
df_main.head()

Unnamed: 0,article_title,url,country,revision_id,article_quality,country_population,region
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [27]:
# Creating a final dataframe that contains articles, article quality, countries, regions, country population, and region population
df_final = df_main.merge(region_population, on = "region", how = 'outer')
df_final.head()


Unnamed: 0,article_title,url,country,revision_id,article_quality,country_population,region,region_population
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA,2009
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,943562276,Start,41.1,SOUTH ASIA,2009
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA,2009
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA,2009
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA,2009


### Saving the merged dataset with required columns
We save the final merged dataset with the required columns in a csv file - **wp_politicians_by_country.csv** <br>
This file contains the list of all articles with their quality, country, population, region and revision ID.

In [81]:
df_politicians_by_country = df_final[['article_title', 'country', 'country_population', 'region', 'revision_id', 'article_quality']].copy()
df_politicians_by_country = df_politicians_by_country.rename(columns = {'country_population': 'population'})
df_politicians_by_country.to_csv('wp_politicians_by_country.csv')


In [83]:
df_politicians_by_country.head()

Unnamed: 0,article_title,country,population,region,revision_id,article_quality
0,Shahjahan Noori,Afghanistan,41.1,SOUTH ASIA,1099689043,GA
1,Abdul Ghafar Lakanwal,Afghanistan,41.1,SOUTH ASIA,943562276,Start
2,Majah Ha Adrif,Afghanistan,41.1,SOUTH ASIA,852404094,Start
3,Haroon al-Afghani,Afghanistan,41.1,SOUTH ASIA,1095102390,B
4,Tayyab Agha,Afghanistan,41.1,SOUTH ASIA,1104998382,Start


### Creating data for analysis

To perform the analysis of to/bottom 10 countries/regions by per capita count of articles, we perform some transformations, such as adding the per-capita count, which is the number of articles from a given country divided by the population of that country. We sort these in descending order and then create the ranking lists.

In [29]:
# Calculating per capita total article count by country
total_articles_country = df_main[['country', 'country_population', 'article_title']].groupby(['country', 'country_population']).nunique().reset_index()
total_articles_country['articles_per_capita'] = total_articles_country['article_title'] / (total_articles_country['country_population'] * 1000000)
total_articles_country = total_articles_country.sort_values(by = ['articles_per_capita'], ascending = False)
total_articles_country = total_articles_country[total_articles_country['articles_per_capita'] != np.inf]
total_articles_country.head(10)

Unnamed: 0,country,country_population,article_title,articles_per_capita
5,Antigua and Barbuda,0.1,17,0.00017
54,Federated States of Micronesia,0.1,13,0.00013
3,Andorra,0.1,10,0.0001
13,Barbados,0.3,28,9.3e-05
104,Marshall Islands,0.1,9,9e-05
110,Montenegro,0.6,36,6e-05
143,Seychelles,0.1,6,6e-05
97,Luxembourg,0.7,37,5.3e-05
18,Bhutan,0.8,41,5.1e-05
64,Grenada,0.1,5,5e-05


### Thus, we have top 10 and lowest 10 countries by articles per capita

### 5. 1 - Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

We get the list of top 10 countries highest per capita article count as follows --->

In [31]:
# Calculating top 10 per capita article countries
countries_top10 = total_articles_country.head(10)[['country', 'articles_per_capita']]
countries_top10

Unnamed: 0,country,articles_per_capita
5,Antigua and Barbuda,0.00017
54,Federated States of Micronesia,0.00013
3,Andorra,0.0001
13,Barbados,9.3e-05
104,Marshall Islands,9e-05
110,Montenegro,6e-05
143,Seychelles,6e-05
97,Luxembourg,5.3e-05
18,Bhutan,5.1e-05
64,Grenada,5e-05


In [89]:
from tabulate import tabulate

In [113]:
countries_top10_test = countries_top10
rank=list(range(1,len(countries_top10_test)+1))
countries_top10_test['Rank'] = rank
countries_top10_test.reset_index(drop=True, inplace=True)
countries_top10_test

Unnamed: 0,Rank,country,articles_per_capita
0,1,Antigua and Barbuda,0.00017
1,2,Federated States of Micronesia,0.00013
2,3,Andorra,0.0001
3,4,Barbados,9.3e-05
4,5,Marshall Islands,9e-05
5,6,Montenegro,6e-05
6,7,Seychelles,6e-05
7,8,Luxembourg,5.3e-05
8,9,Bhutan,5.1e-05
9,10,Grenada,5e-05


In [116]:
first_column = countries_top10_test.pop('Rank')
countries_top10_test.insert(0, 'Rank', first_column)
print(tabulate(countries_top10, headers=list(countries_top10_test.columns), tablefmt="grid", showindex=False))

+--------+--------------------------------+-----------------------+
|   Rank | country                        |   articles_per_capita |
|      1 | Antigua and Barbuda            |           0.00017     |
+--------+--------------------------------+-----------------------+
|      2 | Federated States of Micronesia |           0.00013     |
+--------+--------------------------------+-----------------------+
|      3 | Andorra                        |           0.0001      |
+--------+--------------------------------+-----------------------+
|      4 | Barbados                       |           9.33333e-05 |
+--------+--------------------------------+-----------------------+
|      5 | Marshall Islands               |           9e-05       |
+--------+--------------------------------+-----------------------+
|      6 | Montenegro                     |           6e-05       |
+--------+--------------------------------+-----------------------+
|      7 | Seychelles                     |     

### 5.2 - Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order)

We get the list of bottom 10 countries lowest per capita article count as follows --->

In [37]:
countries_lowest10 = total_articles_country.tail(10)[['country', 'articles_per_capita']]
countries_lowest10.sort_values(by = ['articles_per_capita'], ascending = True, inplace = True)
countries_lowest10

Unnamed: 0,country,articles_per_capita
32,China,1.392176e-09
106,Mexico,7.843137e-09
140,Saudi Arabia,8.174387e-08
134,Romania,1.052632e-07
73,India,1.263054e-07
153,Sri Lanka,1.339286e-07
48,Egypt,1.352657e-07
53,Ethiopia,2.025932e-07
161,Taiwan,2.155172e-07
180,Vietnam,2.716298e-07


In [118]:
countries_lowest10_test = countries_lowest10
rank=list(range(1,len(countries_lowest10_test)+1))
countries_lowest10_test['Rank'] = rank
countries_lowest10_test.reset_index(drop=True, inplace=True)
countries_lowest10_test
first_column = countries_lowest10_test.pop('Rank')
countries_lowest10_test.insert(0, 'Rank', first_column)
print(tabulate(countries_lowest10_test, headers=list(countries_lowest10_test.columns), tablefmt="grid", showindex=False))

+--------+--------------+-----------------------+
|   Rank | country      |   articles_per_capita |
|      1 | China        |           1.39218e-09 |
+--------+--------------+-----------------------+
|      2 | Mexico       |           7.84314e-09 |
+--------+--------------+-----------------------+
|      3 | Saudi Arabia |           8.17439e-08 |
+--------+--------------+-----------------------+
|      4 | Romania      |           1.05263e-07 |
+--------+--------------+-----------------------+
|      5 | India        |           1.26305e-07 |
+--------+--------------+-----------------------+
|      6 | Sri Lanka    |           1.33929e-07 |
+--------+--------------+-----------------------+
|      7 | Egypt        |           1.35266e-07 |
+--------+--------------+-----------------------+
|      8 | Ethiopia     |           2.02593e-07 |
+--------+--------------+-----------------------+
|      9 | Taiwan       |           2.15517e-07 |
+--------+--------------+-----------------------+


### 5.3 Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

For this, we create a sub-dataframe from the original dataframe and retain only the articles that are categorized as high quality.

In [39]:
# Getting articles df with high quality articles only
df_high_quality = df_final[(df_final['article_quality'] == 'FA') | (df_final['article_quality'] == 'GA')]
df_high_quality.head()

Unnamed: 0,article_title,url,country,revision_id,article_quality,country_population,region,region_population
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA,2009
55,Ahmed Wali Karzai,https://en.wikipedia.org/wiki/Ahmed_Wali_Karzai,Afghanistan,1090245979,GA,41.1,SOUTH ASIA,2009
59,Masoud Khalili,https://en.wikipedia.org/wiki/Masoud_Khalili,Afghanistan,1103105365,GA,41.1,SOUTH ASIA,2009
93,Amrullah Saleh,https://en.wikipedia.org/wiki/Amrullah_Saleh,Afghanistan,1115022704,FA,41.1,SOUTH ASIA,2009
107,Nur ul-Haq Ulumi,https://en.wikipedia.org/wiki/Nur_ul-Haq_Ulumi,Afghanistan,1107429109,GA,41.1,SOUTH ASIA,2009


We calculate the article count as previously done, but this time we do it for the high quality articles dataset and thus, get the count of high quality articles for each country. We then calculate the per-capita count, and sort the dataframe in descending order of per-capita value.

In [42]:
total_high_quality_articles_country = df_high_quality[['country', 'country_population', 'article_title']].groupby(['country', 'country_population']).nunique().reset_index()
total_high_quality_articles_country['articles_per_capita'] = total_high_quality_articles_country['article_title'] / (total_high_quality_articles_country['country_population'] * 1000000)

total_high_quality_articles_country.sort_values(by = ['articles_per_capita'], ascending = False, inplace=True)
total_high_quality_articles_country = total_high_quality_articles_country[total_high_quality_articles_country['articles_per_capita'] != np.inf]

We now have a list of countries with top 10 high quality articles per-capita

In [43]:
countries_high_quality_top10 = total_high_quality_articles_country.head(10)[['country', 'articles_per_capita']]
countries_high_quality_top10

Unnamed: 0,country,articles_per_capita
2,Andorra,2e-05
53,Montenegro,5e-06
1,Albania,2.142857e-06
80,Suriname,1.666667e-06
9,Bosnia-Herzegovina,1.470588e-06
49,Lithuania,1.071429e-06
19,Croatia,1.052632e-06
74,Slovenia,9.52381e-07
61,Palestinian Territory,9.259259e-07
28,Gabon,8.333333e-07


In [119]:
countries_high_quality_top10_test = countries_high_quality_top10
rank=list(range(1,len(countries_high_quality_top10_test)+1))
countries_high_quality_top10_test['Rank'] = rank
countries_high_quality_top10_test.reset_index(drop=True, inplace=True)
countries_high_quality_top10_test
first_column = countries_high_quality_top10_test.pop('Rank')
countries_high_quality_top10_test.insert(0, 'Rank', first_column)
print(tabulate(countries_high_quality_top10_test, headers=list(countries_high_quality_top10_test.columns), tablefmt="grid", showindex=False))


+--------+-----------------------+-----------------------+
|   Rank | country               |   articles_per_capita |
|      1 | Andorra               |           2e-05       |
+--------+-----------------------+-----------------------+
|      2 | Montenegro            |           5e-06       |
+--------+-----------------------+-----------------------+
|      3 | Albania               |           2.14286e-06 |
+--------+-----------------------+-----------------------+
|      4 | Suriname              |           1.66667e-06 |
+--------+-----------------------+-----------------------+
|      5 | Bosnia-Herzegovina    |           1.47059e-06 |
+--------+-----------------------+-----------------------+
|      6 | Lithuania             |           1.07143e-06 |
+--------+-----------------------+-----------------------+
|      7 | Croatia               |           1.05263e-06 |
+--------+-----------------------+-----------------------+
|      8 | Slovenia              |           9.52381e-07

### 5.4 Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

We similarly calculate the list of 10 countries with lowest high quality articles per capita. We get the bottom 10 countries, sort their values in ascending order by articles-per-capita column, and print the list.

In [46]:
countries_high_quality_bottom10 = total_high_quality_articles_country.tail(10)[['country', 'articles_per_capita']]
countries_high_quality_bottom10.sort_values(by = ['articles_per_capita'], ascending = True, inplace = True)
countries_high_quality_bottom10

Unnamed: 0,country,articles_per_capita
35,India,4.2337e-09
84,Thailand,1.497006e-08
39,Japan,1.601281e-08
58,Nigeria,1.830664e-08
91,Vietnam,2.012072e-08
17,Colombia,2.03666e-08
87,Uganda,2.118644e-08
60,Pakistan,2.120441e-08
79,Sudan,2.132196e-08
37,Iran,2.257336e-08


In [120]:
countries_high_quality_bottom10_test = countries_high_quality_bottom10
rank=list(range(1,len(countries_high_quality_bottom10_test)+1))
countries_high_quality_bottom10_test['Rank'] = rank
countries_high_quality_bottom10_test.reset_index(drop=True, inplace=True)
countries_high_quality_bottom10_test
first_column = countries_high_quality_bottom10_test.pop('Rank')
countries_high_quality_bottom10_test.insert(0, 'Rank', first_column)
print(tabulate(countries_high_quality_bottom10_test, headers=list(countries_high_quality_bottom10_test.columns), tablefmt="grid", showindex=False))



+--------+-----------+-----------------------+
|   Rank | country   |   articles_per_capita |
|      1 | India     |           4.2337e-09  |
+--------+-----------+-----------------------+
|      2 | Thailand  |           1.49701e-08 |
+--------+-----------+-----------------------+
|      3 | Japan     |           1.60128e-08 |
+--------+-----------+-----------------------+
|      4 | Nigeria   |           1.83066e-08 |
+--------+-----------+-----------------------+
|      5 | Vietnam   |           2.01207e-08 |
+--------+-----------+-----------------------+
|      6 | Colombia  |           2.03666e-08 |
+--------+-----------+-----------------------+
|      7 | Uganda    |           2.11864e-08 |
+--------+-----------+-----------------------+
|      8 | Pakistan  |           2.12044e-08 |
+--------+-----------+-----------------------+
|      9 | Sudan     |           2.1322e-08  |
+--------+-----------+-----------------------+
|     10 | Iran      |           2.25734e-08 |
+--------+---

### 5.5 Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita

To rank the regions, we calculate the number of articles belonging to a given region by usind the groupby on region by article_titles and summing up the regional population (similar to country population as done above).
We then get the per-capita count by dividing the number of articles by the region population.

In [48]:
total_articles_region = df_final[['region', 'region_population', 'article_title']].groupby(['region', 'region_population']).nunique().reset_index()
total_articles_region['articles_per_capita'] = total_articles_region['article_title'] / (total_articles_region['region_population'] * 1000000)

total_articles_region.sort_values(by = ['articles_per_capita'], ascending = False)


Unnamed: 0,region,region_population,article_title,articles_per_capita
15,SOUTHERN EUROPE,151,879,5.821192e-06
0,CARIBBEAN,44,201,4.568182e-06
18,WESTERN EUROPE,197,699,3.548223e-06
5,EASTERN EUROPE,287,733,2.554007e-06
9,NORTHERN EUROPE,106,261,2.462264e-06
17,WESTERN ASIA,294,685,2.329932e-06
10,OCEANIA,44,86,1.954545e-06
14,SOUTHERN AFRICA,69,118,1.710145e-06
4,EASTERN AFRICA,473,646,1.365751e-06
11,SOUTH AMERICA,434,577,1.329493e-06


In [122]:
total_articles_region_test = total_articles_region.sort_values(by = ['articles_per_capita'], ascending = False)
rank=list(range(1,len(total_articles_region_test)+1))
total_articles_region_test['Rank'] = rank
total_articles_region_test.reset_index(drop=True, inplace=True)
total_articles_region_test
first_column = total_articles_region_test.pop('Rank')
total_articles_region_test.insert(0, 'Rank', first_column)
print(tabulate(total_articles_region_test, headers=list(total_articles_region_test.columns), tablefmt="grid", showindex=False))



+--------+------------------+---------------------+-----------------+-----------------------+
|   Rank | region           |   region_population |   article_title |   articles_per_capita |
|      1 | SOUTHERN EUROPE  |                 151 |             879 |           5.82119e-06 |
+--------+------------------+---------------------+-----------------+-----------------------+
|      2 | CARIBBEAN        |                  44 |             201 |           4.56818e-06 |
+--------+------------------+---------------------+-----------------+-----------------------+
|      3 | WESTERN EUROPE   |                 197 |             699 |           3.54822e-06 |
+--------+------------------+---------------------+-----------------+-----------------------+
|      4 | EASTERN EUROPE   |                 287 |             733 |           2.55401e-06 |
+--------+------------------+---------------------+-----------------+-----------------------+
|      5 | NORTHERN EUROPE  |                 106 |         

### 5.6 Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

We perform the same calculation of grouping the high quality articles (from the high quality dataset created before) by region, and calculating the per-capita count based on it.
We sort this in descending order to get the rank of regions by high quality articles per capita.

In [49]:
# 6th part - Getting articles df with high quality articles only by region
total_high_quality_articles_region = df_high_quality[['region', 'region_population', 'article_title']].groupby(['region', 'region_population']).nunique().reset_index()

total_high_quality_articles_region['articles_per_capita'] = total_high_quality_articles_region['article_title'] / (total_high_quality_articles_region['region_population'] * 1000000)

#df_high_quality = df_high_quality.groupby(['region', 'region_population']).nunique().reset_index()
total_high_quality_articles_region.sort_values(by = ['articles_per_capita'], ascending = False)

Unnamed: 0,region,region_population,article_title,articles_per_capita
14,SOUTHERN EUROPE,151,46,3.046358e-07
0,CARIBBEAN,44,8,1.818182e-07
5,EASTERN EUROPE,287,38,1.324042e-07
17,WESTERN EUROPE,197,22,1.116751e-07
16,WESTERN ASIA,294,28,9.52381e-08
8,NORTHERN EUROPE,106,8,7.54717e-08
13,SOUTHERN AFRICA,69,4,5.797101e-08
1,CENTRAL AMERICA,178,10,5.617978e-08
9,OCEANIA,44,2,4.545455e-08
2,CENTRAL ASIA,78,3,3.846154e-08


In [123]:
total_high_quality_articles_region_test = total_high_quality_articles_region.sort_values(by = ['articles_per_capita'], ascending = False)
rank=list(range(1,len(total_high_quality_articles_region_test)+1))
total_high_quality_articles_region_test['Rank'] = rank
total_high_quality_articles_region_test.reset_index(drop=True, inplace=True)
first_column = total_high_quality_articles_region_test.pop('Rank')
total_high_quality_articles_region_test.insert(0, 'Rank', first_column)
print(tabulate(total_high_quality_articles_region_test, headers=list(total_high_quality_articles_region_test.columns), tablefmt="grid", showindex=False))



+--------+-----------------+---------------------+-----------------+-----------------------+
|   Rank | region          |   region_population |   article_title |   articles_per_capita |
|      1 | SOUTHERN EUROPE |                 151 |              46 |           3.04636e-07 |
+--------+-----------------+---------------------+-----------------+-----------------------+
|      2 | CARIBBEAN       |                  44 |               8 |           1.81818e-07 |
+--------+-----------------+---------------------+-----------------+-----------------------+
|      3 | EASTERN EUROPE  |                 287 |              38 |           1.32404e-07 |
+--------+-----------------+---------------------+-----------------+-----------------------+
|      4 | WESTERN EUROPE  |                 197 |              22 |           1.11675e-07 |
+--------+-----------------+---------------------+-----------------+-----------------------+
|      5 | WESTERN ASIA    |                 294 |              28 |  