In [50]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Data Acquisition 

The data required to bootstrap the analysis and acquire article quality was given in CSV form by the class instructor. URLs are evident in the code below. I do not know the lifespan of these URLs or how how they were gathered.

Some notable countries are inexplicably missing from the politician data, e.g. United States and United Kingdom, as well as a slew of very low population countries. The full list is presented in in the analysis section of this document. 

Intermediate datasets are stored for each step: acquisition, pre-processing, and analysis.

In [51]:
# Get the dinosaur CSV
import utils

import pandas as pd
import json
from unidecode import unidecode

def gen_file_name(id, extension='json'): # TODO move to utils
    t = utils.ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE
    return "dino_monthy_" + id + "_" + t['start'] + t['end'] + "." + extension

# Get politicians_by_country.SEPT.2022.csv
utils.download_csv(
    'https://docs.google.com/spreadsheets/u/0/d/1Y4vSTYENgNE5KltqKZqnRQQBQZN5c8uKbSM4QTt8QGg/export?format=csv', 
    'data/politicans_by_country_raw.csv')

# Get population_by_country_2022.csv
utils.download_csv(
    'https://docs.google.com/spreadsheets/u/0/d/1POuZDfA1sRooBq9e1RNukxyzHZZ-nQ2r6H5NcXhsMPU/export?format=csv', 
    'data/population_by_country_raw.csv')

### Data Processing

Pre-analysis processing includes remapping column names, acquiring ORES ratings for each article, and finally combining the population and politician data sets. The end product of this process is `data/combined.csv`. Some ad hoc "pre-processing" steps are performed in the analysis section as well; they are explicitly stated. 

For each article, we acquire the latest revision id. (Note that Wikipedia articles are continuously revised, so subsequent executions of this notbook will likely yeild different results). The ORES system consumes this revision id and returns a rating for the wikipedia rating for the article. 

Finally, we combine the popluation and politician datasets by country. However, not all countries had an associated politician. A list of these countries is provided below. 

In [52]:
df_pop = pd.read_csv('data/population_by_country_raw.csv')
df_pol = pd.read_csv('data/politicans_by_country_raw.csv')

current_region = ""

def get_region(row):
    global current_region
    c = row['Geography']
    if c == c.upper(): # REGION
        current_region = c 
    return current_region


# Clean population dataset
df_pop['region'] = df_pop.apply(lambda r: get_region(r), axis=1)

df_pop['country'] = df_pop['Geography'] # TODO: just do a rename
df_pop['population'] = df_pop['Population (millions)']

df_pop = df_pop[df_pop['country'] != df_pop['country'].str.upper()]

df_pop.to_csv('data/population_by_country.csv')



In [53]:
# Get all the latest revision ID for each artile
json_accum = []

def get_latest_revision_id(row):
    j = utils.get_article_revid(row['name'])
    #print(row)
    #print(j)
    try:
        return list(j['query']['pages'].values())[0]['lastrevid']
    except KeyError:
        return None

df_pol1 = df_pol.copy()

df_pol1['revid'] = df_pol.apply(get_latest_revision_id, axis=1)

df_pol1.to_csv('data/politicans_by_country_revid.csv')

In [54]:
df_pol1 = pd.read_csv('data/politicans_by_country_revid.csv')

In [55]:
# Here, we acquire a ORES rating for each article 

def get_article_rating(row):
    id = str(int(row['revid']))
    
    j = utils.get_article_rating(id)
    print(row)
    print(json.dumps(j, indent=2))
    try:
        rating = j['enwiki']['scores'][id]['wp10']['score']['prediction']
        print("RATING: " + rating)
        return rating
    except KeyError: # TODO: remove this
        print("RAITING: None")
        return None

df_pol2 = df_pol1[df_pol1['revid'].isna() == False]

df_pol2['rating'] = df_pol2.apply(get_article_rating, axis=1)

df_pol2.to_csv('data/politicans_by_country_rating.csv')

Unnamed: 0                                                0
name                                        Shahjahan Noori
url           https://en.wikipedia.org/wiki/Shahjahan_Noori
country                                         Afghanistan
revid                                          1099689043.0
Name: 0, dtype: object
{
  "enwiki": {
    "models": {
      "wp10": {
        "version": "0.9.2"
      }
    },
    "scores": {
      "1099689043": {
        "wp10": {
          "score": {
            "prediction": "GA",
            "probability": {
              "B": 0.15912710013204634,
              "C": 0.3317754589473183,
              "FA": 0.029544640342677096,
              "GA": 0.43829652594886387,
              "Start": 0.034888134844429625,
              "Stub": 0.006368139784664634
            }
          }
        }
      }
    }
  }
}
RATING: GA
Unnamed: 0                                                    1
name                                      Abdul Ghafar Lakanwal
url 

In [None]:
df_rat = pd.read_csv('data/politicans_by_country_rating.csv')

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,name,url,country,revid,rating
0,0,0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1.099689e+09,GA
1,1,1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,9.435623e+08,Start
2,2,2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,8.524041e+08,Start
3,3,3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1.095102e+09,B
4,4,4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1.104998e+09,Start
...,...,...,...,...,...,...,...
7573,7579,7579,Rekayi Tangwena,https://en.wikipedia.org/wiki/Rekayi_Tangwena,Zimbabwe,1.073819e+09,Stub
7574,7580,7580,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,1.106932e+09,C
7575,7581,7581,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,9.042468e+08,Stub
7576,7582,7582,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,9.591118e+08,Stub


In [None]:
# Combine the population and politician datasets for analysis

df_pop = pd.read_csv("data/population_by_country.csv")

# Countries missing from the rating data set
df_combined = pd.merge(
    df_rat[['name', 'country', 'revid', 'rating']],
    df_pop[['country', 'region', 'population']], 
    how='right', on='country')

print("Countries not represented in the ratings dataset")
print(df_combined[df_combined['name'].isna()]['country'])

# Note: Only the country, "Korean", is dropped in the inner join. It represents politicians before Korea split into North Korea and South Korea
df_combined = pd.merge(
    df_rat[['name', 'country', 'revid', 'rating']],
    df_pop[['country', 'region', 'population']], 
    how='inner', on='country')

df_combined.to_csv("data/combined.csv")

Countries not represented in the ratings dataset
227            Western Sahara
1023                Mauritius
1024                  Mayotte
1034                  Reunion
1652    Sao Tome and Principe
1664                 eSwatini
1771                   Canada
1772            United States
2055                  Curacao
2097               Guadeloupe
2148               Martinique
2149              Puerto Rico
2528            French Guiana
4191                   Brunei
4529              Philippines
4607    China,  Hong Kong SAR
4608        China,  Macao SAR
4945                  Ireland
5115           United Kingdom
7441                Australia
7469         French Polynesia
7470                     Guam
7471                 Kiribati
7483            New Caledonia
7484              New Zealand
Name: country, dtype: object


### Analysis
Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


The following articles are associated with countries with less than 100,000 people. Since minimum granularity of the population data is 100k, these countries appear to have a population of zero 0.1-millions, i.e. 0. They are excluded from this alaysis. 

In [None]:
print(df_combined[df_combined['population'] == 0]['country'].unique())

df_combined = df_combined[df_combined['population'] > 0]

['Liechtenstein' 'Monaco' 'Nauru' 'Palau' 'San Marino' 'Tuvalu']


In [None]:
def arts_per_cap(df_sub):
    df_sub['arts_per_cap'] = df_sub['country'].count() / df_sub['population']
    return df_sub

#df_combined.groupby(['country'], as_index=False).apply(arts_per_cap)

df_pc = df_combined.groupby(['country'], as_index=False).apply(arts_per_cap)

print("Top 10 countries with highest total artices per capita (descending), per 1 million people")
df_pc.groupby('country')['arts_per_cap'].first().sort_values(ascending=False).head(10)


Top 10 countries with highest total artices per capita (descending:)


country
Antigua and Barbuda               170.000000
Federated States of Micronesia    130.000000
Andorra                           100.000000
Barbados                           93.333333
Marshall Islands                   90.000000
Montenegro                         60.000000
Seychelles                         60.000000
Luxembourg                         52.857143
Bhutan                             51.250000
Grenada                            50.000000
Name: arts_per_cap, dtype: float64

Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .


In [None]:
print("Top 10 countries with lowest total artices per capita (ascending), per 1 million people")
df_pc.groupby('country')['arts_per_cap'].first().sort_values().head(10)

Top 10 countries with highest total artices per capita (descending:)


country
China           0.001392
Mexico          0.007843
Saudi Arabia    0.081744
Romania         0.105263
India           0.125600
Sri Lanka       0.133929
Egypt           0.135266
Ethiopia        0.202593
Taiwan          0.215517
Vietnam         0.271630
Name: arts_per_cap, dtype: float64

Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order).

We consider the quality of any article with a rating "GA" or "FA" to be high. We assign a numerical 1 to these articles. All other are assigned 0. The numerical value is used to compute the mean article quality of a country. 


In [None]:
df_combined['high_quality'] = df_combined.apply(lambda r: 1 if r['rating'] == "GA" or r['rating'] == "FA" else 0, axis=1)

def qual_per_cap(df_sub):
    df_sub['qual_per_cap'] = df_sub['high_quality'].mean() / df_sub['population']
    return df_sub

df_qual_pc = df_combined.groupby(['country'], as_index=False).apply(qual_per_cap)

print("Top 10 countries with highest high qualtiy articles per capita (descending), per 1 million people")
df_qual_pc.groupby('country')['qual_per_cap'].first().sort_values(ascending=False).head(10)

Top 10 countries with highest high qualtiy articles per capita (descending), per 1 million people


country
Andorra                     2.000000
Gabon                       0.138889
Montenegro                  0.138889
Suriname                    0.072464
Romania                     0.052632
Estonia                     0.028490
Bosnia-Herzegovina          0.028281
Central African Republic    0.027473
Albania                     0.025818
Panama                      0.025253
Name: qual_per_cap, dtype: float64

Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

There were many countires that lacked a high-quality article, meaning there is a tie for the bottom 10. They are listed below. 

In [None]:
print("Top 10 countries with lowest low qualtiy articles per capita (ascending), per 1 million people")
l = df_qual_pc.groupby('country')['qual_per_cap'].first().sort_values()
l[l == 0].index

Top 10 countries with lowest low qualtiy articles per capita (ascending), per 1 million people


Index(['Zimbabwe', 'Israel', 'Iceland', 'Honduras', 'Saint Lucia', 'Guyana',
       'Guinea-Bissau', 'Samoa', 'Seychelles', 'Grenada', 'Greece', 'Ghana',
       'Sierra Leone', 'Gambia', 'Singapore', 'Finland', 'Mozambique',
       'Federated States of Micronesia', 'Solomon Islands', 'Eritrea', 'Italy',
       'Jamaica', 'Qatar', 'Paraguay', 'Namibia', 'Mongolia', 'Moldova',
       'Mexico', 'Nicaragua', 'Marshall Islands', 'Malta', 'Niger', 'Maldives',
       'Equatorial Guinea', 'Malawi', 'Luxembourg', 'North Macedonia',
       'Liberia', 'Lesotho', 'Oman', 'Latvia', 'Laos', 'Zambia', 'Kuwait',
       'Madagascar', 'Egypt', 'Fiji', 'Tonga', 'Angola', 'Timor-Leste', 'Togo',
       'Sri Lanka', 'Brazil', 'Antigua and Barbuda', 'Argentina', 'Botswana',
       'Trinidad and Tobago', 'Turkey', 'Bhutan', 'Turkmenistan', 'Belize',
       'Uzbekistan', 'Barbados', 'Bangladesh', 'Bahrain', 'Bahamas', 'Vanuatu',
       'Austria', 'Tanzania', 'Venezuela', 'Tajikistan', 'Cape Verde',
       'Tai

Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.


In [None]:
print("Top 10 regions with highest number of articles per capita (descending), per 1 million people")
df_pc.groupby('region')['arts_per_cap'].first().sort_values(ascending=False).head(10)

Top 10 regions with highest high qualtiy articles per capita (descending), per 1 million people


region
CARIBBEAN          170.000000
OCEANIA            130.000000
SOUTHERN EUROPE     29.642857
WESTERN ASIA        15.333333
WESTERN EUROPE       9.555556
CENTRAL AMERICA      7.500000
EASTERN EUROPE       4.239130
SOUTHERN AFRICA      4.230769
SOUTH ASIA           2.871046
CENTRAL ASIA         2.239583
Name: arts_per_cap, dtype: float64

In [48]:
print("Top 10 regions with highest high qualtiy articles per capita (descending), per 1 million people")
df_qual_pc.groupby('region')['qual_per_cap'].first().sort_values(ascending=False).head(10)

Top 10 regions with highest high qualtiy articles per capita (descending), per 1 million people


region
SOUTHERN EUROPE    0.025818
WESTERN ASIA       0.007246
EASTERN AFRICA     0.006460
WESTERN AFRICA     0.004390
EASTERN EUROPE     0.002787
CENTRAL ASIA       0.002422
SOUTHEAST ASIA     0.001751
SOUTH ASIA         0.001237
CARIBBEAN          0.000000
SOUTHERN AFRICA    0.000000
Name: qual_per_cap, dtype: float64