# Enfield, Andrew - DATA 512, A2: Bias in Data

This notebook implements [DATA 512 assignment two](https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), to explore bias using the number and quality of Wikipedia articles about politicians, by country.

Per updates via Slack, there ultimate deliverable is a set of lists, not charts. These lists are:

- TBD

For more information about the work and, especially, about the data, refer to the [README](Readme.md).

**TODO** I want a data section, where i note that there are issues, in one place. These include:

- articles that aren't actually a single politician, like 'History of monarchy in Canada	Canada	806849461
'
- follow on to above bullet that i deleted the Template: articles and why 
- something about Oliver's comments about multiple levels of hierarchy
- incomplete mapping of countries even after Oliver's and my follow on remapping
- 

# Prereqs

This code requires the libraries as described below.

In [1]:
# retrieve, load data
import requests
import json
import csv
import os

# prepare and analyze data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
import seaborn as sns # for formatting
%matplotlib inline 

In [2]:
sns.set_style("whitegrid")
figsize(14,7)

# Load and do basic data cleaning

The code in this section loads data from the three sources (described in the [README](README.md)). It also does some minimal cleaning.

## First off, load the Wikipedia articles

In [107]:
d_wikipedia = pd.read_csv('page_data.csv')
d_wikipedia.shape

(47197, 3)

In [108]:
d_wikipedia[:3]

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046


And, before we use the data to pull scores, we'll filter out the 'Template' entries.

On Slack, Oliver said he'd keep these entries, as they're evidence of coverage/activity. I'm going to drop them. I agree with Oliver, but I'm not sure that template articles are the same _kind_ of coverage - are these apples to the actual per-politician page oranges? Do we know that a template article indicates the same amount of interest/coverage as an average actual article? Does ORES work equally well estimating the quality of a template article? Rather than make a statement on these questions, I'm just going to remove them from the analysis at this point.

In [109]:
d_wikipedia.drop(d_wikipedia[d_wikipedia['page'].str.startswith('Template:')].index, inplace=True)
d_wikipedia.shape

(46701, 3)

In [110]:
d_wikipedia[:3]

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005


## Then load the population data

Per Oliver on Slack we shouldn't persist the population data in the repository. Accordingly, the code here downloads it from the source site, and then saves the copy locally. On future runs, if the file exists the code just loads it. The data isn't in the repo.

In [138]:
population_url = 'http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c472%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480'
population_filename = 'population_mid_2015.csv'
population_force_download = False

if (not os.path.exists(population_filename)) | (population_force_download == True):
    d_population = pd.read_csv(population_url, skiprows=2, thousands=',')
    d_population.to_csv(population_filename, index=False)
else:
    d_population = pd.read_csv(population_filename, thousands=',')
    
d_population.shape

(210, 6)

In [140]:
d_wikipedia.groupby(['country']).size().sort_values(ascending=False)[:10]

country
France            1681
Australia         1561
China             1133
United States     1092
Mexico            1077
Pakistan          1040
India              985
Russia             877
Spain              876
United Kingdom     863
dtype: int64

In [141]:
d_population[:3]

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,


## And finally retrieve the article scores

The code in this section retrieves [ORES](https://ores.wikimedia.org) article scores for each article in the data set. For more information about this source, refer to the [README](README.md). I use the https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context API as it enables downloading many scores at a time and reduces the time needed to obtain ~47000 scores from hours (with the [score-per-call API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)) to just a few minutes.

My implementation uses a few functions, to keep separate operations separate. To retrieve article scores for a list of revision IDs and save the results to a local file, call the top-level function get_article_scores_data.

In [10]:
user_agent = 'https://github.com/aenfield'

def get_multiple_full_ores_score_json(rev_ids):
    """Call the ORES web API to retrieve and return the JSON for all revision IDs 
    specified in the rev_ids parameter. In practice I found that asking for 140 or
    fewer IDs at a time worked fine."""
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids={rev_ids_delimited}'
    rev_ids_delimited = '|'.join([str(i) for i in rev_ids])
    params = { 'rev_ids_delimited' : rev_ids_delimited }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()
    
    
def get_ores_prediction_from_score_json(score_json, rev_id):
    """Return the most likely article quality value for the specified rev_id. The
    score_json parameter must be a JSON dict from ORES."""
    try:
        return score_json['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction']
    except KeyError as err: # to handle cases where the article has been deleted
        return f"KeyError: {err}."
            
def chunker(seq, size):
    """Get a generator that returns chunks of size 'size' of the sequence in 'seq'.
    What I get from this function enables me to call get_multiple_full_ores_score_json 
    multiple times with a separate (smaller) set of rev_ids each time.
    
    From: https://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
    """
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def get_article_scores_data(rev_ids, force=False, verbose=False):
    """Retrieve article scores from ORES for the specified 'rev_ids', save locally, 
    and return a dataframe. If 'force' is True, download regardless; otherwise retrieve
    the scores only if the local file doesn't exist.
    """
    
    filename = 'article_scores.csv'

    if (not os.path.exists(filename)) | (force == True):
        # download and save the data locally
        if verbose: print("Local file doesn't exist or download was forced. Downloading...")
        with open(filename, 'w') as output_file:
            writer = csv.writer(output_file, delimiter=',')
            writer.writerow(['RevisionId','Score'])

            progress_frequency = 25
            count_of_rev_ids_in_chunk = 140
            rev_ids_in_chunks = [x for x in chunker(rev_ids, count_of_rev_ids_in_chunk)]

            #rev_ids_in_chunks = rev_ids_in_chunks[300:]
            
            for chunk_index, chunk_of_rev_ids in enumerate(rev_ids_in_chunks):
                if (chunk_index % progress_frequency == 0): print(f"Retrieving chunk with index {chunk_index}.")

                scores_json = get_multiple_full_ores_score_json(chunk_of_rev_ids)
                #print(scores_json)
                for rev_id in chunk_of_rev_ids:
                    writer.writerow([rev_id, get_ores_prediction_from_score_json(scores_json, rev_id)])

            if verbose: print(f"Retrieved {chunk_index + 1} chunks and saved to {filename}. Done.")        
    else:
        if verbose: print("Local file exists already.")
                  
    # now open and return dataframe
    return pd.read_csv(filename, index_col='RevisionId')

In [11]:
d_scores = get_article_scores_data(d_wikipedia['rev_id'].values)
d_scores.shape

(46701, 1)

Some articles may have been deleted since the page_data list was created. How many of these occurred? And then we'll filter them out, since we can't get a score for something that doesn't exist.

In [12]:
len(d_scores[d_scores['Score'].str.startswith('KeyError')])

2

In [13]:
d_scores[d_scores['Score'].str.startswith('KeyError')]

Unnamed: 0_level_0,Score
RevisionId,Unnamed: 1_level_1
807367030,KeyError: 'score'.
807367166,KeyError: 'score'.


In [14]:
d_scores.drop(d_scores[d_scores['Score'].str.startswith('KeyError')].index, inplace=True)
d_scores.shape

(46699, 1)

In [15]:
len(d_scores[d_scores['Score'].str.startswith('KeyError')])

0

And now join to pull in the score data.

If there are rows with no score data, the page data will have nulls for the new score column. I'll filter these rows out below when I clean up rows with incomplete data.

In [16]:
d_wikipedia_with_scores = pd.merge(left=d_wikipedia, right=d_scores, left_on='rev_id', right_index=True, how='left')
d_wikipedia_with_scores.shape

(46701, 4)

In [17]:
d_wikipedia_with_scores[:3]

Unnamed: 0,page,country,rev_id,Score
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
12,Yos Por,Cambodia,393822005,Stub


In [18]:
d_wikipedia_with_scores['Score'].value_counts(dropna=False)

Stub     23707
Start    15341
C         5764
GA         872
B          735
FA         280
NaN          2
Name: Score, dtype: int64

In [19]:
d_wikipedia.shape

(46701, 3)

In [20]:
len(d_wikipedia_with_scores) - len(d_wikipedia)

0

In [21]:
d_scores.shape

(46699, 1)

In [22]:
d_wikipedia_with_scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46701 entries, 1 to 47196
Data columns (total 4 columns):
page       46701 non-null object
country    46701 non-null object
rev_id     46701 non-null int64
Score      46699 non-null object
dtypes: int64(1), object(3)
memory usage: 1.8+ MB


Originally the page_data data set had duplicates. It looks like that's no longer the case, but I'll check for them here to confirm.

In [27]:
dupes = d_wikipedia_with_scores[d_wikipedia_with_scores.duplicated(subset='rev_id', keep=False)].sort_values(['rev_id','page'])
dupes

Unnamed: 0,page,country,rev_id,Score


Finally, I'll join to pull in the population data. For now I'll do an outer join so we can see the country values that don't exist on _both_ sides of the join. This isn't needed for the assignment because it says to just remove all rows that don't have matching data. But, I'm curious and I also want to at least see if there are places where I could do any further matching to expand/improve the data.

In [28]:
d_all = pd.merge(left=d_wikipedia_with_scores, right=d_population[['Location','Data']],
                 how='outer', left_on='country', right_on='Location')
d_all.shape

(46724, 6)

In [29]:
d_all[:3]

Unnamed: 0,page,country,rev_id,Score,Location,Data
0,Bir I of Kanem,Chad,355319463.0,Stub,Chad,13707000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,Chad,13707000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,Chad,13707000.0


What countries exist on one side but not on the other?

First, what countries exist in the Wikipedia data but not in the population data, w/ the number of rows - here, the number of pages - for each?

In [30]:
d_all[d_all['Location'].isnull()]['country'].value_counts(dropna=False)

Hondura                             187
Salvadoran                          116
South Korean                         96
Cape Colony                          81
Samoan                               76
Rhodesian                            75
Faroese                              74
Ivorian                              73
Cook Island                          67
Jersey                               61
Saint Lucian                         47
Pitcairn Islands                     43
Chechen                              38
East Timorese                        36
Saint Kitts and Nevis                30
Montserratian                        27
Guernsey                             25
Omani                                24
Niuean                               22
Carniolan                            22
Saint Vincent and the Grenadines     21
Palauan                              21
South Ossetian                       18
Tokelauan                            17
Abkhazia                             16


And now the countries that exist in the population data but not in the Wikipedia data.

In [31]:
d_all[d_all['country'].isnull()]['Location'].value_counts(dropna=False)

Channel Islands                 1
Macao, SAR                      1
Reunion                         1
Cote d'Ivoire                   1
Palau                           1
St. Kitts-Nevis                 1
Oman                            1
Curacao                         1
St. Vincent & the Grenadines    1
Western Sahara                  1
French Polynesia                1
Guam                            1
Georgia                         1
Samoa                           1
St. Lucia                       1
El Salvador                     1
Timor-Leste                     1
Hong Kong, SAR                  1
Mayotte                         1
Honduras                        1
New Caledonia                   1
Puerto Rico                     1
Brunei                          1
Name: Location, dtype: int64

Based on eyeballing of the most common Wikipedia countries for which we have no population data - I included all countries with 70 or more pages - I came up with the following mapping that I'll use to update the country field in the Wikipedia data, so we _will_ have population data. This is probably clearly better for some countries that map directly - for example updating 'South Korean' to 'Korea, South'. It's fuzzier for at least some others, including for example Wikipedia countries that aren't around anymore in the way they were when the leader was in power, like Cape Colony and Rhodesia. Good enough for now though.

In addition, there are clearly others that I could also map - for example, 'Saint Vincent and the Grenadines' in the Wikipedia data could be changed to 'St. Vincent & the Grenadines', but all of the ones I didn't manually map have fewer than 70 pages.

In [33]:
wikipedia_country_to_population_map = {'Burkinabé':'Burkina Faso',
                                       'Ivorian':"Cote d'Ivoire",
                                       'Faroese':'Denmark',
                                       'Salvadoran':'El Salvador',
                                       'Hondura':'Honduras',
                                       'South Korean':'Korea, South',
                                       'Samoan':'Samoa',
                                       'Cape Colony':'United Kingdom',
                                       'Rhodesian':'Zimbabwe'}

In [34]:
d_wikipedia_with_scores['country'].update(d_wikipedia_with_scores['country'].map(wikipedia_country_to_population_map))

And then we need to re-merge the population data. (This is duplication and not DRY. It's only one line, so I won't worry about it here.)

In [35]:
d_all = pd.merge(left=d_wikipedia_with_scores, right=d_population[['Location','Data']],
                 how='outer', left_on='country', right_on='Location')
d_all.shape

(46720, 6)

In [36]:
d_all[d_all['Location'].isnull()]['country'].value_counts(dropna=False)

Cook Island                         67
Jersey                              61
Saint Lucian                        47
Pitcairn Islands                    43
Chechen                             38
East Timorese                       36
Saint Kitts and Nevis               30
Montserratian                       27
Guernsey                            25
Omani                               24
Niuean                              22
Carniolan                           22
Saint Vincent and the Grenadines    21
Palauan                             21
South Ossetian                      18
Tokelauan                           17
Abkhazia                            16
South African Republic              15
Greenlandic                         13
Ossetian                             9
Dagestani                            7
Incan                                7
Somaliland                           4
Rojava                               2
Name: country, dtype: int64

In [37]:
d_all[d_all['country'].isnull()]['Location'].value_counts(dropna=False)

Brunei                          1
New Caledonia                   1
Reunion                         1
French Polynesia                1
St. Kitts-Nevis                 1
Oman                            1
Curacao                         1
St. Vincent & the Grenadines    1
Western Sahara                  1
Macao, SAR                      1
Georgia                         1
Guam                            1
Channel Islands                 1
St. Lucia                       1
Puerto Rico                     1
Timor-Leste                     1
Mayotte                         1
Hong Kong, SAR                  1
Palau                           1
Name: Location, dtype: int64

In [38]:
d_all[:3]

Unnamed: 0,page,country,rev_id,Score,Location,Data
0,Bir I of Kanem,Chad,355319463.0,Stub,Chad,13707000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,Chad,13707000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,Chad,13707000.0


Finally, now that we've added in the additional remappings here, rows that still have a NaN 'Location' value are rows for which we have no population data and, per the assignment, we can get rid of them.

In [39]:
d_all.shape

(46720, 6)

In [40]:
sum(d_all['Location'].isnull())

592

In [41]:
d_all.dropna(subset=['Location'], inplace=True)
d_all.shape

(46128, 6)

Since we did an outer join above so we could understand data without matches on both sides of the join, we also need to drop the rows with location/population data but with no matching articles. These rows have nulls for the page data fields - for page, country, rev_id, and Score.

In [42]:
d_all.dropna(subset=['page', 'country', 'rev_id', 'Score'], how='all', inplace=True)
d_all.shape

(46109, 6)

And we'll also get rid of any rows where we don't have score data - these can come from, for example as noted above, cases where the actual page has been deleted on Wikipedia.

In [43]:
d_all[d_all['Score'].isnull()]

Unnamed: 0,page,country,rev_id,Score,Location,Data
9352,Jalal Movaghar,Iran,807367030.0,,Iran,78483446.0
9353,Mohsen Movaghar,Iran,807367166.0,,Iran,78483446.0


In [44]:
d_all.dropna(subset=['Score'], inplace=True)
d_all.shape

(46107, 6)

And finally, I'll rename columns to match assignment instructions and drop the extra join column.

In [45]:
del d_all['Location']
d_all.rename(columns={'page':'article_name',
                      'rev_id':'revision_id',
                      'Score':'article_quality',
                      'Data':'population'}, inplace=True)
d_all[:3]

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,13707000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,13707000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,13707000.0


# Analysis

## Articles per person

In [143]:
articles_per_person = d_all.groupby(['country']).apply(lambda g: len(g) / g.iloc[0]['population'])
articles_per_person[:3]

country
Afghanistan   0.00000999
Albania       0.00015802
Algeria       0.00000290
dtype: float64

In [147]:
# we want to display the full list, and to not use scientific notation
pd.options.display.max_rows = 250
pd.options.display.float_format = '{:.8f}'.format

### Countries with the most articles per person

In [148]:
articles_per_person.sort_values(ascending=False)

country
Nauru                            0.00478821
Tuvalu                           0.00457627
San Marino                       0.00245455
Monaco                           0.00105020
Liechtenstein                    0.00074528
Marshall Islands                 0.00067273
Iceland                          0.00061059
Tonga                            0.00060987
Andorra                          0.00043590
Samoa                            0.00039133
Federated States of Micronesia   0.00034951
Grenada                          0.00032432
Luxembourg                       0.00031272
Antigua and Barbuda              0.00026667
Kiribati                         0.00026455
Maldives                         0.00023923
Malta                            0.00023871
Fiji                             0.00022837
Seychelles                       0.00022621
Vanuatu                          0.00021622
Dominica                         0.00017647
New Zealand                      0.00017051
Albania                 

### Countries with the fewest articles per person

In [156]:
articles_per_person.sort_values()

country
India                            0.00000075
Indonesia                        0.00000083
China                            0.00000083
Uzbekistan                       0.00000089
Ethiopia                         0.00000103
Korea, North                     0.00000144
Zambia                           0.00000162
Thailand                         0.00000172
Congo, Dem. Rep. of              0.00000194
Bangladesh                       0.00000200
Vietnam                          0.00000204
Mozambique                       0.00000225
Sudan                            0.00000232
Egypt                            0.00000266
Brazil                           0.00000270
Senegal                          0.00000286
Algeria                          0.00000290
Eritrea                          0.00000308
Cote d'Ivoire                    0.00000314
United States                    0.00000340
Japan                            0.00000344
Nigeria                          0.00000373
Saudi Arabia            

## High quality articles per all articles

In [157]:
def is_high_quality(score):
    if (score == 'FA') | (score == 'GA'):
        return True
    else:
        return False

In [158]:
d_all['article_quality'].value_counts(dropna=False)

Stub     23456
Start    15103
C         5685
GA         862
B          722
FA         279
Name: article_quality, dtype: int64

In [159]:
(d_all['article_quality'].apply(is_high_quality)).value_counts(dropna=False)

False    44966
True      1141
Name: article_quality, dtype: int64

In [160]:
# define this instead of using a lambda, like above, so we can keep the lines around 80 chars wide
# a lambda would be fine as the actual function is still pretty short/a single liner
def get_high_quality_article_proportion(articles):
    return sum(articles['article_quality'].apply(is_high_quality)) / len(articles)

high_quality_articles_per_all_articles = d_all.groupby(['country']).apply(get_high_quality_article_proportion)
high_quality_articles_per_all_articles[:3]

country
Afghanistan   0.05900621
Albania       0.01094092
Algeria       0.02586207
dtype: float64

## Countries with the most high quality articles per all articles

In [161]:
high_quality_articles_per_all_articles.sort_values(ascending=False)

country
Korea, North                     0.25000000
Romania                          0.13119534
Saudi Arabia                     0.12711864
Central African Republic         0.12121212
Guinea-Bissau                    0.10000000
Qatar                            0.10000000
Vietnam                          0.09625668
Bhutan                           0.09090909
Mauritania                       0.08333333
Ireland                          0.08201058
United States                    0.07875458
Singapore                        0.07246377
Guatemala                        0.07228916
Uzbekistan                       0.07142857
Palestinian Territory            0.06703911
Benin                            0.06593407
Syria                            0.06201550
Gabon                            0.06122449
United Kingdom                   0.06038136
Afghanistan                      0.05900621
Ukraine                          0.05666667
Vanuatu                          0.05000000
Congo, Dem. Rep. of     

## Countries with the fewest high quality articles per all articles

In [162]:
high_quality_articles_per_all_articles.sort_values()

country
Federated States of Micronesia   0.00000000
San Marino                       0.00000000
Macedonia                        0.00000000
Comoros                          0.00000000
Djibouti                         0.00000000
Dominica                         0.00000000
Suriname                         0.00000000
Liechtenstein                    0.00000000
Eritrea                          0.00000000
Mozambique                       0.00000000
Tajikistan                       0.00000000
Lesotho                          0.00000000
Cape Verde                       0.00000000
French Guiana                    0.00000000
Swaziland                        0.00000000
Guadeloupe                       0.00000000
Tunisia                          0.00000000
Zambia                           0.00000000
Switzerland                      0.00000000
Guyana                           0.00000000
Honduras                         0.00000000
Tonga                            0.00000000
Nauru                   

## Large countries only

**TBD** consider putting in a short description here of why small countries are more likely to be in the extreme, or refer to the explanation section where I'll describe this at least a bit more.

If we filter out countries with 'smaller' populations, what do we see for the same lists? First we'll add in the population data and then we can use that to filter the lists.

In [102]:
articles_per_person = pd.concat([articles_per_person, 
           d_population[['Location','Data']].set_index('Location')['Data']],
           axis=1).rename(columns={0:'proportion','Data':'population'})
articles_per_person[:3]

Unnamed: 0,proportion,population
Afghanistan,9.99e-06,32247000
Albania,0.00015802,2892000
Algeria,2.9e-06,39948000


In [103]:
high_quality_articles_per_all_articles = pd.concat([high_quality_articles_per_all_articles, 
           d_population[['Location','Data']].set_index('Location')['Data']],
           axis=1).rename(columns={0:'proportion','Data':'population'})
high_quality_articles_per_all_articles[:3]

Unnamed: 0,proportion,population
Afghanistan,0.05900621,32247000
Albania,0.01094092,2892000
Algeria,0.02586207,39948000


And now we can look at the lists again, with a threshold population.

In [104]:
pop_thresh = 50000000

### Large countries, articles per person

In [105]:
articles_per_person[articles_per_person['population'] > pop_thresh].sort_values(by='proportion', ascending=False)['proportion']

France                0.00002612
United Kingdom        0.00001450
Italy                 0.00001319
Iran                  0.00001046
Germany               0.00000852
Mexico                0.00000848
Tanzania              0.00000775
Korea, South          0.00000739
South Africa          0.00000687
Russia                0.00000608
Pakistan              0.00000522
Philippines           0.00000494
Myanmar               0.00000454
Turkey                0.00000445
Nigeria               0.00000373
Japan                 0.00000344
United States         0.00000340
Brazil                0.00000270
Egypt                 0.00000266
Vietnam               0.00000204
Bangladesh            0.00000200
Congo, Dem. Rep. of   0.00000194
Thailand              0.00000172
Ethiopia              0.00000103
China                 0.00000083
Indonesia             0.00000083
India                 0.00000075
Name: proportion, dtype: float64

### Large countries, high quality articles per all articles

In [106]:
high_quality_articles_per_all_articles[high_quality_articles_per_all_articles['population'] > pop_thresh].sort_values(by='proportion', ascending=False)['proportion']

Vietnam               0.09625668
United States         0.07875458
United Kingdom        0.06038136
Congo, Dem. Rep. of   0.04929577
Egypt                 0.04641350
Philippines           0.04322200
Indonesia             0.04265403
Russia                0.03990878
Myanmar               0.03797468
China                 0.03706973
South Africa          0.03174603
Germany               0.02749638
Thailand              0.02678571
Iran                  0.02070646
Ethiopia              0.01980198
Bangladesh            0.01869159
Korea, South          0.01866667
France                0.01784652
Japan                 0.01601831
India                 0.01522843
Pakistan              0.01346154
Turkey                0.01149425
Italy                 0.00970874
Brazil                0.00905797
Mexico                0.00649954
Nigeria               0.00589102
Tanzania              0.00246914
Name: proportion, dtype: float64

## Persist data to a file

In [64]:
d_all.to_csv('en-wikipedia-politician-scores.csv', index=False)