# Enfield, Andrew - DATA 512, A2: Bias in Data

TBD UPDATE

The assignment is at https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data.

TBD remove

This notebook pulls, prepares, and analyzes data about the amount of monthly English Wikipedia traffic from January 1, 2008 through September 30, 2017. For more information about the work and data, refer to the [README](Readme.md).

A few notes:
- Normally I'd prefer to keep the explanation and background that's in the README here in the notebook, so everything's in a single file, but I've split it up this time as that's what the assignment requested. I won't copy/paste because keeping duplicate content in sync is horrible.
- Real reproducibility needs tests for the code. A lot of my implementation below is in functions. I'd normally put these functions in at least one separate file that I import into this notebook, and I'd have tests in an additional file. For this assignment I'll just keep everything in this file, for simplicity, even though it means I can't test the code the way I normally would.

# Prereqs

This code requires the libraries as described below.

In [1]:
# retrieve, load data
import requests
import json
import csv
import os

# prepare and analyze data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from mpl_toolkits.axes_grid.anchored_artists import AnchoredText # for addtl annotations in charts
#from matplotlib.ticker import FuncFormatter # for custom axis labels
from IPython.core.pylabtools import figsize
import seaborn as sns # for formatting
%matplotlib inline 

In [2]:
sns.set_style("whitegrid")
figsize(14,7)

# Load data

TBD UPDATE

This section loads the data from the two APIs described in the README, producing five separate .json files, one for each API and access combination.

In [3]:
d_wikipedia = pd.read_csv('page_data.csv')
d_wikipedia.shape

(47997, 3)

In [4]:
d_wikipedia[:3]

Unnamed: 0,country,page,last_edit
0,Abkhazia,Zurab Achba,802551672
1,Abkhazia,Garri Aiba,774499188
2,Abkhazia,Zaur Avidzba,803841397


In [5]:
d_population = pd.read_csv('Population Mid-2015.csv', skiprows=2, thousands=',')
d_population.shape

(210, 6)

In [6]:
d_wikipedia.groupby(['country']).size().sort_values(ascending=False)[:10]

country
France            1695
Australia         1568
China             1138
United States     1097
Mexico            1079
Pakistan          1069
India              981
Russia             956
Spain              907
United Kingdom     865
dtype: int64

In [7]:
d_population[:3]

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,


## Pull article scores

Docs: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model and https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

Note that when I try the multiple rev ID API with a bunch of valid IDs and one that's a text string, then it gives me a 500 and no data at all. However when I try with a bunch of valid IDs and an ID that's not valid - like -1 - then I get valid/good data for the valid IDs and output like the following. This seems good: I'll go ahead and try just pulling batches of IDs w/o further error handling.

    "-1": {
        "wp10": {
          "error": {
            "message": "RevisionNotFound: Could not find revision ({revision}:-1)",
            "type": "RevisionNotFound"
          }
        }

In [8]:
user_agent = 'https://github.com/aenfield'

def get_multiple_full_ores_score_json(rev_ids):
    """TBD referring to ..., with rev_ids as a list"""
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids={rev_ids_delimited}'

    rev_ids_delimited = '|'.join([str(i) for i in rev_ids])

    params = { 'rev_ids_delimited' : rev_ids_delimited }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()
    
def get_ores_prediction_from_score_json(score_json, rev_id):
    """Return the most likely article type, per ORES. Assumes a JSON dict from Ores. """
    return score_json['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction']
            
def chunker(seq, size):
    """Get a generator that returns chunks of size 'size' of the sequence in 'seq'.
    
    From: https://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
    """
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def get_article_scores_data(rev_ids, force=False, verbose=False):
    """TBD update Download and save results from the specified API to a local file, by default only if local file doesn't exist.
    TBD call with d_wikipedia['last_edit'].values for 'rev_ids'
    
    apiname - 'pagecounts' or 'pageviews'
    params - a dict containing 'access', 'start', and 'end' keys; use get_param_dict_from_params
    user_agent - an identifier for the user making the request; can be a GitHub user URL or general email address
    force - download data and overwrite local file, even if file already exists; default is False
    verbose - print diagnostic data; default is false
    """
    
    filename = 'article_scores.csv'

    if (not os.path.exists(filename)) | (force == True):
        # download and save the data locally
        if verbose: print("Local file doesn't exist or download was forced. Downloading...")
        with open(filename, 'w') as output_file:
            writer = csv.writer(output_file, delimiter=',')
            writer.writerow(['RevisionId','Score'])

            progress_frequency = 25
            count_of_rev_ids_in_chunk = 140
            rev_ids_in_chunks = [x for x in chunker(rev_ids, count_of_rev_ids_in_chunk)]

            for chunk_index, chunk_of_rev_ids in enumerate(rev_ids_in_chunks):
                if (chunk_index % progress_frequency == 0): print(f"Retrieving chunk with index {chunk_index}.")

                scores_json = get_multiple_full_ores_score_json(chunk_of_rev_ids)
                for rev_id in chunk_of_rev_ids:
                    writer.writerow([rev_id, get_ores_prediction_from_score_json(scores_json, rev_id)])

            if verbose: print(f"Retrieved {chunk_index + 1} chunks and saved to {filename}. Done.")        
    else:
        if verbose: print("Local file exists already.")
                  
    # now open and return dataframe
    return pd.read_csv(filename, index_col='RevisionId')

In [9]:
d_scores = get_article_scores_data(d_wikipedia['last_edit'].values)
d_scores.shape

(47997, 1)

In [10]:
d_scores[:3]

Unnamed: 0_level_0,Score
RevisionId,Unnamed: 1_level_1
802551672,C
774499188,Stub
803841397,C


In [11]:
d_wikipedia_with_scores = pd.merge(left=d_wikipedia, right=d_scores, left_on='last_edit', right_index=True, how='left')
d_wikipedia_with_scores.shape

(49827, 4)

In [12]:
d_wikipedia_with_scores[:3]

Unnamed: 0,country,page,last_edit,Score
0,Abkhazia,Zurab Achba,802551672,C
1,Abkhazia,Garri Aiba,774499188,Stub
2,Abkhazia,Zaur Avidzba,803841397,C


In [13]:
d_wikipedia_with_scores['Score'].value_counts(dropna=False)

Stub     25336
Start    16239
C         6203
GA         937
B          777
FA         335
Name: Score, dtype: int64

In [14]:
d_wikipedia.shape

(47997, 3)

In [15]:
len(d_wikipedia_with_scores) - len(d_wikipedia)

1830

In [16]:
d_scores.shape

(47997, 1)

In [17]:
d_wikipedia_with_scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49827 entries, 0 to 47996
Data columns (total 4 columns):
country      49827 non-null object
page         49827 non-null object
last_edit    49827 non-null int64
Score        49827 non-null object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB


In [18]:
# d_wikipedia_with_scores.to_excel('eyeball.xlsx')

In [19]:
len(d_wikipedia_with_scores['last_edit'].unique())

47122

In [20]:
len(d_wikipedia['last_edit'].unique())

47122

In [21]:
len(d_scores.index.unique())

47122

In [22]:
d_wikipedia[d_wikipedia['page'] == 'George Grey']

Unnamed: 0,country,page,last_edit
9821,Cape Colony,George Grey,647367482


In [23]:
d_wikipedia[d_wikipedia['last_edit'] == 647367482]

Unnamed: 0,country,page,last_edit
9821,Cape Colony,George Grey,647367482
41369,South Africa,Klaas Afrikaner,647367482


In [24]:
d_wikipedia_with_scores[d_wikipedia_with_scores['page'] == 'George Grey']

Unnamed: 0,country,page,last_edit,Score
9821,Cape Colony,George Grey,647367482,Start
9821,Cape Colony,George Grey,647367482,Start


In [25]:
d_wikipedia_with_scores[d_wikipedia_with_scores['last_edit'] == 647367482]

Unnamed: 0,country,page,last_edit,Score
9821,Cape Colony,George Grey,647367482,Start
9821,Cape Colony,George Grey,647367482,Start
41369,South Africa,Klaas Afrikaner,647367482,Start
41369,South Africa,Klaas Afrikaner,647367482,Start


In [26]:
dupes = d_wikipedia_with_scores[d_wikipedia_with_scores.duplicated(subset='last_edit', keep=False)].sort_values(['last_edit','page'])
dupes[:10]

Unnamed: 0,country,page,last_edit,Score
47763,Zimbabwe,John Ralph Beaumont,574571582,Stub
47763,Zimbabwe,John Ralph Beaumont,574571582,Stub
38018,Rhodesian,William Fairbridge,574571582,Stub
38018,Rhodesian,William Fairbridge,574571582,Stub
40948,Somalia,Qamar Aden Ali,612095066,Stub
40948,Somalia,Qamar Aden Ali,612095066,Stub
41298,South Africa,Tasneem Essop,612095066,Stub
41298,South Africa,Tasneem Essop,612095066,Stub
9821,Cape Colony,George Grey,647367482,Start
9821,Cape Colony,George Grey,647367482,Start


In [27]:
len(dupes.groupby(['last_edit']).size())

838

In [28]:
dupes['last_edit'].value_counts(dropna=False)[:3]

806153574    25
770939478     9
716574519     9
Name: last_edit, dtype: int64

In [29]:
# remove duplicate score rows (that come from page_data.csv having duplicate rev IDs)
# all index values should have the same score, since the score comes from the API and
# is based only on the rev ID/index value, so we can drop all and keep any one
d_scores_nodupes = d_scores[~d_scores.index.duplicated(keep='first')]
d_scores_nodupes.shape

(47122, 1)

In [30]:
d_scores_nodupes.index.is_unique

True

In [31]:
# and then redo the join
d_wikipedia_with_scores = pd.merge(left=d_wikipedia, right=d_scores_nodupes, left_on='last_edit', right_index=True, how='left')
d_wikipedia_with_scores.shape

(47997, 4)

In [32]:
d_wikipedia.shape

(47997, 3)

In [33]:
d_wikipedia_with_scores[:3]

Unnamed: 0,country,page,last_edit,Score
0,Abkhazia,Zurab Achba,802551672,C
1,Abkhazia,Garri Aiba,774499188,Stub
2,Abkhazia,Zaur Avidzba,803841397,C


Join to pull in the population data. For now we'll do an outer join so we can see the country values that don't exist on _both_ sides of the join. This isn't needed for the assignment because it says to just remove all rows that don't have matching data. But, I'm curious and I also want to at least see if there are places where I could do any further matching to expand/improve the data.

In [34]:
d_all = pd.merge(left=d_wikipedia_with_scores, right=d_population[['Location','Data']],
                 how='outer', left_on='country', right_on='Location')
d_all.shape

(48022, 6)

In [35]:
d_all[:3]

Unnamed: 0,country,page,last_edit,Score,Location,Data
0,Abkhazia,Zurab Achba,802551672.0,C,,
1,Abkhazia,Garri Aiba,774499188.0,Stub,,
2,Abkhazia,Zaur Avidzba,803841397.0,C,,


Now show the countries that exist on each side but not on the other.

First, countries that exist in the Wikipedia data but not in the population data, w/ the number of rows - here, the number of pages - for each

In [36]:
d_all[d_all['Location'].isnull()]['country'].value_counts(dropna=False)

Hondura                             190
South Korean                        157
Salvadoran                          120
Burkinabé                            97
Faroese                              85
Cape Colony                          81
Ivorian                              79
Samoan                               78
Rhodesian                            78
Cook Island                          67
Jersey                               61
Saint Lucian                         48
Pitcairn Islands                     46
Chechen                              38
East Timorese                        36
Saint Kitts and Nevis                32
South African Republic               31
Montserratian                        29
Tokelauan                            27
Greenlandic                          26
Guernsey                             25
Niuean                               25
Somaliland                           25
Omani                                25
Palauan                              23


And the countries that exist in the population data but not in the Wikipedia data.

In [37]:
d_all[d_all['country'].isnull()]['Location'].value_counts(dropna=False)

Channel Islands                 1
French Polynesia                1
Sao Tome and Principe           1
Georgia                         1
Palau                           1
Puerto Rico                     1
Mayotte                         1
Brunei                          1
Honduras                        1
St. Kitts-Nevis                 1
Samoa                           1
Burkina Faso                    1
St. Vincent & the Grenadines    1
Timor-Leste                     1
Hong Kong, SAR                  1
New Caledonia                   1
Western Sahara                  1
Macao, SAR                      1
Curacao                         1
St. Lucia                       1
Cote d'Ivoire                   1
El Salvador                     1
Reunion                         1
Oman                            1
Guam                            1
Name: Location, dtype: int64

In [39]:
#d_population.to_excel('population.xlsx')
#d_wikipedia_with_scores['country'].value_counts(dropna=False).reset_index().to_excel('page countries.xlsx')

Based on eyeballing of the most common Wikipedia countries for which we have no population data - I included all countries with 70 or more pages - I came up with the following mapping that I'll use to update the country field in the Wikipedia data, so we _will_ have population data. This is probably clearly better for some countries that map directly - for example updating 'South Korean' to 'Korea, South'. It's fuzzier for at least some others, including for example Wikipedia countries that aren't around anymore in the way they were when the leader was in power, like Cape Colony and Rhodesia. Good enough for now though.

In addition, there are clearly others that I could also map - for example, 'Saint Vincent and the Grenadines' in the Wikipedia data could be changed to 'St. Vincent & the Grenadines', but all of the ones I didn't manually map have fewer than 70 pages.

In [40]:
wikipedia_country_to_population_map = {'Burkinabé':'Burkina Faso',
                                       'Ivorian':"Cote d'Ivoire",
                                       'Faroese':'Denmark',
                                       'Salvadoran':'El Salvador',
                                       'Hondura':'Honduras',
                                       'South Korean':'Korea, South',
                                       'Samoan':'Samoa',
                                       'Cape Colony':'United Kingdom',
                                       'Rhodesian':'Zimbabwe'}

In [41]:
d_wikipedia_with_scores['country'].update(d_wikipedia_with_scores['country'].map(wikipedia_country_to_population_map))

And then we need to re-merge the population data.

**TODO** This is duplication. DRY.

In [57]:
d_all = pd.merge(left=d_wikipedia_with_scores, right=d_population[['Location','Data']],
                 how='outer', left_on='country', right_on='Location')
d_all.shape

(48017, 6)

In [58]:
d_all[d_all['Location'].isnull()]['country'].value_counts(dropna=False)

Cook Island                         67
Jersey                              61
Saint Lucian                        48
Pitcairn Islands                    46
Chechen                             38
East Timorese                       36
Saint Kitts and Nevis               32
South African Republic              31
Montserratian                       29
Tokelauan                           27
Greenlandic                         26
Somaliland                          25
Niuean                              25
Guernsey                            25
Omani                               25
Palauan                             23
São Tomé and Príncipe               22
Saint Vincent and the Grenadines    22
Carniolan                           22
South Ossetian                      19
Abkhazia                            16
Ossetian                             9
Dagestani                            7
Rojava                               3
Name: country, dtype: int64

In [59]:
d_all[d_all['country'].isnull()]['Location'].value_counts(dropna=False)

Oman                            1
Timor-Leste                     1
Brunei                          1
Channel Islands                 1
Puerto Rico                     1
Mayotte                         1
Macao, SAR                      1
St. Kitts-Nevis                 1
Sao Tome and Principe           1
St. Lucia                       1
New Caledonia                   1
St. Vincent & the Grenadines    1
Hong Kong, SAR                  1
Western Sahara                  1
Palau                           1
Curacao                         1
French Polynesia                1
Georgia                         1
Reunion                         1
Guam                            1
Name: Location, dtype: int64

In [60]:
d_all[:3]

Unnamed: 0,country,page,last_edit,Score,Location,Data
0,Abkhazia,Zurab Achba,802551672.0,C,,
1,Abkhazia,Garri Aiba,774499188.0,Stub,,
2,Abkhazia,Zaur Avidzba,803841397.0,C,,


And now that we've added in the additional remappings here, rows that still have a NaN 'Location' value are rows for which we have no population data and, per the assignment, we can get rid of them.

In [62]:
d_all.shape

(48017, 6)

In [63]:
sum(d_all['Location'].isnull())

684

In [67]:
d_all.dropna(subset=['Location'], inplace=True)
d_all.shape

(47333, 6)

In [68]:
d_all[:3]

Unnamed: 0,country,page,last_edit,Score,Location,Data
16,Afghanistan,Laghman Province,778690357.0,Start,Afghanistan,32247000.0
17,Afghanistan,Roqia Abubakr,779839643.0,Stub,Afghanistan,32247000.0
18,Afghanistan,Sitara Achakzai,803055503.0,GA,Afghanistan,32247000.0


Clean up columns to match assignment instructions: rename and drop extra join column.

In [69]:
del d_all['Location']
d_all.rename(columns={'page':'article_name',
                      'last_edit':'revision_id',
                      'Score':'article_quality',
                      'Data':'population'}, inplace=True)
d_all[:3]

Unnamed: 0,country,article_name,revision_id,article_quality,population
16,Afghanistan,Laghman Province,778690357.0,Start,32247000.0
17,Afghanistan,Roqia Abubakr,779839643.0,Stub,32247000.0
18,Afghanistan,Sitara Achakzai,803055503.0,GA,32247000.0


# Analysis

In [88]:
articles_per_person = d_all.groupby(['country']).apply(lambda g: g.size / g.iloc[0]['population'])
articles_per_person[:3]

country
Afghanistan    0.000051
Albania        0.000795
Algeria        0.000015
dtype: float64

In [89]:
articles_per_person['United States']

1.7074771235732666e-05

In [90]:
articles_per_person['Norway']

0.00063529820801626975

In [100]:
articles_per_person.sort_values(ascending=False)

country
Tuvalu                            0.023305
Nauru                             0.022560
San Marino                        0.013182
Monaco                            0.005645
Liechtenstein                     0.003859
Marshall Islands                  0.003364
Iceland                           0.003113
Tonga                             0.003049
Andorra                           0.002179
Samoa                             0.002008
Federated States of Micronesia    0.001845
Grenada                           0.001622
Luxembourg                        0.001581
Kiribati                          0.001455
Antigua and Barbuda               0.001389
Malta                             0.001194
Seychelles                        0.001185
Maldives                          0.001182
Fiji                              0.001148
Vanuatu                           0.001117
Dominica                          0.000882
New Zealand                       0.000860
Albania                           0.000795
Sol

In [83]:
def is_high_quality(score):
    if (score == 'FA') | (score == 'GA'):
        return True
    else:
        return False

In [84]:
d_all['article_quality'].value_counts(dropna=False)

Stub     24272
Start    15356
C         5787
GA         875
B          728
FA         295
NaN         20
Name: article_quality, dtype: int64

**TODO** Side comment: need to check why we have 20 articles with NaN quality scores.

In [86]:
(d_all['article_quality'].apply(is_high_quality)).value_counts(dropna=False)

False    46163
True      1170
Name: article_quality, dtype: int64

In [93]:
# define this instead of using a lambda, like above, so we can keep the lines around 80 chars wide
# a lambda would be fine as the actual function is still pretty short/a single liner
def get_high_quality_article_proportion(articles):
    return sum(articles['article_quality'].apply(is_high_quality)) / articles.size

high_quality_articles_per_all_articles = d_all.groupby(['country']).apply(get_high_quality_article_proportion)
high_quality_articles_per_all_articles[:3]

country
Afghanistan    0.011009
Albania        0.002174
Algeria        0.005042
dtype: float64

In [97]:
high_quality_articles_per_all_articles.sort_values(ascending=False)

country
Central African Republic          0.023529
Romania                           0.021267
Vanuatu                           0.019355
Guinea-Bissau                     0.019048
Saudi Arabia                      0.018487
Vietnam                           0.017801
Singapore                         0.017391
Ireland                           0.016751
United States                     0.016044
Zambia                            0.015385
Qatar                             0.015094
Latvia                            0.014286
Senegal                           0.013953
Palestinian Territory             0.013904
Liechtenstein                     0.013793
United Arab Emirates              0.013333
Korea, North                      0.013208
Benin                             0.012766
Ukraine                           0.012384
United Kingdom                    0.012262
Bhutan                            0.012121
Gambia                            0.012048
Syria                             0.011765
Afg

* **TODO** double check at least a few of the above calcs for both sets by calculating the proportions manually.
* **TODO** also double check the with population calcs below

If we filter out countries with 'smaller' populations, what do we see for the same lists? First we'll add in the population data and then we can use that to filter the lists.

In [126]:
articles_per_person = pd.concat([articles_per_person, 
           d_population[['Location','Data']].set_index('Location')['Data']],
           axis=1).rename(columns={0:'proportion','Data':'population'})
articles_per_person[:3]

Unnamed: 0,proportion,population
Afghanistan,5.1e-05,32247000.0
Albania,0.000795,2892000.0
Algeria,1.5e-05,39948000.0


In [127]:
high_quality_articles_per_all_articles = pd.concat([high_quality_articles_per_all_articles, 
           d_population[['Location','Data']].set_index('Location')['Data']],
           axis=1).rename(columns={0:'proportion','Data':'population'})
high_quality_articles_per_all_articles[:3]

Unnamed: 0,proportion,population
Afghanistan,0.011009,32247000
Albania,0.002174,2892000
Algeria,0.005042,39948000


And now we can look at the lists again, with a threshold population.

In [136]:
pop_thresh = 50000000

In [138]:
articles_per_person[articles_per_person['population'] > pop_thresh].sort_values(by='proportion', ascending=False)['proportion']

France                 0.000132
United Kingdom         0.000073
Italy                  0.000067
Iran                   0.000053
Korea, South           0.000044
Germany                0.000044
Mexico                 0.000042
South Africa           0.000039
Tanzania               0.000039
Russia                 0.000033
Pakistan               0.000027
Philippines            0.000025
Turkey                 0.000023
Myanmar                0.000023
Nigeria                0.000019
Japan                  0.000017
United States          0.000017
Brazil                 0.000014
Egypt                  0.000013
Vietnam                0.000010
Bangladesh             0.000010
Congo, Dem. Rep. of    0.000010
Thailand               0.000009
Ethiopia               0.000005
Indonesia              0.000004
China                  0.000004
India                  0.000004
Name: proportion, dtype: float64

In [139]:
high_quality_articles_per_all_articles[high_quality_articles_per_all_articles['population'] > pop_thresh].sort_values(by='proportion', ascending=False)['proportion']

Vietnam                0.017801
United States          0.016044
United Kingdom         0.012262
Egypt                  0.009244
Indonesia              0.008219
Philippines            0.007797
Russia                 0.007741
China                  0.007557
Congo, Dem. Rep. of    0.006993
South Africa           0.006928
Myanmar                0.005907
Germany                0.005610
Japan                  0.004535
Bangladesh             0.004321
Iran                   0.004067
Korea, South           0.004027
France                 0.003658
Pakistan               0.003181
India                  0.003058
Brazil                 0.002513
Turkey                 0.002222
Italy                  0.001921
Ethiopia               0.001887
Thailand               0.001770
Tanzania               0.001471
Mexico                 0.001297
Nigeria                0.001168
Name: proportion, dtype: float64

In [116]:
foo[foo['Population'].isnull()]

Unnamed: 0,0,Population
foo,3.0,


In [121]:
foo.loc['Korea, North']

0             2.121443e-05
Population    2.498300e+07
Name: Korea, North, dtype: float64

In [105]:
d_wikipedia_with_scores['country'].value_counts(dropna=False).to_excel('after_map.xlsx')

In [107]:
sum(d_wikipedia['country'] == 'Korea, South')

290

In [80]:
d_all['Location'].value_counts(dropna=False)

France                            1695
NaN                               1649
Australia                         1568
China                             1138
United States                     1097
Mexico                            1079
Pakistan                          1069
India                              981
Russia                             956
Spain                              907
United Kingdom                     865
Canada                             851
Iran                               836
Italy                              833
Poland                             823
New Zealand                        791
Germany                            713
Netherlands                        702
Nigeria                            685
Norway                             660
Hungary                            648
Finland                            572
Brazil                             557
Taiwan                             524
Belgium                            523
Philippines              

In [67]:
d_scores

Unnamed: 0_level_0,Score
RevisionId,Unnamed: 1_level_1
802551672,C
774499188,Stub
803841397,C
789818648,Start
785284614,Start
798644673,Stub
728644481,Stub
788591677,Start
758713659,C
802860970,Start


# Combine