# A2 Bias in Data

In [5]:
#Import libaries
import requests
import json
import pandas as pd
import numpy as np

In [1]:
#Need to cite this code
#https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model
headers = {'User-Agent' : 'https://github.com/dwhite105', 'From' : 'dkwhite@uw.edu'}
def get_ores_data(revision_ids, headers):
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [3]:
#Import page data from csv file, downloaded from the link here. 
#Update path
page_data = pd.read_csv('A2_Github/country/data/page_data.csv', index_col = None)
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


The code block below calls the ORES API. First the IDs are collected from the page data dataset and converted into a string. There are 47,197 IDs to run through the API, however, there is a limit to how many can be run through the API at one time. The code below does three things:

* Collect the rev_ids in increments of 250 to pass through the API
* Extracts the prediction for each ID, for articles without a prediction, it logs a NaN.
* Iterates through until all IDs are passed through the API, returning a list of all the predictions.

In [26]:
rev_ids = list(page_data.rev_id.apply(str))
rev_idx = 0
increment = 100
article_predictions = []
length = len(page_data)
while rev_idx < length:
    if rev_idx + increment > length:
        rev_ids_subset = rev_ids[rev_idx:len(page_data)]
    else: 
        rev_ids_subset = rev_ids[rev_idx:rev_idx+increment]    
    
    ores = get_ores_data(rev_ids_subset,headers)
    for i in rev_ids_subset:
        try:
            prediction = ores['enwiki']['scores'][i]['wp10']['score']['prediction']
        except KeyError:
            prediction = np.nan
        article_predictions.append(prediction)
    rev_idx = rev_idx + increment
len(article_predictions)

47197

Next, a dataframe is constructed with each article name, article quality, revision ID, and country. 

In [28]:
page_df = pd.DataFrame({'country': page_data.country, 
              'article_name':page_data.page , 
              'revision_id': rev_ids,
              'article_quality' : article_predictions})
page_df.head()              

Unnamed: 0,article_name,article_quality,country,revision_id
0,Template:ZambiaProvincialMinisters,,Zambia,235107991
1,Bir I of Kanem,Stub,Chad,355319463
2,Template:Zimbabwe-politician-stub,Stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Stub,Uganda,391862070
4,Template:Namibia-politician-stub,Stub,Namibia,391862409


Population data for each country is also read into memory. The column names are changed for easier future reference.

In [29]:
pop_df = pd.read_csv('A2_Github/WPDS_2018_data.csv')
pop_df = pop_data.rename(columns = {'Geography': 'country',
                                     'Population mid-2018 (millions)': 'population'})
pop_df.head()

Unnamed: 0,country,population
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


The two dataframes are then merge on a common country name. A left join is performed to preserve all the page data observations.

In [38]:
df = pd.merge(page_df, pop_df, how='left', on = 'country')
df.head()             

Unnamed: 0,article_name,article_quality,country,revision_id,population
0,Template:ZambiaProvincialMinisters,,Zambia,235107991,17.7
1,Bir I of Kanem,Stub,Chad,355319463,15.4
2,Template:Zimbabwe-politician-stub,Stub,Zimbabwe,391862046,14.0
3,Template:Uganda-politician-stub,Stub,Uganda,391862070,44.1
4,Template:Namibia-politician-stub,Stub,Namibia,391862409,2.5


The dataframe snapshot above shows that some articles do not have predictions of article quality or were not matched up with a country. The observations that contain any NaNs are removed from the dataset and the dataframe is saved as .csv file.

In [39]:
df1 = df.dropna()
df1.to_csv('wiki_article_quality_population_data.csv', index = False)
df1.head()

Unnamed: 0,article_name,article_quality,country,revision_id,population
1,Bir I of Kanem,Stub,Chad,355319463,15.4
2,Template:Zimbabwe-politician-stub,Stub,Zimbabwe,391862046,14.0
3,Template:Uganda-politician-stub,Stub,Uganda,391862070,44.1
4,Template:Namibia-politician-stub,Stub,Namibia,391862409,2.5
5,Template:Nigeria-politician-stub,Stub,Nigeria,391862819,195.9


A new column named 'high_quality' is created which indicates a 1 if the article is rated FA or GA, and a 0 if its not. A count of the values in this column show that 980 articles were deemed of high quality. 

In [40]:
df1['high_quality'] = np.where((df1.article_quality == 'FA') | (df1.article_quality == 'GA'),1,0)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,article_name,article_quality,country,revision_id,population,high_quality
1,Bir I of Kanem,Stub,Chad,355319463,15.4,0
2,Template:Zimbabwe-politician-stub,Stub,Zimbabwe,391862046,14.0,0
3,Template:Uganda-politician-stub,Stub,Uganda,391862070,44.1,0
4,Template:Namibia-politician-stub,Stub,Namibia,391862409,2.5,0
5,Template:Nigeria-politician-stub,Stub,Nigeria,391862819,195.9,0


In [42]:
df1['high_quality'].value_counts()

0    43992
1      980
Name: high_quality, dtype: int64

if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

10 highest-ranked countries in terms of number of politician articles as a proportion of country population
10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

10 highest-ranked countries in terms of number of politician articles as a proportion of country population
10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [88]:
# country_articles = pd.DataFrame(df1.groupby(['country'], as_index = False)['article_name'].count())
#country_articles.rename(columns = {'article_name' : 'article_count'}, inplace = True)

country_articles = df1.groupby(['country'], as_index = False).agg({'article_name': 'count', 
                                                 'high_quality': 'sum', 
                                                 'population' : 'max'})
country_articles.rename(columns = {'article_name' : 'article_count',
                                  'high_quality' : 'quality_article_count'}, inplace = True)

#Remove commas from population, convert to integer
country_articles['population'] = country_articles['population'].str.replace(',','')
country_articles['population'] = country_articles['population'].apply(pd.to_numeric)


country_articles['articles_per_population'] = country_articles['article_count'] / (country_articles['population'])
country_articles['percent_quality_article'] = country_articles['quality_article_count'] / country_articles['article_count']
country_articles.head()


Unnamed: 0,country,article_count,quality_article_count,population,articles_per_population,percent_quality_article
0,Afghanistan,326,10,36.5,8.931507,0.030675
1,Albania,460,4,2.9,158.62069,0.008696
2,Algeria,119,2,42.7,2.786885,0.016807
3,Andorra,34,0,0.08,425.0,0.0
4,Angola,110,0,30.4,3.618421,0.0


In [90]:
top10_article_per_population

Unnamed: 0,country,articles_per_population,article_count,population
166,Tuvalu,5500.0,55,0.01
115,Nauru,5300.0,53,0.01
135,San Marino,2733.333333,82,0.03
108,Monaco,1000.0,40,0.04
93,Liechtenstein,725.0,29,0.04
161,Tonga,630.0,63,0.1
103,Marshall Islands,616.666667,37,0.06
68,Iceland,515.0,206,0.4
3,Andorra,425.0,34,0.08
52,Federated States of Micronesia,380.0,38,0.1


In [91]:
bottom10_article_per_population

Unnamed: 0,country,articles_per_population,article_count,population
69,India,0.719026,986,1371.3
70,Indonesia,0.806938,214,265.2
34,China,0.814321,1135,1393.8
173,Uzbekistan,0.881459,29,32.9
51,Ethiopia,0.976744,105,107.5
178,Zambia,1.412429,25,17.7
82,"Korea, North",1.523438,39,25.6
159,Thailand,1.691843,112,66.2
13,Bangladesh,1.941106,323,166.4
112,Mozambique,1.967213,60,30.5


In [89]:
top10_article_per_population = country_articles[
    ['country','articles_per_population','article_count','population']].sort_values(
    'articles_per_population', ascending = False).head(10)
bottom10_article_per_population = country_articles[
    ['country','articles_per_population','article_count','population']].sort_values(
    'articles_per_population').head(10)

Unnamed: 0,country,articles_per_population,article_count,population
69,India,0.719026,986,1371.3
70,Indonesia,0.806938,214,265.2
34,China,0.814321,1135,1393.8
173,Uzbekistan,0.881459,29,32.9
51,Ethiopia,0.976744,105,107.5
178,Zambia,1.412429,25,17.7
82,"Korea, North",1.523438,39,25.6
159,Thailand,1.691843,112,66.2
13,Bangladesh,1.941106,323,166.4
112,Mozambique,1.967213,60,30.5


In [98]:
top10_percent_quality_article = country_articles[
    ['country','percent_quality_article', 'quality_article_count','article_count']].sort_values(
    'percent_quality_article', ascending = False).head(10)
bottom10_percent_quality_article = country_articles[
    ['country','percent_quality_article', 'quality_article_count','article_count']].sort_values(
    ['percent_quality_article', 'article_count'], ascending = [True, False]).head(10)
top10_percent_quality_article.index = np.arange(1, len(top10_percent_quality_article)+1)
bottom10_percent_quality_article.index = np.arange(1, len(bottom10_percent_quality_article)+1)

In [99]:
top10_percent_quality_article

Unnamed: 0,country,percent_quality_article,quality_article_count,article_count
1,"Korea, North",0.179487,7,39
2,Saudi Arabia,0.134454,16,119
3,Central African Republic,0.117647,8,68
4,Romania,0.114943,40,348
5,Mauritania,0.096154,5,52
6,Bhutan,0.090909,3,33
7,Tuvalu,0.090909,5,55
8,Dominica,0.083333,1,12
9,United States,0.07516,82,1091
10,Benin,0.074468,7,94


In [100]:
bottom10_percent_quality_article

Unnamed: 0,country,percent_quality_article,quality_article_count,article_count
1,Finland,0.0,0,572
2,Belgium,0.0,0,523
3,Moldova,0.0,0,426
4,Switzerland,0.0,0,407
5,Nepal,0.0,0,361
6,Uganda,0.0,0,188
7,Costa Rica,0.0,0,150
8,Tunisia,0.0,0,140
9,Slovakia,0.0,0,119
10,Angola,0.0,0,110


In [93]:
country_groupby = pd.DataFrame(df1.groupby(['country'])['quality'].sum())
country_groupby = country_groupby.reset_index()
quality_table = pd.merge(country_groupby, pop_data, how = 'inner', on = 'country')
quality_table['population'] = quality_table['population'].str.replace(',','')
quality_table[quality_table['country'] == 'China']
quality_table['population'] = quality_table['population'].apply(pd.to_numeric)
quality_table['proportion'] = quality_table['quality'] / (quality_table['population']


Unnamed: 0,country,quality,population,proportion
0,Afghanistan,10.0,36.5,0.273973
1,Albania,4.0,2.9,1.37931
2,Algeria,2.0,42.7,0.046838
3,Andorra,0.0,0.08,0.0
4,Angola,0.0,30.4,0.0


In [99]:
country_count_df = pd.DataFrame(df1.groupby(['country'])['article_name'].count())
country_count_df = country_count_df.reset_index()
total_articles_table = pd.merge(country_count_df, pop_data, how = 'inner', on = 'country')
total_articles_table.head()


Unnamed: 0,country,article_name,population
0,Afghanistan,326,36.5
1,Albania,460,2.9
2,Algeria,119,42.7
3,Andorra,34,0.08
4,Angola,110,30.4


In [95]:
quality_table.sort_values(by = 'proportion', ascending = False).head(10)

Unnamed: 0,country,quality,population,proportion
166,Tuvalu,5.0,0.01,500.0
44,Dominica,1.0,0.07,14.285714
61,Grenada,1.0,0.1,10.0
161,Tonga,1.0,0.1,10.0
174,Vanuatu,3.0,0.3,10.0
100,Maldives,2.0,0.4,5.0
68,Iceland,2.0,0.4,5.0
73,Ireland,24.0,4.9,4.897959
19,Bhutan,3.0,0.8,3.75
74,Israel,21.0,8.5,2.470588


In [96]:
quality_table.sort_values(by = 'proportion').head(10)

Unnamed: 0,country,quality,population,proportion
143,Slovakia,0.0,5.4,0.0
90,Lesotho,0.0,2.3,0.0
28,Cameroon,0.0,25.6,0.0
30,Cape Verde,0.0,0.6,0.0
178,Zambia,0.0,17.7,0.0
36,Comoros,0.0,0.8,0.0
116,Nepal,0.0,29.7,0.0
154,Switzerland,0.0,8.5,0.0
43,Djibouti,0.0,1.0,0.0
145,Solomon Islands,0.0,0.7,0.0
