# Bias on Wikipedia
The goal of this assignment is to explore the concept of 'bias' through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. 
* perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries
* list the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
* list the countries with the highest and lowest proportion of high quality articles about politicians.

## ORES request

ORES(Objective Revision Evaluation Service) is an artificial intelligence system used to identify vandalism on Wikipedia and distinguish it from good faith edits.
## References
* https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data
* https://en.wikipedia.org/wiki/Aaron_Halfaker
* https://www.mediawiki.org/wiki/ORES

## Data Sources
* http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14
* https://figshare.com/articles/Untitled_Item/5513449

### Setup

In [250]:
#import required libraries
import requests
import json
import pandas as pd
import numpy as np

Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).

### Step1: Getting the article and population data

In [285]:
# downloaded from figshare
wiki_data = pd.read_csv('data-512-a2/data/raw/page_data.csv')
# downloaded from Population Reference Bureau
population_data = pd.read_csv('data-512-a2/data/raw/Population Mid-2015.csv', header = 2)

In [336]:
wiki_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [287]:
population_data = population_data.drop('Footnotes',1)
population_data.head()

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data
0,Afghanistan,Country,Mid-2015,Number,32247000
1,Albania,Country,Mid-2015,Number,2892000
2,Algeria,Country,Mid-2015,Number,39948000
3,Andorra,Country,Mid-2015,Number,78000
4,Angola,Country,Mid-2015,Number,25000000


### Step 2: Getting article quality predictions

In [179]:
# function to get article quality prediction(provided in class from Oliver Keyes)
def get_ores_data(revision_ids, headers):    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'    
    # Define parameters
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    for x in revision_ids:
        item = str(x)
        # extract prediction from json
        prediction.append(response['enwiki']['scores'][item]['wp10']['score']['prediction'])
        revision_id.append(x)

In [182]:
# Call get_ores_data() by providing 100 revision ids at a time
response = []
prediction = []
country =[]
article_name = []
revision_id = []
i=0
while(i < len(wiki_data)): 
    try:
        example_ids = []
        j = 0
        while(j < 100):
            example_ids.append(wiki_data['rev_id'][i])
            j = j + 1
            i = i + 1   
        get_ores_data(example_ids, headers)
    except Exception:
        pass
#print prediction and revision_id length   
len(prediction)
len(revision_id)

47062

In [183]:
# merge prediction and revision id which we got from get_ores_data()
wiki_df = pd.DataFrame({'article_quality':prediction, 'rev_id':revision_id})

In [187]:
# merge the dataframe with prediction and revision_id with wikipedia data 
wiki_merged_df = wiki_df.merge(wiki_data, left_on='rev_id', right_on='rev_id', how='inner')

In [189]:
wiki_merged_df.head()

Unnamed: 0,article_quality,rev_id,page,country
0,Stub,235107991,Template:ZambiaProvincialMinisters,Zambia
1,Stub,355319463,Bir I of Kanem,Chad
2,Stub,391862046,Template:Zimbabwe-politician-stub,Zimbabwe
3,Stub,391862070,Template:Uganda-politician-stub,Uganda
4,Stub,391862409,Template:Namibia-politician-stub,Namibia


In [191]:
# write wikipedia data along with article quality in a single dataframe
wikipedia_data = pd.DataFrame({
    'country':wiki_merged_df['country'],
    'article_name': wiki_merged_df['page'],
    'revision_id': wiki_merged_df['rev_id'],
    'article_quality': wiki_merged_df['article_quality']
} )

In [288]:
# write population data in a single dataframe, this will be merged with wikipedia data
pop_data = pd.DataFrame({
    'country':population_data['Location'],
    'population': population_data['Data']})

In [289]:
pop_data.head()

Unnamed: 0,country,population
0,Afghanistan,32247000
1,Albania,2892000
2,Algeria,39948000
3,Andorra,78000
4,Angola,25000000


### Step 3: Combining the datasets

In [194]:
#merge wikipedia and population data
final_merged_df = wikipedia_data.merge(pop_data, left_on='country', right_on='country', how='inner')

In [198]:
# resulting data on merging wikipedia and population data
final_merged_df.head()

Unnamed: 0,article_name,article_quality,country,revision_id,population
0,Template:ZambiaProvincialMinisters,Stub,Zambia,235107991,15473900
1,Gladys Lundwe,Stub,Zambia,757566606,15473900
2,Mwamba Luchembe,Stub,Zambia,764848643,15473900
3,Thandiwe Banda,Start,Zambia,768166426,15473900
4,Sylvester Chisembele,C,Zambia,776082926,15473900


In [197]:
# write merged dataframe in csv file
final_merged_df.to_csv('article_quality_data_with_population.csv', sep=',')

### Step 4: Analysis
#### Step 4(a): Number of politician articles as a proportion of country population

In [302]:
# get count of articles by country, this is same as groupby in sql
articles_by_country = final_merged_df.groupby('country').count()['article_name'].astype(int).reset_index()
articles_by_country = pd.DataFrame({'country':articles_by_country['country'], 'articles_count':articles_by_country['article_name']})
articles_by_country.head()

Unnamed: 0,articles_count,country
0,327,Afghanistan
1,458,Albania
2,119,Algeria
3,34,Andorra
4,110,Angola


In [319]:
# merge grouped data with population data
articles_proportion = articles_by_country.merge(pop_data, left_on='country', right_on='country', how='inner')
articles_proportion['percentage'] = articles_proportion['articles_count']*100/articles_proportion['population']

In [320]:
# sort the data in descending order to get list of top 10 and bottom 10 countries
rank_of_countries =articles_proportion.sort_values(['percentage'], ascending=[False])
rank_of_countries = rank_of_countries.dropna()

In [321]:
#10 highest-ranked countries in terms of number of politician articles as a proportion of country population
rank_of_countries.head(10)

Unnamed: 0,articles_count,country,population,percentage
120,53,Nauru,10860,0.488029
173,55,Tuvalu,11800,0.466102
141,82,San Marino,33000,0.248485
113,40,Monaco,38088,0.10502
97,29,Liechtenstein,37570,0.077189
107,37,Marshall Islands,55000,0.067273
72,206,Iceland,330828,0.062268
168,63,Tonga,103300,0.060987
3,34,Andorra,78000,0.04359
54,38,Federated States of Micronesia,103000,0.036893


In [322]:
#10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
rank_of_countries.tail(10)

Unnamed: 0,articles_count,country,population,percentage
13,322,Bangladesh,160411000,0.000201
38,142,"Congo, Dem. Rep. of",73340200,0.000194
166,111,Thailand,65121250,0.00017
185,26,Zambia,15473900,0.000168
86,38,"Korea, North",24983000,0.000152
53,105,Ethiopia,98148000,0.000107
180,29,Uzbekistan,31290791,9.3e-05
74,214,Indonesia,255741973,8.4e-05
34,1133,China,1371920000,8.3e-05
73,983,India,1314097616,7.5e-05


#### Step 4(b): Number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [324]:
# get count of high quality articles by country by selecting rows with 'FA' and 'GA' and then grouping on country
GA_FA_quality = pd.concat([final_merged_df.loc[final_merged_df['article_quality']=='FA'], 
                           final_merged_df.loc[final_merged_df['article_quality']=='GA']])
GA_FA_quality = GA_FA_quality.groupby('country').count()['article_name'].reset_index()
GA_FA_quality = pd.DataFrame({'country':GA_FA_quality['country'], 'GA_FA_articles_count':GA_FA_quality['article_name']})
GA_FA_quality.head()

Unnamed: 0,GA_FA_articles_count,country
0,19,Afghanistan
1,5,Albania
2,3,Algeria
3,2,Angola
4,16,Argentina


In [331]:
# merge grouped data with articles by country data
GA_FA_articles_proportion = GA_FA_quality.merge(articles_by_country, left_on='country', right_on='country', how='inner')
GA_FA_articles_proportion['percentage_of_GA_FA'] = GA_FA_quality['GA_FA_articles_count']*100/GA_FA_articles_proportion['articles_count']

In [332]:
GA_FA_articles_proportion.head()

Unnamed: 0,GA_FA_articles_count,country,articles_count,percentage_of_GA_FA
0,19,Afghanistan,327,5.810398
1,5,Albania,458,1.091703
2,3,Algeria,119,2.521008
3,2,Angola,110,1.818182
4,16,Argentina,496,3.225806


In [338]:
## sort the data in descending order to get list of top 10 and bottom 10 countries
rank_of_countries_by_GA_FA =GA_FA_articles_proportion.sort_values(['percentage_of_GA_FA'], ascending=[False])

In [334]:
#10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
rank_of_countries_by_GA_FA.head(10)

Unnamed: 0,GA_FA_articles_count,country,articles_count,percentage_of_GA_FA
66,8,"Korea, North",38,21.052632
110,45,Romania,341,13.196481
22,8,Central African Republic,68,11.764706
113,13,Saudi Arabia,117,11.111111
109,5,Qatar,51,9.803922
51,2,Guinea-Bissau,21,9.52381
12,3,Bhutan,33,9.090909
144,17,Vietnam,190,8.947368
59,30,Ireland,379,7.915567
139,85,United States,1086,7.826888


In [335]:
#10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
rank_of_countries_by_GA_FA.tail(10)


Unnamed: 0,GA_FA_articles_count,country,articles_count,percentage_of_GA_FA
98,4,Nigeria,683,0.585652
31,1,Cuba,175,0.571429
77,1,Luxembourg,180,0.555556
135,1,Uganda,187,0.534759
88,2,Moldova,426,0.469484
76,1,Lithuania,248,0.403226
33,1,Czech Republic,252,0.396825
105,1,Peru,354,0.282486
129,1,Tanzania,404,0.247525
41,1,Finland,572,0.174825


## Writeup
There are some surprising findings and some not so surpring findings from the data.
#### Surprising findings:
    * 9 out of the top 10 countries  with highest percentage of politician articles have count of articles in 2-digits. Contast this with India, China and US which have thosands of articles published. On closer look, the countries featured in the top 10 list because of the low population not because of unusually high number of articles. 
    * most of the top 10 countries with highest percentage of politician articles do not have English as native language. Still considerable number of articles are published in English wrt population. 
    * North Korea features on top of the charts for high quality article. Given North Korea remains a closed state with mostly non English speakers, having >20% high quality articles is very surprising.
#### Not so surpring findings:  
    * India and China have the least percentage of politician articles. Given that these 2 countries have largest population in the world the ratio is bound to be low irrespective of the high number of articles.
    * the lowest number of high quality artciles are with the countries which have a non English native language. Thus this finding is not very surprising and as expected. Would be interesting to see though how these countries fare in their native languages on Wikipedia.