# Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

In [1]:
import pandas as pd
import numpy as np

### Getting the article and population data
In step 1, we read two datasets, wikipedia dataset and population dataset from csv. The wikipedia dataset is downloaded from Figshsare; the population data is found on the Population Research Bureau website.

In [2]:
wikidf = pd.read_csv('page_data.csv')
populationdf = pd.read_csv('Population Mid-2015.csv',skiprows=1)
del populationdf['Footnotes']

### Getting article quality predictions

In step 2, we calculate article quality using a machine learning system, ORES ("Objective Revision Evaluation Service"). ORES assign quality of each article into one of the following 6 categories: 
FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

In [3]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/dianachenyu', 'From' : 'dczhang@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

def compute_chunk(rev_ids_chunk,result, headers):
    ores = get_ores_data(rev_ids_chunk, headers)['enwiki']['scores']
    for key in ores:
        try:
            result.append([key,ores[key]['wp10']['score']['prediction']])
        except KeyError:
            continue
    return None

In [4]:
result = []
all_rev_id = wikidf["rev_id"]
recur_time = len(all_rev_id)/100
for i in range(int(recur_time)):
    rev_ids_chunk = list(all_rev_id[100*i:100*(i+1)])
    compute_chunk(rev_ids_chunk,result, headers)    

i=int(recur_time)
rev_ids_chunk = list(all_rev_id[100*i:len(all_rev_id)])
compute_chunk(rev_ids_chunk,result, headers)    

In [5]:
ores_df = pd.DataFrame(result,columns=['revision_id', 'article_quality'])
ores_df.head()

Unnamed: 0,revision_id,article_quality
0,235107991,Stub
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


### Combining the datasets
In step 3, we combine wikipeida article quality data calcuated in last step, with population dataset.

In [6]:
ores_df['revision_id']=ores_df['revision_id'].astype(str).astype(int)
wiki_quality_df = ores_df.merge(wikidf, how='inner',left_on='revision_id',right_on='rev_id',copy=True)
del wiki_quality_df['rev_id']

In [7]:
all_data_df = wiki_quality_df.merge(populationdf, how='inner',left_on='country',right_on='Location',copy=True)
all_data_df = all_data_df.drop(['Location', 'Location Type','TimeFrame','Data Type'], axis=1)
all_data_df.rename(columns={'page': 'article_name', 'data': 'population'}, inplace=True)
all_data_df.to_csv("wikipedia_ores_population.csv")

### Analysis
In Step 4, we calculate percentage of articles-per-population and the percentage high-quality articles in all articles per country. 

In [8]:
article_num = all_data_df.groupby(['country']).count()['revision_id']

high_qual= all_data_df.loc[all_data_df['article_quality'].isin(['FA','GA'])]

high_qual_article_num = high_qual.groupby(['country']).count()['revision_id']

In [9]:
article_df = pd.DataFrame(article_num).merge(pd.DataFrame(high_qual_article_num), how='outer',left_index = True, right_index = True,copy=True)
article_df.fillna(value=0, inplace = True)
article_df.rename(columns={'revision_id_x': 'num_article', 'revision_id_y': 'num_high_quality_article'}, inplace=True)

article_df['country']=article_df.index
article_df = article_df.merge(populationdf, how='left',left_on = 'country', right_on='Location',copy=True)
article_df = article_df.drop(['Location', 'Location Type','TimeFrame','Data Type'], axis=1)
article_df.rename(columns={'Data': 'population'}, inplace=True)

In [10]:
article_df['population']=article_df['population'].str.replace(',','').astype(int)

article_df['articles_per_population'] = article_df['num_article']/article_df['population'] 
article_df['high_qual_articles_proportion'] = article_df['num_high_quality_article']/article_df['num_article'] 

In [11]:
article_df.head()

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population,high_qual_articles_proportion
0,327,15.0,Afghanistan,32247000,1e-05,0.045872
1,460,5.0,Albania,2892000,0.000159,0.01087
2,119,2.0,Algeria,39948000,3e-06,0.016807
3,34,0.0,Andorra,78000,0.000436,0.0
4,110,1.0,Angola,25000000,4e-06,0.009091


### Tables

In step 5, we sort and output the top 10 and lowest 10 counties by articles-per-population; and the top 10 and lowest 10 counties by the propotional of high quality articles in all aritcles per country. 

In [12]:
table_highest_articles_per_population = article_df.sort_values('articles_per_population', ascending=False).head(10)
table_highest_articles_per_population 

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population,high_qual_articles_proportion
120,53,0.0,Nauru,10860,0.00488,0.0
173,55,3.0,Tuvalu,11800,0.004661,0.054545
141,82,0.0,San Marino,33000,0.002485,0.0
113,40,0.0,Monaco,38088,0.00105,0.0
97,29,0.0,Liechtenstein,37570,0.000772,0.0
107,37,0.0,Marshall Islands,55000,0.000673,0.0
72,206,2.0,Iceland,330828,0.000623,0.009709
168,63,0.0,Tonga,103300,0.00061,0.0
3,34,0.0,Andorra,78000,0.000436,0.0
54,38,0.0,Federated States of Micronesia,103000,0.000369,0.0


In [13]:
table_lowest_articles_per_population = article_df.sort_values('articles_per_population', ascending=True).head(10)
table_lowest_articles_per_population 

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population,high_qual_articles_proportion
73,989,13.0,India,1314097616,7.526077e-07,0.013145
34,1138,35.0,China,1371920000,8.294944e-07,0.030756
74,215,8.0,Indonesia,255741973,8.406911e-07,0.037209
180,29,3.0,Uzbekistan,31290791,9.267902e-07,0.103448
53,105,3.0,Ethiopia,98148000,1.069813e-06,0.028571
86,39,9.0,"Korea, North",24983000,1.561062e-06,0.230769
185,26,0.0,Zambia,15473900,1.680249e-06,0.0
166,112,3.0,Thailand,65121250,1.719869e-06,0.026786
38,142,8.0,"Congo, Dem. Rep. of",73340200,1.936182e-06,0.056338
13,324,3.0,Bangladesh,160411000,2.019812e-06,0.009259


In [14]:
table_highest_qual_articles_proportion = article_df.sort_values('high_qual_articles_proportion', ascending=False).head(10)
table_highest_qual_articles_proportion

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population,high_qual_articles_proportion
86,39,9.0,"Korea, North",24983000,1.561062e-06,0.230769
143,119,14.0,Saudi Arabia,31565109,3.769985e-06,0.117647
180,29,3.0,Uzbekistan,31290791,9.267902e-07,0.103448
31,68,7.0,Central African Republic,5551900,1.224806e-05,0.102941
138,348,34.0,Romania,19838662,1.754151e-05,0.097701
68,21,2.0,Guinea-Bissau,1788000,1.174497e-05,0.095238
19,33,3.0,Bhutan,757000,4.359313e-05,0.090909
183,191,16.0,Vietnam,91714080,2.082559e-06,0.08377
46,12,1.0,Dominica,68000,0.0001764706,0.083333
109,52,4.0,Mauritania,3641288,1.428066e-05,0.076923


In [15]:
table_lowest_qual_articles_proportion = article_df.sort_values('high_qual_articles_proportion', ascending=True).head(10)
table_lowest_qual_articles_proportion

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population,high_qual_articles_proportion
172,33,0.0,Turkmenistan,5373000,6e-06,0.0
164,40,0.0,Tajikistan,8452153,5e-06,0.0
113,40,0.0,Monaco,38088,0.00105,0.0
117,60,0.0,Mozambique,25736000,2e-06,0.0
120,53,0.0,Nauru,10860,0.00488,0.0
168,63,0.0,Tonga,103300,0.00061,0.0
30,37,0.0,Cape Verde,514000,7.2e-05,0.0
65,49,0.0,Guadeloupe,407000,0.00012,0.0
83,79,0.0,Kazakhstan,17544274,5e-06,0.0
158,40,0.0,Suriname,576000,6.9e-05,0.0


### Writeup

I am supried about the above 10 tables. These are the countries I did not expect. For example, for table 1, 10 highest-ranked countries in terms of number of politician articles as a proportion of country population. I expect Britain would be in the list because according to what we learnt in last lecture, Wikipedia eiditing is greatly popular in Britain.

I think the bias comes from the evaluation method. Calcuating the number of articles per population may not be a good metric to measure coverage of politicians and the quality of articles on Wikipedia. For a country A with a population 100 times as country B, country A would not has 100 politicians compared to country B. There is a inherent bias in the data source by the metric we used. 