# Bias on Wikipedia

The objective of the project is to find whether bias exists in Wikipedia articles of potical figures from countries around the world. We will perform an analysis on article quality and politician coverage among various countries. 

In [1]:
import pandas as pd
import numpy as np

### Getting the article and population data
In step 1, we read in two datasets, wikipedia dataset and population dataset from csv. The wikipedia dataset is downloaded from [Figshsare](https://figshare.com/articles/Untitled_Item/5513449); the population data is found on the [Population Research Bureau website](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14).

In [2]:
wikidf = pd.read_csv('raw_data/page_data.csv')
populationdf = pd.read_csv('raw_data/Population Mid-2015.csv',skiprows=1)
del populationdf['Footnotes']

### Getting article quality predictions

In step 2, we calculate article quality using a machine learning API, ORES ("Objective Revision Evaluation Service"). ORES calculates quality of each article into one of the following 6 categories: 
* FA - Featured article
* GA - Good article
* B - B-class article
* C - C-class article
* Start - Start-class article
* Stub - Stub-class article

In [3]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/dianachenyu', 'From' : 'dczhang@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

def compute_chunk(rev_ids_chunk,result, headers):
    ores = get_ores_data(rev_ids_chunk, headers)['enwiki']['scores']
    for key in ores:
        try:
            result.append([key,ores[key]['wp10']['score']['prediction']])
        except KeyError:
            continue
    return None

In [4]:
result = []
all_rev_id = wikidf["rev_id"]
recur_time = len(all_rev_id)/100
for i in range(int(recur_time)):
    rev_ids_chunk = list(all_rev_id[100*i:100*(i+1)])
    compute_chunk(rev_ids_chunk,result, headers)    

i=int(recur_time)
rev_ids_chunk = list(all_rev_id[100*i:len(all_rev_id)])
compute_chunk(rev_ids_chunk,result, headers)    

In [5]:
ores_df = pd.DataFrame(result,columns=['revision_id', 'article_quality'])
ores_df.head()

Unnamed: 0,revision_id,article_quality
0,235107991,Stub
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


### Combining the datasets
In step 3, we combine wikipeida article quality data calcuated in last step, with population dataset. 

One thing to notice is both Wikipedia dataset and population dataset have fields containing country name. Values of the two fields do not overlap exactly. There are countries exist in one dataset but not the other. In this project, we only use countries found in both datasets and remove values of non-matching rows by using inner-join.

In [6]:
ores_df['revision_id']=ores_df['revision_id'].astype(str).astype(int)
wiki_quality_df = ores_df.merge(wikidf, how='inner',left_on='revision_id',right_on='rev_id',copy=True)
del wiki_quality_df['rev_id']

In [7]:
all_data_df = wiki_quality_df.merge(populationdf, how='inner',left_on='country',right_on='Location',copy=True)
all_data_df = all_data_df.drop(['Location', 'Location Type','TimeFrame','Data Type'], axis=1)
all_data_df.rename(columns={'page': 'article_name', 'data': 'population'}, inplace=True)
all_data_df.to_csv("wikipedia_ores_population.csv")

### Analysis
In Step 4, we calculate percentage of articles-per-population and the percentage of high-quality articles in all articles per country. 
* percentage of articles-per-population: the number of articles of that country/country population * 100
* percentage of high-quality articles: the number of high-quality articles of that country/the number of articles of that country * 100

In [8]:
article_num = all_data_df.groupby(['country']).count()['revision_id']

high_qual= all_data_df.loc[all_data_df['article_quality'].isin(['FA','GA'])]

high_qual_article_num = high_qual.groupby(['country']).count()['revision_id']

In [9]:
article_df = pd.DataFrame(article_num).merge(pd.DataFrame(high_qual_article_num), how='outer',left_index = True, right_index = True,copy=True)
article_df.fillna(value=0, inplace = True)
article_df.rename(columns={'revision_id_x': 'num_article', 'revision_id_y': 'num_high_quality_article'}, inplace=True)

article_df['country']=article_df.index
article_df = article_df.merge(populationdf, how='left',left_on = 'country', right_on='Location',copy=True)
article_df = article_df.drop(['Location', 'Location Type','TimeFrame','Data Type'], axis=1)
article_df.rename(columns={'Data': 'population'}, inplace=True)

In [10]:
article_df['population']=article_df['population'].str.replace(',','').astype(int)

article_df['articles_per_population(as a percentage)'] = article_df['num_article']/article_df['population'] *100
article_df['high_qual_articles_proportion(as a percentage)'] = article_df['num_high_quality_article']/article_df['num_article'] *100

In [11]:
article_df.head()

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population(as a percentage),high_qual_articles_proportion(as a percentage)
0,327,15.0,Afghanistan,32247000,0.001014,4.587156
1,460,5.0,Albania,2892000,0.015906,1.086957
2,119,2.0,Algeria,39948000,0.000298,1.680672
3,34,0.0,Andorra,78000,0.04359,0.0
4,110,1.0,Angola,25000000,0.00044,0.909091


### Tables

In step 5, we sort and output the top 10 and lowest 10 counties by articles-per-population; and the top 10 and lowest 10 counties by the propotional of high quality articles in all aritcles per country. 

In [12]:
table_highest_articles_per_population = article_df.sort_values('articles_per_population(as a percentage)', ascending=False).head(10)
table_highest_articles_per_population 

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population(as a percentage),high_qual_articles_proportion(as a percentage)
120,53,0.0,Nauru,10860,0.488029,0.0
173,55,3.0,Tuvalu,11800,0.466102,5.454545
141,82,0.0,San Marino,33000,0.248485,0.0
113,40,0.0,Monaco,38088,0.10502,0.0
97,29,0.0,Liechtenstein,37570,0.077189,0.0
107,37,0.0,Marshall Islands,55000,0.067273,0.0
72,206,2.0,Iceland,330828,0.062268,0.970874
168,63,0.0,Tonga,103300,0.060987,0.0
3,34,0.0,Andorra,78000,0.04359,0.0
54,38,0.0,Federated States of Micronesia,103000,0.036893,0.0


In [13]:
table_lowest_articles_per_population = article_df.sort_values('articles_per_population(as a percentage)', ascending=True).head(10)
table_lowest_articles_per_population 

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population(as a percentage),high_qual_articles_proportion(as a percentage)
73,989,13.0,India,1314097616,7.5e-05,1.314459
34,1138,35.0,China,1371920000,8.3e-05,3.075571
74,215,8.0,Indonesia,255741973,8.4e-05,3.72093
180,29,3.0,Uzbekistan,31290791,9.3e-05,10.344828
53,105,3.0,Ethiopia,98148000,0.000107,2.857143
86,39,9.0,"Korea, North",24983000,0.000156,23.076923
185,26,0.0,Zambia,15473900,0.000168,0.0
166,112,3.0,Thailand,65121250,0.000172,2.678571
38,142,8.0,"Congo, Dem. Rep. of",73340200,0.000194,5.633803
13,324,3.0,Bangladesh,160411000,0.000202,0.925926


In [14]:
table_highest_qual_articles_proportion = article_df.sort_values('high_qual_articles_proportion(as a percentage)', ascending=False).head(10)
table_highest_qual_articles_proportion

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population(as a percentage),high_qual_articles_proportion(as a percentage)
86,39,9.0,"Korea, North",24983000,0.000156,23.076923
143,119,14.0,Saudi Arabia,31565109,0.000377,11.764706
180,29,3.0,Uzbekistan,31290791,9.3e-05,10.344828
31,68,7.0,Central African Republic,5551900,0.001225,10.294118
138,348,34.0,Romania,19838662,0.001754,9.770115
68,21,2.0,Guinea-Bissau,1788000,0.001174,9.52381
19,33,3.0,Bhutan,757000,0.004359,9.090909
183,191,16.0,Vietnam,91714080,0.000208,8.376963
46,12,1.0,Dominica,68000,0.017647,8.333333
109,52,4.0,Mauritania,3641288,0.001428,7.692308


In [15]:
table_lowest_qual_articles_proportion = article_df.sort_values('high_qual_articles_proportion(as a percentage)', ascending=True).head(10)
table_lowest_qual_articles_proportion

Unnamed: 0,num_article,num_high_quality_article,country,population,articles_per_population(as a percentage),high_qual_articles_proportion(as a percentage)
172,33,0.0,Turkmenistan,5373000,0.000614,0.0
164,40,0.0,Tajikistan,8452153,0.000473,0.0
113,40,0.0,Monaco,38088,0.10502,0.0
117,60,0.0,Mozambique,25736000,0.000233,0.0
120,53,0.0,Nauru,10860,0.488029,0.0
168,63,0.0,Tonga,103300,0.060987,0.0
30,37,0.0,Cape Verde,514000,0.007198,0.0
65,49,0.0,Guadeloupe,407000,0.012039,0.0
83,79,0.0,Kazakhstan,17544274,0.00045,0.0
158,40,0.0,Suriname,576000,0.006944,0.0


Note: we actually found 39 countries are tied for lowest-ranked countries, since all values of num_high_quality_article were 0.

### Writeup

I am supried about results in the above four tables. These are the countries I did not expect. For example, in table 1, 10 highest-ranked countries in terms of number of politician articles as a proportion of country population, I expect Britain would be in the list. According to what we learnt in the last lecture, Wikipedia eiditing is greatly popular in Britain. Thus politician article should has a good coverage of Britian politicians and the number of articles on this subject should be large.

Instead, most countries in table 1 are countries with a relatively small population. At the same time, India and China are ranked as the top 2 in table 2, 10 lowest-ranked countries in terms of number of politician articles. China and India are the largest countries by population in the world.

I think bias exist in the evaluation metrics. The metrics, number of politician articles as a proportion of country population and number of high-quality articles as a proportion of all articles, do not reflect the measurement of our problems of interests well. In this project, we are interested to analyze coverage of politicians on Wikipedia and the quality of articles on politician among different countries. For a country A with a population 100 times as country B, country A would not has 100 times of politicians compared to country B; thus country A would not has 100 times of articles on politicians compared to country B. There is an inherent bias in the evluation metrics. For article quality, a better metrics may be calculating the sum of article number times quality weight; weight is positively related to article quality (e.g. FA=10,GA=5,B=3,C=2,Start=1,Stub=0.5, etc).

Because there is bias in evaluation metircs, the result found in step 5 may not be the answers to the problems which we want to study. 