# Bias on Wikipedia

This assignment calculates bias on wikipedia by computing two metric.  

a) # of articles per country 

b) Ratio of high quality articles to article count

Data Inputs for this analyis:
a) Page data provided by Oliver Kyes, Human Centered Design (HCD), University of Washinton and is available at: https://ndownloader.figshare.com/files/9614893

b) Population data from: http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14

c) Page quality data using ORES REST API. References: 
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context
https://www.mediawiki.org/wiki/ORES

Analysis involves observing the top and bottom 10 countries for the above two metric

Countries that are low on "# of articles per country" metric, are under-represented on wikipedia and vice versa. Countries with low "ratio of high quality articles to article count" metric have low representation of good quality articles on wikipedia and vice versa


### Step 0 - Data Acquisition - Download page and population data

1) Download page data from: https://ndownloader.figshare.com/files/9614893. This data is provided by Oliver Kyes, Human Centered Design (HCD), University of Washinton

2) Download population data from: http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14 . Click on the Microsoft Excel icon on top right corner

Save both csv files locally on your computer

### Step 1 - Data Acquisition - Load the previously downloaded csv files 
Load the population data and page data csv files in data frames 

Note: Update the localPath variable with the right path for your machine where csv files are stored 

In [4]:
## getting the data from the CSV files. Please update localPath variable to location where you down load the csv file
import csv
import pandas as pd
localPath = 'C:/Users/amnag/OneDrive/DataScience/HumanCenteredDS/week4/country/country/data/'

# Create an empty list to store page data
page_data = []
with open(localPath + 'page_data.csv',encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    header = True
    # Using two hints to get to the data row:
    # 1 - Skip rows that have less than 3 columns
    # 2 - Skip the header row  
    for row in reader:
        if(len(row) >= 3):
            if(header==True):
                header=False
            else:    
                page_data.append([row[0],row[1],row[2]])
# Convert the page_data list to a dataframe and assign column names
page_data_df = pd.DataFrame(page_data,columns=['article_name','country','revision_id'])
# Add a column to store article quality from ORES. Initialize the column with NA
page_data_df = page_data_df.assign(article_quality = lambda x: 'NA')
# Store the data frame to an intermediate csv. 
# This will be re-loaded later to update the article_quality using ORES REST API
page_data_df.to_csv(localPath+'page_data_with_ORES_score.csv') 

# Create an empty list to store population data
population_data = []
with open(localPath + 'Population Mid-2015.csv',encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    header = True
    for row in reader:
        # Skip the header row and then read data
        if(len(row) >= 6):
            if(header==True):
                header=False
            else:    
                population_data.append([row[0],row[1],row[2],row[3],row[4],row[5]])

# Convert the population_data list to a dataframe and assign column names
population_data_df = pd.DataFrame(population_data)
population_data_df.columns = ['country','Location Type','TimeFrame','Data Type','population','Footnotes']
# Drop columns that are not required for analysis to improve processing time and data readability
population_data_df.drop(['Location Type','TimeFrame','Data Type','Footnotes'], axis=1, inplace=True)

### Step 2 - Data Acquisition - Call ORES API to find the article quality for each page
Page data with ORES article quality is saved to an intermediate file: page_data_with_ORES_score.csv. This intemediate file contains page data with article quality. It is updated as soon data is downloaded for 100 rev ids. This is helpful in avoiding to restart the ORES API query from beginning in case of network connectivity loss.

In [5]:
import requests
import json
import json

headers = {'User-Agent' : 'https://github.com/amitabhnag', 'From' : 'amnag@uw.edu'}
# Function that calls the ORES REST API for a set of revision ids and returns the response object  
# This function is developed based off code sample provided by Prof. Jonathan Morgan, Human Centered Design (HCD), 
# University of Washington  
def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))    
    response = api_call.json()
    return(response)

# This function returns the index of the page from which the ORES REST API needs to be called.  
# This function is useful as in many cases network connectivity issues can prevent you from receiving the response
# from ORES REST API. You may be left with partially filled article quality. 
# This function loads the previously saved page_data_with_ORES_score.csv 
# and returns the index from where the article quality needs to be calculated 
def findIndexToGetORESData():
    currentIndex = 0
    page_data= []
    header = True
    with open(localPath + 'page_data_with_ORES_score.csv') as csvfile:
        reader = csv.reader(csvfile)
        # Skip the header
        for row in reader:
            if (header == True):
                header = False
            else:    
                page_data.append([row[1],row[2],row[3],row[4]])
    page_data_df = pd.DataFrame(page_data,columns=['article_name','country','revision_id','article_quality']) 
    # Skip rows that do not have a artcle_quality as NA and return the first index that has NA
    for i in range(0,len(page_data_df)):
        if(page_data_df['article_quality'][i]!='NA'):
            currentIndex = currentIndex + 1
    return currentIndex    

# Call findIndexToGetORESData() to find the index from where ORES REST API needs to be called
currentIndex = findIndexToGetORESData()
totalPages = len(page_data)

# This loop calls get_ores_data() to get article quality in batches of 100 pages.
while(currentIndex < totalPages):
    print('currently processing index:=' + str(currentIndex))
    # Get a list of 100 rev ids to be passed to get_ores_data()
    # If 100 rev ids are not there, call get_ores_data() with the remaining rev ids
    if(totalPages - currentIndex >= 100 ):
        revids = page_data_df['revision_id'][currentIndex : (currentIndex + 100)]
        currentIndex = currentIndex + 100
    else:
        revids = page_data_df['revision_id'][currentIndex : totalPages]   
        currentIndex = totalPages
    # Get the ORES data    
    response = get_ores_data(revids, headers)
    
    for revid in revids:
        if(('error' in response['enwiki']['scores'][revid]['wp10']) == False):        
            prediction = response['enwiki']['scores'][revid]['wp10']['score']['prediction']        
            page_data_df.set_value(page_data_df[page_data_df['revision_id'] == revid].index[0],'article_quality',prediction )   
        else:
            print('error in rev_id:' + str(revid) )
    # Save the page data with ORES to a file so that in case of network connection issue data can still be retrived          
    page_data_df.to_csv(localPath+'page_data_with_ORES_score.csv')


currently processing index:=0
currently processing index:=100
currently processing index:=200
currently processing index:=300
currently processing index:=400
currently processing index:=500
currently processing index:=600
currently processing index:=700
currently processing index:=800
currently processing index:=900
currently processing index:=1000
currently processing index:=1100
currently processing index:=1200
currently processing index:=1300
currently processing index:=1400
currently processing index:=1500
currently processing index:=1600
currently processing index:=1700
currently processing index:=1800
currently processing index:=1900
currently processing index:=2000
currently processing index:=2100
currently processing index:=2200
currently processing index:=2300
currently processing index:=2400
currently processing index:=2500
currently processing index:=2600
currently processing index:=2700
currently processing index:=2800
currently processing index:=2900
currently processing i

### Step 3 - Data processing - Load the page data with article quality from intermediate file
Load the page data from page_data_with_ORES_score.csv into a data frame. This file is updated in Step 2 with article quality.

In [6]:
page_data_ores = []

# Load the data from page_data_with_ORES_score.csv into a list page_data_ores 
with open(localPath + 'page_data_with_ORES_score.csv') as csvfile:
    reader = csv.reader(csvfile)
    header = True
    for row in reader:
        # Skip the header row and then read data
        if(len(row) >= 3):
            if(header==True):
                header=False
            else:    
                page_data_ores.append([row[1],row[2],row[3],row[4]])
# Create data frame from the list and assign column names
page_data_ores_df = pd.DataFrame(page_data_ores)
page_data_ores_df.columns = ['article_name','country','revision_id','article_quality']
page_data_ores_df                

Unnamed: 0,article_name,country,revision_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub
5,Template:Nigeria-politician-stub,Nigeria,391862819,Stub
6,Template:Colombia-politician-stub,Colombia,391863340,Stub
7,Template:Chile-politician-stub,Chile,391863361,Stub
8,Template:Fiji-politician-stub,Fiji,391863617,Stub
9,Template:Solomons-politician-stub,Solomon Islands,391863809,Stub


### Step 4 - Data Processing - Merge the page data and population data

In [7]:
# Merge the page data and population data on country 
merged_df = population_data_df.merge(page_data_ores_df,on=['country'],how='inner')
# Remove thousand "," symbol from population data. This symbol makes it harder to convert population to integer
merged_df['population'] = merged_df['population'].apply(lambda x: x.replace(',',''))
# Convert the population to integer
merged_df['population'] = merged_df['population'].astype(int)
# Save data to a csv file
merged_df.to_csv(localPath+'ConsolidatedData.csv')
merged_df

Unnamed: 0,country,population,article_name,revision_id,article_quality
0,Afghanistan,32247000,Template:Afghanistan-politician-stub,394580295,Stub
1,Afghanistan,32247000,Template:Afghanistan-mayor-stub,443496992,Stub
2,Afghanistan,32247000,Template:Afghanistan-diplomat-stub,540459929,Stub
3,Afghanistan,32247000,Daud Arsala,627547024,Stub
4,Afghanistan,32247000,Murad Quenili,670462475,Stub
5,Afghanistan,32247000,Badar,671455150,Stub
6,Afghanistan,32247000,Mohammed Qalamuddin,671473289,Stub
7,Afghanistan,32247000,Faizanullah Faizan,703507854,Stub
8,Afghanistan,32247000,Mohammad Fahim Dashty,706112927,Stub
9,Afghanistan,32247000,Aamir Latif,708476182,Stub


### Step 5 - Analysis - Compute the two metrics and find top and bottom 10 values: 
#### a) # of articles per country 
#### b) Ratio of high quality articles to article count

In [8]:
from IPython.display import display
import warnings
warnings.filterwarnings("ignore")
# Create a dataframe that holds these two metric
# a) Number of articles per country 
# b) Ratio of high quality articles to article count

# Create a data frame analysis_df that stores the metric values for analysis
analysis_df = pd.DataFrame()
# Compute article count per country
analysis_df['article count'] = merged_df.groupby('country')['revision_id'].count()
# Sice population is in each row for a country, find the population from the first row for a country
analysis_df['population'] = merged_df.groupby('country')['population'].first()
# Calculate high quality articles for each country
analysis_df['high quality articles'] = merged_df[(merged_df.article_quality == 'GA') | (merged_df.article_quality == 'FA')].groupby('country')['article_quality'].count()

# Calculate the metric # of articles per country
analysis_df['# of articles by country population'] = analysis_df['article count']/analysis_df['population']*100
# Calculate the metric # of articles per country
analysis_df['ratio of high quality articles to article count'] = analysis_df['high quality articles']/analysis_df['article count']*100

# Display the results for top and bottom 10 values for # of articles per country  
print('10 highest-ranked countries in terms of number of politician articles as a proportion of country population')
display(analysis_df.sort(['# of articles by country population'],ascending=0)[0:10])
print('10 lowest-ranked countries in terms of number of politician articles as a proportion of country population')
display(analysis_df.sort(['# of articles by country population'],ascending=0)[len(analysis_df)-10:len(analysis_df)])

# Display the results for top and bottom 10 values for ratio of high quality articles to article count 
print('10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country')
display(analysis_df.sort(['ratio of high quality articles to article count'],ascending=0)[0:10])
print('10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country')
display(analysis_df.sort(['ratio of high quality articles to article count'],ascending=0)[len(analysis_df)-10:len(analysis_df)])


10 highest-ranked countries in terms of number of politician articles as a proportion of country population


Unnamed: 0_level_0,article count,population,high quality articles,# of articles by country population,ratio of high quality articles to article count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nauru,53,10860,,0.488029,
Tuvalu,55,11800,1.0,0.466102,1.818182
San Marino,82,33000,,0.248485,
Monaco,40,38088,,0.10502,
Liechtenstein,29,37570,,0.077189,
Marshall Islands,37,55000,,0.067273,
Iceland,206,330828,3.0,0.062268,1.456311
Tonga,63,103300,,0.060987,
Andorra,34,78000,,0.04359,
Federated States of Micronesia,38,103000,,0.036893,


10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


Unnamed: 0_level_0,article count,population,high quality articles,# of articles by country population,ratio of high quality articles to article count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bangladesh,324,160411000,6.0,0.000202,1.851852
"Congo, Dem. Rep. of",142,73340200,7.0,0.000194,4.929577
Thailand,112,65121250,3.0,0.000172,2.678571
Zambia,26,15473900,,0.000168,
"Korea, North",39,24983000,9.0,0.000156,23.076923
Ethiopia,105,98148000,2.0,0.000107,1.904762
Uzbekistan,29,31290791,2.0,9.3e-05,6.896552
Indonesia,215,255741973,9.0,8.4e-05,4.186047
China,1138,1371920000,42.0,8.3e-05,3.690685
India,990,1314097616,15.0,7.5e-05,1.515152


10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


Unnamed: 0_level_0,article count,population,high quality articles,# of articles by country population,ratio of high quality articles to article count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Korea, North",39,24983000,9.0,0.000156,23.076923
Romania,348,19838662,45.0,0.001754,12.931034
Saudi Arabia,119,31565109,15.0,0.000377,12.605042
Central African Republic,68,5551900,8.0,0.001225,11.764706
Qatar,51,2394524,5.0,0.00213,9.803922
Guinea-Bissau,21,1788000,2.0,0.001174,9.52381
Vietnam,191,91714080,18.0,0.000208,9.424084
Bhutan,33,757000,3.0,0.004359,9.090909
Ireland,381,4630308,31.0,0.008228,8.136483
United States,1098,321234172,86.0,0.000342,7.832423


10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


Unnamed: 0_level_0,article count,population,high quality articles,# of articles by country population,ratio of high quality articles to article count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Seychelles,22,92833,,0.023698,
Solomon Islands,98,641900,,0.015267,
Suriname,40,576000,,0.006944,
Swaziland,32,1286000,,0.002488,
Switzerland,407,8292851,,0.004908,
Tajikistan,40,8452153,,0.000473,
Tonga,63,103300,,0.060987,
Tunisia,140,11026000,,0.00127,
Turkmenistan,33,5373000,,0.000614,
Zambia,26,15473900,,0.000168,
