# DATA 512: Assignment 2: Bias in Data

Charles Duze - 11/02/2017

### Objective
For this assignment the task is to analyze "what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content"*. 

### Process and Contents
We gather data from mulitple sources to get information about a Country's population, number of political articles and number of high quality political articles. High-Quality is based of the ORES API described below. We then calculate both the percentage of artcles per population and percentage of high quality articles per total articles. There is a write-up at the end summarizing my opinion on the findings.

/* Content from https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments

# Step 1: Data Acquisition

#### Import Statements
Importing the modules we'll need.

In [316]:
import requests
import json
import csv
import pandas as pd


#### Function to Get ORES Data
This function takes in a set of revIds separated by "|" and returns a list of the Wikipedia scores. It is possible that some revIds do not have score, in which case we denote it with "Error".

It is calling a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). More information about the API including all the values returned and the ratings definition can be found here: https://www.mediawiki.org/wiki/ORES.

In [317]:
# Funciton to get scores from ORES API in batches
def get_ores_data(revision_ids):
    
    # Generate endpoint string with appended revIds
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki/?models=wp10&revids=' + revids
    
    # Make the JSON call
    api_call = requests.get(endpoint)    
    response = api_call.json()
    
    # Extract scores from response and handle errors
    scores = [response['enwiki']['scores'][x]['wp10']['score']['prediction'] if "score" in response['enwiki']['scores'][x]['wp10'] else "Error" for x in response['enwiki']['scores']]
    
    return scores



#### Read in the Page_Data csv file
This section loads in a CSV file, Politicians by Country from the English-language Wikipedia by Oliver Keyes, and stores it the "data" list. The download as well as additional information can be found at: https://figshare.com/articles/Untitled_Item/5513449

In [318]:
## getting the data from the CSV files
data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

#### Get the ORES score for articles in the Page_Data csv file
This section obtains the scores from the ORES API for the articles in the Page_Data csv file in batches. The batch size is set as a parameter. I skip the first row of the csv since it's just the header. I loop through the Page_Data articles in batches and store the scores in the "ores_scores" list. It was initialized with placeholders ("None").

In [319]:
#Let's get the length of page_data
page_data_len = len(data)

#Create an empty array to store the values as we get them
ores_scores = [None] * page_data_len

#Define Batch Size
batchSize = 130

# Initial starting index. 0 is header.
currStart = 1

#Loop through the page_data get the scores
while currStart < page_data_len:
    
    #initialize revIds for this iteration
    revids = ""
    
    # Calculate the end index
    currEnd = currStart + batchSize 
    
    # Make sure currEnd is not out of bounds
    if currEnd > page_data_len -1:
        currEnd = page_data_len -1
    
    # Progress update. It takes a while.
    print("Getting ",currStart, "-", currEnd)        
    
    # Construct the list of revIds in this batch and append "|"
    for x in range(currStart,currEnd+1):
        revids = revids + data[x][2] + "|"

    # Remove the last "|" otherwise will cause an error with the API
    revids = revids[:-1]
    
    # Call the ORES API via the function defined above
    myresp = get_ores_data(revids)
    
    # Store the scores we got back in the ores_scores list.
    ores_scores[currStart:currEnd+1] = myresp
    
    # Update starting index for the next iteration
    currStart = currEnd + 1

# Update the num of recored retrived.
print ("Got back", len(ores_scores), "Records")


Getting  1 - 131
Getting  132 - 262
Getting  263 - 393
Getting  394 - 524
Getting  525 - 655
Getting  656 - 786
Getting  787 - 917
Getting  918 - 1048
Getting  1049 - 1179
Getting  1180 - 1310
Getting  1311 - 1441
Getting  1442 - 1572
Getting  1573 - 1703
Getting  1704 - 1834
Getting  1835 - 1965
Getting  1966 - 2096
Getting  2097 - 2227
Getting  2228 - 2358
Getting  2359 - 2489
Getting  2490 - 2620
Getting  2621 - 2751
Getting  2752 - 2882
Getting  2883 - 3013
Getting  3014 - 3144
Getting  3145 - 3275
Getting  3276 - 3406
Getting  3407 - 3537
Getting  3538 - 3668
Getting  3669 - 3799
Getting  3800 - 3930
Getting  3931 - 4061
Getting  4062 - 4192
Getting  4193 - 4323
Getting  4324 - 4454
Getting  4455 - 4585
Getting  4586 - 4716
Getting  4717 - 4847
Getting  4848 - 4978
Getting  4979 - 5109
Getting  5110 - 5240
Getting  5241 - 5371
Getting  5372 - 5502
Getting  5503 - 5633
Getting  5634 - 5764
Getting  5765 - 5895
Getting  5896 - 6026
Getting  6027 - 6157
Getting  6158 - 6288
Getting  

#### Merge Page_Data and the ORES scores in a DataFrame
In this section, I push Page_Data into a DataFrame and set column names. I also append the ores_scores retrieved in the same DataFrame.

In [320]:
# Push Page_Data into a DataFrame and set column names
df = pd.DataFrame(data[1:], columns=['article_name','country', 'revision_id'])

# Append the ores_scores retrieved. 
df["article_quality"] = ores_scores[1:]

# Display a sample
df[0:10]

Unnamed: 0,article_name,country,revision_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub
5,Template:Nigeria-politician-stub,Nigeria,391862819,Stub
6,Template:Colombia-politician-stub,Colombia,391863340,Stub
7,Template:Chile-politician-stub,Chile,391863361,Stub
8,Template:Fiji-politician-stub,Fiji,391863617,Stub
9,Template:Solomons-politician-stub,Solomon Islands,391863809,Stub


#### Read in the "Population Mid-2015" csv file
This section loads in a CSV file, "Population Mid-2015" from the Population Reference Bureau, and stores it the "pop_data" DataFrame. The download as well as additional information can be found at: http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. I select the "country" and "population" columns and set the column names. 

Note: that I had to manually delete a few extra lines at the top of the file so that the headers are the first row before importing.

In [321]:
# Read csv into DataFrame
pop_data = pd.read_csv('Population Mid-2015.csv', thousands=',') 

# Select columns
pop_data = pop_data.iloc[:,[0,4]]

# Name selected columns
pop_data.columns = ["country", "population" ]

# Display a sample
pop_data[0:10]

Unnamed: 0,country,population
0,Afghanistan,32247000
1,Albania,2892000
2,Algeria,39948000
3,Andorra,78000
4,Angola,25000000
5,Antigua and Barbuda,90000
6,Argentina,42426000
7,Armenia,3017106
8,Australia,23888000
9,Austria,8615955


#### Merge the Page_Data (+ ORES score) with the Population Data
This section merges the Page_Data (+ ORES score) with the Population Data. I use an inner join so I only get back rows that have country values in both datasets. 

In [322]:
# Merge data from both data sets
myData = df.merge(pop_data, on='country', how='inner')

# Display number of returned records
print ("Got", len(myData), "Records after merge")

# Display sample
myData[0:10]

Got 45799 Records after merge


Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,15473900
1,Gladys Lundwe,Zambia,757566606,Stub,15473900
2,Mwamba Luchembe,Zambia,764848643,Stub,15473900
3,Thandiwe Banda,Zambia,768166426,Start,15473900
4,Sylvester Chisembele,Zambia,776082926,C,15473900
5,Victoria Kalima,Zambia,776530837,Start,15473900
6,Margaret Mwanakatwe,Zambia,779747587,Start,15473900
7,Nkandu Luo,Zambia,779747961,C,15473900
8,Susan Nakazwe,Zambia,779748181,Start,15473900
9,Catherine Namugala,Zambia,779748285,Start,15473900


# Step 2: Data Processing  (By Country Aggregation)

### Population for each Country
In this section, I'm getting a unique row for each country with the population data.

In [323]:
# Select country and population columns then dedupe.
myData_pop = myData.iloc[:,[1,4]].drop_duplicates()

# Print number of records
print ("Got", len(myData_pop), "Records after merge")

# Display a sample
myData_pop[0:10]

Got 187 Records after merge


Unnamed: 0,country,population
0,Zambia,15473900
26,Chad,13707000
126,Zimbabwe,17354000
293,Uganda,40141000
481,Namibia,2482100
646,Nigeria,181839400
1330,Colombia,48218000
1618,Chile,18025000
1970,Fiji,867000
2169,Solomon Islands,641900


In [324]:
#myData_art = pd.DataFrame(myData['country'].value_counts(), columns="ArticleCount")
#myData_art["country"] = myData_art.index
#myData_art.head()
#myData_art.columns
#myData_art.head()

### Article Count for each Country
In this section, I do a Group By on "country" and count the number of rows which gives the total article per country. I have to re-add the "country" column so I can join later.

In [325]:
# Select columns
myData_art = myData.iloc[:,[0,1]]

# Get counts after Group By
myData_art = myData_art.groupby("country").count()

# Re-add country as a column
myData_art["country"] = myData_art.index

# Set column names
myData_art.columns = ['articleCount', 'country']

# Display sample
myData_art[0:10]

Unnamed: 0_level_0,articleCount,country
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,327,Afghanistan
Albania,460,Albania
Algeria,119,Algeria
Andorra,34,Andorra
Angola,110,Angola
Antigua and Barbuda,25,Antigua and Barbuda
Argentina,496,Argentina
Armenia,199,Armenia
Australia,1566,Australia
Austria,340,Austria


### High Quality Article Count for each Country
In this section, I first filter the dataset to just High Quality articles. I do a Group By on "country" and count the number of rows which gives the total High Quality article per country. I have to re-add the "country" column so I can join later.

In [326]:
# Filter to just High Quality articles ("GA" and "FA")
myData_HQ = myData[myData['article_quality'].isin(['GA','FA'])]

# Select columns
myData_HQ = myData_HQ.iloc[:,[0,1]]

# Get counts after Group By
myData_HQ = myData_HQ.groupby("country").count()

# Re-add country as a column
myData_HQ["country"] = myData_HQ.index

# Set column names
myData_HQ.columns = ['HQarticleCount', 'country']

# Display sample
myData_HQ.head()

Unnamed: 0_level_0,HQarticleCount,country
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,16,Afghanistan
Albania,5,Albania
Algeria,2,Algeria
Angola,1,Angola
Argentina,17,Argentina


### Merge the 3 datasets and calculate percentages
In this section, I merge the 3 datasets from immediately above. I fill in zeros for countrys without any High Quality articles. Then I calculate percentages for "articles-per-population-percent" and "FA-GA-articles-percent".

In [327]:
# Merge the population dataset with the total articles dataset.
myAnalysisData = myData_pop.merge(myData_art, on='country', how='left')

# Merge the combined dataset with the total High Quality articles dataset.
myAnalysisData = myAnalysisData.merge(myData_HQ, on='country', how='left')

# Fill in zeros for countrys without any High Quality articles.
myAnalysisData['HQarticleCount'].fillna(0, inplace=True)

# Calculate percentages
myAnalysisData['articles-per-population-percent'] = (myAnalysisData['articleCount']/myAnalysisData['population'])*100
myAnalysisData['FA-GA-articles-percent'] = (myAnalysisData['HQarticleCount']/myAnalysisData['articleCount'])*100

# Display sample
myAnalysisData[0:10]


Unnamed: 0,country,population,articleCount,HQarticleCount,articles-per-population-percent,FA-GA-articles-percent
0,Zambia,15473900,26,0.0,0.000168,0.0
1,Chad,13707000,100,2.0,0.00073,2.0
2,Zimbabwe,17354000,167,2.0,0.000962,1.197605
3,Uganda,40141000,188,1.0,0.000468,0.531915
4,Namibia,2482100,165,1.0,0.006648,0.606061
5,Nigeria,181839400,684,5.0,0.000376,0.730994
6,Colombia,48218000,288,4.0,0.000597,1.388889
7,Chile,18025000,352,3.0,0.001953,0.852273
8,Fiji,867000,199,1.0,0.022953,0.502513
9,Solomon Islands,641900,98,0.0,0.015267,0.0


# Step 3: Analysis
For this section, we generate four tables that show: 
* 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
* 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
* 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
* 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

### 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [328]:
myAnalysisData.sort_values(by='articles-per-population-percent', ascending= False)[0:10]

Unnamed: 0,country,population,articleCount,HQarticleCount,articles-per-population-percent,FA-GA-articles-percent
124,Nauru,10860,53,0.0,0.488029,0.0
114,Tuvalu,11800,55,2.0,0.466102,3.636364
98,San Marino,33000,82,0.0,0.248485,0.0
134,Monaco,38088,40,0.0,0.10502,0.0
142,Liechtenstein,37570,29,0.0,0.077189,0.0
148,Marshall Islands,55000,37,0.0,0.067273,0.0
53,Iceland,330828,206,2.0,0.062268,0.970874
138,Tonga,103300,63,0.0,0.060987,0.0
177,Andorra,78000,34,0.0,0.04359,0.0
180,Federated States of Micronesia,103000,38,0.0,0.036893,0.0


### 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [329]:
myAnalysisData.sort_values(by='articles-per-population-percent', ascending= True)[0:10]

Unnamed: 0,country,population,articleCount,HQarticleCount,articles-per-population-percent,FA-GA-articles-percent
44,India,1314097616,990,14.0,7.5e-05,1.414141
80,China,1371920000,1138,39.0,8.3e-05,3.427065
30,Indonesia,255741973,215,6.0,8.4e-05,2.790698
167,Uzbekistan,31290791,29,3.0,9.3e-05,10.344828
113,Ethiopia,98148000,105,3.0,0.000107,2.857143
119,"Korea, North",24983000,39,9.0,0.000156,23.076923
0,Zambia,15473900,26,0.0,0.000168,0.0
157,Thailand,65121250,112,3.0,0.000172,2.678571
110,"Congo, Dem. Rep. of",73340200,142,8.0,0.000194,5.633803
43,Bangladesh,160411000,324,5.0,0.000202,1.54321


### 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [330]:
myAnalysisData.sort_values(by='FA-GA-articles-percent', ascending= False)[0:10]

Unnamed: 0,country,population,articleCount,HQarticleCount,articles-per-population-percent,FA-GA-articles-percent
119,"Korea, North",24983000,39,9.0,0.000156,23.076923
128,Saudi Arabia,31565109,119,15.0,0.000377,12.605042
172,Central African Republic,5551900,68,8.0,0.001225,11.764706
55,Romania,19838662,348,40.0,0.001754,11.494253
167,Uzbekistan,31290791,29,3.0,9.3e-05,10.344828
144,Guinea-Bissau,1788000,21,2.0,0.001174,9.52381
156,Bhutan,757000,33,3.0,0.004359,9.090909
91,Vietnam,91714080,191,17.0,0.000208,8.900524
162,Mauritania,3641288,52,4.0,0.001428,7.692308
94,Ireland,4630308,381,29.0,0.008228,7.611549


### 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [331]:
myAnalysisData.sort_values(by='FA-GA-articles-percent', ascending= True)[0:10]

Unnamed: 0,country,population,articleCount,HQarticleCount,articles-per-population-percent,FA-GA-articles-percent
0,Zambia,15473900,26,0.0,0.000168,0.0
65,Belgium,11211064,523,0.0,0.004665,0.0
185,Belize,368000,16,0.0,0.004348,0.0
98,San Marino,33000,82,0.0,0.248485,0.0
100,Turkmenistan,5373000,33,0.0,0.000614,0.0
102,French Guiana,251000,28,0.0,0.011155,0.0
103,Djibouti,900000,39,0.0,0.004333,0.0
115,Antigua and Barbuda,90000,25,0.0,0.027778,0.0
124,Nauru,10860,53,0.0,0.488029,0.0
127,Mozambique,25736000,60,0.0,0.000233,0.0


The sort order for the 0s may be arbitrary. This just shows 10.

# Step 4: Final Output File 
Output the file to a csv (with some column re-ordering)

In [332]:
# Extract the key columns
final_output = myData.iloc[:,[1,0,2,3,4]]

#Write out the final output
filename = 'en-wikipedia_articles_country_population_and_ratings.csv'
final_output.to_csv(filename)

# Step 5: Writeup

The coverage of the data in terms of countries is quite decent. After the merge, I had 187 countries. There are about 196 independent countries according to www.worldatlas.com/nations.htm. I was initially surprised by the top 10 countries in terms of number of politician articles as most of them were unrecognizable. But I remember that population was the denominator, then it makes sense that less common countries will have smaller population and vice versa.

It was not surprising that India and China were at the bottom, given their population size. There was no evidence presented that the number of political leaders are necessarily correlated with population especially given the different government structures.
It does beg the question though "Does using population as the denominator really show bias for political articles"? That is do we expect there to be a strong correlation in the absence of bias. We also filter to political articles not all articles, which in a way introduces bias. Given the data available for this exercise, I can't put too much weight in a confirmation of bias based on the percent of articles per population. One other concern is that there is currently no calculation for if any of these changes are statistically significant.

North Korea is very interesting. It's high quality percent is almost double the next country. This is English Wikipedia so it should be unrelated to any censorship in North Korea, unless there is limited amount of information coming from the outside to write about. And maybe it's status in current affairs brings more editor traffic to increase the quality. It could explain why many African countries have no high-quality articles.

Another possibility is that the algorithm itself (or the training data) is biased. Is high quality defined to fit the standards of the majority without accounting for the nuances of the minority?
