# <center>DATA 512 Homework 2: Bias in Data</center>
<center>Fall 2021</center>
<center>Author: Dwight Sablan</center>

## Background

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. I will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. I perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. 

## Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. 

Dataset 1: The Wikipedia politicians by country dataset can be found on Figshare. We download and unzip the data file named page_data.csv.

Dataset 2: The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

IMPORT DEPENDENCIES

In [170]:
import pandas as pd
import numpy as np

import json
import requests

READ IN THE TWO DATASETS

In [16]:
politician_data = pd.read_csv('page_data.csv')

#print dataframe shape
display(politician_data.shape)

#print first 5 rows
display(politician_data.head())

(47197, 3)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [17]:
population_data = pd.read_csv('WPDS_2020_data.csv')

#print dataframe shape
display(population_data.shape)

#print first five rows
display(population_data.head())

(234, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Step 2: Cleaning the Data

In the politician dataset, filter out the page names that contain the string 'Template' as they won't be needed in the analysis.

In [18]:
politician_data_cleaned = politician_data[~ politician_data.page.str.contains("Template")]

display(politician_data_cleaned.shape)

display(politician_data_cleaned.head())

(46701, 3)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In the population dataset, separate cumulative regional population counts and country-level counts.  The regional population rows are denoted with characters in all caps.  Ex: OCEANIA

In [22]:
#apply the isupper function to the Name column
regional_population = population_data[population_data['Name'].apply(lambda x: x.isupper())]

display(regional_population.shape)

display(regional_population.head())

(24, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000


In [23]:
#get the inverse of the regional_population dataset to get the country-level populations
country_population = population_data[~population_data['Name'].apply(lambda x: x.isupper())]

display(country_population.shape)

display(country_population.head())

(210, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


## Step 3: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. To do so, we use a using a machine learning system called ORES.  ORES is a machine learning tool that can provide estimates of Wikipedia article quality. 

The article quality estimates are, from best to worst:
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For a given rev_id, ORES will assign one of these 6 categories. 

Use the REST API which provide access to a set of scoring models. This is how we'll get the article preductions.

SET USER-AGENT AND ENDPOINT TO RETREIVE DATA

In [132]:
headers = {
    'User-Agent': 'https://github.com/dwightsablan16',
    'From': 'sabland@uw.edu'
}

endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={revid}'

DEFINE FUNCTION TO CALL API GET SCORES DATA

In [136]:
def api_call(endpoint, parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

DEFINE FUNCTION TO GET THE PREDICTIONS FOR EACH REV_ID

In [179]:
#Create list for predictions
predictions_list = []

def get_prediction(article_scores):
    
    for i in article_scores['enwiki']['scores']:
        
            #if there exists a prediction 
            if 'score' in article_scores['enwiki']['scores'][i]['articlequality']:
                
                #get the article quality prediction
                article_quality = article_scores['enwiki']['scores'][i]['articlequality']['score']['prediction']
                
                #add prediction
                predictions_list.append(article_quality)
                
            else :
                #Add 'no_pred' for articles with no predictions
                predictions_list.append('no_pred')

INGEST THE DATA

In [180]:
#set initial index
begin_point = 0

while begin_point < politician_data_cleaned.shape[0]:
    
    #set intervals of data ingestion
    ingest_range = begin_point + 50
    
    #set end index
    end_point = min(ingest_range, politician_data_cleaned.shape[0])
    
    #set parameters for API
    parameters = {'context' : 'enwiki',
          'revid'  : '|'.join (str(x) for x in politician_data_cleaned['rev_id'][begin_point:end_point]),
          'model' : 'articlequality'
          }

    #call api to get scores for corresponding indices
    scores = api_call(endpoint, parameters)
    
    #adjust beginning point to get next 50 responses
    begin_point = begin_point + 50
    
    #call function to store prediction value into prediction list
    get_prediction(scores)

In [185]:
#Add the predictions to the dataframe
politician_data_cleaned['prediction'] = predictions_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  politician_data_cleaned['prediction'] = predictions_list


In [188]:
#View politician data frame
politician_data_cleaned.head()

Unnamed: 0,page,country,rev_id,prediction
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
12,Yos Por,Cambodia,393822005,Stub
23,Julius Gregr,Czech Republic,395521877,Stub
24,Edvard Gregr,Czech Republic,395526568,Stub


GET THE PAGES WITH NO PREDICTIONS AND SAVE AS CSV

In [192]:
#get all the rev_id's with no prediction
no_prediction_data = politician_data_cleaned[politician_data_cleaned.prediction == 'no_pred']

#save as csv
no_prediction_data.to_csv('no_prediction_data.csv')

In [302]:
#remove articles in the dataset where we have no_pred
politician_data_cleaned = politician_data_cleaned[politician_data_cleaned.prediction != 'no_pred']

## Step 4: Combining the Datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

In [303]:
#rename column name 'Name' in country_population dataset to merge on 'country'
country_population = country_population.rename(columns={'Name': 'country'})

MERGE DATA

In [329]:
#data where country is in both datasets
merged_data = politician_data_cleaned.merge(country_population, how = 'outer' ,indicator=True).loc[lambda x : x['_merge'] == 'both']

#data where country is only in politician dataset
left_only_data = politician_data_cleaned.merge(country_population, how = 'outer' ,indicator=True).loc[lambda x : x['_merge'] == 'left_only']

#data where country is only in right dataset
right_only_data = politician_data_cleaned.merge(country_population, how = 'outer' ,indicator=True).loc[lambda x : x['_merge'] == 'right_only']

#aggregate data with no matches
no_match_data = left_only_data.append(right_only_data, ignore_index=True)

RENAME AND REMOVE UNNECESSARY COLUMNS

In [330]:
#rename columns
merged_data = merged_data.rename(columns = {'page': 'article_name', 'rev_id': 'revision_id', 'prediction': 'article_quality_est.', 'Population': 'population'})
no_match_data = no_match_data.rename(columns = {'page': 'article_name', 'rev_id': 'revision_id', 'prediction': 'article_quality_est.', 'Population': 'population'})

In [331]:
#drop columns not needed
dropped_cols = ['FIPS', 'Type', 'TimeFrame', 'Data (M)', '_merge']

merged_data = merged_data.drop(labels = dropped_cols, axis = 1)

no_match_data = no_match_data.drop(labels = dropped_cols, axis = 1)

SAVE DATA

In [332]:
#save data with matches
merged_data.to_csv('wp_wpds_politicians_by_country.csv')

#save data with no matches
no_match_data.to_csv('wp_wpds_countries-no_match.csv')

## Step 5: Analysis

The analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.


In [333]:
#create a dataframe with just high quality articles
high_quality_pages = merged_data[(merged_data['article_quality_est.'] == 'FA') | (merged_data['article_quality_est.'] == 'GA') ]

PROPORTION OF ARTICLES PER POPULATION FOR EACH COUNTRY

In [314]:
#country population
country_pop = merged_data.groupby(['country'])['population'].mean()

#number of articles
page_count = merged_data.groupby(['country'])['article_name'].count()

#articles per country
pages_per_country = page_count/country_pop*100

pages_per_country

country
Afghanistan    0.000819
Albania        0.016068
Algeria        0.000262
Andorra        0.041463
Angola         0.000326
                 ...   
Venezuela      0.000454
Vietnam        0.000194
Yemen          0.000389
Zambia         0.000136
Zimbabwe       0.001097
Length: 183, dtype: float64

PROPORTION OF HIGH QUALITY ARTICLES PER POPULATION FOR EACH COUNTRY

In [336]:
#country population
high_country_pop = high_quality_pages.groupby(['country'])['population'].mean()

#number of articles
high_page_count = high_quality_pages.groupby(['country'])['article_name'].count()

#articles per country
high_pages_per_country = high_page_count/high_country_pop*100

high_pages_per_country

country
Afghanistan    0.000033
Albania        0.000106
Algeria        0.000005
Argentina      0.000035
Armenia        0.000169
                 ...   
Vanuatu        0.000935
Venezuela      0.000010
Vietnam        0.000014
Yemen          0.000010
Zimbabwe       0.000013
Length: 146, dtype: float64

## Step 6: Results

The following results are produced in the following 6 tables:

- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality


In [421]:
regional_population_data_cleaned.head(10)

Unnamed: 0_level_0,Name,Type,TimeFrame,Data (M),Population
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
,Algeria,Country,2019,44.357,44357000
,Egypt,Country,2019,100.803,100803000
,Libya,Country,2019,6.891,6891000
,Morocco,Country,2019,35.952,35952000
,Sudan,Country,2019,43.849,43849000
,Tunisia,Country,2019,11.896,11896000
,Western Sahara,Country,2019,0.597,597000
WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000


In [417]:
display(population_data.head(10))
display(population_data.shape)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
8,TN,Tunisia,Country,2019,11.896,11896000
9,EH,Western Sahara,Country,2019,0.597,597000


(234, 6)

In [None]:
region_dict = {'NORTHERN AFRICA': ['Algeria',
'Egypt',
'Libya',
'Morocco',
'Sudan',
'Tunisia', 
'Western Sahara']}

regional_merged_data = merged_data["region"]

In [450]:
regional_population_data_cleaned_2 = population_data[population_data.FIPS != 'WORLD']

regional_population_data_cleaned_2.loc[(regional_population_data_cleaned_2.FIPS.str.len() <3),'FIPS']= ""

regional_population_data_cleaned_2 = regional_population_data_cleaned_2.dropna()

regional_population_data_cleaned_2 = regional_population_data_cleaned_2.set_index(keys = 'FIPS')

regional_population_data_cleaned_2
len(regional_population_data_cleaned_2.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


232

In [433]:
#filter out WORLD, countries, in FIPS and set index to FIPS
regional_population_data_cleaned = population_data[population_data.FIPS != 'WORLD']

regional_population_data_cleaned.loc[(regional_population_data_cleaned.FIPS.str.len() <3),'FIPS']= None

regional_population_data_cleaned = regional_population_data_cleaned.set_index(keys = 'FIPS')

regional_population_data_cleaned = regional_population_data_cleaned[regional_population_data_cleaned.index.notnull()]

regional_population_data_cleaned.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0_level_0,Name,Type,TimeFrame,Data (M),Population
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000


In [415]:
regional_population_data_cleaned.index

Index(['AFRICA', 'NORTHERN AFRICA', '', '', '', '', '', '', '',
       'WESTERN AFRICA',
       ...
       '', '', '', '', '', '', '', '', '', ''],
      dtype='object', name='FIPS', length=232)

#### Top 10 countries by coverage

In [339]:
top_ten_pages_per_country = pages_per_country.sort_values(ascending=False)[0:10]

top_ten_pages_per_country

country
Tuvalu                            0.540000
Nauru                             0.472727
San Marino                        0.238235
Monaco                            0.105263
Liechtenstein                     0.071795
Marshall Islands                  0.064912
Tonga                             0.063636
Iceland                           0.054620
Andorra                           0.041463
Federated States of Micronesia    0.033962
dtype: float64

#### Bottom 10 countries by coverage

In [341]:
bottom_ten_pages_per_country = pages_per_country.sort_values(ascending=True)[0:10]

bottom_ten_pages_per_country

country
India           0.000069
Indonesia       0.000077
China           0.000081
Uzbekistan      0.000082
Ethiopia        0.000088
Zambia          0.000136
Korea, North    0.000140
Thailand        0.000168
Mozambique      0.000186
Bangladesh      0.000187
dtype: float64

#### Top 10 countries by relative quality

In [342]:
top_ten_high_pages_per_country = high_pages_per_country.sort_values(ascending=False)[0:10]

top_ten_high_pages_per_country

country
Tuvalu         0.040000
Dominica       0.001389
Vanuatu        0.000935
Iceland        0.000543
Ireland        0.000500
Montenegro     0.000322
Martinique     0.000281
Bhutan         0.000274
New Zealand    0.000261
Romania        0.000218
dtype: float64

#### Bottom 10 countries by relative quality

In [343]:
top_ten_high_pages_per_country = high_pages_per_country.sort_values(ascending=True)[0:10]

top_ten_high_pages_per_country

country
India         9.285051e-07
Nigeria       9.702144e-07
Tanzania      1.674088e-06
Ethiopia      1.740402e-06
Bangladesh    1.766691e-06
Colombia      2.022490e-06
Uganda        2.186222e-06
Morocco       2.781486e-06
Brazil        2.832701e-06
China         2.852284e-06
dtype: float64

### Step 7: Writeup: Reflections and Implications

Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

One thing that didn't surpirse me about the findings is that the top 10 countries by coverage 