# <center>DATA 512 Homework 2: Bias in Data</center>
<center>Fall 2021</center>
<center>Author: Dwight Sablan</center>

## Background

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. I will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. I perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. 

## Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. 

Dataset 1: The Wikipedia politicians by country dataset can be found on Figshare. We download and unzip the data file named page_data.csv.

Dataset 2: The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

IMPORT DEPENDENCIES

In [31]:
import pandas as pd
import requests

READ IN THE TWO DATASETS

In [16]:
politician_data = pd.read_csv('page_data.csv')

#print dataframe shape
display(politician_data.shape)

#print first 5 rows
display(politician_data.head())

(47197, 3)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [17]:
population_data = pd.read_csv('WPDS_2020_data.csv')

#print dataframe shape
display(population_data.shape)

#print first five rows
display(population_data.head())

(234, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Step 2: Cleaning the Data

In the politician dataset, filter out the page names that contain the string 'Template' as they won't be needed in the analysis.

In [18]:
politician_data_cleaned = politician_data[~ politician_data.page.str.contains("Template")]

display(politician_data_cleaned.shape)

display(politician_data_cleaned.head())

(46701, 3)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In the population dataset, separate cumulative regional population counts and country-level counts.  The regional population rows are denoted with characters in all caps.  Ex: OCEANIA

In [22]:
#apply the isupper function to the Name column
regional_population = population_data[population_data['Name'].apply(lambda x: x.isupper())]

display(regional_population.shape)

display(regional_population.head())

(24, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000


In [23]:
#get the inverse of the regional_population dataset to get the country-level populations
country_population = population_data[~population_data['Name'].apply(lambda x: x.isupper())]

display(country_population.shape)

display(country_population.head())

(210, 6)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


## Step 3: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. To do so, we use a using a machine learning system called ORES.  ORES is a machine learning tool that can provide estimates of Wikipedia article quality. 

The article quality estimates are, from best to worst:
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For a given rev_id, ORES will assign one of these 6 categories. 

Use the REST API which provide access to a set of scoring models. This is how we'll get the article preductions.

SET USER-AGENT AND ENDPOINT TO RETREIVE DATA

In [29]:
headers = {
    'User-Agent': 'https://github.com/dwightsablan16',
    'From': 'sabland@uw.edu'
}

In [47]:
endpoint = 'https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model'

In [90]:
params = {'context' : 'enwiki',
          'revid'  : '|'.join (str(x) for x in politician_data_cleaned['rev_id'][0:10]),
          'model' : 'articlequality'
          }

In [98]:
def api_call(endpoint, parameters, filename):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
     #Save json
    with open(filename,'w') as f:
        json.dump(response,f)
    
    return response

In [99]:
test_list = api_call(endpoint, params, 'predictions.csv')

JSONDecodeError: Expecting value: line 1 column 1 (char 0)