# A2: Bias in Data Assignment

### DATA 512

#### Emily Yamauchi

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.  

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:  

1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.  

You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.


## Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. The Wikipedia [politicians by country dataset](https://figshare.com/articles/Untitled_Item/5513449) can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.  

The population data is available in CSV format as [`WPDS_2020_data.csv`](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing). This dataset is drawn from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

In [26]:
from zipfile import ZipFile
import os

import pandas as pd

In [68]:
## Step 0: download the two files above to data_raw directory

In [69]:
# Unzip politician file

os.chdir('data_raw')

with ZipFile('country.zip') as zipfiles:
    zipfiles.extractall()
    
os.chdir('..')

os.getcwd()

'C:\\Users\\admin\\Documents\\UW\\DATA512\\Assignments\\A2'

In [56]:
# load country politician data from unzipped folder

pols = pd.read_csv('data_raw/country/data/page_data.csv')

pols.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [47]:
# load population data

skiprows = range(4) # csv file includes headers
pops = pd.read_csv('data_raw/export.csv', skiprows=list(skiprows))

pops.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data
0,WORLD,WORLD,World,2019,7772.85
1,AFRICA,AFRICA,Sub-Region,2019,1337.918
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344
3,DZ,Algeria,Country,2019,44.357
4,EG,Egypt,Country,2019,100.803


## Step 2: Cleaning the Data

Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of `page_data.csv`, the dataset contains some page names that start with the string "`Template`:". These pages are not Wikipedia articles, and should not be included in your analysis.  

Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. `AFRICA`, `OCEANIA`). These rows won't match the country values in `page_data.csv`, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

In [70]:
# how many template files?

pols.loc[pols['page'].str.contains('Template:')].shape

(496, 3)

In [71]:
# drop the template pages

pols_keep = pols[~pols.page.str.contains('Template:')]

pols_keep.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [72]:
# population types?

pops.Type.unique()

array(['World', 'Sub-Region', 'Country'], dtype=object)

In [73]:
# keep just countries

pops_country = pops.loc[pops.Type == 'Country'].copy().reset_index(drop=True)

pops_country.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data
0,DZ,Algeria,Country,2019,44.357
1,EG,Egypt,Country,2019,100.803
2,LY,Libya,Country,2019,6.891
3,MA,Morocco,Country,2019,35.952
4,SD,Sudan,Country,2019,43.849


In [75]:
# write clean files to csv

pols_keep.to_csv('data_clean/politicians.csv', index=False)

pops_country.to_csv('data_clean/populations.csv', index=False)

## Step 3: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article  

These were learned based on articles in Wikipedia that were peer-reviewed using the [Wikipedia content assessment](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment) procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any `rev_id` you send it.  

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read `page_data.csv` into Python (or R), and then read through the dataset line by line, using the value of the `rev_id` column to make an API query.