Anushna Prakash  
DATA 512 - Human-Centered Data Science  
October 14, 2021  
# A2 - Bias in Data
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:  
- the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
- the countries with the highest and lowest proportion of high quality articles about politicians.  
- a ranking of geographic regions by articles-per-person and proportion of high quality articles.  
You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.

## Step 0: Set Up Notebook

In [3]:
# Optional: Make notebook width wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:55% !important; }</style>"))

In [4]:
# Import libraries
import pandas as pd
import numpy as np

## Step 1: Import data

In [30]:
# See README.md for where data was downloaded from originally.
# Import from data_raw folder assuming we are running from the src folder.
page_data = pd.read_csv('../data_raw/country/data/page_data.csv')
population = pd.read_csv('../data_raw/WPDS_2020_data.csv')

In [11]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [12]:
page_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     47197 non-null  object
 1   country  47197 non-null  object
 2   rev_id   47197 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [13]:
page_data.isnull().sum()

page       0
country    0
rev_id     0
dtype: int64

In [14]:
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [15]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        233 non-null    object 
 1   Name        234 non-null    object 
 2   Type        234 non-null    object 
 3   TimeFrame   234 non-null    int64  
 4   Data (M)    234 non-null    float64
 5   Population  234 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.1+ KB


In [18]:
population.isnull().sum()

FIPS          1
Name          0
Type          0
TimeFrame     0
Data (M)      0
Population    0
dtype: int64

In [17]:
population.loc[population['FIPS'].isnull(),]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
62,,Namibia,Country,2019,2.541,2541000


## Step 2: Cleaning the Data  
Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of `page_data.csv`, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.  
Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the `geography` field (e.g. AFRICA, OCEANIA). These rows won't match the country values in `page_data.csv`, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

In [24]:
# Remove page names that begin with 'Template:' since these are not wikipedia articles
page_data = page_data.loc[~page_data['page'].str.startswith('Template:'), ]

**Can I just filter on type == 'country'? Why do we have to check if its all caps. This method ends up including the Channel Islands**

In [29]:
# Remove rows that have regional population counts and not country-level counts. Ex: AFRICA, OCEANIA
# population = population.loc[population['Type'] == 'Country', ]
# population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 209 entries, 3 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        208 non-null    object 
 1   Name        209 non-null    object 
 2   Type        209 non-null    object 
 3   TimeFrame   209 non-null    int64  
 4   Data (M)    209 non-null    float64
 5   Population  209 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.4+ KB


In [31]:
population = population.loc[~population['Name'].str.isupper(), ]
population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 3 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        209 non-null    object 
 1   Name        210 non-null    object 
 2   Type        210 non-null    object 
 3   TimeFrame   210 non-null    int64  
 4   Data (M)    210 non-null    float64
 5   Population  210 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.5+ KB


In [34]:
population.loc[population['Type'] != 'Country',]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
168,Channel Islands,Channel Islands,Sub-Region,2019,0.172,172000


## Step 3: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  
- FA - Featured article  
- GA - Good article  
- B - B-class article  
- C - C-class article  
- Start - Start-class article  
- Stub - Stub-class article  

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any rev_id you send it.  
In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the rev_id column to make an API query.