<a href="https://colab.research.google.com/github/aly-such/data-512-a2/blob/main/hcds_a2_bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A2 - Bias in Data Assignment**
**Alyson Suchodolski**


---


The goal of this assignment is to collect, curate, and analyze data on articles of politicians for many countries, then explore the biases that can be present in this analysis.
We will collect article data from Wikipedia as well as population data from the Population Reference Bureau. 

**Import necessary packages**

In [15]:
# install ores locally
# Will need to comment out and rerun this cell again so output isn't messy
# When booting up notebook again, you'll have to uncomment to reinstall ores locally
# If ores is already installed in your system, you won't need to run this
#!pip install ores

# imports
import requests
import json
import pandas as pd
import numpy as np
from ores import api

**Clone the github repository**

In [4]:
!git clone https://github.com/aly-such/data-512-a2.git

Cloning into 'data-512-a2'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 14 (delta 4), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (14/14), done.


### **Step 1: Getting the Article and Population Data**
First we must collect the necessary data from our sources below:


*   [Politicians by Country Dataset](https://figshare.com/articles/Untitled_Item/5513449)
*   [World Population Data Sheet](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing)



In [5]:
# Politicians by Country dataset - from Figshare
wiki = pd.read_csv('https://raw.githubusercontent.com/aly-such/data-512-a2/main/page_data.csv')

# Population Data - from Population Reference Bureau
pop = pd.read_csv('https://raw.githubusercontent.com/aly-such/data-512-a2/main/WPDS_2020_data.csv')

In [6]:
wiki.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### **Step 2: Cleaning the Data**

Both data sets have information we do not need.  Within the wiki dataframe, you will see 'Template:..." at the beginning of some titles in the 'page' column. These are not wikipedia articles and can be dropped from the dataframe.

In [7]:
# Template:... are not wiki articles, drop from dataframe
wiki = wiki[~wiki.page.str.contains("Template:")]

In [84]:
wiki.head()

Unnamed: 0,page,country,rev_id,article_quality
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
12,Yos Por,Cambodia,393822005,Stub
23,Julius Gregr,Czech Republic,395521877,Stub
24,Edvard Gregr,Czech Republic,395526568,Stub


For neatness purposes, I have dropped some fields that we will not be using.

In [9]:
pop = pop.drop(columns= ['TimeFrame', 'FIPS', 'Data (M)'])
pop.head()

Unnamed: 0,Name,Type,Population
0,WORLD,World,7772850000
1,AFRICA,Sub-Region,1337918000
2,NORTHERN AFRICA,Sub-Region,244344000
3,Algeria,Country,44357000
4,Egypt,Country,100803000


Now, we will create two separate dataframes for the population data. Because only countries exist within the wiki dataset and have a rev_id, we will want to house the sub-regional data in a separate frame. This sub-regional data can be identified with all caps (ex. SOUTHEAST ASIA), so those must be popped out from the country dataframe that we want.

In [10]:
# Separate The regional populations from the country populations
# First, create a dataframe of the Names that are not in all caps (country-level counts)
# These rows will match country values in paga_data.csv
country_pop = pop[~pop.Name.str.isupper()]

# Second, create a dataframe of the Names that are in all caps (regional-level counts)
# These rows will not have a match in paga_data.csv
region_pop = pop[pop.Name.str.isupper()]

In [11]:
region_pop.head()

Unnamed: 0,Name,Type,Population
0,WORLD,World,7772850000
1,AFRICA,Sub-Region,1337918000
2,NORTHERN AFRICA,Sub-Region,244344000
10,WESTERN AFRICA,Sub-Region,401115000
27,EASTERN AFRICA,Sub-Region,444970000


In [12]:
country_pop.head()

Unnamed: 0,Name,Type,Population
3,Algeria,Country,44357000
4,Egypt,Country,100803000
5,Libya,Country,6891000
6,Morocco,Country,35952000
7,Sudan,Country,43849000


### **Step 3: Getting Article Quality Predictions**
We need to get the predicted quality scores for each article in the Wikipedia dataset. ORES can provide estimates on article quality based off this Wiki information. Article Quality is as follows:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

The following will read the rev_id of each article line by line so that it can predict the quality of the article. Each rev_id will be assigned one of the rankings from above.


First we must connect to the wikimedia API

In [13]:
# Start a ores session using api package
ores_session = api.Session('https://ores.wikimedia.org', 'DATA512 A2 ams884@uw.edu')

Then we can pull the results from the API endpoint

In [14]:
# Pull the results of the session
result = ores_session.score('enwiki', ['articlequality'], wiki['rev_id'])

Last, we can create a new column that will contain all the predictions found from using ORES.

In [16]:
# Create new column that will keep track of predicted quality
# Create an empty list to append to
predictions = []

# Loop through the results of the session to append the scores to our empty list
for prediction in result:
  try:
    predictions.append(prediction['articlequality']['score']['prediction'])
  except:
    predictions.append(-1) # appends -1 where there is no prediction

### **Step 4: Combining the Datasets**
Now that we have our predictions for article quality, we will want to combine this with our wiki data frame that has our article information. On top of this, we want to look at these scores relative to the populations of the country of origin, so we will also combine our country population dataset

We will add a new column to the wiki dataframe that will be our article quality. 

In [17]:
# Create a new dataframe that includes wiki data as well as score data
wiki_scores = wiki
wiki_scores['article_quality'] = predictions

wiki_scores.head()

Unnamed: 0,page,country,rev_id,article_quality
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
12,Yos Por,Cambodia,393822005,Stub
23,Julius Gregr,Czech Republic,395521877,Stub
24,Edvard Gregr,Czech Republic,395526568,Stub


Merge the Wiki and Scoring dataset with the country population data. This will happen as an inner join on the fields 'country' and 'Name', respectively.

In [18]:
# Merge wiki/pred data with country population data
merge_df = wiki_scores.merge(country_pop, left_on='country', right_on='Name', how='inner')

In [19]:
merge_df.head()

Unnamed: 0,page,country,rev_id,article_quality,Name,Type,Population
0,Bir I of Kanem,Chad,355319463,Stub,Chad,Country,16877000
1,Abdullah II of Kanem,Chad,498683267,Stub,Chad,Country,16877000
2,Salmama II of Kanem,Chad,565745353,Stub,Chad,Country,16877000
3,Kuri I of Kanem,Chad,565745365,Stub,Chad,Country,16877000
4,Mohammed I of Kanem,Chad,565745375,Stub,Chad,Country,16877000


Clean up the column names to accurately reflect what we want to show (country, article name, article quality, population, etc.)

In [20]:
# Create single dataframe of wikipedia, prediction, and country population data
# This is to simply rename all the columns to intuitive names

wikipedia_df = pd.DataFrame({
    'country' : merge_df['country'],
    'article_name' : merge_df['page'],
    'revision_id' : merge_df['rev_id'],
    'article_quality_est.' : merge_df['article_quality'],
    'population' : merge_df['Population']
} )

In [21]:
print(wikipedia_df.head())

  country          article_name  revision_id article_quality_est.  population
0    Chad        Bir I of Kanem    355319463                 Stub    16877000
1    Chad  Abdullah II of Kanem    498683267                 Stub    16877000
2    Chad   Salmama II of Kanem    565745353                 Stub    16877000
3    Chad       Kuri I of Kanem    565745365                 Stub    16877000
4    Chad   Mohammed I of Kanem    565745375                 Stub    16877000


Some articles might not produce a prediction score, so we will log these in a separate file.

In [22]:
# Drop the rows that did not produce a prediction score
wikipedia_df_final = wikipedia_df.loc[wikipedia_df['article_quality_est.'] != -1]

wikipedia_df_final.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Chad,Bir I of Kanem,355319463,Stub,16877000
1,Chad,Abdullah II of Kanem,498683267,Stub,16877000
2,Chad,Salmama II of Kanem,565745353,Stub,16877000
3,Chad,Kuri I of Kanem,565745365,Stub,16877000
4,Chad,Mohammed I of Kanem,565745375,Stub,16877000


In [23]:
# Save the rows that did not produce a prediction score to a separate csv
wikipedia_no_score = wikipedia_df.loc[wikipedia_df['article_quality_est.'] == -1]

wikipedia_no_score.head()

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
36,Chad,Kalthouma Nguembang,762816132,-1,16877000
534,Canada,Pierre-Luc Paquette,708813010,-1,38190000
568,Canada,James H. Stuart,715457941,-1,38190000
598,Canada,René Matteau,723308478,-1,38190000
602,Canada,David J. Reimer,724052271,-1,38190000


Export the two dataframes we produced to CSVs.

In [47]:
# Turn dataframes to CSV's
wikipedia_df_final.to_csv('wp_wpds_politicians_by_country.csv', sep = ',')
wikipedia_no_score.to_csv('wp_wpds_countries-no_match.csv', sep = ',')

### **Step 5: Analysis**
After curating the data to our liking, we will now analyze what we have so far. There are a few different metrics we would like to look at, such as the **proportion of articles per population** of each country as well as the **proportion of high quality articles per total number of articles** for each country. We will also want to do this at a regional level (for the 'Names' from the population dataset that are **all caps**).

**High quality** articles can be defined as articles that are FA, or Featured Article, and GA, or Good Article.

Before we can look at results, we must capture these metrics from the dataset we have so far.

Below, we will count the number of articles for each country. This is simply done by the group_by() function, which is very similar to the GROUP BY function in SQL. 

In [25]:
# Count of Articles by Country
art_country = wikipedia_df_final.groupby('country').count()['article_name'].astype(int).reset_index()
art_country.rename(columns = {'article_name' : 'article_count'}, inplace = True)
art_country.head()

Unnamed: 0,country,article_count
0,Afghanistan,319
1,Albania,456
2,Algeria,116
3,Andorra,34
4,Angola,106


Merge the dataset we made of count of articles by country with the country population. This will allow us to find the proportion of articles per selected country population. We will need manually formulate the proportion using:

(Number of Articles/Country Population) * 100

In [26]:
# Merge Count of Articles by Country and Country Population
art_prop = art_country.merge(country_pop, left_on = 'country', right_on = 'Name', how = 'inner')

# Create a Proportion Column of Number of Articles Per the Country's Population
art_prop['percentage'] = (art_prop['article_count'] * 100) / art_prop['Population'] # multiply by 100 to get a percentage

art_prop.head()

Unnamed: 0,country,article_count,Name,Type,Population,percentage
0,Afghanistan,319,Afghanistan,Country,38928000,0.000819
1,Albania,456,Albania,Country,2838000,0.016068
2,Algeria,116,Algeria,Country,44357000,0.000262
3,Andorra,34,Andorra,Country,82000,0.041463
4,Angola,106,Angola,Country,32522000,0.000326


Create a dataset that will only contain articles with estimated quality of GA or FA. 

Find the number of high quality articles per country by grouping by country. Again use the group_by() function.

In [27]:
# Group high quality articles together (High Quality = 'FA' and 'GA')
high_quality = pd.concat([wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='FA'], 
                           wikipedia_df_final.loc[wikipedia_df_final['article_quality_est.']=='GA']])

# Obtain count of high quality articles
high_quality_group = high_quality.groupby('country').count()['article_name'].reset_index()

# Create a dataframe of country and high quality article count
high_quality_df = pd.DataFrame({'country':high_quality_group['country'], 'high_quality_article_count':high_quality_group['article_name']})
high_quality_df.head()

Unnamed: 0,country,high_quality_article_count
0,Afghanistan,13
1,Albania,3
2,Algeria,2
3,Argentina,16
4,Armenia,5


Merge the dataset containing the number of high quality articles with the dataset containing the total number of articles.

Find the proportion of high quality articles per total number of articles for each country by using the following formula:

(Number of High Quality Articles/Total Number of Articles) * 100

In [28]:
# Merge high quality df (which has the counts) with the number of articles by country dataframe
high_quality_prop = high_quality_df.merge(art_country, left_on = 'country', right_on = 'country', how = 'inner')

# Find the proportion of high quality articles and number of total articles by country
high_quality_prop['Percentage of Quality Articles'] = (high_quality_df['high_quality_article_count'] * 100) / high_quality_prop['article_count']

high_quality_prop.head()

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
0,Afghanistan,13,319,4.075235
1,Albania,3,456,0.657895
2,Algeria,2,116,1.724138
3,Argentina,16,491,3.258656
4,Armenia,5,193,2.590674


### **Step 6: Results**
We finally have the metrics we are looking for and now we can explore the results of our analysis.

We have a few different questions to look at involving top 10s and bottom 10s. 

They are as follows:


1.   Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2.   Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality



To get the answers to these questions, it will be easiest to sort our tables in descending order

In [29]:
# Sort article proportion data by country in descending order
rank_countries = art_prop.sort_values(['percentage'], ascending=[False])

## Sort High Quality article proportion by country in descending order
rank__countries_high_quality = high_quality_prop.sort_values(['Percentage of Quality Articles'], ascending=[False])

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [30]:
# Top 10 countries with largest proportion of articles to country population
rank_countries.head(10)

Unnamed: 0,country,article_count,Name,Type,Population,percentage
169,Tuvalu,54,Tuvalu,Country,10000,0.54
117,Nauru,52,Nauru,Country,11000,0.472727
138,San Marino,81,San Marino,Country,34000,0.238235
110,Monaco,40,Monaco,Country,38000,0.105263
95,Liechtenstein,28,Liechtenstein,Country,39000,0.071795
104,Marshall Islands,37,Marshall Islands,Country,57000,0.064912
164,Tonga,63,Tonga,Country,99000,0.063636
70,Iceland,201,Iceland,Country,368000,0.05462
3,Andorra,34,Andorra,Country,82000,0.041463
52,Federated States of Micronesia,36,Federated States of Micronesia,Country,106000,0.033962


Tuvalu is the country with the largest proportion of politician articles per citizen. Tuvalu is one of the smallest countries population-wise.

2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [31]:
# Bottom 10 countries with largest proportion of articles to country population
rank_countries.tail(10)

Unnamed: 0,country,article_count,Name,Type,Population,percentage
13,Bangladesh,317,Bangladesh,Country,169809000,0.000187
114,Mozambique,58,Mozambique,Country,31166000,0.000186
162,Thailand,112,Thailand,Country,66534000,0.000168
84,"Korea, North",36,"Korea, North",Country,25779000,0.00014
181,Zambia,25,Zambia,Country,18384000,0.000136
51,Ethiopia,101,Ethiopia,Country,114916000,8.8e-05
176,Uzbekistan,28,Uzbekistan,Country,34174000,8.2e-05
34,China,1129,China,Country,1402385000,8.1e-05
72,Indonesia,209,Indonesia,Country,271739000,7.7e-05
71,India,968,India,Country,1400100000,6.9e-05


India is the country with the smallest proportion of politician articles per citizen. I imagine this is because the population is so large.

3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [32]:
# Top 10 countries with largest proportion of high quality articles to total number of articles
rank__countries_high_quality.head(10)

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
63,"Korea, North",8,36,22.222222
109,Saudi Arabia,15,117,12.820513
106,Romania,42,343,12.244898
23,Central African Republic,8,66,12.121212
140,Uzbekistan,3,28,10.714286
82,Mauritania,5,48,10.416667
46,Guatemala,7,83,8.433735
33,Dominica,1,12,8.333333
125,Syria,10,128,7.8125
11,Benin,7,91,7.692308


Interestingly, North Korea has the largest percentage of quality politican articles compared to any other country. I remember in class it was stated that quality is dictated based on the country/language it was written in, so based on the English wikipedia articles, North Korea has a high percentage. I imagine that the percentage could be even higher if we were to look at the Korean version of these wikipedia articles. 

4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [33]:
# Bottom 10 countries with largest proportion of high quality articles to total number of articles
rank__countries_high_quality.tail(10)

Unnamed: 0,country,high_quality_article_count,article_count,Percentage of Quality Articles
87,Morocco,1,206,0.485437
73,Lithuania,1,244,0.409836
27,Colombia,1,285,0.350877
104,Portugal,1,318,0.314465
94,Nigeria,2,676,0.295858
101,Peru,1,350,0.285714
89,Nepal,1,356,0.280899
124,Switzerland,1,402,0.248756
128,Tanzania,1,404,0.247525
10,Belgium,1,519,0.192678


Belgium has the lowest percentage of quality articles of any country. Even if you take into consideration the total number of articles Belgium has on politicians, it still only has 1 high quality article. I wonder why that is?