# Considering Bias in Data

#### **Homework #2**

This assignment is part of the course DATA 512 - Human Centered Data Science.  

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles.  
We perform an analysis of the number of articles and their qualities across different countries and regions.  
A discussion of the results is presented in the Readme section of this repository.  

This notebook has the following sections:  
1. Getting the population data and list of articles
1. Getting the Article Quality data
1. Combining the two datasets
1. Analysis and Results


## 1. Getting the population data and list of articles

First we import the libraries that we will need throughout this notebook

In [1]:
# 
# These are standard python modules
import urllib, json, time
#
# You will need to install the modules below with pip/pip3 if you do not already have it
# You could also use the requirements.txt file in this repo for easily installing all the dependencies
import pandas as pd
import numpy as np
import requests
from tqdm import tqdm

The **population data** and **list of articles** are present in the files `input/population_by_country_2022.csv` and `input/politicians_by_country_SEPT_2022.csv` respectively.  

The population data was sourced from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau whereas the list of articles is sourced from the Politicians by nationality category from Wikipedia.

In [5]:
pop_df = pd.read_csv('../input/population_by_country_2022.csv')
politicians_df = pd.read_csv('../input/politicians_by_country_SEPT_2022.csv')

pop_df.head(10)

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5
5,Libya,6.8
6,Morocco,36.7
7,Sudan,46.9
8,Tunisia,11.8
9,Western Sahara,0.6


If we look at the population data present in this file, it is stored in a heirarchical format where the closest region is shown in uppercase letters. The countries belonging to that region occur in the following lines and are a mix of uppercase and lowercase letters.  

We use this information to convert this into a tabular format where we create an additional column to denote the closest region of that country.  

This is done by the following lines of code.

In [8]:
pop_df['is_region'] = pop_df['Geography'].apply(str.isupper)
pop_df.loc[pop_df['is_region'], 'region'] = pop_df['Geography']
pop_df['region'] = pop_df['region'].ffill()

population_df = pop_df[pop_df['is_region']==False]
population_df = population_df.drop('is_region', axis=1)
population_df

Unnamed: 0,Geography,Population (millions),region
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.7,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


## 2. Getting the Article Quality data

Next, we get the article quality prediction from a machine learning system called [ORES](https://www.mediawiki.org/wiki/ORES).  
To get the article quality prediction from ORES, we also need to supply the revision_id of the article. We get the latest revision_id of the article using the [API:Info](https://www.mediawiki.org/wiki/API:Info)   

We first write two classes that will be able to fetch the response from these APIs. These two classes are written below.

In [9]:
class ORES_APIRequest(object):
    def __init__(self) -> None:
        self.endpoint_url = "https://ores.wikimedia.org/v3"
        self.endpoint_params = "/scores/{context}/{revid}/{model}"
        self.request_headers = {
            'User-Agent': 'abhis1@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022',
        }
        self.params_template = {
            "context": "enwiki",        # which WMF project for the specified revid
            "revid" : "",               # the revision to be scored - this will probably change each call
            "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
        }
        self.API_LATENCY_ASSUMED = 0.005
        self.API_THROTTLE_WAIT = (1.0/100.0) - self.API_LATENCY_ASSUMED

    def fetch(self, article_revid = None):
        # Make sure we have an article title
        if not article_revid: return None

        # set the revision id into the template
        params_template = self.params_template
        params_template['revid'] = article_revid

        # now, create a request URL by combining the endpoint_url with the parameters for the request
        request_url = self.endpoint_url + self.endpoint_params.format(**params_template)

        # make the request
        try:
            # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
            # occurs during the request processing - throttling is always a good practice with a free
            # data source like Wikipedia - or other community sources
            if self.API_THROTTLE_WAIT > 0.0:
                time.sleep(self.API_THROTTLE_WAIT)
            response = requests.get(request_url, headers=self.request_headers)
            json_response = response.json()
        except Exception as e:
            print(e)
            json_response = None
        return json_response


class WikiMedia_APIRequest(object):
    def __init__(self) -> None:
        self.endpoint_url = "https://en.wikipedia.org/w/api.php"
        self.request_headers = {
            'User-Agent': 'abhis1@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022',
        }
        self.request_template = {
            "action": "query",
            "format": "json",
            "titles": "",           # to simplify this should be a single page title at a time
            "prop": "info",
            "inprop": ""
        }
        self.API_LATENCY_ASSUMED = 0.005
        self.API_THROTTLE_WAIT = (1.0/100.0) - self.API_LATENCY_ASSUMED

    def fetch(self, article_title):
        if not article_title: return None

        request_template = self.request_template

        request_template['titles'] = article_title
            
        # make the request
        try:
            # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
            # occurs during the request processing - throttling is always a good practice with a free
            # data source like Wikipedia - or any other community sources
            if self.API_THROTTLE_WAIT > 0.0:
                time.sleep(self.API_THROTTLE_WAIT)
            response = requests.get(self.endpoint_url, headers=self.request_headers, params=request_template)
            json_response = response.json()
        except Exception as e:
            print(e)
            json_response = None
        return json_response

Next we create two instances of the above two classes. We also create two lists that will store the revision_ids and quality predictions of the articles.

In [10]:
wiki_api_fetcher = WikiMedia_APIRequest()
ores_fetcher = ORES_APIRequest()

rev_id_list = []
predictions = []

Next, we loop through each article in the `politicians_df` dataframe, fetch the `revision_id`, then use that `revision_id` to fetch the quality prediction of that article.  

We also handle any error by marking those articles are either "NOT FOUND" if the article doesn't return a revision_id or "NO PRED" if the revision_id doesn't return a prediction.  

The revision ids and quality predictions are added to the politicians dataframe as two new columns and we save this dataframe in the file `../data/politicians_with_revid_quality.csv`

In [None]:
for article_title in tqdm(list(politicians_df['name'])):
    resp = wiki_api_fetcher.fetch(article_title)
    try:
        last_revid = list(resp['query']['pages'].values())[0]['lastrevid']
    except:
        last_revid = "NOT FOUND"
        prediction = "NO PRED"

    if last_revid == "NOT FOUND": 
        rev_id_list.append(last_revid)
        predictions.append(prediction)
        continue

    ores_score = ores_fetcher.fetch(last_revid)
    try:
        prediction = ores_score["enwiki"]["scores"][str(last_revid)]["articlequality"]["score"]["prediction"]
    except:
        prediction = "NO PRED"

    rev_id_list.append(last_revid)
    predictions.append(prediction)


politicians_df['revision_id'] = rev_id_list
politicians_df['article_quality'] = predictions

politicians_df.to_csv('../data/politicians_with_revid_quality.csv')

Once the data is saved to `../data/` folder, we can directly load it from here so that we don't need to fetch the data again.

In [11]:
politicians_df = pd.read_csv('../data/politicians_with_revid_quality.csv')

Let us look at the entries for which we did not get any `revision_id` or `article_quality`

In [17]:
politicians_df[np.logical_or(politicians_df['revision_id']=="NOT FOUND", politicians_df['article_quality']=="NO PRED")]

Unnamed: 0.1,Unnamed: 0,name,url,country,revision_id,article_quality
2446,2446,Prince Ofosu Sefah,https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah,Ghana,NOT FOUND,NO PRED
2985,2985,Harjit Kaur Talwandi,https://en.wikipedia.org/wiki/Harjit_Kaur_Talw...,India,NOT FOUND,NO PRED
3212,3212,Abd al-Razzaq al-Hasani,https://en.wikipedia.org/wiki/'Abd_al-Razzaq_a...,Iraq,NOT FOUND,NO PRED
3784,3784,Kang Sun-nam,https://en.wikipedia.org/wiki/Kang_Sun-nam,"Korea, North",NOT FOUND,NO PRED
4865,4865,Abiodun Abimbola Orekoya,https://en.wikipedia.org/wiki/Abiodun_Abimbola...,Nigeria,NOT FOUND,NO PRED
4879,4879,Segun “Aeroland” Adewale,https://en.wikipedia.org/wiki/Segun_”Aeroland”...,Nigeria,NOT FOUND,NO PRED
5801,5801,Roman Konoplev,https://en.wikipedia.org/wiki/Roman_Konoplev,Russia,NOT FOUND,NO PRED
6344,6344,Nhlanhla “Lux” Dlamini,https://en.wikipedia.org/wiki/Nhlanhla_”Lux”_D...,South Africa,NOT FOUND,NO PRED


We simply filter out these entries for subsequent analysis.

In [18]:
politicians_df = politicians_df[~np.logical_or(politicians_df['revision_id']=="NOT FOUND", politicians_df['article_quality']=="NO PRED")]

In [19]:
politicians_df

Unnamed: 0.1,Unnamed: 0,name,url,country,revision_id,article_quality
0,0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043,GA
1,1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,943562276,Start
2,2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,852404094,Start
3,3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1095102390,B
4,4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1104998382,Start
...,...,...,...,...,...,...
7579,7579,Rekayi Tangwena,https://en.wikipedia.org/wiki/Rekayi_Tangwena,Zimbabwe,1073818982,Stub
7580,7580,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,1106932400,C
7581,7581,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,904246837,Stub
7582,7582,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,959111842,Stub


## 3. Combining the two datasets

Next, we combine the populations dataframe with the politicians dataframe.  

We first convert the countries to lowercase and then merge the two tables on `country`. Then, we only keep the required columns.

In [20]:
population_df['country'] = population_df['Geography'].str.lower()
politicians_df['country'] = politicians_df['country'].str.lower()

merged_df = pd.merge(politicians_df, population_df, how='left', on='country')
merged_df = merged_df[['country', 'region', 'Population (millions)', 'name', 'revision_id', 'article_quality']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  politicians_df['country'] = politicians_df['country'].str.lower()


Now, we check for countries that have region is a null value. This can happen if the country in politicians dataframe is not present in population dataframe.  
We also filter out the countries that have zero population since that would not allow us to analyze the data at per capita level.  

These countries are saved to the file `result/wp_countries-no_match.txt` 

In [23]:
null_regions = merged_df[np.logical_or(merged_df['region'].isnull(), merged_df['Population (millions)'] == 0)]
null_countries = null_regions['country'].unique()

np.savetxt('../result/wp_countries-no_match.txt', null_countries, delimiter=',', fmt='%s')
null_countries

array(['korean', 'liechtenstein', 'monaco', 'nauru', 'palau',
       'san marino', 'tuvalu'], dtype=object)

We filter out the above countries from the merged dataframe and save it to the file `data/wp_politicians_by_country.csv`

In [25]:
merged_df = merged_df[~np.logical_or(merged_df['region'].isnull(), merged_df['Population (millions)'] == 0)]

merged_df.to_csv('../data/wp_politicians_by_country.csv', index=False)
merged_df.head()

Unnamed: 0,country,region,Population (millions),name,revision_id,article_quality
0,afghanistan,SOUTH ASIA,41.1,Shahjahan Noori,1099689043,GA
1,afghanistan,SOUTH ASIA,41.1,Abdul Ghafar Lakanwal,943562276,Start
2,afghanistan,SOUTH ASIA,41.1,Majah Ha Adrif,852404094,Start
3,afghanistan,SOUTH ASIA,41.1,Haroon al-Afghani,1095102390,B
4,afghanistan,SOUTH ASIA,41.1,Tayyab Agha,1104998382,Start


## 4. Analysis and Results 

First, we load the saved file from `data/wp_politicians_by_country.csv` and create a new column `is_high_quality` which is set to True whenever the prediction is either 'GA' or 'FA'.  

The article quality estimates are, from best to worst:  
FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article  

In [29]:
analysis_df = pd.read_csv('../data/wp_politicians_by_country.csv')

analysis_df['is_high_quality'] = analysis_df['article_quality'].apply(lambda x: True if (x == 'GA' or x == 'FA') else False)

Then we calculate the total number of articles per country. We then divide the total number of articles by the population in millions to get articles per million population.

In [30]:
table = pd.pivot_table(
    analysis_df, 
    values=['name', 'Population (millions)'], 
    index=['region', 'country'],
    aggfunc={'name': np.size, 'Population (millions)': np.mean}
)
table['total_articles'] = table['name']

table['articles per million population'] = table['total_articles'] / table['Population (millions)']

The top 10 countries by articles per million population are

In [38]:
print(table.droplevel('region')['articles per million population'].nlargest(n=10).to_markdown())

| country                        |   articles per million population |
|:-------------------------------|----------------------------------:|
| antigua and barbuda            |                          170      |
| federated states of micronesia |                          130      |
| andorra                        |                          100      |
| barbados                       |                           93.3333 |
| marshall islands               |                           90      |
| montenegro                     |                           60      |
| seychelles                     |                           60      |
| luxembourg                     |                           52.8571 |
| bhutan                         |                           51.25   |
| grenada                        |                           50      |


The bottom 10 countries by articles per million population are

In [39]:
print(table.droplevel('region')['articles per million population'].nsmallest(n=10).to_markdown())

| country      |   articles per million population |
|:-------------|----------------------------------:|
| china        |                        0.00139218 |
| mexico       |                        0.00784314 |
| saudi arabia |                        0.0817439  |
| romania      |                        0.105263   |
| india        |                        0.1256     |
| sri lanka    |                        0.133929   |
| egypt        |                        0.135266   |
| ethiopia     |                        0.202593   |
| taiwan       |                        0.215517   |
| vietnam      |                        0.27163    |


Next we do something similar but only for the high quality articles. We filter for only the high quality articles, then we calculate the total number of high quality articles per country.  
We then divide the total number of high quality articles by the population in millions to get high quality articles per million population.

In [40]:
high_quality_df = analysis_df[analysis_df['is_high_quality']==True]

table_high_quality = pd.pivot_table(
    high_quality_df, 
    values=['name', 'Population (millions)'], 
    index=['region', 'country'],
    aggfunc={'name': np.size, 'Population (millions)': np.mean}
)
table_high_quality['total_articles'] = table_high_quality['name']

table_high_quality['articles per million population'] = table_high_quality['total_articles'] / table_high_quality['Population (millions)']

The top 10 countries by articles per million population are

In [41]:
print(table_high_quality.droplevel('region')['articles per million population'].nlargest(n=10).to_markdown())

| country               |   articles per million population |
|:----------------------|----------------------------------:|
| andorra               |                         20        |
| montenegro            |                          5        |
| albania               |                          2.14286  |
| suriname              |                          1.66667  |
| bosnia-herzegovina    |                          1.47059  |
| lithuania             |                          1.07143  |
| croatia               |                          1.05263  |
| slovenia              |                          0.952381 |
| palestinian territory |                          0.925926 |
| gabon                 |                          0.833333 |


The bottom 10 countries by articles per million population are

In [42]:
print(table_high_quality.droplevel('region')['articles per million population'].nsmallest(n=10).to_markdown())

| country   |   articles per million population |
|:----------|----------------------------------:|
| india     |                         0.0042337 |
| thailand  |                         0.0149701 |
| japan     |                         0.0160128 |
| nigeria   |                         0.0183066 |
| vietnam   |                         0.0201207 |
| colombia  |                         0.0203666 |
| uganda    |                         0.0211864 |
| pakistan  |                         0.0212044 |
| sudan     |                         0.021322  |
| iran      |                         0.0225734 |


Next we get the articles per million population at region level. We do this by aggregating the table created previously at region level.

In [43]:
region_table = table.groupby(level=['region']).sum()
region_table['articles per million population'] = region_table['total_articles'] / region_table['Population (millions)']

The regions sorted by articles per million population are shown below

In [47]:
print(region_table['articles per million population'].sort_values(ascending=False).to_markdown())

| region          |   articles per million population |
|:----------------|----------------------------------:|
| NORTHERN EUROPE |                          7.75148  |
| OCEANIA         |                          6.15385  |
| SOUTHERN EUROPE |                          5.88469  |
| CARIBBEAN       |                          5.08861  |
| WESTERN EUROPE  |                          3.47384  |
| EASTERN EUROPE  |                          2.55741  |
| WESTERN ASIA    |                          2.33095  |
| SOUTHERN AFRICA |                          1.71806  |
| EASTERN AFRICA  |                          1.3821   |
| CENTRAL ASIA    |                          1.35897  |
| SOUTH AMERICA   |                          1.33072  |
| WESTERN AFRICA  |                          1.31922  |
| CENTRAL AMERICA |                          1.09612  |
| MIDDLE AFRICA   |                          1.03624  |
| NORTHERN AFRICA |                          0.905826 |
| SOUTHEAST ASIA  |                          0.7

Similarly, we get the high quality articles per million population at region level by aggregating the table created previously at region level.

In [48]:
region_table_high_quality = table_high_quality.groupby(level=['region']).sum()
region_table_high_quality['articles per million population'] = region_table_high_quality['total_articles'] / region_table_high_quality['Population (millions)']

The regions sorted by high quality articles per million population are shown below

In [49]:
print(region_table_high_quality['articles per million population'].sort_values(ascending=False).to_markdown())

| region          |   articles per million population |
|:----------------|----------------------------------:|
| SOUTHERN EUROPE |                         0.581542  |
| NORTHERN EUROPE |                         0.4       |
| CENTRAL AMERICA |                         0.296736  |
| CARIBBEAN       |                         0.235988  |
| WESTERN ASIA    |                         0.151025  |
| MIDDLE AFRICA   |                         0.139276  |
| EASTERN EUROPE  |                         0.13385   |
| WESTERN EUROPE  |                         0.117521  |
| CENTRAL ASIA    |                         0.115385  |
| OCEANIA         |                         0.107527  |
| SOUTH AMERICA   |                         0.0950988 |
| EAST ASIA       |                         0.0789733 |
| NORTHERN AFRICA |                         0.0684932 |
| SOUTHERN AFRICA |                         0.0660066 |
| EASTERN AFRICA  |                         0.0536097 |
| SOUTHEAST ASIA  |                         0.04