# A2: Bias in data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. 
This notebook walks through the data transformations and steps necessary to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.
[https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data]  

In [553]:
# import necessary packages
import pandas as pd
from functools import reduce

import warnings
warnings.filterwarnings('ignore')

### Getting the article and population data

We use two datasets:
  1. The Wikipedia politicians by country dataset - ./datafiles/page_data.csv
  2. The population dataset - ./datafiles/WPDS_2018_data.csv

### Cleaning the data

In [5]:
# Load necessary data files
page_data_path = './data_files/page_data.csv'
wpds_data_path = './data_files/WPDS_2018_data.csv'
page_data = pd.read_csv(page_data_path)
raw_wpds_data = pd.read_csv(wpds_data_path)

In [6]:
# The dataset contains some page names that start with the string "Template:". 
# Since these pages are not Wikipedia articles, they are not included in the analysis.
page_data = page_data[~page_data['page'].str.contains(r'Template')]
page_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [7]:
raw_wpds_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In [8]:
# Filter out contries that are all CAPS and rename the column
wpds_data = raw_wpds_data[~raw_wpds_data['Geography'].str.isupper()]
wpds_data.rename(columns={'Geography':'country', 'Population mid-2018 (millions)': 'population'}, inplace=True)
wpds_data['population'] = wpds_data['population'].apply(lambda x: float(x.replace(",", ""))*10e5)
wpds_data.head()

Unnamed: 0,country,population
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0
5,Sudan,41700000.0


In [9]:
# Merge wpds data with page_data
merged_df = pd.merge(page_data, wpds_data, on='country')
merged_df.rename(columns={'Population mid-2018 (millions)': 'population'}, inplace=True)
merged_df.head()

Unnamed: 0,page,country,rev_id,population
0,Bir I of Kanem,Chad,355319463,15400000.0
1,Abdullah II of Kanem,Chad,498683267,15400000.0
2,Salmama II of Kanem,Chad,565745353,15400000.0
3,Kuri I of Kanem,Chad,565745365,15400000.0
4,Mohammed I of Kanem,Chad,565745375,15400000.0


In [10]:
len(set(page_data.country.unique()) & set(wpds_data['country'].unique()))

180

### Getting article quality predictions

We use the [ORES](https://www.mediawiki.org/wiki/ORES) API to estimate the quality of an article. 

This API returns a probability value for each of these categories:

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

The following function is derived from [sample notebook](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb)

In [11]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/deepthimhegde', 'From' : 'dhegde@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response


In [12]:

def extract_class_labels(response):
    for rev_id, val in response.items():
        try:
            rev_scores[rev_id] = val["wp10"]["score"]["prediction"]
        except:
            pass
        

In [13]:
page_data_rev_ids = page_data["rev_id"].tolist()
print(len(page_data_rev_ids))

rev_scores = {}
step = 50
for i in range(0, len(page_data_rev_ids), step):
    response = get_ores_data(page_data_rev_ids[i: i+step], headers)
    extract_class_labels(response["enwiki"]["scores"])


46701


In [14]:

article_quality_df = pd.DataFrame(list(rev_scores.items()), columns=['rev_id', 'article_quality'])
article_quality_df.head()


Unnamed: 0,rev_id,article_quality
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [15]:
article_quality_df.to_csv("./data_files/article_quality.csv")

### Combining the datasets

In [16]:
article_quality_df['rev_id'] = article_quality_df['rev_id'].astype('int')

In [17]:
merged_quality_df = pd.merge(page_data, article_quality_df, on='rev_id', how='left')
print(len(merged_quality_df))
(merged_quality_df).head()

46701


Unnamed: 0,page,country,rev_id,article_quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [18]:
wpds_data_no_match = merged_quality_df[merged_quality_df['article_quality'].isnull()]
len(wpds_data_no_match)
wpds_data_no_match.head()

Unnamed: 0,page,country,rev_id,article_quality
14,List of politicians in Poland,Poland,516633096,
21,Tingtingru,Vanuatu,550682925,
51,Daud Arsala,Afghanistan,627547024,
204,Bharat Saud,Nepal,671484594,
301,Robert Sych,Poland,684023803,


In [19]:
wpds_output_file_path = './data_files/wp_wpds_countries-no_match.csv'
wpds_data_no_match.to_csv(wpds_output_file_path)

In [20]:
merged_quality_df = merged_quality_df[~merged_quality_df['article_quality'].isnull()]
len(merged_quality_df)

46546

In [21]:
merged_quality_df.to_csv('./data_files/wp_wpds_politicians_by_country.csv')

### Analysis of countries with the greatest and least coverage of politicians on Wikipedia compared to their population.

In [572]:
all_articles_count = merged_quality_df.groupby('country').size()
all_articles_count = all_articles_count.reset_index()
all_articles_count.columns = ['country', 'all_articles_count']
print(len(all_articles_count))
all_articles_count.head()

180


Unnamed: 0,country,all_articles_count
0,Afghanistan,320
1,Albania,457
2,Algeria,116
3,Andorra,34
4,Angola,106


In [573]:
high_qual_articles_count = merged_quality_df[(merged_quality_df['article_quality']=='FA') | (merged_quality_df['article_quality']=='GA')].groupby('country').size()
high_qual_articles_count = high_qual_articles_count.reset_index()
high_qual_articles_count.columns = ['country', 'high_qual_articles_count']
print(len(high_qual_articles_count))
high_qual_articles_count.head()

142


Unnamed: 0,country,high_qual_articles_count
0,Afghanistan,12
1,Albania,3
2,Algeria,2
3,Argentina,12
4,Armenia,5


In [574]:

_ = pd.merge(all_articles_count, high_qual_articles_count, on='country', how='left')
coverage_df = pd.merge(_, wpds_data, on='country')
coverage_df = coverage_df.fillna(0)
coverage_df.head()

Unnamed: 0,country,all_articles_count,high_qual_articles_count,population
0,Afghanistan,320,12.0,36500000.0
1,Albania,457,3.0,2900000.0
2,Algeria,116,2.0,42700000.0
3,Andorra,34,0.0,80000.0
4,Angola,106,0.0,30400000.0


In [575]:
coverage_df['coverage'] = coverage_df['all_articles_count']/coverage_df['population']
coverage_df['quality'] = coverage_df['high_qual_articles_count']/coverage_df['all_articles_count']

In [22]:
len(coverage_df)
coverage_df.to_csv('./data_files/coverage.csv')

NameError: name 'coverage_df' is not defined

In [590]:
country_region_mapper = {}

region = ''
for country in raw_wpds_data['Geography']:
    if country.isupper():
        region = country
        continue
    country_region_mapper[country] = region
    
country_region_df = pd.DataFrame(list(country_region_mapper.items()), columns=['country', 'region'])
region_df = pd.merge(coverage_df, country_region_df)
region_df = region_df.drop(columns = ['coverage', 'quality'])
region_df = region_df.groupby('region').sum()
region_df['coverage'] = region_df['all_articles_count']/region_df['population']
region_df['quality'] = region_df['high_qual_articles_count']/region_df['all_articles_count']
region_df

Unnamed: 0_level_0,all_articles_count,high_qual_articles_count,population,coverage,quality
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AFRICA,6851,125.0,1172400000.0,6e-06,0.018246
ASIA,11531,310.0,4513100000.0,3e-06,0.026884
EUROPE,15864,322.0,734590000.0,2.2e-05,0.020298
LATIN AMERICA AND THE CARIBBEAN,5169,69.0,628270000.0,8e-06,0.013349
NORTHERN AMERICA,1921,99.0,365200000.0,5e-06,0.051536
OCEANIA,3128,66.0,39780000.0,7.9e-05,0.0211


### Analysis

#### - Analysis of countries with the greatest and least coverage of politicians on Wikipedia compared to their population.

In [578]:
# 1.
coverage_df.sort_values(by=['coverage'], ascending=False)[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,all_articles_count,high_qual_articles_count,population,coverage,quality
0,Tuvalu,54,5.0,10000.0,0.0054,0.092593
1,Nauru,52,0.0,10000.0,0.0052,0.0
2,San Marino,81,0.0,30000.0,0.0027,0.0
3,Monaco,40,0.0,40000.0,0.001,0.0
4,Liechtenstein,28,0.0,40000.0,0.0007,0.0
5,Tonga,63,0.0,100000.0,0.00063,0.0
6,Marshall Islands,37,0.0,60000.0,0.000617,0.0
7,Iceland,201,2.0,400000.0,0.000503,0.00995
8,Andorra,34,0.0,80000.0,0.000425,0.0
9,Grenada,36,1.0,100000.0,0.00036,0.027778


In [579]:
# 2.
coverage_df.sort_values(by=['coverage'])[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,all_articles_count,high_qual_articles_count,population,coverage,quality
0,India,980,17.0,1371300000.0,7.146503e-07,0.017347
1,Indonesia,210,10.0,265200000.0,7.918552e-07,0.047619
2,China,1130,41.0,1393800000.0,8.107332e-07,0.036283
3,Uzbekistan,28,2.0,32900000.0,8.510638e-07,0.071429
4,Ethiopia,101,2.0,107500000.0,9.395349e-07,0.019802
5,"Korea, North",36,7.0,25600000.0,1.40625e-06,0.194444
6,Zambia,25,0.0,17700000.0,1.412429e-06,0.0
7,Thailand,112,3.0,66200000.0,1.691843e-06,0.026786
8,Mozambique,58,0.0,30500000.0,1.901639e-06,0.0
9,Bangladesh,319,3.0,166400000.0,1.917067e-06,0.009404


#### - Analysis of countries with the highest and lowest proportion of high quality articles about politicians.

In [580]:
# 3.
coverage_df.sort_values(by=['quality'], ascending=False)[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,all_articles_count,high_qual_articles_count,population,coverage,quality
0,"Korea, North",36,7.0,25600000.0,1e-06,0.194444
1,Saudi Arabia,118,15.0,33400000.0,4e-06,0.127119
2,Mauritania,48,6.0,4500000.0,1.1e-05,0.125
3,Central African Republic,66,8.0,4700000.0,1.4e-05,0.121212
4,Romania,343,39.0,19500000.0,1.8e-05,0.113703
5,Tuvalu,54,5.0,10000.0,0.0054,0.092593
6,Bhutan,33,3.0,800000.0,4.1e-05,0.090909
7,Dominica,12,1.0,70000.0,0.000171,0.083333
8,Syria,128,10.0,18300000.0,7e-06,0.078125
9,Benin,91,7.0,11500000.0,8e-06,0.076923


In [581]:
# 4.
coverage_df.sort_values(by=['quality'])[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,all_articles_count,high_qual_articles_count,population,coverage,quality
0,Slovakia,116,0.0,5400000.0,2.1e-05,0.0
1,Namibia,162,0.0,2500000.0,6.5e-05,0.0
2,Cape Verde,37,0.0,600000.0,6.2e-05,0.0
3,Mozambique,58,0.0,30500000.0,2e-06,0.0
4,Costa Rica,147,0.0,5000000.0,2.9e-05,0.0
5,Monaco,40,0.0,40000.0,0.001,0.0
6,Djibouti,37,0.0,1000000.0,3.7e-05,0.0
7,Moldova,423,0.0,3500000.0,0.000121,0.0
8,Uganda,185,0.0,44100000.0,4e-06,0.0
9,Eritrea,16,0.0,6000000.0,3e-06,0.0



#### - Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality


In [595]:
# 5.
region_df.sort_values(by=['coverage'], ascending=False).reset_index()

Unnamed: 0,region,all_articles_count,high_qual_articles_count,population,coverage,quality
0,OCEANIA,3128,66.0,39780000.0,7.9e-05,0.0211
1,EUROPE,15864,322.0,734590000.0,2.2e-05,0.020298
2,LATIN AMERICA AND THE CARIBBEAN,5169,69.0,628270000.0,8e-06,0.013349
3,AFRICA,6851,125.0,1172400000.0,6e-06,0.018246
4,NORTHERN AMERICA,1921,99.0,365200000.0,5e-06,0.051536
5,ASIA,11531,310.0,4513100000.0,3e-06,0.026884


In [596]:
# 6.
region_df.sort_values(by=['quality'], ascending=False).reset_index()

Unnamed: 0,region,all_articles_count,high_qual_articles_count,population,coverage,quality
0,NORTHERN AMERICA,1921,99.0,365200000.0,5e-06,0.051536
1,ASIA,11531,310.0,4513100000.0,3e-06,0.026884
2,OCEANIA,3128,66.0,39780000.0,7.9e-05,0.0211
3,EUROPE,15864,322.0,734590000.0,2.2e-05,0.020298
4,AFRICA,6851,125.0,1172400000.0,6e-06,0.018246
5,LATIN AMERICA AND THE CARIBBEAN,5169,69.0,628270000.0,8e-06,0.013349
