# Assignment 2, Bias in Data
Author: Alexander Van Roijen

Synopsis: We explore how there may be bias in political figure article coverage of various countries and regions across the globe using wikimedia ORES, page data, and population data. I conjecture results indicate biases for coverage of countries with large numbers of english speakers, high literacy rates, good economic standing, and open access to internet technologies. Further discussion at the end of this notebook and in the README

In [105]:
## require packages to run this repo
import pandas as pd
import numpy as np
from ores import api
import os
import pickle as pk

# First we read in the data and clean it

In [106]:
rawDataPage = pd.read_csv("data/page_data.csv")
rawDataPop = pd.read_csv("data/WPDS_2018_data.csv")

In [107]:
print(rawDataPage.head()) ## what does the page data look like?
print(rawDataPage.shape) ## how many are there?

                                 page   country     rev_id
0  Template:ZambiaProvincialMinisters    Zambia  235107991
1                      Bir I of Kanem      Chad  355319463
2   Template:Zimbabwe-politician-stub  Zimbabwe  391862046
3     Template:Uganda-politician-stub    Uganda  391862070
4    Template:Namibia-politician-stub   Namibia  391862409
(47197, 3)


In [123]:
## cleaning up the population column
toclean = rawDataPop['Population mid-2018 (millions)']
cleanedPops = []
for x in toclean:
    cleanedPops.append(float(x.replace(',',''))) ## removing the commas so it can be cast to float
rawDataPop['Population mid-2018 (millions)'] = cleanedPops
print(rawDataPop.head()) ##  whats it look like now?
print(rawDataPop.shape) ## whats the dimension

  Geography  Population mid-2018 (millions)
0    AFRICA                          1284.0
1   Algeria                            42.7
2     Egypt                            97.0
3     Libya                             6.5
4   Morocco                            35.2
(207, 2)


In [124]:
cleanDataPage = rawDataPage[~rawDataPage['page'].str.contains("Template:")] # all articles of politicians and their country filtering for bad data
print(cleanDataPage.shape) # now how many rows do we got? we lost about 500 pages! 
print(cleanDataPage.head())

(46701, 3)
                                                 page                country  \
1                                      Bir I of Kanem                   Chad   
10  Information Minister of the Palestinian Nation...  Palestinian Territory   
12                                            Yos Por               Cambodia   
23                                       Julius Gregr         Czech Republic   
24                                       Edvard Gregr         Czech Republic   

       rev_id  
1   355319463  
10  393276188  
12  393822005  
23  395521877  
24  395526568  


In [125]:
## are there any duplicates?
cleanDataPage['rev_id'].shape == pd.unique(cleanDataPage['rev_id']).shape ## nope!

True

In [126]:
regions = rawDataPop[rawDataPop['Geography'].str.isupper()].rename(columns = {'Population mid-2018 (millions)':'population'}) # all regions and their populations
print(regions)

                           Geography  population
0                             AFRICA      1284.0
56                  NORTHERN AMERICA       365.0
59   LATIN AMERICA AND THE CARIBBEAN       649.0
95                              ASIA      4536.0
144                           EUROPE       746.0
189                          OCEANIA        41.0


## Lets create a mapping between our regions and countries for future reference

In [127]:
toparse = rawDataPop[rawDataPop['Geography'].str.isupper()].index
start=0
end=1
regionToCountryMapping={} ## this will hold that mapping in a dictionary, where the region is the key
regionsArray = [] ## this will be appended to our final csvs so we have the region and the country per page
for r in regions['Geography']:
    if(end < len(regions)):
        regionToCountryMapping[r]=rawDataPop.iloc[toparse[start]+1:toparse[end]]
        regionsArray.append(np.repeat(r,toparse[end]-(toparse[start]+1)))
    else:
        regionToCountryMapping[r] = rawDataPop.iloc[toparse[start]+1:]
        regionsArray.append(np.repeat(r,len(rawDataPop)-(toparse[start]+1)))

    start+=1
    end+=1
regionsListMapping = np.concatenate(regionsArray, axis=0 )

In [128]:
nonregions = rawDataPop[~rawDataPop['Geography'].str.isupper()] #all non regions and their populations
nonregions['region']=regionsListMapping # now we add the region along with the country
withRegionLabel = nonregions # renamed to be more descriptibe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## We use ORES to get article quality ratings

In [137]:
arr = []
if(os.path.isfile('allqualities.pk')): ## If we already did this, just load in the pickle file!
    with open('allqualities.pk','rb') as f:
        arr = pk.load(f)
else:
    ores_session = api.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>") ## api session for our class, change this appropriately
    results = ores_session.score("enwiki", ["articlequality"], cleanDataPage['rev_id']) ## api call for all scores from all my ids!
    # find information on the license and usage of this in the README
    for r in results:
        arr.append(r)
    with open('allqualities.pk','wb') as f:
        pk.dump(arr,f)

In [138]:
with open('allrevs.pk','wb') as f:
    pk.dump(cleanDataPage['rev_id'],f) #so we can find the corresponding rev ids with the table above

## Quickly scraping the calls to get only the class belonging to the highest prediction

In [139]:
justScores=[]

for r in arr:
    temp = r['articlequality']
    if('error' in temp.keys()):
        justScores.append(np.NaN)
    else:
        justScores.append(temp['score']['prediction'])

## Some articles do not interact well with the API, we save these articles in a separate csv

In [140]:
## attaching it to our original database
pageWScore = cleanDataPage
pageWScore['article_quality']=justScores
nonScoreArticles = pageWScore[pageWScore.isnull().any(axis=1)]
pageWScore = pageWScore.dropna()
nonScoreArticles.to_csv('wp_pages-no_match.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## We merge our population and wikipedia page data on country and regions
Those that do not match up within the two tables are put to a seperate csv located in the repository titled wp_wpds_countries-no_match.csv

In [141]:
t1 = pageWScore.set_index('country')
t2 = nonregions.set_index('Geography')
res = t1.join(t2,how='left')


## What countries are missing population data? and vice versa?

In [142]:
hasArticles = pd.unique(res.index)
hasPopulation = pd.unique(t2.index)
unseen = np.setdiff1d(hasArticles,hasPopulation) ## this is the list of those are missing in either population or article coverage
np.savetxt(fname='wp_wpds_countries-no_match.csv',X=unseen,delimiter=',',fmt='%s')

# Now we want to see which articles are of a "good" quality, and label them for future ease of analysis

In [143]:

##Droping NAs before saving NOTE this is because some ORES call got "error" returns
res = res.dropna()
res = res.rename(columns = {'Population mid-2018 (millions)':'population'})
holdNewCol = []
for r in res['article_quality']:
    if(r == 'FA' or r == 'GA'): # FA = featured article, GA = good article, breakdown of different kinds are in README
        holdNewCol.append(1)
    else:
        holdNewCol.append(0)
res['isGood'] = holdNewCol # this will be used later to get an idea of percentage of good articles


res.drop('isGood',axis=1).to_csv('wp_wpds_politicians_by_country.csv') ## we drop is good as it is not necessarily useful to everyone else

In [144]:
res.drop('isGood',axis=1).head()

Unnamed: 0,page,rev_id,article_quality,population,region
Afghanistan,Murad Quenili,670462475,Stub,36.5,ASIA
Afghanistan,Badar,671455150,Stub,36.5,ASIA
Afghanistan,Mohammed Qalamuddin,671473289,Stub,36.5,ASIA
Afghanistan,Faizanullah Faizan,703507854,Stub,36.5,ASIA
Afghanistan,Mohammad Fahim Dashty,706112927,Stub,36.5,ASIA


# Analysis
## I will explore the article coverage by country with the following metrics
- articles per person (articleCoverage): $\frac{\text{total articles}_{\text{country}}}{\text{population}_{country}}$
- percentage excellent articles (goodArticleCoverage): $\frac{\text{# of good articles}_\text{country}}{\text{total articles}_{\text{country}}}$
Where "# of good articles" is the number of articles belonging to a politician in the country that is of ranking "FA" or "GA". For more definitions on ORES ranking of articles, check their website at https://www.mediawiki.org/wiki/ORES

In [145]:
numPages= res.groupby(res.index).count()['page']
pop = res.groupby(res.index).min()['population']

numPagesDF = pd.DataFrame(numPages).rename(columns={'page':'numArticles'})

numGoodPages= res.groupby(res.index).sum()['isGood']

numGoodPagesDF = pd.DataFrame(numGoodPages).rename(columns={'isGood':'numGoodArticles'})

dfCoverage = pd.DataFrame(numPages/pop) ## here we calculate the articles per person
dfCoverage = dfCoverage.rename(columns={0:'articleCoverage'})
dfGoodCoverage = pd.DataFrame(numGoodPages/numPages) ## here we calculate the percentage of excellent articles
dfGoodCoverage = dfGoodCoverage.rename(columns={0:'goodArticleCoverage'})

# below we join all the information on coverage, percentage good articles, 
analysisDF = dfCoverage.join(dfGoodCoverage)
analysisDF = analysisDF.join(nonregions.set_index('Geography'))
analysisDF = analysisDF.join(numPagesDF)
analysisDF = analysisDF.join(numGoodPagesDF)

analysisDF.head()

Unnamed: 0,articleCoverage,goodArticleCoverage,Population mid-2018 (millions),region,numArticles,numGoodArticles
Afghanistan,8.767123,0.0375,36.5,ASIA,320,12
Albania,157.586207,0.006565,2.9,EUROPE,457,3
Algeria,2.716628,0.017241,42.7,AFRICA,116,2
Andorra,425.0,0.0,0.08,EUROPE,34,0
Angola,3.486842,0.0,30.4,AFRICA,106,0


## Top 10 countries by coverage

In [146]:
analysisDF.sort_values(by='articleCoverage',ascending=False).head(10)

Unnamed: 0,articleCoverage,goodArticleCoverage,Population mid-2018 (millions),region,numArticles,numGoodArticles
Tuvalu,5400.0,0.092593,0.01,OCEANIA,54,5
Nauru,5200.0,0.0,0.01,OCEANIA,52,0
San Marino,2700.0,0.0,0.03,EUROPE,81,0
Monaco,1000.0,0.0,0.04,EUROPE,40,0
Liechtenstein,700.0,0.0,0.04,EUROPE,28,0
Tonga,630.0,0.0,0.1,OCEANIA,63,0
Marshall Islands,616.666667,0.0,0.06,OCEANIA,37,0
Iceland,502.5,0.00995,0.4,EUROPE,201,2
Andorra,425.0,0.0,0.08,EUROPE,34,0
Grenada,360.0,0.027778,0.1,LATIN AMERICA AND THE CARIBBEAN,36,1


## Observations
We see that the best coverage is found in places with relatively small populations (order of ten thousand) with a surprising amount of articles, particularly with Oceania and european regions. I would propose that there could be articles written in these countries about political figures that act in their interest in other countries, or may come from other countries.

## Bottom 10 countries by coverage

In [147]:
analysisDF.sort_values(by='articleCoverage',ascending=True).head(10)

Unnamed: 0,articleCoverage,goodArticleCoverage,Population mid-2018 (millions),region,numArticles,numGoodArticles
India,0.71465,0.017347,1371.3,ASIA,980,17
Indonesia,0.791855,0.047619,265.2,ASIA,210,10
China,0.810733,0.036283,1393.8,ASIA,1130,41
Uzbekistan,0.851064,0.071429,32.9,ASIA,28,2
Ethiopia,0.939535,0.019802,107.5,AFRICA,101,2
"Korea, North",1.40625,0.194444,25.6,ASIA,36,7
Zambia,1.412429,0.0,17.7,AFRICA,25,0
Thailand,1.691843,0.026786,66.2,ASIA,112,3
Mozambique,1.901639,0.0,30.5,AFRICA,58,0
Bangladesh,1.917067,0.009404,166.4,ASIA,319,3


## Observations 
Meanwhile, with the worst coverage, we see that Asia and Africa are the regions that are represented by the countries here. Further, many of these have much larger populations than our best performing counterparts. In this situation, I would believe that there are some factors that could help explain this low article coverage. I would guess its partly due to many not having a firm grasp on the english language, as well as a question of who may have access to the internet resources easily. It is something we take for granted, but I feel some may not be able to report on local politics partly because of this fact.

## Top 10 countries by relative quality

In [148]:
analysisDF.sort_values(by='goodArticleCoverage',ascending=False).head(10)

Unnamed: 0,articleCoverage,goodArticleCoverage,Population mid-2018 (millions),region,numArticles,numGoodArticles
"Korea, North",1.40625,0.194444,25.6,ASIA,36,7
Saudi Arabia,3.532934,0.127119,33.4,ASIA,118,15
Mauritania,10.666667,0.125,4.5,AFRICA,48,6
Central African Republic,14.042553,0.121212,4.7,AFRICA,66,8
Romania,17.589744,0.113703,19.5,EUROPE,343,39
Tuvalu,5400.0,0.092593,0.01,OCEANIA,54,5
Bhutan,41.25,0.090909,0.8,ASIA,33,3
Dominica,171.428571,0.083333,0.07,LATIN AMERICA AND THE CARIBBEAN,12,1
Syria,6.994536,0.078125,18.3,ASIA,128,10
Benin,7.913043,0.076923,11.5,AFRICA,91,7


## Observations
Unlike the previous results involving number of articles per millions of people, we see the opposite result. Namely that we have a better distribution of countries and the regions they belong to here. I wonder if for some of these countries, its a matter of access to internet resources again. Unlike simple article coverage, we are measuring how many are high quality. If you limit these politician publications to a select and elite few, you may come across countries that preform quite well in this metric.

## Bottom 10 countries by relative quality

In [149]:
analysisDF.sort_values(by='goodArticleCoverage',ascending=True).head(10)

Unnamed: 0,articleCoverage,goodArticleCoverage,Population mid-2018 (millions),region,numArticles,numGoodArticles
Slovakia,21.481481,0.0,5.4,EUROPE,116,0
Namibia,64.8,0.0,2.5,AFRICA,162,0
Cape Verde,61.666667,0.0,0.6,AFRICA,37,0
Mozambique,1.901639,0.0,30.5,AFRICA,58,0
Costa Rica,29.4,0.0,5.0,LATIN AMERICA AND THE CARIBBEAN,147,0
Monaco,1000.0,0.0,0.04,EUROPE,40,0
Djibouti,37.0,0.0,1.0,AFRICA,37,0
Moldova,120.857143,0.0,3.5,EUROPE,423,0
Uganda,4.195011,0.0,44.1,AFRICA,185,0
Eritrea,2.666667,0.0,6.0,AFRICA,16,0


# Observations
" Slovak (official) 78.6%, Hungarian 9.4%, Roma 2.3%, Ruthenian 1% " (https://www.cia.gov/library/publications/the-world-factbook/geos/lo.html). The previous statistic is from the CIA factbook expressing the percentage of languages spoken by speakers in the country of Slovakia. This is one of apparently many countries who have 0 good (GA or FA quality) articles. Again, this highlights a bias to countries where english speaking countries, as they are much more likely to have higher quality articles in that same language. I would imagine that many of the countries on this list as well as for bottom 10 article coverage countries share a similar pattern. 

# Now we can analyze by region rather than country
We have seen some interesting results looking at the top and bottom 10 and the regions they belong to, but what about regions overall? There certainly were some outliers in some regions as we have seen, but what if we average it all out?

In [150]:
numPagesPerRegion = res.groupby('region').count()['page']
pop = regions['population']

numPagesRegionDF = pd.DataFrame(numPagesPerRegion).rename(columns={'page':'numArticles'})

numGoodPagesRegionDF = pd.DataFrame(numGoodPages).rename(columns={'isGood':'numGoodArticles'})
numGoodPagesRegion = res.groupby('region').sum()['isGood']
combinedPopAndCoverage = (pd.DataFrame(numPagesRegionDF).join(regions.set_index('Geography')))
combinedPopAndCoverage['articleCoverage'] = combinedPopAndCoverage['numArticles']/combinedPopAndCoverage['population']
combinedPopAndCoverage['goodArticleCoverage']=numGoodPagesRegion/combinedPopAndCoverage['numArticles']
combinedPopAndCoverage['goodArticles']=numGoodPagesRegion

combinedPopAndCoverage.to_csv("regionAnalysis.csv")

## Geographic regions by coverage

In [151]:
combinedPopAndCoverage.sort_values(by='articleCoverage',ascending=False)

Unnamed: 0_level_0,numArticles,population,articleCoverage,goodArticleCoverage,goodArticles
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OCEANIA,3128,41.0,76.292683,0.0211,66
EUROPE,15864,746.0,21.265416,0.020298,322
LATIN AMERICA AND THE CARIBBEAN,5169,649.0,7.964561,0.013349,69
AFRICA,6851,1284.0,5.33567,0.018246,125
NORTHERN AMERICA,1921,365.0,5.263014,0.051536,99
ASIA,11531,4536.0,2.542108,0.026884,310


## Geographic regions by relative quality

In [152]:
combinedPopAndCoverage.sort_values(by='goodArticleCoverage',ascending=False)

Unnamed: 0_level_0,numArticles,population,articleCoverage,goodArticleCoverage,goodArticles
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,1921,365.0,5.263014,0.051536,99
ASIA,11531,4536.0,2.542108,0.026884,310
OCEANIA,3128,41.0,76.292683,0.0211,66
EUROPE,15864,746.0,21.265416,0.020298,322
AFRICA,6851,1284.0,5.33567,0.018246,125
LATIN AMERICA AND THE CARIBBEAN,5169,649.0,7.964561,0.013349,69


In [153]:
regionToCountryMapping['NORTHERN AMERICA']

Unnamed: 0,Geography,Population mid-2018 (millions)
57,Canada,37.2
58,United States,328.0


# Observations
I am no expert on the makeup of the population and the political system established in many of these regions, however, I think I can somewhat comment on the good article coverage expressed in the second table. I would conjecture that we see a large amount of good articles relative to the total number of articles about politicians in the Northern America region over their counterparts for two reasons. One, because northern america only includes the United States and Canada, and two, because their dialect is predominantly english and have a well educated population / literacy. This allows for many more random persons from these countries to develop articles on politicians with greater detail that will be of a higher quality on average.

# Summary
- Countries who do poorly in article coverage and percentage of high quality articles are likely countries with low literacy, limited access to computers and internet resources (or perhaps exclusive usage to a select elite), and speak many languages other than english
- There tends to be outliers due to the previous hypotheses. Countries with limited resources likely have few articles, causing them to act as outliers as over performers and under performers
- Region granularity results show that Northern America dominates in relative quality coverage, most likely to their predominantly english dialect and particular subset of countries (U.S. and Canada, but not Mexico).

Lastly, I have a few questions regarding the methodology used to score the article quality. There may be inherent bias in how these ratings are determined that could explain the results we have found. As of now, the ORES system is quite a black box.

However, I do think a lot of the results can likely be explained, at least on article coverage, by examining local dialect, literacy rates, system of government, and economic status (1st,2nd,3rd world). Some countries didnt even make this table! Is there a bias here? (most definitely is my guess)