# A2: Bias in Data - Ankit Tandon

In this assignment we're going to be looking at bias in Wikipedia articles. Specifically, we're going to use the ORES API to evaluate the quality of Wikipedia articles from around the world that are about political figures.

The code below is provided under the MIT license from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb
I have modified the headers object and added a return for the function

In [1]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/ankittandon', 'From' : 'antand@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response
#    print(json.dumps(response, indent=4, sort_keys=True))


# So if we grab some example revision IDs and turn them into a list and then call get_ores_data...
example_ids = [783381498, 807355596, 757539710]
get_ores_data(example_ids, headers)

{'enwiki': {'models': {'wp10': {'version': '0.6.1'}},
  'scores': {'757539710': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.05635270475191951,
       'C': 0.17635417131683803,
       'FA': 0.001919869734464717,
       'GA': 0.005517075264277984,
       'Start': 0.732764644204933,
       'Stub': 0.027091534727566813}}}},
   '783381498': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.039498449850621085,
       'C': 0.06068466061111685,
       'FA': 0.0029057427468351755,
       'GA': 0.007477221115409147,
       'Start': 0.5674464916024892,
       'Stub': 0.3219874340735285}}}},
   '807355596': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.04566408685167919,
       'C': 0.10144128886317841,
       'FA': 0.002651239009002438,
       'GA': 0.006433022662730785,
       'Start': 0.7675063182740381,
       'Stub': 0.07630404433937113}}}}}}}

The code below is provided under the MIT license from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb
I modified this code by adding the encoding = 'utf-8' to ensure the file is read with the proper encoding.

In [2]:
## getting the data from the CSV files
import csv
from collections import defaultdict
revisions = defaultdict(lambda: defaultdict(str))
with open('page_data.csv',encoding="utf8") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        if row[2] != 'rev_id':
            revisions[row[2]]['article_name'] = row[0]
            revisions[row[2]]['country'] = row[1]

To avoid spamming the ORES API, i'll create 100 header chunks that I can send instead of the entire list of headers.

In [3]:
chunks = []
chunk = []
count = 0
for revision in revisions:
    chunk.append(revision)
    count += 1
    if count % 100 == 0:
        chunks.append(chunk)
        chunk = []

Now that the headers are 'chunked' up, I can append responses from the API to a responses list.

In [4]:
responses = []
for chunk in chunks:
    responses.append(get_ores_data(chunk, headers))

In the response object there is a 'prediction' value for each article revision. I'm going to create a dictionary to map the prediction to the revision id

In [5]:
predictions = {}
for response in responses:
    for r in response['enwiki']['scores']:
        if 'score' in response['enwiki']['scores'][r]['wp10']:
            revisions[r]['article_quality'] = response['enwiki']['scores'][r]['wp10']['score']['prediction']

Printing the number of predictions we've got to understand the loss in errors

In [6]:
print(len(revisions))

47197


Now let's join our dataset with the population data.
First we're going to create a dictionary with the geography to population mappings

In [7]:
population = {}
with open('WPDS_2018_data.csv',encoding="utf8") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        if row[0] != 'Geography':
            population[row[0].lower()] = float(row[1].replace(',',''))

Second, let's lookup the countries from our revisions dictionary in the population dictionary

In [8]:
for revision in revisions:
    if revisions[revision]['country'].lower() in population:
        revisions[revision]['population'] = population[revisions[revision]['country'].lower()]

Just a simple check to make sure the lookups worked

In [9]:
print(revisions['807355596'])

defaultdict(<class 'str'>, {'article_name': "'Matlelima Hlalele", 'country': 'Lesotho', 'article_quality': 'Start', 'population': 2.3})


Let's output this data to a csv

In [10]:
with open('revisions_data_file.csv', mode='w', encoding='utf-8',newline='') as datafile:
    datafilewriter = csv.writer(datafile, delimiter=',', quotechar="'")
    datafilewriter.writerow(['country','article_name','revision_id','article_quality','population'])
    for revision in revisions:
        row = revisions[revision]
        datafilewriter.writerow([row['country'],row['article_name'],str(revision),row['article_quality'],row['population']])

Now it's time for analysis. Let's create a few dictionaries for number of articles per country and number of 'good' articles per country.

In [11]:
articles_per_country = defaultdict(int)
good_articles_per_country = defaultdict(int)
for revision in revisions:
    articles_per_country[revisions[revision]['country']] += 1
    if revisions[revision]['article_quality'] == 'FA' or revisions[revision]['article_quality'] == 'GA':
        good_articles_per_country[revisions[revision]['country']] += 1

Now let's create dictionaries for number of articles by population of country and percentage of high quality articles

In [12]:
articles_by_population = defaultdict(float)
proportion_of_high_quality = defaultdict(float)
for country in articles_per_country:
    if country in articles_per_country and country.lower() in population:
        articles_by_population[country.lower()] = (articles_per_country[country] / population[country.lower()])
    if country in good_articles_per_country and country in articles_per_country:
        if country.lower() is 'united states':
            print(good_articles_per_country[country])
        proportion_of_high_quality[country.lower()] = (good_articles_per_country[country] / articles_per_country[country])

In [13]:
largest_by_articles_by_population = []
values_largest_articles_by_population = []
smallest_by_articles_by_population = []
values_smallest_articles_by_population = []

largest_by_good_articles_by_population = []
values_largest_good_articles_by_population = []
smallest_by_good_articles_by_population = []
values_smallest_good_articles_by_population = []

for i in range(10):
    largest = max(articles_by_population,key=articles_by_population.get)
    largest_by_articles_by_population.append(largest)
    values_largest_articles_by_population.append(articles_by_population[largest])
    del articles_by_population[largest]
    smallest = min(articles_by_population,key=articles_by_population.get)
    smallest_by_articles_by_population.append(smallest)
    values_smallest_articles_by_population.append(articles_by_population[smallest])
    del articles_by_population[smallest]
    
    largest = max(proportion_of_high_quality,key=proportion_of_high_quality.get)
    largest_by_good_articles_by_population.append(largest)
    values_largest_good_articles_by_population.append(proportion_of_high_quality[largest])
    del proportion_of_high_quality[largest]
    smallest = min(proportion_of_high_quality,key=proportion_of_high_quality.get)
    smallest_by_good_articles_by_population.append(smallest)
    values_smallest_good_articles_by_population.append(proportion_of_high_quality[smallest])
    del proportion_of_high_quality[smallest]

In [14]:
rank = 1
print("top 10 countries in terms of number of politician articles as a proportion of country population")
print("Rank\t\tCountry\t\t\t\tpolitician articles as a proportion of population")
for i in range(10):
    print(str(rank) + "\t\t" + str(largest_by_articles_by_population[i]) + "\t\t\t\t" + str(values_largest_articles_by_population[i]))
    rank += 1
print()
rank = 1
print("bottom 10 countries in terms of number of politician articles as a proportion of country population")
print("Rank\t\tCountry\t\t\t\tpolitician articles as a proportion of population")
for i in range(10):
    print(str(rank) + "\t\t" + str(smallest_by_articles_by_population[i]) + "\t\t\t\t" + str(values_smallest_articles_by_population[i]))
    rank += 1
print()
rank = 1
print("top 10 countries in terms of proportion of articles that are predicted as GA or FA a.k.a good articles")
print("Rank\t\tCountry\t\t\t\tproportion of good articles")
for i in range(10):
    print(str(rank) + "\t\t" + str(largest_by_good_articles_by_population[i]) + "\t\t\t\t" + str(values_largest_good_articles_by_population[i]))
    rank += 1
print()   
rank = 1
print("bottom 10 countries in terms of proportion of articles that are predicted as GA or FA a.k.a good articles")
print("Rank\t\tCountry\t\t\t\tproportion of good articles")
for i in range(10):
    print(str(rank) + "\t\t" + str(smallest_by_good_articles_by_population[i]) + "\t\t\t\t" + str(values_smallest_good_articles_by_population[i]))
    rank += 1

top 10 countries in terms of number of politician articles as a proportion of country population
Rank		Country				politician articles as a proportion of population
1		tuvalu				5500.0
2		nauru				5300.0
3		san marino				2733.3333333333335
4		monaco				1000.0
5		liechtenstein				725.0
6		tonga				630.0
7		marshall islands				616.6666666666667
8		iceland				515.0
9		andorra				425.0
10		federated states of micronesia				380.0

bottom 10 countries in terms of number of politician articles as a proportion of country population
Rank		Country				politician articles as a proportion of population
1		india				0.7219426821264494
2		indonesia				0.8107088989441931
3		china				0.8164729516429904
4		uzbekistan				0.8814589665653496
5		ethiopia				0.9767441860465116
6		zambia				1.4689265536723164
7		korea, north				1.5234375
8		thailand				1.6918429003021147
9		bangladesh				1.9471153846153846
10		mozambique				1.9672131147540983

top 10 countries in terms of proportion of articles that are predict