# Step 0: Pre-processing

In [2]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
from datetime import datetime
import numpy as np

The base_path should be set to a location on your local machine where you'd like the script to output files and input source data from.

In [4]:
base_path = 'C:/Users/geoffc.REDMOND/OneDrive/Data512/A2/'

# Step 1: Data acquisition

First, we pull wikipedia article data and population reference bureau data from the CSV files we have in the base_path location.

In [5]:
#wikipedia data
wiki_page_data = pd.read_csv(base_path+'page_data.csv',header=0)
wiki_page_data = wiki_page_data.sort_values(by=['rev_id'],ascending = True) #The data appears to be pre-sorted but better safe than sorry.

#population reference bureau data
prb_data = pd.read_csv(base_path+'population_prb.csv',header=2)
prb_data = prb_data.drop(prb_data.columns[[1,2,3,5]],axis=1) #Drop location type, timeframe, data type, footnotes
prb_data.columns = ['country','population'] #Rename columns

Next, define a function to call the ORES API and return json with given revision ids, ORES quality prediction and several other fields for the associated article. This code was (heavily) based on a example use of this API provided by Oliver Keyes.

In [6]:
def get_ores_data(revision_ids, headers):
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    
    api_call = requests.get(endpoint.format(**params))
    return json.dumps(json.loads(api_call.text))

Define a function that given the json returned by the ORES API will turn it into a sorted pandas dataframe object with just the revision ids and ORES quality prediction.

In [7]:
def process_ores_data(ores_json):
    out = pd.read_json(pd.read_json(ores_json,'index',typ='frame')['scores'].to_json(),date_unit='us')['enwiki'].astype(str)

    #take only the prediction from the json
    out = out.str.split(',',expand=True)[0] 
    out = out.str.split("'",expand=True)[7]
    out = out.to_frame()
    out.columns = ['prediction']
    out['rev_id'] = out.index
    out = out.sort_values(by=['rev_id'],ascending = True)
    return out

Define a function which given a header and sorted pandas dataframe of rev_ids will process rev_ids through ORES in batches of 50 and return a sorted list of predictions.

In [8]:
def process_rev_id_list(revision_ids, headers): 
    start = revision_ids[0:1]
    out = process_ores_data(get_ores_data(start, headers))
    #print('out:', out)
    index = 1 
    inc = 50
    list_len = len(revision_ids)    
    while (index < list_len):
        end = min(list_len,index+inc)
        #print('end:',end)
        lst = revision_ids[index:end]
        res = process_ores_data(get_ores_data(lst, headers))
        #print('res:',res)
        out = out.append(res)
        index = end
    return out

Using our functions, process the list of revisions_ids from the wikipedia article data. 

In [9]:
headers = {'User-Agent' : 'https://github.com/gdc3000', 'From' : 'gdc3000@uw.edu'}
input_ids = wiki_page_data['rev_id'].tolist()
output_df = process_rev_id_list(input_ids,headers)

# Step 2: Data processing

Now that we have the wikipedia page data, ORES quality scores and population we merge the data into one table for analysis. First, we join the resulting ORES dataframe with wiki the other wikipedia page fields.

In [10]:
wiki_page_data_wPrediction = pd.merge(wiki_page_data,output_df,on='rev_id',how='outer')

Define a function which outputs a given dataframe to a CSV file with given name at the given path.

In [11]:
def expToCSV(path,filename,dataframe):
    combined_data.to_csv(path_or_buf=path+filename,sep=',', encoding='utf-8',index=False)

Next, join the wikipedia and population data together on country. Where countries in the two datasets do not match, we will remove the row (i.e. we are doing an inner join). 

In [12]:
#convert population to numeric type
prb_data['population'] = prb_data['population'].str.replace(',', '') 
prb_data['population'] = pd.to_numeric(prb_data['population'])

#combine data
combined_data = pd.merge(wiki_page_data_wPrediction,prb_data,on='country',how='inner')
combined_data = combined_data[['country','page','rev_id','prediction','population']]
combined_data.columns = ['country','article_name','revision_id','article_quality','population']

There are a few rows in the data where the ORES API couldn't return a quality score. The quality score returned starts with "RevisionNotFound". We'll remove these rows from the data even though it only appears there are two of them.

In [14]:
print(combined_data[combined_data.article_quality.str.match('RevisionNotFound',case=False)].shape)
combined_data_clean = combined_data[~combined_data.article_quality.str.match('RevisionNotFound',case=False)]

(2, 5)


Before starting the analysis step, we will export the scrubbed, combined data to a CSV file.

In [15]:
expToCSV(base_path,'final_data_a2.csv',combined_data_clean)

# Step 3: Analysis

Taking the table resulting from step 2, flag any articles where article_quality is 'FA' or 'GA' as high quality and add these flags as a field to the table. The warnings shown below should not affect our analysis.

In [16]:
combined_data['high_quality'] = 0
combined_data['high_quality'][combined_data['article_quality'] == 'FA'] = 1
combined_data['high_quality'][combined_data['article_quality'] == 'GA'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Compute the proportion of high quality articles in terms of the total number of politician articles per country.

In [17]:
qual_byarticlecount = combined_data.groupby('country',as_index=False).agg({'high_quality': np.sum, 'article_name': np.size})
qual_byarticlecount.columns = ['country','high_quality','article_count'] #fix column name

qual_byarticlecount['proportion'] = qual_byarticlecount['high_quality'] / qual_byarticlecount['article_count']

Next, compute the proportion of total articles in terms of the population of each country.

In [18]:
qual_bypop = combined_data.groupby(['country','population'],as_index=False).agg({'article_name': np.size})
qual_bypop.columns = ['country','population','article_count'] #fix column name

qual_bypop['proportion'] = qual_bypop['article_count'] / qual_bypop['population']

Next, sort these tables by the proportion.

In [19]:
qual_bypop = qual_bypop.sort_values(by=['proportion','population'],ascending = [False,True])
qual_byarticlecount = qual_byarticlecount.sort_values(by=['proportion','article_count'],ascending = [False,True])

Display the 10 highest-ranked countries in terms of proportion of politician articles to a country's population.

In [20]:
qual_bypop.head(10)

Unnamed: 0,country,population,article_count,proportion
120,Nauru,10860,53,0.00488
173,Tuvalu,11800,55,0.004661
141,San Marino,33000,82,0.002485
113,Monaco,38088,40,0.00105
97,Liechtenstein,37570,29,0.000772
107,Marshall Islands,55000,37,0.000673
72,Iceland,330828,206,0.000623
168,Tonga,103300,63,0.00061
3,Andorra,78000,34,0.000436
54,Federated States of Micronesia,103000,38,0.000369


Display the 10 lowest-ranked countries in terms of proportion of politician articles to a country's population. For countries with an equivalent proportion (if that occurred), we show those with the highest population at the bottom.

In [21]:
qual_bypop.tail(10)

Unnamed: 0,country,population,article_count,proportion
13,Bangladesh,160411000,324,2.019812e-06
38,"Congo, Dem. Rep. of",73340200,142,1.936182e-06
166,Thailand,65121250,112,1.719869e-06
185,Zambia,15473900,26,1.680249e-06
86,"Korea, North",24983000,39,1.561062e-06
53,Ethiopia,98148000,105,1.069813e-06
180,Uzbekistan,31290791,29,9.267902e-07
74,Indonesia,255741973,215,8.406911e-07
34,China,1371920000,1138,8.294944e-07
73,India,1314097616,990,7.533687e-07


Display the 10 highest-ranked countries in terms of the proportion of high-quality politician articles to total articles.

In [22]:
qual_byarticlecount.head(10)

Unnamed: 0,country,high_quality,article_count,proportion
86,"Korea, North",9,39,0.230769
138,Romania,45,348,0.12931
143,Saudi Arabia,15,119,0.12605
31,Central African Republic,8,68,0.117647
137,Qatar,5,51,0.098039
68,Guinea-Bissau,2,21,0.095238
183,Vietnam,18,191,0.094241
19,Bhutan,3,33,0.090909
77,Ireland,31,381,0.081365
178,United States,86,1098,0.078324


Display the 10 lowest-ranked countries in terms of the proportion of high-quality politician articles to total articles. For countries with an equivalent proportion of high quality articles, we show those with the highest total article_count at the bottom.

In [23]:
qual_byarticlecount.tail(10)

Unnamed: 0,country,high_quality,article_count,proportion
168,Tonga,0,63,0.0
100,Macedonia,0,65,0.0
26,Burundi,0,76,0.0
83,Kazakhstan,0,79,0.0
141,San Marino,0,82,0.0
151,Solomon Islands,0,98,0.0
170,Tunisia,0,140,0.0
121,Nepal,0,363,0.0
161,Switzerland,0,407,0.0
16,Belgium,0,523,0.0
