# <center>A2 Assignment - Bias in Data</center>
<center>Darshan Mehta</center>

1. [Data Preparation](#Data-Preparation)
2. [Data Analysis](#Data-Analysis)
3. [Results](#Results)
4. [Reflections](#Reflections)

In [1]:
import oresapi as op
import pandas as pd
from IPython.display import display

### Data Preparation

Read the ```page_data.csv``` and the ```WPDS_2018_data.csv``` files. Display a sample of the input from both the files.

In [2]:
page_data = pd.read_csv('./page_data.csv')
wpds_data = pd.read_csv('./WPDS_2018_data.csv')

print('Wikipedia Politicians by Country Dataset')
display(page_data.head())
print()
print()
print('Population Data')
display(wpds_data.head())

Wikipedia Politicians by Country Dataset


Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409




Population Data


Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In the ```page_data.csv```, some of the page names start with the string "Template:". These pages are do not represent Wikipedia articles, and hence should be removed since they don't concern this analysis.

In [3]:
filtered_page_data = page_data.loc[~page_data.page.str.startswith("Template:")]

Now, we notice that in ```WPDS_2018_data.csv``` there are some rows which have all caps values for the 'Geography' field. These rows provide cumulative regional population counts, instead of country-level counts. All the contries below a region name when iterating row-wise sequentially are the countries which belong to this region. So we create a country-region mapper dataframe and isolate the country only rows in the dataframe for our country level analysis.

In [4]:
# Get the indices of rows where the 'Geography' is in all caps
regional_rows = wpds_data["Geography"].str.isupper()

country_level_counts = wpds_data[~regional_rows]

# Make the region-country mapper by iterating through the rows sequentially.
region = ""
region_country_mapper = []
for idx, row in wpds_data.iterrows():
    if row["Geography"].isupper():
        region = row["Geography"]
    else:
        region_country_mapper.append({'region': region, 
                                      'country': row['Geography']})
region_country_mapper = pd.DataFrame(region_country_mapper)

Next, for each article in ```filtered_page_data```, we are going to rate the article using the ```oresapi``` library. This service rates the article into one of the following 6 categories:

| Rank | Symbol | Description |
|------|--------|-------------|
| 1 | FA | Featured Article |
| 2 | GA | Good Article |
| 3 | B | B-class Article |
| 4 | C | C-class Article |
| 5 | Start | Start-class Article |
| 6 | Stub | Stub-class Article |

For each article, the response contains probability scores for each of the ranks along with a prediction which contains the symbol of the class which has the highest probability score. A demo on how to call this API can be found [here](https://github.com/halfak/oresapi).

In [5]:
# Create a session for making the API calls
# Please specify the user-agent string to help the ORES team track requests
ores_session = op.Session("https://ores.wikimedia.org", user_agent="Class Project <darshanm@uw.edu>")

# Now obtain a list of revids from the filtered_page_data dataframe
rev_ids = filtered_page_data.rev_id.values

# Make the API call to retrieve 'articlequality' for each rev_id
results = ores_session.score("enwiki", ["articlequality"], rev_ids)

# We read the results and parse it into a dataframe keeping only the rev_id and the prediction
# NOTE: There could be some rev_ids which could not be found by the API. We collect this 
# in a list and write to a file.
df_results = []
error_revids = []
for rev_id, result in zip(rev_ids, results):
    try:
        result_dict = {'rev_id': rev_id, 
                       'article_quality': result['articlequality']['score']['prediction']}
        df_results.append(result_dict)
    except:
        error_revids.append(rev_id)
df_results = pd.DataFrame(df_results)

# Write the revision_ids which we couldn't make the prediction for to a file.
with open('invalid_rev_ids.txt', 'w') as err_file:
    err_file.write('\n'.join(list(map(str, error_revids))))

# Display a few samples of the curated results dataframe
df_results.head()

Unnamed: 0,article_quality,rev_id
0,Stub,355319463
1,Stub,393276188
2,Stub,393822005
3,Stub,395521877
4,Stub,395526568


Next, we merge this dataframe with the `filtered_page_data` based on the `rev_id` column.

In [6]:
scored_page_data = pd.merge(left=filtered_page_data, right=df_results, 
                            left_on='rev_id', right_on='rev_id')

# Display few sample rows
scored_page_data.head()

Unnamed: 0,page,country,rev_id,article_quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


Next, we analyze to see if we have population data on all the countries in `scored_page_data`.

In [7]:
# Get the list of countries
countries = set(country_level_counts.Geography.values)

# Get the indices of rows in `scored_page_data` which have a country not present in `countries`
valid_country_indices = scored_page_data.country.isin(countries)

# Split the dataframe based on the above indices
scored_page_data_valid = scored_page_data[valid_country_indices]
scored_page_data_invalid = scored_page_data[~valid_country_indices]

print('Number of records with no country match:', len(scored_page_data_invalid))

Number of records with no country match: 2082


We will save these records with no country match into the file named `wp_wpds_countries-no_match.csv` and merge the rest with the population data we have. We will also rename the column `rev_id` to `revision_id`, `Population mid-2018 (millions)` to `population`, `page` to `article_name` and drop `Geography` and keep `country` instead just for clarity purposes. We also multiple the `population` column by $10^6$ to denote the actual population. We will save the final dataset to `wp_wpds_politicians_by_country.csv`.

In [8]:
# Save invalid dataset to file
scored_page_data_invalid.to_csv('wp_wpds_countries-no_match.csv', index=False)

# Merge the rest with the population dataset
final_dataset = pd.merge(left=country_level_counts, right=scored_page_data_valid, 
                         left_on='Geography', right_on='country')

# Drop the `Geography` column
final_dataset = final_dataset.drop(columns=['Geography'])

# Rename the columns
final_dataset = final_dataset.rename(columns={'Population mid-2018 (millions)': 'population',
                                              'page': 'article_name', 
                                              'rev_id': 'revision_id'})

# Multiple the `population` column by 10^6
final_dataset['population'] = final_dataset['population'].apply(lambda x: float(x.replace(',', '')))
final_dataset['population'] = final_dataset.loc[:, 'population'] * 1e6

# Save the final dataset to `wp_wpds_politicians_by_country.csv`
final_dataset.to_csv('wp_wpds_politicians_by_country.csv', index=False)

# Display a sample of the final dataset
final_dataset.head()

Unnamed: 0,population,article_name,country,revision_id,article_quality
0,42700000.0,Ali Fawzi Rebaine,Algeria,686269631,Stub
1,42700000.0,Ahmed Attaf,Algeria,705910185,Stub
2,42700000.0,Ahmed Djoghlaf,Algeria,707427823,Stub
3,42700000.0,Hammi Larouissi,Algeria,708060571,Stub
4,42700000.0,Salah Goudjil,Algeria,708980561,Stub


### Data Analysis

We now create two temporary columns to make our analysis code simpler. The column `article_count` will be just filled with ones and the column `is_good_article` will be 1 if `article_quality` is FA or GA.

In [9]:
final_dataset['article_count'] = 1
final_dataset['is_good_article'] = final_dataset.article_quality.isin(['FA', 'GA']).astype(int)

Next we perform a transformation which would convert the value of `article_count` to actual count of articles by country.

In [10]:
final_dataset_counts_country = final_dataset.copy()
final_dataset_counts_country['article_count'] = \
    final_dataset_counts_country.groupby('country')['article_count'].transform('sum')

Similarly, we create the column `good_article_count`.

In [11]:
final_dataset_counts_country['good_article_count'] = \
    final_dataset_counts_country.groupby('country')['is_good_article'].transform('sum')

Next, we create the following columns:

$$coverage = \frac{article\_count \times 100.0}{population}$$

$$relative\_quality = \frac{good\_article\_count \times 100.0}{article\_count}$$

In [12]:
final_dataset_counts_country['coverage'] = \
    (final_dataset_counts_country['article_count'] * 100.0 / 
     final_dataset_counts_country['population'])

final_dataset_counts_country['relative_quality'] = \
    (final_dataset_counts_country['good_article_count'] * 100.0 / 
     final_dataset_counts_country['article_count'])

In [13]:
# Only keep one row corresponding to each country
final_dataset_counts_country = \
    final_dataset_counts_country \
        .drop_duplicates(subset='country') \
        .reset_index(drop=True)

# Keep a copy of the dataframe for our region analysis
final_dataset_country_bkup = final_dataset_counts_country.copy()

final_dataset_counts_country = final_dataset_counts_country[['country', 'coverage', 
                                                             'relative_quality', 
                                                             'population', 'article_count', 
                                                             'good_article_count']]

Now for our region-wise analysis, we will repeat the above steps but first begin with merging the region information into our `final_dataset_country_bkup` using the region-country mapper we prepared in the beginning. We also sum up the population of each country under the region to obtain the population of the entire region.

In [14]:
# Merge the dataframes to pull in the region information
final_data_region = pd.merge(left=region_country_mapper, right=final_dataset_country_bkup, 
                             left_on='country', right_on='country')

In [15]:
# Repeat all the above steps for preparing the columns `coverage` and `relative_quality`
# by grouping on `region` this time.
final_dataset_counts_region = final_data_region.copy()

# Get the total population for each coutry
final_dataset_counts_region['population'] = \
    final_dataset_counts_region.groupby('region')['population'].transform('sum')

final_dataset_counts_region['article_count'] = \
    final_dataset_counts_region.groupby('region')['article_count'].transform('sum')

final_dataset_counts_region['good_article_count'] = \
    final_dataset_counts_region.groupby('region')['good_article_count'].transform('sum')

final_dataset_counts_region['coverage'] = \
    (final_dataset_counts_region['article_count'] * 100.0 / 
     final_dataset_counts_region['population'])

final_dataset_counts_region['relative_quality'] = \
    (final_dataset_counts_region['good_article_count'] * 100.0 / 
     final_dataset_counts_region['article_count'])

# Only keep one row corresponding to each region
final_dataset_counts_region = \
    final_dataset_counts_region \
        .drop_duplicates(subset='region') \
        .reset_index(drop=True)[['region', 'coverage', 
                                 'relative_quality', 
                                 'population', 'article_count', 
                                 'good_article_count']]

### Results

#### Top 10 countries by coverage

In [16]:
final_dataset_counts_country.sort_values('coverage', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,coverage,relative_quality,population,article_count,good_article_count
0,Tuvalu,0.54,9.259259,10000.0,54,5
1,Nauru,0.52,0.0,10000.0,52,0
2,San Marino,0.27,0.0,30000.0,81,0
3,Monaco,0.1,0.0,40000.0,40,0
4,Liechtenstein,0.07,0.0,40000.0,28,0
5,Tonga,0.063,0.0,100000.0,63,0
6,Marshall Islands,0.061667,0.0,60000.0,37,0
7,Iceland,0.05025,0.995025,400000.0,201,2
8,Andorra,0.0425,0.0,80000.0,34,0
9,Federated States of Micronesia,0.036,0.0,100000.0,36,0


#### Bottom 10 countries by coverage

In [17]:
final_dataset_counts_country.sort_values('coverage').reset_index(drop=True).head(10)

Unnamed: 0,country,coverage,relative_quality,population,article_count,good_article_count
0,India,7.1e-05,1.734694,1371300000.0,980,17
1,Indonesia,7.9e-05,4.761905,265200000.0,210,10
2,China,8.1e-05,3.628319,1393800000.0,1130,41
3,Uzbekistan,8.5e-05,7.142857,32900000.0,28,2
4,Ethiopia,9.4e-05,1.980198,107500000.0,101,2
5,"Korea, North",0.000141,19.444444,25600000.0,36,7
6,Zambia,0.000141,0.0,17700000.0,25,0
7,Thailand,0.000169,2.678571,66200000.0,112,3
8,Mozambique,0.00019,0.0,30500000.0,58,0
9,Bangladesh,0.000192,0.940439,166400000.0,319,3


#### Top 10 countries by relative quality

In [18]:
final_dataset_counts_country.sort_values('relative_quality', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,coverage,relative_quality,population,article_count,good_article_count
0,"Korea, North",0.000141,19.444444,25600000.0,36,7
1,Saudi Arabia,0.000353,12.711864,33400000.0,118,15
2,Mauritania,0.001067,12.5,4500000.0,48,6
3,Central African Republic,0.001404,12.121212,4700000.0,66,8
4,Romania,0.001759,11.370262,19500000.0,343,39
5,Tuvalu,0.54,9.259259,10000.0,54,5
6,Bhutan,0.004125,9.090909,800000.0,33,3
7,Dominica,0.017143,8.333333,70000.0,12,1
8,Syria,0.000699,7.8125,18300000.0,128,10
9,Benin,0.000791,7.692308,11500000.0,91,7


#### Bottom10 countries by relative quality

In [19]:
final_dataset_counts_country.sort_values('relative_quality').reset_index(drop=True).head(10)

Unnamed: 0,country,coverage,relative_quality,population,article_count,good_article_count
0,Kazakhstan,0.000424,0.0,18400000.0,78,0
1,Eritrea,0.000267,0.0,6000000.0,16,0
2,San Marino,0.27,0.0,30000.0,81,0
3,Costa Rica,0.00294,0.0,5000000.0,147,0
4,Macedonia,0.003095,0.0,2100000.0,65,0
5,Mozambique,0.00019,0.0,30500000.0,58,0
6,Malta,0.0206,0.0,500000.0,103,0
7,Seychelles,0.021,0.0,100000.0,21,0
8,Uganda,0.00042,0.0,44100000.0,185,0
9,Zambia,0.000141,0.0,17700000.0,25,0


#### Geographic regions by coverage

In [20]:
final_dataset_counts_region.sort_values('coverage', ascending=False).reset_index(drop=True)

Unnamed: 0,region,coverage,relative_quality,population,article_count,good_article_count
0,OCEANIA,0.007863,2.109974,39780000.0,3128,66
1,EUROPE,0.00216,2.029753,734590000.0,15864,322
2,LATIN AMERICA AND THE CARIBBEAN,0.000823,1.334881,628270000.0,5169,69
3,AFRICA,0.000584,1.824551,1172400000.0,6851,125
4,NORTHERN AMERICA,0.000526,5.153566,365200000.0,1921,99
5,ASIA,0.000256,2.688405,4513100000.0,11531,310


#### Geographic regions by relative_quality

In [21]:
final_dataset_counts_region.sort_values('relative_quality', ascending=False).reset_index(drop=True)

Unnamed: 0,region,coverage,relative_quality,population,article_count,good_article_count
0,NORTHERN AMERICA,0.000526,5.153566,365200000.0,1921,99
1,ASIA,0.000256,2.688405,4513100000.0,11531,310
2,OCEANIA,0.007863,2.109974,39780000.0,3128,66
3,EUROPE,0.00216,2.029753,734590000.0,15864,322
4,AFRICA,0.000584,1.824551,1172400000.0,6851,125
5,LATIN AMERICA AND THE CARIBBEAN,0.000823,1.334881,628270000.0,5169,69


### Reflections

One of the things I noticed while working on the Wikipedia politicians dataset was that the column `country` referred to the country to which the politician belonged. Now, one could potentially use this dataset to study the freedom of political media coverage in the country and might assume that a country with high coverage might have a good degree of freedom of speech, etc. but one thing which they might overlook is the fact that the `country` column does not reflect the country of the person who actually wrote the article. For example, companies such as BBC and Reuters frequently write about politics and politicians from other countries. It could be highly possible that some of such writers from other countries could write up the article on Wikipedia about a politician from a different country. I strongly feel that the owners of the dataset must include the countries of the authors who have edited the corresponding Wikipedia page.


I've read about some techniques which are used to study fairness in machine learning where a classifier is said to satisfy a demographic parity under a distribution over $(X, A, Y)$ if its prediction $h(X)$ is statistically independent of the protected attribute (gender, ethnicity, etc) $A$, i.e.,

$$ P(h(X)= \hat{y} | A=a) = P(h(X) = \hat{y})  \forall a, \hat{y}$$

There are some more ways such as checking for Equilized Odds to make sure that a classifier is unbiased.
Most of these ways exist more as research practices. I wonder if we could have a package which when specified a list of columns could make sure that the given model isn't biased on any of those columns.

- **"What (potential) sources of bias did you discover in the course of your data processing and analysis?"**<br>
The Wikipedia guidelines present [here](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) classify an article to be good when it is professional, thorough and of encyclopedia quality. While this being an "English" Wikipedia dataset justifies that the article quality would be judging the English language, however, one thing which gets overlooked is that Eglish is not the primary language of many countries. So it is highly possible that an article is rich with correct information but the organization of thoughts of the non-native English writer might not be up to the mark and hence the article gets rated poorly. Such a measure is already very subjective to begin with and such an impartiality to non-native English speakers further adds to the bias in the training data of the machine learning model downstream. Also, it could be a case that in some of these countries, access to internet as a source of news and media might not be a thing. In such cases, the newspapers, or other websites could be more a popular source than English Wikipedia and this is something the "coverage" doesn't account for.


- **"What might your results suggest about (English) Wikipedia as a data source?"** <br>
One very obvious pattern after looking at the relative_quality of the low ranked countries is that Eglish is not a widespread language most of these countries. Definitely, they could have great articles in their regional newspapers and websites in their regional languages, but English Wikipedia is not a great reflection of it. So it is definitely and obviously not a great idea to use Wikipedia to study countries where English is not a primary language. One would have to supplement their data with information from the local websites and newpapers of country being researched.


- **"How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?"**<br>
This dataset and our analysis seems to be very shallow. We merely calculated aggregates and used them for our analyses. What we really need in this case is "Thick" data. We need the data to be more descriptive. Provide information on the authors, maybe have some levee for articles from countries with primary languages other than English. We could also supplement this dataset by having articles and quality measures from Wikipedia in regional dialects so as to provide a more holistic view of the country's media system. 

## _Fin._