# DATA 512 - Human-Centered Data Science
## Assignment 2
Will Wright 
Q4 2019

### Purpose and Methodology
TODO

### Load Modules and Set Parameters

In [3]:
import pandas as pd

In [4]:
pd.options.display.max_rows = 500

### Get Data

In [5]:
page_data = pd.read_csv("../data_raw/page_data.csv")
pop_data = pd.read_csv("../data_raw/WPDS_2018_data.csv")

In [6]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [7]:
pop_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


#### Clean Data
As can be seen above, the page_data table has rows within the 'page' column that start with "Template." These are not wikipedia articles and will need to be removed.

In [8]:
# Use regex to remove rows that start with "Template"
page_data = page_data[~page_data.page.str.contains("^Template:.*")]
page_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


Additionally, the pop_data table has some rows for 'Geography' in all-caps to indicate that they are the population for the entire region.  We'll want to retain these values, but put them in a different table.  After looking through the data, I verified that only the region-level rows were in all caps (no cases of 'USA' or similar) and it's safe to use all-caps as a signal to split the table.

In [9]:
pop_region_data = pop_data[pop_data.Geography.str.contains("^[^a-z]*$")] # retain only uppercase strings
pop_country_data = pop_data[~pop_data.Geography.str.contains("^[^a-z]*$")] # retain only lowercase strings

In [10]:
pop_region_data

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284
56,NORTHERN AMERICA,365
59,LATIN AMERICA AND THE CARIBBEAN,649
95,ASIA,4536
144,EUROPE,746
189,OCEANIA,41


In [11]:
pop_country_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


With this data cleaned, we'll then write page_data to a .csv to be read into the 'hcds-a2-bias_ores-rating.R' script.  This is done because neither of the options provided in Python seem to be working (installing ORES or using the API) despite much testing. 

In [12]:
page_data.to_csv("../data_raw/page_data_nonTemplate.csv", index = False)

#### \*\*Generating Article Scores Using ORES Done in R and Written to .csv\*\*
Pick up ORES prediction results and join on the country of the page_data

In [38]:
prediction_data = pd.read_csv("../data_raw/page_data_predictions.csv")

In [39]:
page_predictions = pd.merge(page_data,
                       prediction_data,
                       left_on = "rev_id",
                       right_on = "rev_id",
                       how = "left")

In [40]:
page_predictions.head()

Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [41]:
page_predictions.prediction.value_counts()

Stub                                                    24255
Start                                                   14650
C                                                        5856
B                                                         755
GA                                                        755
FA                                                        275
RevisionNotFound                                          146
TextDeleted: Text deleted (datasource.revision.text)        9
Name: prediction, dtype: int64

Looks like there were 146 cases of RevisionNotFound and 9 cases of Text Deleted. These results will be separated into a log of unused articles.

In [42]:
# set aside RevisionNotFound and TextDeleted predictions
page_predictions_unused = page_predictions[page_predictions.prediction.str.contains('RevisionNotFound') |
                                          page_predictions.prediction.str.contains('TextDeleted')]
page_predictions_unused.to_csv("../data_raw/page_predictions_unused.csv", index = False)

# remove from the predictions dataframe
page_predictions = page_predictions[~(page_predictions.prediction.str.contains('RevisionNotFound') |
                                          page_predictions.prediction.str.contains('TextDeleted'))]

In [43]:
page_predictions.prediction.value_counts()

Stub     24255
Start    14650
C         5856
B          755
GA         755
FA         275
Name: prediction, dtype: int64

Add the country population data from the pop_data dataframe

In [44]:
page_predictions_pop_data = pd.merge(page_predictions,
                       pop_data,
                       left_on = "country",
                       right_on = "Geography",
                       how = "left")

Because not all the countries in the page dataset will necessarily map to the countries in the population dataset, let's investigate which countries aren't matching and by how much.

In [48]:
page_predictions_pop_data.country[page_predictions_pop_data['Population mid-2018 (millions)'].isnull()].value_counts()

Czech Republic                      251
Hondura                             187
Palestinian Territory               179
Congo, Dem. Rep. of                 142
Salvadoran                          116
South Korean                         96
Cape Colony                          81
Samoan                               76
Rhodesian                            75
Faroese                              74
Ivorian                              72
Cook Island                          67
Jersey                               61
Guadeloupe                           49
Saint Lucian                         47
Pitcairn Islands                     43
Chechen                              38
East Timorese                        36
Martinique                           34
Swaziland                            31
Saint Kitts and Nevis                30
French Guiana                        27
Montserratian                        27
Guernsey                             25
Omani                                24


In [51]:
pop_data[pop_data.Geography.str.contains("Czech")]

Unnamed: 0,Geography,Population mid-2018 (millions)
166,Czechia,10.6


In [52]:
pop_data[pop_data.Geography.str.contains("Hondura")]

Unnamed: 0,Geography,Population mid-2018 (millions)
64,Honduras,9


Just based on the top two offenders, it looks like we could manually map many (if not all) of the exceptions that were caused by minor differences in spelling or naming format.  In these cases "Czechia" is actually a short-term for "Czech Republic" and "Honduras" is the correct spelling of "Hondura". 

However, given that the instructions grant that we can simply set these exceptions aside and avoid this tedious process, I'm going to opt for that route. If the result of this investigation were to have a real impact, obviously this mapping would make a lot more sense.

In [57]:
# set aside cases where there isn't a direct match
page_predictions_pop_data_unused = page_predictions_pop_data[page_predictions_pop_data['Population mid-2018 (millions)'].isnull()]
page_predictions_pop_data_unused.to_csv("../data_raw/wp_wpds_countries-no_match.csv", index = False)

# remove these exceptions from clean data
page_predictions_pop_data = page_predictions_pop_data[~page_predictions_pop_data['Population mid-2018 (millions)'].isnull()]

Finally, we'll make some changes to the dataframe to match the expected schema and save to the data_clean folder

In [61]:
# drop geography and reorder
page_predictions_pop_data = page_predictions_pop_data[['country','page','rev_id','prediction','Population mid-2018 (millions)']]

In [63]:
# rename
page_predictions_pop_data.columns = ['country','article_name','revision_id','article_quality','population']

In [64]:
page_predictions_pop_data.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Chad,Bir I of Kanem,355319463,Stub,15.4
2,Cambodia,Yos Por,393822005,Stub,16.0
5,Canada,Robert Douglas Cook,401577829,Stub,37.2
6,Egypt,List of Grand Viziers of Egypt,442937236,Stub,97.0
7,Pakistan,Sehba Musharraf,448555418,Stub,200.6


In [65]:
# write to csv
page_predictions_pop_data.to_csv("../data_clean/wp_wpds_politicians_by_country.csv", index = False)

#### Analysis