In [1]:
import numpy as np
import pandas as pd

# A2: BIAS IN DATA

This assignment aims to look at the possible biases in data, and present it in a format that is readable and reproducible.

This notebook is divided into the following sections:

- [Reading Data](#Reading-data)
- [Initial Data Cleaning](#Initial-Data-Cleaning)
- [ORES Analysis](#Using-ORES-API-for-predicting-article-quality)
- [Final Data Cleaning and consolidation](#Final-Data-Cleaning-and-Consolidation)
- [Analysis](#Data-Analysis)
    - [Creating Percentages](#In-the-first-half-of-this-analysis-we-will-calculate-the-percentages-and-store-them-in-a-data-frame)
    - [Looking at the tables](#The-Second-half-of-this-analysis-will-be-looking-at-the-10-lowest-and-highest-ranking-countries-in:)
- [Conclusion](#Conclusion)

## Reading data

### The data we will be analyzing is from two sources:

- Population data: this data is drawn from the world population datasheet at: https://www.prb.org/international/indicator/population/table/
- Politicle articles data by country is drawn from https://figshare.com/articles/Untitled_Item/5513449 (the documentation for the same is available at the link)

The raw data files are present in the data folder in the parent repo.

In the following cells we will read the data from the local folders into dataframes:

In [2]:
politician_data = pd.read_csv('data/raw_data/page_data.csv')

In [3]:
population_data = pd.read_csv('data/raw_data/WPDS_2018_data.csv')

## Initial Data Cleaning

As mentioned in the [A2 assignment wiki](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#Cleaning_the_data), certain rows in both dataframes need to be filtered out, as they have nothing to do with our current analysis.

### 1) We will remove all the rows containing the word "Template" in the page column of the politician_data
#### Step 1: Split the page column by the delimiter ':' into two columns

In [4]:
politician_data[['Template','Page_Only']] = politician_data.page.str.split(":",expand=True) 

In [5]:
politician_data.head()

Unnamed: 0,page,country,rev_id,Template,Page_Only
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Template,ZambiaProvincialMinisters
1,Bir I of Kanem,Chad,355319463,Bir I of Kanem,
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Template,Zimbabwe-politician-stub
3,Template:Uganda-politician-stub,Uganda,391862070,Template,Uganda-politician-stub
4,Template:Namibia-politician-stub,Namibia,391862409,Template,Namibia-politician-stub


#### Step 2: Remove all rows with Template value "Template"

In [6]:
politician_data = politician_data[politician_data.Template != "Template"]

#### Step 3: Finally we will drop the newly generated columns

In [7]:
del politician_data["Template"]

In [8]:
del politician_data["Page_Only"]

In [9]:
politician_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


#### Step 4: We will write this clean data into a csv file for future analysis


In [10]:
politician_data.to_csv('data/generated_data/clean_political_article_data.csv')

### 2) We will now remove all rows with capital country names in the population_data , and store it in an intermediate csv file for future analysis

#### Step 1: Check for only uppercase values in the "Geography" column and store it in another column "is_upper"

In [11]:
population_data['is_upper'] = list(map(lambda x: x.isupper(), population_data['Geography']))

In [12]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions),is_upper
0,AFRICA,1284.0,True
1,Algeria,42.7,False
2,Egypt,97.0,False
3,Libya,6.5,False
4,Morocco,35.2,False


#### Step 2: Now we will store all rows with is_upper values as "True" in an intermediate csv

In [13]:
population_data_cumulative = population_data[population_data.is_upper == True]

In [14]:
del population_data_cumulative["is_upper"]

In [15]:
population_data_cumulative.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284
56,NORTHERN AMERICA,365
59,LATIN AMERICA AND THE CARIBBEAN,649
95,ASIA,4536
144,EUROPE,746


In [16]:
population_data_cumulative.to_csv('data/generated_data/cumulative_population.csv')

#### Step 3: Finally we will delete all rows with is_upper value True from our population_data DataFrame

In [17]:
population_data = population_data[population_data.is_upper != True]

#### Step 4: Deleting superfluous rows

In [18]:
del population_data['is_upper']

In [19]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


#### Step 5:  Write clean population data into its own csv

In [20]:
population_data.to_csv('data/generated_data/clean_population_data.csv')

## Using ORES API for predicting article quality

Our politician_data contains rev_ids which are used by the ORES API to categorize the article by "quality". The documentation for this API can be found [here](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model).

In the following cells we will use the [template](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb) provided in the [A2 assignment wiki](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#Getting_article_quality_predictions) to get these predictions from the ORES API 

As per the template code provided in the A2 assignment, we know that the json format returned is as follows:

```{
    "enwiki": {
        "models": {
            "wp10": {
                "version": "0.8.1"
            }
        },
        "scores": {
            "757539710": {
                "wp10": {
                    "score": {
                        "prediction": "Start",
                        "probability": {
                            "B": 0.06907655349650586,
                            "C": 0.1730497923608886,
                            "FA": 0.003738253691275387,
                            "GA": 0.007083489019420698,
                            "Start": 0.7205318510650603,
                            "Stub": 0.02652006036684928
                        }
                    }
                }
            },
        }
    }
}```

And we want to isolate the enwiki -> scores -> {revid} -> wp10 -> score -> prediction value, 

#### Step 1: thus we will create a function that returns just that value

In [21]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/apoorva-sh', 'From' : 'apshetty@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids) #'smushing' rev ids
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    predictions = []
    
    # Isolating prediction value
    for key, value in response["enwiki"]["scores"].items():
        item_wp10 = value["wp10"]
        if "error" not in item_wp10: #filtering out error values
            prediction = [
                int(key),
                item_wp10["score"]["prediction"]
            ]
            predictions.append(prediction)
    
    return predictions


#### Step 2: Now we will create a function that sends 100 revision ids at a time to the get_ores_data function

In [22]:
def get_predictions(rev_ids):
    predictions = []
    rev_list = []
    i = 0
    for rev in rev_ids:
        if i < 100 :
            rev_list.append(rev)
            i = i+1
        else:
            predictions.append(get_ores_data(rev_list,headers))
            i=0
            rev_list = []
    return predictions

#### Step 3: Using the above two functions to get a set of predictions in encapsulated list format (lists within lists)

In [23]:
preds = get_predictions(politician_data.rev_id)

#### Step 4: Now preds contains prediction in subsets of 100, so we will convert it into a temporary collection of dataframes 

In [24]:
temporary = [pd.DataFrame(pred_sub) for pred_sub in preds]

#### Step 5: Now we will concatenate this into one prediction dataframe

In [25]:
quality_prediction = pd.concat(temporary)

In [26]:
quality_prediction.head()

Unnamed: 0,0,1
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [27]:
#Renaming columns
quality_prediction = quality_prediction.rename(columns={0: "rev_id", 1: "prediction"})

quality_prediction.head()

Unnamed: 0,rev_id,prediction
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


#### Step 6: We will store this value for further analysis

In [28]:
#Storing it in csv file

quality_prediction.to_csv('data/generated_data/quality_prediction.csv')


## Final Data Cleaning and Consolidation

### 1) Merging our clean politician data and population data

#### Step 1: To do this we will first read from our clean data csvs

In [29]:
#Reading from clean population data
clean_popdata = pd.read_csv('data/generated_data/clean_population_data.csv')

clean_popdata.head(1)

Unnamed: 0.1,Unnamed: 0,Geography,Population mid-2018 (millions)
0,1,Algeria,42.7


#### Step 2: Deleting superfluous columns and renaming "Geography" to "country"

In [30]:
del clean_popdata['Unnamed: 0']
clean_popdata = clean_popdata.rename(columns={'Geography':'country'})
clean_popdata.head(1)

Unnamed: 0,country,Population mid-2018 (millions)
0,Algeria,42.7


In [31]:
#Reading from clean political data
clean_poldata = pd.read_csv('data/generated_data/clean_political_article_data.csv')

clean_poldata.head(1)

Unnamed: 0.1,Unnamed: 0,page,country,rev_id
0,1,Bir I of Kanem,Chad,355319463


In [32]:
del clean_poldata['Unnamed: 0']

clean_poldata.head(1)

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463


#### Now merging this data may lead us to losing some records (If there is no population data for a country/ no political article data for a country), looking at the shapes before merging:

In [33]:
clean_popdata.shape

(201, 2)

In [34]:
clean_poldata.shape

(46701, 3)

#### Step 3: Before merging our prediction data we will maintain a list of countries that are left out of this analysis, by seeing which countries contain no population data/political data:

In [35]:
pop_countries = clean_popdata.country.unique() #Get all unique countries for population data

In [36]:
pol_countries = clean_poldata.country.unique() #Get all unique countries for political data

In [37]:
missing_countries = []
for a in pop_countries: #Check which countries do not have political data 
    if a not in pol_countries:
        missing_countries.append(a)

for a in pol_countries: #Check which countries do not have population data
    if a not in pop_countries:
        missing_countries.append(a)

In [38]:
len(missing_countries)

60

#### Now we will write these 60 countries into a csv

In [39]:
missing_countries_df = pd.DataFrame(missing_countries) # Writing missing countries into a dataframe

In [40]:
missing_countries_df = missing_countries_df.rename(columns={0:"countries"}) #Renaming columns

In [41]:
missing_countries_df.to_csv('data/generated_data/wp_wpds_countries-no_match.csv') #Writing to csv

#### Step 4: Merge our clean political article data and population data

In [42]:
merged_data = pd.merge(clean_poldata,clean_popdata,on="country",how="inner")

Shape after merging:

In [43]:
merged_data.shape

(44618, 4)

### 2) Merging this intermediate data with prediction data
#### Step 1: Load predictions data

In [44]:
pred_data = pd.read_csv('data/generated_data/quality_prediction.csv')

In [45]:
del pred_data['Unnamed: 0'] #removing superfluous columns

pred_data.head(1)

Unnamed: 0,rev_id,prediction
0,355319463,Stub


In [46]:
merged_data = pd.merge(merged_data,pred_data,on="rev_id",how="inner")

In [47]:
merged_data.head(1)

Unnamed: 0,page,country,rev_id,Population mid-2018 (millions),prediction
0,Bir I of Kanem,Chad,355319463,15.4,Stub


#### Step 2: We will write this data into a csv file for further analysis

In [48]:
merged_data.to_csv('data/generated_data/wp_wpds_politicians_by_country.csv')

## Data Analysis

We want to analyze this data based on

    1) percentage of high quality articles out of total articles in a country
    
    2) percentage of articles to population in a country
    
    3) percentage of articles to region population
    
Now the ORES API ([documentation](https://www.mediawiki.org/wiki/ORES)) categorizes each article as :

    1) FA - Featured article
    2) GA - Good article
    3) B - B-class article
    4) C - C-class article
    5) Start - Start-class article
    6) Stub - Stub-class article
    
We will take the first two categories to mean a "high quality article"

### In the first half of this analysis we will calculate the percentages and store them in a data frame

#### Step 1: Read data from csv

In [49]:
data = pd.read_csv('data/generated_data/wp_wpds_politicians_by_country.csv')

In [50]:
#Remove superfluous columns

del data['Unnamed: 0']

data.head(1)

Unnamed: 0,page,country,rev_id,Population mid-2018 (millions),prediction
0,Bir I of Kanem,Chad,355319463,15.4,Stub


#### Step 2: Store tuples with prediction "FA" or "GA" in a separate dataframe

In [51]:
highquality_FA = data[data.prediction == 'FA']

In [52]:
highquality_GA = data[data.prediction == 'GA']

In [53]:
highquality_df = pd.concat([highquality_FA,highquality_GA],axis=0) #concatenate data one after the other

#### Step 3: Count of high quality articles by country

In [54]:
highquality_df = highquality_df.groupby('country').count()[['rev_id']] #Grouping by country we get 
                                                                       #the count of articles for each country


#### Step 4. Renaming rev_id to article count and sorting by article count we get:

In [55]:
highquality_df = highquality_df.rename(columns={"rev_id":"Num of highquality articles"})
highquality_df = highquality_df.sort_values('Num of highquality articles',ascending = False)

In [56]:
highquality_df.head(10)

Unnamed: 0_level_0,Num of highquality articles
country,Unnamed: 1_level_1
United States,77
United Kingdom,54
China,41
Romania,39
Australia,39
Spain,35
Russia,29
France,23
Canada,22
Ireland,21


Thus we have the top 10 countries with high quality articles as listed above 
(Note: this is without factoring in population), thus if an ORES user was to just analyze articles based on this data they would not take into account the overall population/ the overall number of articles generated from a country

### Now let us look at this in proportion to the total number of articles

#### Step 1: Get total count of articles for a country

In [57]:
allquality_df = data.groupby('country').count()[['rev_id']]

#### Step 2: Renaming rev_id to article count and sorting by article count we get:

In [58]:
allquality_df = allquality_df.rename(columns={"rev_id":"Num of articles"})
allquality_df = allquality_df.sort_values('Num of articles',ascending = False)

In [59]:
allquality_df.head(10)

Unnamed: 0_level_0,Num of articles
country,Unnamed: 1_level_1
France,1654
Australia,1544
China,1114
Mexico,1064
United States,1061
Pakistan,1011
India,970
Russia,865
Spain,864
United Kingdom,848


#### Step 3: Merging this data we get:

In [60]:
merged_data = pd.merge(highquality_df,allquality_df,on="country",how="outer")

In [61]:
merged_data.head(1)

Unnamed: 0_level_0,Num of highquality articles,Num of articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,77.0,1061


In [62]:
merged_data = merged_data.fillna(0) #Some countries have no high quality articles, thus fill NaN values with 0

In [63]:
merged_data['percentage of highquality to total'] = merged_data['Num of highquality articles']*100/merged_data['Num of articles']

In [64]:
merged_data.head()

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,77.0,1061,7.257304
United Kingdom,54.0,848,6.367925
China,41.0,1114,3.680431
Romania,39.0,340,11.470588
Australia,39.0,1544,2.525907


### Adding population data to this merged_data

#### Step 1. Getting population by country as a numeric value

In [65]:
data = data.rename(columns={"Population mid-2018 (millions)":"population"})
data["population"] = data["population"].apply(lambda s: s.replace(",", ""))
data['population'] = pd.to_numeric(data['population'],errors='coerce')*1000000 # Convert to numeric

In [66]:
population_data = data.groupby('country').mean()[['population']] #Getting population for each country

In [67]:
population_data.head(1)

Unnamed: 0_level_0,population
country,Unnamed: 1_level_1
Afghanistan,36500000.0


#### Step 2. Merging this with existing article data

In [68]:
final_data = pd.merge(merged_data,population_data,on="country",how="outer")
final_data.head()

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United States,77.0,1061,7.257304,328000000.0
United Kingdom,54.0,848,6.367925,66400000.0
China,41.0,1114,3.680431,1393800000.0
Romania,39.0,340,11.470588,19500000.0
Australia,39.0,1544,2.525907,24100000.0


#### Step 3. Getting percentage of articles to population

In [69]:
final_data['percentage of articles to population'] = final_data['Num of articles']*100/final_data['population']

In [70]:
#The following code is to simplify the region specific calculations done in Step 4.
final_data.index.name = 'country'
final_data.reset_index(inplace=True)
final_data.head()

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
United States,77.0,1061,7.257304,328000000.0,0.000323
United Kingdom,54.0,848,6.367925,66400000.0,0.001277
China,41.0,1114,3.680431,1393800000.0,8e-05
Romania,39.0,340,11.470588,19500000.0,0.001744
Australia,39.0,1544,2.525907,24100000.0,0.006407


#### Step 4. Loading region specific data

In [118]:
region_data = pd.read_csv('data/generated_data/cumulative_population.csv')

In [119]:
del region_data['Unnamed: 0'] #removing superfluous columns

Now from the data structure we have the following countries falling under each region:

- AFRICA:
    Algeria,
    Egypt,
    Libya,
    Morocco,
    Sudan,
    Tunisia,
    Western Sahara,
    Benin,
    Burkina Faso,
    Cape Verde,
    Cote d'Ivoire,
    Gambia,
    Ghana,
    Guinea,
    Guinea-Bissau,
    Liberia,
    Mali
    ,Mauritania
    ,Niger
    ,Nigeria
    ,Senegal
    ,Sierra Leone
    ,Togo
    ,Burundi
    ,Comoros
    ,Djibouti
    ,Eritrea
    ,Ethiopia
    ,Kenya
    ,Madagascar
    ,Malawi
    ,Mauritius
    ,Mozambique
    ,Rwanda
    ,Seychelles
    ,Somalia
    ,South Sudan
    ,Tanzania
    ,Uganda
    ,Zambia
    ,Zimbabwe
    ,Angola
    ,Cameroon
    ,Central African Republic
    ,Chad
    ,Congo
    ,Congo, Dem. Rep.
    ,Equatorial Guinea
    ,Gabon
    ,Sao Tome and Principe
    ,Botswana
    ,eSwatini
    ,Lesotho
    ,Namibia
    ,South Africa
    
- NORTHERN AMERICA:
    Canada, United States

- LATIN AMERICA AND THE CARIBBEAN:
    Belize
    ,Costa Rica
    ,El Salvador
    ,Guatemala
    ,Honduras
    ,Mexico
    ,Nicaragua
    ,Panama
    ,Antigua and Barbuda
    ,Bahamas
    ,Barbados
    ,Cuba
    ,Curacao
    ,Dominica
    ,Dominican Republic
    ,Grenada
    ,Haiti
    ,Jamaica
    ,Puerto Rico
    ,St. Kitts-Nevis
    ,Saint Lucia
    ,St. Vincent and the Grenadines
    ,Trinidad and Tobago
    ,Argentina
    ,Bolivia
    ,Brazil
    ,Chile
    ,Colombia
    ,Ecuador
    ,Guyana
    ,Paraguay
    ,Peru
    ,Suriname
    ,Uruguay
    ,Venezuela
    
- ASIA:
    Armenia
    ,Azerbaijan
    ,Bahrain
    ,Cyprus
    ,Georgia
    ,Iraq
    ,Israel
    ,Jordan
    ,Kuwait
    ,Lebanon
    ,Oman
    ,Qatar
    ,Saudi Arabia
    ,Syria
    ,Turkey
    ,United Arab Emirates
    ,Yemen
    ,Kazakhstan
    ,Kyrgyzstan
    ,Tajikistan
    ,Turkmenistan
    ,Uzbekistan
    ,Afghanistan
    ,Bangladesh
    ,Bhutan
    ,India
    ,Iran
    ,Maldives
    ,Nepal
    ,Pakistan
    ,Sri Lanka
    ,Brunei
    ,Cambodia
    ,Indonesia
    ,Laos
    ,Malaysia
    ,Myanmar
    ,Philippines
    ,Singapore
    ,Thailand
    ,Timor-Leste
    ,Vietnam
    ,China
    ,Japan
    ,Korea, North
    ,Korea, South
    ,Mongolia
    ,Taiwan
    
- EUROPE:
    Denmark
    ,Estonia
    ,Finland
    ,Iceland
    ,Ireland
    ,Latvia
    ,Lithuania
    ,Norway
    ,Sweden
    ,United Kingdom
    ,Austria
    ,Belgium
    ,France
    ,Germany
    ,Liechtenstein
    ,Luxembourg
    ,Monaco
    ,Netherlands
    ,Switzerland
    ,Belarus
    ,Bulgaria
    ,Czechia
    ,Hungary
    ,Moldova
    ,Poland
    ,Romania
    ,Russia
    ,Slovakia
    ,Ukraine
    ,Albania
    ,Andorra
    ,Bosnia-Herzegovina
    ,Croatia
    ,Greece
    ,Italy
    ,Kosovo
    ,Macedonia
    ,Malta
    ,Montenegro
    ,Portugal
    ,San Marino
    ,Serbia
    ,Slovenia
    ,Spain

- OCEANIA:
    Australia
    ,Federated States of Micronesia
    ,Fiji
    ,French Polynesia
    ,Guam
    ,Kiribati
    ,Marshall Islands
    ,Nauru
    ,New Caledonia
    ,New Zealand
    ,Palau
    ,Papua New Guinea
    ,Samoa
    ,Solomon Islands
    ,Tonga
    ,Tuvalu
    ,Vanuatu
    
    
    
#### Step 5. Adding up article counts for each region

In [120]:
#Article count set to zero
AFRICA = 0
NA = 0
LA = 0
ASIA = 0
EUROPE = 0
OCEANIA = 0

#High quality counts set to 0
AFRICA_hq = 0
NA_hq = 0
LA_hq = 0
ASIA_hq = 0
EUROPE_hq = 0
OCEANIA_hq = 0

#List of all countries belonging to a region
Africa_l = ["Algeria", "Egypt", "Libya", 
            "Morocco", "Sudan", "Tunisia", 
            "Western Sahara", "Benin", "Burkina Faso", 
            "Cape Verde", "Cote d'Ivoire", "Gambia", "Ghana", 
            "Guinea", "Guinea-Bissau", "Liberia, Mali" ,"Mauritania" ,"Niger" ,"Nigeria" ,"Senegal" 
            ,"Sierra Leone" ,"Togo" ,"Burundi" ,"Comoros" ,"Djibouti" ,"Eritrea" ,"Ethiopia" ,"Kenya" ,"Madagascar" ,
            "Malawi" ,"Mauritius" ,"Mozambique" ,"Rwanda" ,"Seychelles" ,"Somalia" ,"South Sudan" ,"Tanzania" ,"Uganda" ,
            "Zambia" ,
            "Zimbabwe" ,"Angola" ,"Cameroon" ,"Central African Republic" ,"Chad" ,"Congo" ,"Congo, Dem. Rep." ,"Equatorial Guinea" ,
            "Gabon" ,"Sao Tome and Principe" ,"Botswana" ,"eSwatini" ,"Lesotho" ,"Namibia" ,"South Africa"]
            
NA_l = ["Canada","United States"]

LA_l = ["Belize" ,"Costa Rica" ,"El Salvador" ,"Guatemala" ,"Honduras" ,"Mexico" ,"Nicaragua" ,"Panama" ,"Antigua and Barbuda" ,
        "Bahamas" ,"Barbados" ,"Cuba" ,"Curacao" ,"Dominica" ,"Dominican Republic" ,"Grenada" ,"Haiti" ,"Jamaica" ,"Puerto Rico" ,
        "St. Kitts-Nevis" ,"Saint Lucia" ,"St. Vincent and the Grenadines" ,"Trinidad and Tobago" ,"Argentina" ,"Bolivia" ,
        "Brazil" ,"Chile" ,"Colombia" ,"Ecuador" ,"Guyana" ,"Paraguay" ,"Peru" ,"Suriname" ,"Uruguay" ,"Venezuela"]

Asia_l = ["Armenia" ,"Azerbaijan" ,"Bahrain" ,"Cyprus" ,"Georgia" ,"Iraq" ,"Israel" ,"Jordan" ,"Kuwait" ,"Lebanon" ,"Oman" ,"Qatar" 
          ,"Saudi Arabia" ,"Syria" ,"Turkey" ,"United Arab Emirates" ,"Yemen" ,"Kazakhstan" ,"Kyrgyzstan" ,"Tajikistan" 
          ,"Turkmenistan" ,"Uzbekistan" ,"Afghanistan" ,"Bangladesh" ,"Bhutan" ,"India" ,"Iran" ,"Maldives" ,"Nepal" 
          ,"Pakistan" ,"Sri Lanka" ,"Brunei" ,"Cambodia" ,"Indonesia" ,"Laos" ,"Malaysia" ,"Myanmar" ,"Philippines" 
          ,"Singapore" ,"Thailand" ,"Timor-Leste" ,"Vietnam" ,"China" ,"Japan" ,"Korea, North" ,"Korea, South" ,"Mongolia" ,"Taiwan"]

Euro_l = ["Denmark" ,"Estonia" ,"Finland" ,"Iceland" ,"Ireland" ,"Latvia" ,"Lithuania" ,"Norway" ,"Sweden" ,"United Kingdom" ,
          "Austria" ,"Belgium" ,"France" ,"Germany" ,"Liechtenstein" ,"Luxembourg" ,"Monaco" ,"Netherlands" ,"Switzerland" ,
          "Belarus" ,"Bulgaria" ,"Czechia" ,"Hungary" ,"Moldova" ,"Poland" ,"Romania" ,"Russia" ,"Slovakia" ,"Ukraine" ,"Albania" ,
          "Andorra" ,"Bosnia-Herzegovina" ,"Croatia" ,"Greece" ,"Italy" ,"Kosovo" ,"Macedonia" ,"Malta" ,"Montenegro" ,"Portugal" ,
          "San Marino" ,"Serbia" ,"Slovenia" ,"Spain"]

Oceania_l = ["Australia" ,"Federated States of Micronesia" ,"Fiji" ,"French Polynesia" ,"Guam" ,"Kiribati" ,"Marshall Islands" 
             ,"Nauru" ,"New Caledonia" ,"New Zealand" ,"Palau" ,"Papua New Guinea" ,"Samoa" ,"Solomon Islands" ,"Tonga" ,"Tuvalu" 
             ,"Vanuatu"]

#Traversing a dataframe to get the number of articles (high quality and total)
for index,r in final_data.iterrows():
    if r["country"] in Africa_l:
        AFRICA = AFRICA + r['Num of articles']
        AFRICA_hq = AFRICA_hq + r['Num of highquality articles']
    elif r["country"] in NA_l:
        NA = NA + r['Num of articles']
        NA_hq = NA_hq + r['Num of highquality articles']
    elif r["country"] in LA_l:
        LA = LA + r['Num of articles']
        LA_hq = LA_hq + r['Num of highquality articles']
    elif r["country"] in Asia_l:
        ASIA = ASIA + r['Num of articles']
        ASIA_hq = ASIA_hq + r['Num of highquality articles']
    elif r["country"] in Euro_l:
        EUROPE = EUROPE + r['Num of articles']
        EUROPE_hq = EUROPE_hq + r['Num of highquality articles']
    elif r["country"] in Oceania_l:
        OCEANIA = OCEANIA + r['Num of articles']
        OCEANIA_hq = OCEANIA_hq + r['Num of highquality articles']
    

#### Step 6. Merging article counts and population data for regions

In [121]:
#Creating data frame for counts
region_data_article= pd.DataFrame({'Geography': ["AFRICA", "NORTHERN AMERICA", "LATIN AMERICA AND THE CARIBBEAN","ASIA","EUROPE","OCEANIA"], 'Num of Articles': [AFRICA, NA,LA,ASIA,EUROPE,OCEANIA], 'Num of highquality Articles':
                                  [AFRICA_hq,NA_hq,LA_hq,ASIA_hq,EUROPE_hq,OCEANIA_hq]}) 

In [122]:
region_data = pd.merge(region_data,region_data_article,on="Geography",how="inner")

In [129]:
region_data.head(1)

Unnamed: 0,Geography,population,Num of Articles,Num of highquality Articles,percentage of articles to population,percentage of high quality articles
0,AFRICA,1284000000,6507,116.0,0.000507,1.782696


#### Step 7. Converting population to numeric value, and calculatimg percentage of articles

In [124]:
region_data = region_data.rename(columns={"Population mid-2018 (millions)":"population"}) #renaming column
region_data["population"] = region_data["population"].apply(lambda s: s.replace(",", ""))
region_data['population'] = pd.to_numeric(region_data['population'],errors='coerce')*1000000 # Convert to numeric

In [125]:
region_data['percentage of articles to population'] = region_data['Num of Articles']*100/region_data['population']

In [126]:
region_data['percentage of high quality articles'] = region_data['Num of highquality Articles']*100/region_data['Num of Articles']

### The Second half of this analysis will be looking at the 10 lowest and highest ranking countries in:

#### a) Percentage of high quality articles
##### - TOP TEN

In [71]:
final_data.sort_values('percentage of highquality to total',ascending = False).head(10)

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Korea, North",6.0,34,17.647059,25600000.0,0.000133
Mauritania,6.0,47,12.765957,4500000.0,0.001044
Central African Republic,8.0,65,12.307692,4700000.0,0.001383
Saudi Arabia,14.0,115,12.173913,33400000.0,0.000344
Romania,39.0,340,11.470588,19500000.0,0.001744
Tuvalu,5.0,54,9.259259,10000.0,0.54
Bhutan,3.0,33,9.090909,800000.0,0.004125
Dominica,1.0,12,8.333333,70000.0,0.017143
Syria,10.0,127,7.874016,18300000.0,0.000694
Benin,7.0,91,7.692308,11500000.0,0.000791


##### - LOWEST TEN

In [72]:
final_data.sort_values('percentage of highquality to total',ascending = True).head(10)

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Barbados,0.0,13,0.0,300000.0,0.004333
Macedonia,0.0,65,0.0,2100000.0,0.003095
Kazakhstan,0.0,77,0.0,18400000.0,0.000418
San Marino,0.0,81,0.0,30000.0,0.27
Malta,0.0,102,0.0,500000.0,0.0204
Cameroon,0.0,102,0.0,25600000.0,0.000398
Angola,0.0,106,0.0,30400000.0,0.000349
Slovakia,0.0,116,0.0,5400000.0,0.002148
Tonga,0.0,63,0.0,100000.0,0.063
Tunisia,0.0,138,0.0,11600000.0,0.00119


#### b) Percentage of articles to population
##### - TOP TEN

In [73]:
final_data.sort_values('percentage of articles to population',ascending = False).head(10)

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tuvalu,5.0,54,9.259259,10000.0,0.54
Nauru,0.0,52,0.0,10000.0,0.52
San Marino,0.0,81,0.0,30000.0,0.27
Monaco,0.0,40,0.0,40000.0,0.1
Liechtenstein,0.0,28,0.0,40000.0,0.07
Tonga,0.0,63,0.0,100000.0,0.063
Marshall Islands,0.0,37,0.0,60000.0,0.061667
Iceland,2.0,199,1.005025,400000.0,0.04975
Andorra,0.0,34,0.0,80000.0,0.0425
Grenada,1.0,36,2.777778,100000.0,0.036


##### - LOWEST TEN

In [74]:
final_data.sort_values('percentage of articles to population',ascending = True).head(10)

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,17.0,970,1.752577,1371300000.0,7.1e-05
Indonesia,10.0,206,4.854369,265200000.0,7.8e-05
China,41.0,1114,3.680431,1393800000.0,8e-05
Uzbekistan,2.0,28,7.142857,32900000.0,8.5e-05
Ethiopia,2.0,101,1.980198,107500000.0,9.4e-05
"Korea, North",6.0,34,17.647059,25600000.0,0.000133
Zambia,0.0,24,0.0,17700000.0,0.000136
Thailand,3.0,111,2.702703,66200000.0,0.000168
Mozambique,0.0,57,0.0,30500000.0,0.000187
Bangladesh,3.0,319,0.940439,166400000.0,0.000192


#### c) Highest to lowest region-wise ranking of
##### - PERCENTAGE OF ARTICLES TO POPULATION

In [127]:
region_data.sort_values('percentage of articles to population',ascending=False)

Unnamed: 0,Geography,population,Num of Articles,Num of highquality Articles,percentage of articles to population,percentage of high quality articles
5,OCEANIA,41000000,3094,64.0,0.007546,2.06852
4,EUROPE,746000000,15694,317.0,0.002104,2.01988
2,LATIN AMERICA AND THE CARIBBEAN,649000000,5116,67.0,0.000788,1.309617
1,NORTHERN AMERICA,365000000,1893,99.0,0.000519,5.229794
0,AFRICA,1284000000,6507,116.0,0.000507,1.782696
3,ASIA,4536000000,11414,306.0,0.000252,2.680918


##### - PERCENTAGE OF HIGH QUALITY ARTICLES 

In [128]:
region_data.sort_values('percentage of high quality articles',ascending=False)

Unnamed: 0,Geography,population,Num of Articles,Num of highquality Articles,percentage of articles to population,percentage of high quality articles
1,NORTHERN AMERICA,365000000,1893,99.0,0.000519,5.229794
3,ASIA,4536000000,11414,306.0,0.000252,2.680918
5,OCEANIA,41000000,3094,64.0,0.007546,2.06852
4,EUROPE,746000000,15694,317.0,0.002104,2.01988
0,AFRICA,1284000000,6507,116.0,0.000507,1.782696
2,LATIN AMERICA AND THE CARIBBEAN,649000000,5116,67.0,0.000788,1.309617


### Conclusion

- From the above tables we see that looking at just the ORES data would give us a completely different analysis, as we are looking at only "English" articles the country with the most highest ranking articles is as expected a country with it's first language being English, however when you look at this as a percentage of the overall number of articles interestingly you see countries that do not have English as their first language
- Region wise this still holds to be true in terms of percentage of high-quality articles, which maybe because of the larger number of english articles written by these countries overpowers the granular look at countries with fewer articles

In [75]:
final_data.sort_values('population',ascending=False).head(10)

Unnamed: 0_level_0,Num of highquality articles,Num of articles,percentage of highquality to total,population,percentage of articles to population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
China,41.0,1114,3.680431,1393800000.0,8e-05
India,17.0,970,1.752577,1371300000.0,7.1e-05
United States,77.0,1061,7.257304,328000000.0,0.000323
Indonesia,10.0,206,4.854369,265200000.0,7.8e-05
Brazil,5.0,541,0.924214,209400000.0,0.000258
Pakistan,19.0,1011,1.879327,200600000.0,0.000504
Nigeria,2.0,669,0.298954,195900000.0,0.000342
Bangladesh,3.0,319,0.940439,166400000.0,0.000192
Russia,29.0,865,3.352601,147300000.0,0.000587
Mexico,9.0,1064,0.845865,130800000.0,0.000813


- The analysis of percentage of articles to population is very much dependent on the population and says little about political articles Especially in largely populated countries such as India, China, Indonesia, just because a country has a higher population does not imply that they will have more politicians/ articles about politicians.
- This is also supported in the region wise glance at ratio of article to population, Asia, Africa, and North America ranking bottom in that analysis