# Analysis
This code contains all of the analysis for data processing and creating the 6 tables detailed below.

## Importing Libraries

In [5]:
import pandas as pd
import copy

## Loading the data and preliminary cleaning

Note, since the population is reported in millions to the first decimal, there are two countries (Monaco and Tuvalo) with corresponding wikepedia articles that have a listed population of 0 Million. Due to this lack of granularity in measurement, I have chose to fill in these populations as 50,000 (or 0.05 Million), which is the largest possible population that would not have rounded to 0.1 Million and therefore been reported. Therefore, all ratio measurement derived for these two countries are a minimum threshold (i.e. the true article_per_capita must be greater than or equal to the reported number).

In [42]:
data = pd.read_csv("../data_clean/wp_politicians_by_country.csv")

#Start by replacing the countries with zero population, with 50K
data.loc[data['population'] == 0, 'population'] = 0.05

## Table 1: Top 10 countries by coverage
The below code executes the analysis for identifying the top 10 contries with the highest total articles per capita. Note that Monaco is one of the countries where our articles per capita quantity is a lower threshold.



In [77]:
# Compute the articles per capita
coverage_data = copy.deepcopy(data)
coverage_data = coverage_data.groupby('country').agg( Count = ('population', 'count'), population = ('population', 'mean')).reset_index()
coverage_data['articles_per_capita'] = coverage_data["Count"] / coverage_data["population"]

# Rename the apc column to make it more readable
coverage_data = coverage_data.rename(columns={'articles_per_capita': 'Articles per Capita (count/Million)', 'country': 'Country'})

#Print the Top 10 to a Table
print('Top 10 countries by coverage\n')
print(coverage_data.sort_values(by='Articles per Capita (count/Million)', ascending=False).head(10)[['Country', 'Articles per Capita (count/Million)']].round(2).to_string(index=False))

Top 10 countries by coverage

                       Country  Articles per Capita (count/Million)
           Antigua and Barbuda                               330.00
                        Monaco                               200.00
Federated States of Micronesia                               140.00
              Marshall Islands                               130.00
                         Tonga                               100.00
                      Barbados                                83.33
                    Seychelles                                60.00
                    Montenegro                                56.67
                      Maldives                                55.00
                        Bhutan                                55.00


## Table 2: Bottom 10 countries by coverage
The below code executes the analysis for identifying the top 10 contries with the highest total articles per capita.

In [76]:
#Print the Bottom 10 to a Table
print('Bottom 10 countries by coverage\n')
print(coverage_data.sort_values(by='Articles per Capita (count/Million)', ascending=True).head(10)[['Country', 'Articles per Capita (count/Million)']].round(2).to_string(index=False))

Bottom 10 countries by coverage

      Country  Articles per Capita (count/Million)
        China                                 0.01
        Ghana                                 0.09
        India                                 0.11
 Saudi Arabia                                 0.14
       Zambia                                 0.15
       Norway                                 0.18
       Israel                                 0.20
        Egypt                                 0.30
Cote d'Ivoire                                 0.32
     Ethiopia                                 0.35


## Table 3: Top 10 countries by high quality
The below code executes the analysis for identifying the top 10 contries with the highest quality articles per capita.


In [75]:
# Compute the articles per capita
hq_coverage_data = copy.deepcopy(data)

#Extract all the countries present
all_countries = pd.DataFrame({'country':hq_coverage_data['country'].unique()})

#Filter to only the high quality articles and compute aggregate scores
hq_coverage_data = hq_coverage_data[hq_coverage_data["article_quality"].isin(['FA', 'GA'])]
hq_coverage_data = hq_coverage_data.groupby('country').agg( Count = ('population', 'count'), population = ('population', 'mean')).reset_index()
hq_coverage_data['articles_per_capita'] = hq_coverage_data["Count"] / hq_coverage_data["population"]

#Remerge with the set of all countries and fill in the aggregates with 0
hq_coverage_data = pd.merge(all_countries, hq_coverage_data, on='country', how='left')
hq_coverage_data.fillna(0, inplace=True)


# Rename the apc column to make it more readable
hq_coverage_data = hq_coverage_data.rename(columns={'articles_per_capita': 'High Quality Articles per Capita (count/Million)', 'country': 'Country'})

#Print the Top 10 to a Table
print('Top 10 countries by coverage\n')
print(hq_coverage_data.sort_values(by='High Quality Articles per Capita (count/Million)', ascending=False).head(10)[['Country', 'High Quality Articles per Capita (count/Million)']].round(2).to_string(index=False))

Top 10 countries by coverage

              Country  High Quality Articles per Capita (count/Million)
           Montenegro                                              5.00
           Luxembourg                                              2.86
              Albania                                              2.59
               Kosovo                                              2.35
             Maldives                                              1.67
              Croatia                                              1.32
               Guyana                                              1.25
Palestinian Territory                                              1.09
            Lithuania                                              1.03
             Slovenia                                              0.95


## Table 4: Bottom 10 countries by high quality
The below code executes the analysis for identifying the bottom 10 contries with the highest quality articles per capita.

Note that in this analysis, there are a lot of countries (65 to be exact) that are present in the dataset but do not have ANY high quality articles. Therefor, the ten shown here are just 10 of 65 with a high quality coverage per capita score of 0

In [74]:
#Print the Bottom 10 to a Table
print('Bottom 10 countries by high quality coverage\n')
print(hq_coverage_data.sort_values(by='High Quality Articles per Capita (count/Million)', ascending=True).head(10)[['Country', 'High Quality Articles per Capita (count/Million)']].round(2).to_string(index=False))

print('\nList of all 65 countries without any high quality articles\n')
print(hq_coverage_data[hq_coverage_data['High Quality Articles per Capita (count/Million)'] == 0]['Country'].to_list())

Bottom 10 countries by high quality coverage

   Country  High Quality Articles per Capita (count/Million)
    Tuvalu                                               0.0
  Barbados                                               0.0
     Benin                                               0.0
     India                                               0.0
    Monaco                                               0.0
     China                                               0.0
Seychelles                                               0.0
   Bahamas                                               0.0
    Taiwan                                               0.0
    Bhutan                                               0.0

List of all 65 countries without any high quality articles

['Malaysia', 'Senegal', 'Chad', 'Yemen', 'Djibouti', 'Comoros', 'Tanzania', 'Gambia', 'Niger', 'Qatar', 'Kuwait', 'Eritrea', 'Oman', 'Turkey', 'Uzbekistan', 'Zimbabwe', 'Congo', 'Namibia', 'Cape Verde', 'India', 'Barbados'

## Table 5: Geographic regions by total coverage

The code below produces a rank ordered list of geographic regions (in descending order) by total articles per capita.

Since we didn't store the region-level population in the original "data" dataframe, we need to remerge with the "population_by_country_AUG.2024.csv" dataset to get the region-based population

In [73]:
# Load the population by country dataset and merge on region
pop_country = pd.read_csv("../data_raw/population_by_country_AUG.2024.csv").drop_duplicates()
data_region = pd.merge(data, pop_country, left_on='region', right_on='Geography', how='inner')
data_region = data_region.rename(columns={'Population': 'Region Population'})

# Compute the articles per capita
coverage_data_region = copy.deepcopy(data_region)
coverage_data_region = coverage_data_region.groupby('region').agg( Count = ('Region Population', 'count'), Region_Population = ('Region Population', 'mean')).reset_index()
coverage_data_region['articles_per_capita'] = coverage_data_region["Count"] / coverage_data_region["Region_Population"]

# Rename the apc column to make it more readable
coverage_data_region = coverage_data_region.rename(columns={'articles_per_capita': 'Articles per Capita (count/Million)', 'region': 'Region'})

#Print the output to a Table
print('All geographic region, ordered by coverage\n')
print(coverage_data_region.sort_values(by='Articles per Capita (count/Million)', ascending=False)[['Region', 'Articles per Capita (count/Million)']].round(2).to_string(index=False))

All geographic region, ordered by coverage

         Region  Articles per Capita (count/Million)
SOUTHERN EUROPE                                 5.17
      CARIBBEAN                                 4.93
 WESTERN EUROPE                                 2.48
 EASTERN EUROPE                                 2.47
   WESTERN ASIA                                 2.03
SOUTHERN AFRICA                                 1.76
NORTHERN EUROPE                                 1.74
        OCEANIA                                 1.60
 EASTERN AFRICA                                 1.36
  SOUTH AMERICA                                 1.34
   CENTRAL ASIA                                 1.29
NORTHERN AFRICA                                 1.18
 WESTERN AFRICA                                 1.16
  MIDDLE AFRICA                                 1.14
CENTRAL AMERICA                                 1.02
 SOUTHEAST ASIA                                 0.58
     SOUTH ASIA                                 0.33
  

## Table 6: Geographic regions by high quality coverage

The code below produces a rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [72]:
#Compute high quality coverage by region
hq_coverage_data_region = copy.deepcopy(data_region)

#Extract all the regions present
all_regions = pd.DataFrame({'region':hq_coverage_data_region['region'].unique()})

#Filter to only the high quality articles and compute aggregate scores
hq_coverage_data_region  = hq_coverage_data_region [hq_coverage_data_region ["article_quality"].isin(['FA', 'GA'])]
hq_coverage_data_region = hq_coverage_data_region.groupby('region').agg( Count = ('Region Population', 'count'), Region_Population = ('Region Population', 'mean')).reset_index()
hq_coverage_data_region['articles_per_capita'] = hq_coverage_data_region["Count"] / hq_coverage_data_region["Region_Population"]

#Remerge with the set of all countries and fill in the aggregates with 0
hq_coverage_data_region = pd.merge(all_regions, hq_coverage_data_region , on='region', how='left')
hq_coverage_data_region.fillna(0, inplace=True)

#rename the apc column to make it more readable
hq_coverage_data_region = hq_coverage_data_region.rename(columns={'articles_per_capita': 'Articles per Capita (count/Million)', 'region': 'Region'})

#Print the output to a Table
print('All geographic region, ordered by quality article coverage\n')
print(hq_coverage_data_region.sort_values(by='Articles per Capita (count/Million)', ascending=False)[['Region', 'Articles per Capita (count/Million)']].round(2).to_string(index=False))

All geographic region, ordered by quality article coverage

         Region  Articles per Capita (count/Million)
SOUTHERN EUROPE                                 0.35
      CARIBBEAN                                 0.20
 EASTERN EUROPE                                 0.13
SOUTHERN AFRICA                                 0.11
 WESTERN EUROPE                                 0.11
   WESTERN ASIA                                 0.09
NORTHERN EUROPE                                 0.07
NORTHERN AFRICA                                 0.07
   CENTRAL ASIA                                 0.06
CENTRAL AMERICA                                 0.05
  SOUTH AMERICA                                 0.04
  MIDDLE AFRICA                                 0.04
 SOUTHEAST ASIA                                 0.04
 EASTERN AFRICA                                 0.03
 WESTERN AFRICA                                 0.03
        OCEANIA                                 0.02
     SOUTH ASIA                        