## Imports and Dependencies

In this section, we import the necessary libraries that will be used throughout the notebook for data processing.

### Libraries Overview

- **`pandas`**: A powerful data manipulation library used for data analysis and manipulation.

In [113]:

# You will need to install these with pip/pip3 if you do not already have it
import pandas as pd

## Loading the Final Merged Dataset and Population Adjustment

In this step, we load the final merged dataset from the previous step and perform an adjustment to the population values. Since the population data is provided in millions, we need to scale it to reflect the actual population count.

### Steps:
1. **Load the Final Merged Dataset**:
   - The final merged dataset is loaded from the CSV file `wp_politicians_by_country.csv`, which was saved in the previous step.
   
2. **Adjust the Population Values**:
   - The population column in the dataset represents population in millions. To get the actual population count, we multiply the values by 1,000,000.


In [114]:
# Load the final merged dataset from the previous step
merged_df = pd.read_csv('../data/wp_politicians_by_country.csv')

# Population is given in millions, so multiply by 1,000,000 to get the actual population count
merged_df['population'] = merged_df['population'] * 1_000_000

## Country-by-Country Analysis: Articles Per Capita

In this section, we perform a country-by-country analysis of the total number of Wikipedia articles and high-quality articles (featured and good articles) per capita. The results are scaled to represent articles per 10 million people for better interpretability.

### Steps:
1. **Filter High-Quality Articles**:
   - High-quality articles are defined as those with ORES quality predictions of "FA" (Featured Article) or "GA" (Good Article).
   
2. **Remove Rows with Missing Revision IDs**:
   - Rows that do not have a revision ID are removed, as these do not have the required metadata for analysis.
   
3. **Calculate Population Once Per Country**:
   - We calculate the population for each country by taking the first occurrence of the country in the dataset to avoid duplication.

4. **Total Articles Per Capita**:
   - The total number of articles per country is divided by the population of that country to get the articles per capita.

5. **High-Quality Articles Per Capita**:
   - The same calculation is done for high-quality articles, by filtering out only the articles marked as FA or GA.

6. **Create a DataFrame**:
   - A DataFrame is created to store both the total articles per capita and high-quality articles per capita.

7. **Handling Missing Values**:
   - Countries with no high-quality articles are filled with zeros.

8. **Scaling the Values**:
   - Both metrics (total and high-quality articles per capita) are scaled by multiplying them by 10^7 to represent articles per 10 million people.


In [115]:
# Define high-quality articles (FA and GA)
high_quality = ['FA', 'GA']

# Remove the rows with missing revision_id from the main DataFrame
merged_df = merged_df.dropna(subset=['revision_id'])

# Ensure population is calculated only once per country (taking the first occurrence)
country_population = merged_df.groupby('country')['population'].first()

# Calculate total-articles-per-capita for each country
country_articles_per_capita = merged_df.groupby('country').size() / country_population

# Filter out the high-quality articles (FA and GA)
country_high_quality_articles = merged_df[merged_df['article_quality'].isin(high_quality)]

# Calculate high-quality-articles-per-capita for each country
country_high_quality_per_capita = country_high_quality_articles.groupby('country').size() / country_population

# Create a DataFrame to store both metrics
country_analysis_df = pd.DataFrame({
    'total_articles_per_capita': country_articles_per_capita,
    'high_quality_articles_per_capita': country_high_quality_per_capita
})

# Fill missing values with 0 (if a country has no high-quality articles)
country_analysis_df = country_analysis_df.fillna(0)

# Scale the values by multiplying by 10^7 to make them more interpretable (articles per 10 million people)
scale_factor = 1e7
country_analysis_df['total_articles_per_capita_scaled'] = country_analysis_df['total_articles_per_capita'] * scale_factor
country_analysis_df['high_quality_articles_per_capita_scaled'] = country_analysis_df['high_quality_articles_per_capita'] * scale_factor

# Display the results for country-by-country analysis
print("\nCountry-by-Country Analysis (Articles Per Capita, scaled):")
display(country_analysis_df[['total_articles_per_capita_scaled', 'high_quality_articles_per_capita_scaled']].head())




Country-by-Country Analysis (Articles Per Capita, scaled):


Unnamed: 0_level_0,total_articles_per_capita_scaled,high_quality_articles_per_capita_scaled
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,20.04717,0.707547
Albania,259.259259,25.925926
Algeria,15.17094,0.213675
Angola,15.803815,0.544959
Antigua and Barbuda,3300.0,0.0


## Region-by-Region Analysis: Articles Per Capita

In this section, we extend the analysis to a regional level, calculating the total number of Wikipedia articles and high-quality articles (featured and good articles) per capita for each region. The results are scaled to represent articles per 10 million people for clarity.

### Steps:
1. **Remove Duplicates**:
   - For calculating population, we ensure that each country is represented only once. This is done by removing duplicates in the merged dataset and keeping the first instance of each country.

2. **Calculate Total Articles Per Capita (Region)**:
   - The total number of articles per region is calculated by dividing the article count by the sum of the populations for countries within that region.
   
3. **Filter High-Quality Articles**:
   - As before, high-quality articles are defined as articles with ORES predictions of "FA" (Featured Article) or "GA" (Good Article).

4. **Calculate High-Quality Articles Per Capita (Region)**:
   - The same process is followed for high-quality articles, calculating the number of high-quality articles per capita by dividing the article count by the population sum for each region.

5. **Create a DataFrame**:
   - A DataFrame is created to store the total articles per capita and high-quality articles per capita for each region.

6. **Handling Missing Values**:
   - If a region has no high-quality articles, we fill in the missing values with zeros.

7. **Scaling the Values**:
   - Both metrics (total and high-quality articles per capita) are scaled by multiplying them by \(10^7\) to represent articles per 10 million people.


In [118]:
# Remove duplicates by keeping only the first instance of each country in the merged dataset
deduped_population_df = merged_df.drop_duplicates(subset=['country'], keep='first')

# Calculate total-articles-per-capita for each region
# We use the entire merged_df for article counts, and the deduplicated dataframe for population
region_articles_per_capita = merged_df.groupby('region').size() / deduped_population_df.groupby('region')['population'].sum()

# Calculate high-quality-articles-per-capita for each region
# Filter for high-quality articles and then calculate the per capita based on the deduplicated population data
region_high_quality_articles = merged_df[merged_df['article_quality'].isin(high_quality)]
region_high_quality_per_capita = region_high_quality_articles.groupby('region').size() / deduped_population_df.groupby('region')['population'].sum()

# Create a DataFrame to store regional metrics
region_analysis_df = pd.DataFrame({
    'total_articles_per_capita': region_articles_per_capita,
    'high_quality_articles_per_capita': region_high_quality_per_capita
})

# Fill missing values with 0 (if a region has no high-quality articles)
region_analysis_df = region_analysis_df.fillna(0)

# Scale the values by multiplying by 10^7 to make them more interpretable (articles per 10 million people)
scale_factor = 1e7
region_analysis_df['total_articles_per_capita_scaled'] = region_analysis_df['total_articles_per_capita'] * scale_factor
region_analysis_df['high_quality_articles_per_capita_scaled'] = region_analysis_df['high_quality_articles_per_capita'] * scale_factor

# Display the results for region-by-region analysis
print("\nRegion-by-Region Analysis (Articles Per Capita, scaled):")
display(region_analysis_df[['total_articles_per_capita_scaled', 'high_quality_articles_per_capita_scaled']].head())




Region-by-Region Analysis (Articles Per Capita, scaled):


Unnamed: 0_level_0,total_articles_per_capita_scaled,high_quality_articles_per_capita_scaled
region,Unnamed: 1_level_1,Unnamed: 2_level_1
CARIBBEAN,59.562842,2.459016
CENTRAL AMERICA,36.647173,1.949318
CENTRAL ASIA,13.18408,0.621891
EAST ASIA,0.972675,0.019198
EASTERN AFRICA,13.807444,0.353504


## Results - Country and Regional Analysis

In this section, we present the analysis of Wikipedia article coverage and quality by country and region. The metrics analyzed are:
1. **Total Articles Per Capita**: The total number of Wikipedia articles per 10 million people.
2. **High-Quality Articles Per Capita**: The number of high-quality Wikipedia articles (featured articles - FA and good articles - GA) per 10 million people.

### Top 10 Countries by Total Articles Per Capita
The top 10 countries ranked by the highest total articles per capita (scaled to 10 million people).

### Bottom 10 Countries by Total Articles Per Capita
The bottom 10 countries ranked by the lowest total articles per capita (scaled to 10 million people).

### Top 10 Countries by High-Quality Articles Per Capita
The top 10 countries ranked by the highest number of high-quality articles per capita (scaled to 10 million people).

### Bottom 10 Countries by High-Quality Articles Per Capita
The bottom 10 countries ranked by the lowest number of high-quality articles per capita (scaled to 10 million people).

### Geographic Regions by Total Articles Per Capita
Regions are ranked by total articles per capita (scaled to 10 million people). This analysis highlights which regions have the most Wikipedia article coverage per capita.

### Geographic Regions by High-Quality Articles Per Capita
Regions ranked by high-quality articles per capita (scaled to 10 million people), highlighting the regions with the best content quality.

In [119]:
# Results - Country and Regional Tables

# Top 10 countries by coverage (total articles per capita)
top_10_countries_coverage = country_analysis_df.sort_values(by='total_articles_per_capita_scaled', ascending=False).head(10)
print("Top 10 Countries by Coverage (Total Articles Per 10 Million People):")
display(top_10_countries_coverage[['total_articles_per_capita_scaled']])

# Bottom 10 countries by coverage (total articles per capita)
bottom_10_countries_coverage = country_analysis_df.sort_values(by='total_articles_per_capita_scaled', ascending=True).head(10)
print("\nBottom 10 Countries by Coverage (Total Articles Per 10 Million People):")
display(bottom_10_countries_coverage[['total_articles_per_capita_scaled']])

# Top 10 countries by high-quality articles per capita
top_10_countries_high_quality = country_analysis_df.sort_values(by='high_quality_articles_per_capita_scaled', ascending=False).head(10)
print("\nTop 10 Countries by High-Quality Articles Per 10 Million People:")
display(top_10_countries_high_quality[['high_quality_articles_per_capita_scaled']])

# Bottom 10 countries by high-quality articles per capita
bottom_10_countries_high_quality = country_analysis_df.sort_values(by='high_quality_articles_per_capita_scaled', ascending=True).head(10)
print("\nBottom 10 Countries by High-Quality Articles Per 10 Million People:")
display(bottom_10_countries_high_quality[['high_quality_articles_per_capita_scaled']])

# %%
# Geographic regions by total coverage (rank ordered by total articles per capita)
regions_by_total_coverage = region_analysis_df.sort_values(by='total_articles_per_capita_scaled', ascending=False)
print("\nGeographic Regions by Total Coverage (Total Articles Per 10 Million People):")
display(regions_by_total_coverage[['total_articles_per_capita_scaled']])

# Geographic regions by high-quality coverage (rank ordered by high-quality articles per capita)
regions_by_high_quality_coverage = region_analysis_df.sort_values(by='high_quality_articles_per_capita_scaled', ascending=False)
print("\nGeographic Regions by High-Quality Coverage (High-Quality Articles Per  10 Million People):")
display(regions_by_high_quality_coverage[['high_quality_articles_per_capita_scaled']])


Top 10 Countries by Coverage (Total Articles Per 10 Million People):


Unnamed: 0_level_0,total_articles_per_capita_scaled
country,Unnamed: 1_level_1
Antigua and Barbuda,3300.0
Federated States of Micronesia,1400.0
Marshall Islands,1300.0
Tonga,1000.0
Barbados,833.333333
Seychelles,600.0
Montenegro,600.0
Bhutan,550.0
Maldives,550.0
Samoa,400.0



Bottom 10 Countries by Coverage (Total Articles Per 10 Million People):


Unnamed: 0_level_0,total_articles_per_capita_scaled
country,Unnamed: 1_level_1
China,0.113371
Ghana,0.879765
India,1.056979
Saudi Arabia,1.355014
Zambia,1.485149
Norway,1.818182
Israel,2.040816
Egypt,3.041825
Cote d'Ivoire,3.236246
Ethiopia,3.478261



Top 10 Countries by High-Quality Articles Per 10 Million People:


Unnamed: 0_level_0,high_quality_articles_per_capita_scaled
country,Unnamed: 1_level_1
Montenegro,50.0
Luxembourg,28.571429
Albania,25.925926
Kosovo,23.529412
Maldives,16.666667
Lithuania,13.793103
Croatia,13.157895
Guyana,12.5
Palestinian Territory,10.909091
Slovenia,9.52381



Bottom 10 Countries by High-Quality Articles Per 10 Million People:


Unnamed: 0_level_0,high_quality_articles_per_capita_scaled
country,Unnamed: 1_level_1
Zimbabwe,0.0
Congo,0.0
Kuwait,0.0
St. Lucia,0.0
Cote d'Ivoire,0.0
St. Kitts and Nevis,0.0
Solomon Islands,0.0
Cyprus,0.0
Singapore,0.0
Djibouti,0.0



Geographic Regions by Total Coverage (Total Articles Per 10 Million People):


Unnamed: 0_level_0,total_articles_per_capita_scaled
region,Unnamed: 1_level_1
NORTHERN EUROPE,68.705036
OCEANIA,63.963964
CARIBBEAN,59.562842
SOUTHERN EUROPE,52.607261
CENTRAL AMERICA,36.647173
WESTERN EUROPE,26.861555
EASTERN EUROPE,26.63411
WESTERN ASIA,20.616114
SOUTHERN AFRICA,18.008785
EASTERN AFRICA,13.807444



Geographic Regions by High-Quality Coverage (High-Quality Articles Per  10 Million People):


Unnamed: 0_level_0,high_quality_articles_per_capita_scaled
region,Unnamed: 1_level_1
SOUTHERN EUROPE,3.49835
NORTHERN EUROPE,3.23741
CARIBBEAN,2.459016
CENTRAL AMERICA,1.949318
EASTERN EUROPE,1.427498
SOUTHERN AFRICA,1.171303
WESTERN EUROPE,1.158301
WESTERN ASIA,0.914015
OCEANIA,0.900901
NORTHERN AFRICA,0.664322
