# Analysis

The cleaning and combination of data is completed in the [DATA512_HW2_DataRetrieval_ipynb](DATA512_HW2_DataRetrieval_ipynb).  
The output was printed to a csv file, so here we read it in for our analysis

In [76]:
import pandas as pd

In [158]:
articles_data = pd.read_csv("C:/Users/fioyu/Desktop/UW/DATA512/Homework/data-512-homework_2/output/wp_politicians_by_country.csv")

In [160]:
len(articles_data)

7147

In [161]:
# Display rows with any missing or empty values
rows_with_missing = articles_data[articles_data.isna().any(axis=1) | (articles_data == '').any(axis=1)]
rows_with_missing

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
898,Botche Candé,Guinea-Bissau,992221311,Stub,,
899,Juliano Fernandes,Guinea-Bissau,1015285975,Stub,,
900,Teodora Inácia Gomes,Guinea-Bissau,1060153270,C,,
901,Desejado Lima da Costa,Guinea-Bissau,1200012565,Stub,,
902,Aristide Menezes,Guinea-Bissau,1142925183,Stub,,
...,...,...,...,...,...,...
5932,Yun Il-seon,"Korea, South",1221352139,Start,,
5933,Yun So-ha,"Korea, South",1214500724,Start,,
5934,Yun Young-sun (minister),"Korea, South",1214785975,Stub,,
6544,Myriam Dossou D'Almeida,Togo,1244179914,,9.1000e+00,WESTERN AFRICA


After some examination, there countries where we did not have population or region information on. Therefore, we dropped these entries as we need these values in our analysis later. 

In [80]:
articles_data_cleaned = articles_data.dropna()
articles_data_cleaned.shape[0]

7001

In [172]:
articles_data_cleaned

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Majah Ha Adrif,Afghanistan,1233202991,Start,4.2400e+01,SOUTH ASIA
1,Haroon al-Afghani,Afghanistan,1230459615,B,4.2400e+01,SOUTH ASIA
2,Tayyab Agha,Afghanistan,1225661708,Start,4.2400e+01,SOUTH ASIA
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub,4.2400e+01,SOUTH ASIA
4,Aziza Ahmadyar,Afghanistan,1195651393,Start,4.2400e+01,SOUTH ASIA
...,...,...,...,...,...,...
7142,Jacob Magnus Sprengtporten,Sweden,1235736355,C,5.6000e+00,NORTHERN EUROPE
7143,Hrant Maloyan,Syria,1250311867,C,3.0000e+00,WESTERN ASIA
7144,Yat Hwaidi,Thailand,1130454039,Stub,1.7000e+01,SOUTHEAST ASIA
7145,Sergey Abisov,Ukraine,1236963594,Start,1.4690e+02,EASTERN EUROPE


### Total articles per capita (country)

Notes: population_by_country_AUG.2024.csv provides populations in millions. Thus calculated proportions are likely to be very small numbers, therefore I represented them in scientific notation.

In [82]:
# Just checking how many countries there are
len(articles_data_cleaned["country"].unique())

167

A country appears multiple times in the dataframe, but they all have the same population. Therefore I only set the population to be the first instance. <br>
Table 1:

In [179]:
countries_list = []
articles_per_capita_list = []

for country in articles_data_cleaned["country"].unique():
    article_numbers = (articles_data_cleaned['country'] == country).sum()

    population = articles_data_cleaned.loc[articles_data_cleaned['country'] == country, 'population'].values[0]
    
    if population == 0:
        articles_per_capita = 0
    else:
        articles_per_capita = article_numbers / population

    countries_list.append(country)
    articles_per_capita_list.append(articles_per_capita)

articles_per_capita_country = pd.DataFrame({
    'Country': countries_list,
    'Articles Per Capita': articles_per_capita_list
})
pd.options.display.float_format = '{:.4e}'.format
articles_per_capita_country.head(3)

Unnamed: 0,Country,Articles Per Capita
0,Afghanistan,2.0047
1,Albania,25.926
2,Algeria,1.5171


In [181]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
articles_per_capita_country


Unnamed: 0,Country,Articles Per Capita
0,Afghanistan,2.0047
1,Albania,25.926
2,Algeria,1.5171
3,Angola,1.5804
4,Antigua and Barbuda,330.0
5,Argentina,1.3823
6,Armenia,12.0
7,Austria,8.0435
8,Azerbaijan,7.8431
9,Bahamas,22.5


### High quality articler per capita (country)

The difference with Table 1 is that we are only counting the quantiy of high quality articles.
High quality is defined as having a score of "FA" or "GA".
Table 2:

In [164]:
countries_list = []
articles_per_capita_list = []

for country in articles_data_cleaned["country"].unique():
    country_data = articles_data_cleaned[articles_data_cleaned['country'] == country]
    
    FAarticles = (country_data['article_quality'] == "FA").sum()
    GAarticles = (country_data['article_quality'] == "GA").sum()
    article_numbers = FAarticles + GAarticles
    population = articles_data_cleaned.loc[articles_data_cleaned['country'] == country, 'population'].values[0]
    
    if population == 0:
        articles_per_capita = 0
    else:
        articles_per_capita = article_numbers / population

    countries_list.append(country)
    articles_per_capita_list.append(articles_per_capita)

high_quality_articles_per_capita_country = pd.DataFrame({
    'Country': countries_list,
    'High Quality Articles Per Capita': articles_per_capita_list
})
pd.options.display.float_format = '{:.4e}'.format
high_quality_articles_per_capita_country.head(3)

Unnamed: 0,Country,High Quality Articles Per Capita
0,Afghanistan,0.070755
1,Albania,2.5926
2,Algeria,0.021368


### Total articles per capita (region)

Thiis table is similar to Table 1, except we are need to caluclate the total population for that region (sum of country population in that region). If you examine the [population_by_country_AUG.2024.csv](../input_data/population_by_country_AUG.2024.csv), the population for region will be slightly different, as the region in the csv file does not include decimal places. 
Table 3:

In [165]:
region_list = []
articles_per_capita_list2 = []

for region in articles_data_cleaned["region"].unique():
    article_numbers = (articles_data_cleaned['region'] == region).sum()
    population = articles_data_cleaned.loc[articles_data_cleaned['region'] == region, 'population'].sum()
    if population == 0:
        articles_per_capita = 0
    else:
        articles_per_capita = article_numbers / population

    region_list.append(region)
    articles_per_capita_list2.append(articles_per_capita)

articles_per_capita_region = pd.DataFrame({
    'Region': region_list,
    'Articles Per Capita': articles_per_capita_list2
})
pd.options.display.float_format = '{:.4e}'.format
articles_per_capita_region.head(3)

Unnamed: 0,Region,Articles Per Capita
0,SOUTH ASIA,0.0025383
1,SOUTHERN EUROPE,0.04432
2,NORTHERN AFRICA,0.024872


### High quality articler per capita (region)

This table is similar to Table 2, with population generated in the same manner as Table 3.
Table 4:

In [167]:
region_list = []
articles_per_capita_list2 = []

for region in articles_data_cleaned["region"].unique():
    region_data = articles_data_cleaned[articles_data_cleaned['region'] == region]
    
    FAarticles = (region_data ['article_quality'] == "FA").sum()
    GAarticles = (region_data ['article_quality'] == "GA").sum()
    article_numbers = FAarticles + GAarticles
    population = articles_data_cleaned.loc[articles_data_cleaned['region'] == region, 'population'].sum()
    if population == 0:
        articles_per_capita = 0
    else:
        articles_per_capita = article_numbers / population

    region_list.append(region)
    articles_per_capita_list2.append(articles_per_capita)

high_quality_articles_per_capita_region = pd.DataFrame({
    'Region': region_list,
    'High Quality Articles Per Capita': articles_per_capita_list2
})
pd.options.display.float_format = '{:.4e}'.format
high_quality_articles_per_capita_region.head(3)

Unnamed: 0,Region,High Quality Articles Per Capita
0,SOUTH ASIA,7.9557e-05
1,SOUTHERN EUROPE,0.0029622
2,NORTHERN AFRICA,0.0012519


# Results

## Top 10 countries by coverage 

The table below shows the 10 countries with the highest total articles per capita in descending order.

In [151]:
top10_coverage_country = articles_per_capita_country.sort_values(by='Articles Per Capita', ascending=False)
top10_coverage_country.head(10)

Unnamed: 0,Country,Articles Per Capita
4,Antigua and Barbuda,330.0
97,Federated States of Micronesia,140.0
95,Marshall Islands,130.0
150,Tonga,100.0
12,Barbados,83.333
130,Seychelles,60.0
101,Montenegro,60.0
17,Bhutan,55.0
92,Maldives,55.0
126,Samoa,40.0


## Bottom 10 countries by coverage

The table below shows the 10 countries with the lowest total articles per capita in ascending order.

In [152]:
bot10_coverage_country = articles_per_capita_country.sort_values(by='Articles Per Capita', ascending=True)
bot10_coverage_country.head(10)

Unnamed: 0,Country,Articles Per Capita
99,Monaco,0.0
155,Tuvalu,0.0
166,Korean,0.0080321
32,China,0.011337
57,Ghana,0.087977
66,India,0.1057
127,Saudi Arabia,0.1355
164,Zambia,0.14851
109,Norway,0.18182
70,Israel,0.20408


## Top 10 countries by high quality

The table below shows the 10 countries with the highest high quality articlers per capita in descending order. 

In [168]:
top10_quality_country = high_quality_articles_per_capita_country.sort_values(by='High Quality Articles Per Capita', ascending=False)
top10_quality_country.head(10)

Unnamed: 0,Country,High Quality Articles Per Capita
101,Montenegro,5.0
87,Luxembourg,2.8571
1,Albania,2.5926
77,Kosovo,2.3529
92,Maldives,1.6667
86,Lithuania,1.3793
38,Croatia,1.3158
62,Guyana,1.25
112,Palestinian Territory,1.0909
134,Slovenia,0.95238


## Bottom 10 countries by high quality

The table below shows the 10 countries with the lowest high quality articlers per capita in ascending order. 

In [169]:
bot10_quality_country = high_quality_articles_per_capita_country.sort_values(by='High Quality Articles Per Capita', ascending=True)
bot10_quality_country.head(10)

Unnamed: 0,Country,High Quality Articles Per Capita
83,Lesotho,0.0
106,Nicaragua,0.0
104,Namibia,0.0
103,Mozambique,0.0
99,Monaco,0.0
97,Federated States of Micronesia,0.0
95,Marshall Islands,0.0
94,Malta,0.0
91,Malaysia,0.0
90,Malawi,0.0


## Geographic regions by total coverage

The table below shows the regions and their total articles per capita in descending order. 

In [170]:
coverage_region = articles_per_capita_region.sort_values(by='Articles Per Capita', ascending=False)
coverage_region

Unnamed: 0,Region,Articles Per Capita
17,OCEANIA,0.64806
15,NORTHERN EUROPE,0.16524
4,CARIBBEAN,0.15481
9,CENTRAL AMERICA,0.13077
16,CENTRAL ASIA,0.053269
6,WESTERN ASIA,0.045509
1,SOUTHERN EUROPE,0.04432
13,EASTERN AFRICA,0.028058
7,WESTERN EUROPE,0.026267
2,NORTHERN AFRICA,0.024872


## Geographic regions by high quality coverage

The table below shows the regions and their high quality articles per capita in descending order. 

In [171]:
quality_region = high_quality_articles_per_capita_region.sort_values(by='High Quality Articles Per Capita', ascending=False)
quality_region

Unnamed: 0,Region,High Quality Articles Per Capita
17,OCEANIA,0.0090009
15,NORTHERN EUROPE,0.0069942
9,CENTRAL AMERICA,0.006919
4,CARIBBEAN,0.0063622
1,SOUTHERN EUROPE,0.0029622
16,CENTRAL ASIA,0.0025127
6,WESTERN ASIA,0.0020177
8,EASTERN EUROPE,0.0013472
11,SOUTHERN AFRICA,0.0013436
2,NORTHERN AFRICA,0.0012519
