# ANALYSIS

Disclaimer: To understand how the csv files used here were generated, formatted refer to the preceding notebooks in this repository.

Here, we perform the actual analysis of the data.

Before that, we will first join the data we have as 4 separate csv files.

In [1]:
import pandas as pd
import json

Importing ores data from the csv file.

### Make sure to copy the csv file from the data_files folder and place it in the path of this notebook.

In [2]:
ores_data = pd.read_csv('article_quality.csv')

Importing population and politicians data from the csv files.

### Make sure to copy the csv file from the data_files folder and place it in the path of this notebook.

In [3]:
population_data = pd.read_csv('population_by_country_AUG.2024.csv')
politicians_data = pd.read_csv('politicians_by_country_AUG.2024.csv')

Converting the rev_id to type int

In [5]:
ores_data
#convert rev_id to int
ores_data['rev_id'] = ores_data['rev_id'].astype(int)

### Make sure to copy the csv file from the data_files folder and place it in the path of this notebook.

In [7]:
article_page_info_df = pd.read_csv('wiki_information.csv')
article_page_info_df

Unnamed: 0,name,rev_id
0,Majah Ha Adrif,1.233203e+09
1,Haroon al-Afghani,1.230460e+09
2,Tayyab Agha,1.225662e+09
3,Khadija Zahra Ahmadi,1.234742e+09
4,Aziza Ahmadyar,1.195651e+09
...,...,...
7106,Josiah Tongogara,1.203429e+09
7107,Langton Towungana,1.246280e+09
7108,Sengezo Tshabangu,1.228478e+09
7109,Herbert Ushewokunze,9.591118e+08


In order to continue with the analysis, we need to join the data we have as 4 separate csv files to help us answer the questions we have.

First, I have merged the politicians and article page data. Since both of them have article_title common (whihc I have encoded as name), I am doing a left join to join them. Since the population data was provided as paart of the assignement, I am using it on the left side of the join. I have named this intermidiary merge as merge_data_1.

In [8]:
#left join dict_names_df and ores_data
merged_data_1 = pd.merge(politicians_data,article_page_info_df , left_on='name', right_on='name', how='left')

In [9]:
merged_data_1

Unnamed: 0,name,url,country,rev_id
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1.233203e+09
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1.230460e+09
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1.225662e+09
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1.234742e+09
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1.195651e+09
...,...,...,...,...
7150,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,1.203429e+09
7151,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,1.246280e+09
7152,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,1.228478e+09
7153,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,9.591118e+08


We now have merged_data_1 which has the population data, politicians data and article data. We will now merge this with the ores data. Since both of them have rev_id common, I am doing a left join to join them. I have named this final merge as merge_data_2.

In [30]:
merged_data_2 = pd.merge(merged_data_1, ores_data, left_on='rev_id', right_on='rev_id', how='left')

In [12]:
merged_data_2

Unnamed: 0,name,url,country,rev_id,article_quality
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1.233203e+09,Start
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1.230460e+09,B
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1.225662e+09,Start
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1.234742e+09,Stub
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1.195651e+09,Start
...,...,...,...,...,...
7150,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,1.203429e+09,C
7151,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,1.246280e+09,Stub
7152,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,1.228478e+09,Start
7153,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,9.591118e+08,Stub


The population data set has a hierarchy of countries, regions. We will get them as a separate column by mapping the region to each record as shown below.

In [31]:
# Define a variable to hold the current region
current_region_name = None
country_to_region = []

# Iterate through each row to map countries to regions
for idx, data_row in population_data.iterrows():
    geo_name = data_row['Geography']
    pop_count = data_row['Population']
    
    # Check if the geography is in all caps (indicating a region)
    if geo_name.isupper():
        current_region_name = geo_name
    else:
        # If it's a country, map it to the current region
        if current_region_name:
            country_to_region.append([geo_name, current_region_name, pop_count])

# Convert the mapping list into a DataFrame
df_mapped_population = pd.DataFrame(country_to_region, columns=['country', 'Region', 'Population'])


Now, we merge the merge_data_2 with the region data we got aboe to get the region for each record. We do the join on country column.

First we perform an inner join on the merge_data_2 and region data to get the region for each record. This will give us the records that contain counrty in both of the tables. We have named this final merge as merge_data_3.

In [14]:
merged_data_3 = pd.merge(merged_data_2, df_mapped_population, left_on='country', right_on='country', how='inner')
merged_data_3

Unnamed: 0,name,url,country,rev_id,article_quality,Region,Population
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1.233203e+09,Start,SOUTH ASIA,42.4
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1.230460e+09,B,SOUTH ASIA,42.4
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1.225662e+09,Start,SOUTH ASIA,42.4
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1.234742e+09,Stub,SOUTH ASIA,42.4
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1.195651e+09,Start,SOUTH ASIA,42.4
...,...,...,...,...,...,...,...
7008,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,1.203429e+09,C,EASTERN AFRICA,16.7
7009,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,1.246280e+09,Stub,EASTERN AFRICA,16.7
7010,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,1.228478e+09,Start,EASTERN AFRICA,16.7
7011,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,9.591118e+08,Stub,EASTERN AFRICA,16.7


As per the assignment requirements, we are required to save the final merged data to a csv file. We will do that now.

Renaming the columns to make them more readable and understandable. Writing to csv named wp_politicians_by_country.csv

In [15]:
merged_data_3
#renaming name to article_title
merged_data_3.rename(columns={'name': 'article_title'}, inplace=True)
merged_data_3.rename(columns={'rev_id': 'revision_id'}, inplace=True)
merged_data_3.rename(columns={'Population': 'population'}, inplace=True)
merged_data_3.rename(columns={'Region': 'region'}, inplace=True)

fin_merged = merged_data_3[['country', 'region', 'population', 'article_title', 'revision_id', 'article_quality']]

#writing the final merged data to a csv file
fin_merged.to_csv('wp_politicians_by_country.csv', index=False)

We also want to find out what are the articles that have some missing data. We will do that by doing an outer join on the merge_data_2 and region data to get the records that do not have country in both of the tables. We have named this final merge as merge_data_4.

In [16]:
merged_data_4 = pd.merge(merged_data_2, df_mapped_population, left_on='country', right_on='country', how='outer')

In [17]:
merged_data_4

Unnamed: 0,name,url,country,rev_id,article_quality,Region,Population
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1.233203e+09,Start,SOUTH ASIA,42.4
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1.230460e+09,B,SOUTH ASIA,42.4
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1.225662e+09,Start,SOUTH ASIA,42.4
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1.234742e+09,Stub,SOUTH ASIA,42.4
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1.195651e+09,Start,SOUTH ASIA,42.4
...,...,...,...,...,...,...,...
7193,,,Kiribati,,,OCEANIA,0.1
7194,,,Nauru,,,OCEANIA,0.0
7195,,,New Caledonia,,,OCEANIA,0.3
7196,,,New Zealand,,,OCEANIA,5.2


In order to get the missing data, we will need to take a set difference between the merged_data_2 and df_mapped_population. We will do that here. 

We are also required to wrtie these to a file. We get a totla of 43 + 3 missing values whihc we write to the file.

In [32]:
print(set(merged_data_2['country']) - set(df_mapped_population['country']))
print(len(set(df_mapped_population['country'])- set(merged_data_2['country']) ))

#as asked in the question, printing these to a file
with open('wp_countries-no_match.txt', 'w') as f:
    f.write("\n".join(set(df_mapped_population['country'])- set(merged_data_2['country'])))

    
with open('wp_countries-no_match.txt', 'a') as f:
    f.write("\n")

    
with open('wp_countries-no_match.txt', 'a') as f:
    f.write("\n".join(set(merged_data_2['country']) - set(df_mapped_population['country'])))



{'Korean', 'Korea, South', 'Guinea-Bissau'}
43


# ANALYSIS

We are required to analyse the data to answer the following questions:

1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

Lets look at each one by one.

NOTE: All per capita calculations are done per million people.


### 1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

In [33]:

# Filter out countries with zero population
merged_data_3 = merged_data_3[merged_data_3['population'] > 0]

# Calculate the number of articles per capita for each country
articles_per_capita = merged_data_3.groupby('country')['article_title'].count() / merged_data_3.groupby('country')['population'].first()

# Get the top 10 countries by articles per capita
top_10_countries_by_coverage = articles_per_capita.sort_values(ascending=False).head(10)
top_10_countries_by_coverage = top_10_countries_by_coverage.reset_index()
top_10_countries_by_coverage.columns = ['country', 'Articles per capita(10^6)']

# Display the top 10 countries
top_10_countries_by_coverage

Unnamed: 0,country,Articles per capita(10^6)
0,Antigua and Barbuda,330.0
1,Federated States of Micronesia,140.0
2,Marshall Islands,130.0
3,Tonga,100.0
4,Barbados,83.333333
5,Seychelles,60.0
6,Montenegro,60.0
7,Bhutan,55.0
8,Maldives,55.0
9,Samoa,40.0


It is interesting to see that Antigua and Barbuda has the highest number of articles per capita. This is followed by the countries like Federated States of Micronesia, Marshall Islands etc. I thought that there will be more articles per captia in countries like USA, UK etc. But it is interesting to see that the countries with less population have more articles per capita. This also makes sense since the countries with less population might still have similar number of politicians but since their population is low the overall articles per capita is high.

### 2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [42]:
# Bottom 10 countries by coverage
# Sort the articles per capita in ascending order and get the bottom 10 countries
bottom_10_countries_by_coverage = articles_per_capita.sort_values(ascending=True).head(10)

# Reset the index to have a clean DataFrame
bottom_10_countries_by_coverage = bottom_10_countries_by_coverage.reset_index()

# Rename the columns for better readability
bottom_10_countries_by_coverage.columns = ['country', 'Articles per capita(10^6)']

# Display the bottom 10 countries by coverage
bottom_10_countries_by_coverage

Unnamed: 0,country,Articles per capita(10^6)
0,China,0.011337
1,India,0.105698
2,Ghana,0.117302
3,Saudi Arabia,0.135501
4,Zambia,0.148515
5,Norway,0.181818
6,Israel,0.204082
7,Egypt,0.304183
8,Cote d'Ivoire,0.323625
9,Ethiopia,0.347826


After the previous output, this seems to be more understandable. China and India have high poulation. hence the ratio might be low. But I wonder if population is the only factor that is affecting the articles per capita. It might be interesting to see if there are other factors that are affecting this like literacy rate, internet penetration, GDP, etc.

### 3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [43]:
# Filter articles with high quality (FA and GA)
high_quality_articles = merged_data_3[merged_data_3['article_quality'].isin(['FA', 'GA'])]

# Calculate high quality articles per capita for each country
high_quality_articles_per_capita = high_quality_articles.groupby('country')['article_title'].count() / merged_data_3.groupby('country')['population'].first()

# Get the top 10 countries by high quality articles per capita
top_10_high_quality_countries = high_quality_articles_per_capita.sort_values(ascending=False).head(10)

# Reset the index to have a clean DataFrame
top_10_high_quality_countries = top_10_high_quality_countries.reset_index()

# Rename the columns for better readability
top_10_high_quality_countries.columns = ['country', 'High Quality Articles per capita(10^6)']

# Display the top 10 countries by high quality articles per capita
top_10_high_quality_countries


Unnamed: 0,country,High Quality Articles per capita(10^6)
0,Montenegro,5.0
1,Luxembourg,2.857143
2,Albania,2.592593
3,Kosovo,2.352941
4,Maldives,1.666667
5,Lithuania,1.37931
6,Croatia,1.315789
7,Guyana,1.25
8,Palestinian Territory,1.090909
9,Slovenia,0.952381


Here, I anticipated that countries with high population will have more high quality articles per capita since there are more people and there mihgt be a bette chance of vetting and thus having higher quality data. But it is interesting that countries like Montenegaro, Luxembourg, etc have more high quality articles per capita. This might be because along with having less population, they mihgt have more internet penetration, literacy rate, etc.

### 4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

In [36]:
# Bottom 10 countries by high-quality article coverage

# Sort the high-quality articles per capita in ascending order and get the bottom 10 countries
bottom_10_high_quality = high_quality_articles_per_capita.sort_values(ascending=True).head(10)

# Reset the index to have a clean DataFrame
bottom_10_high_quality = bottom_10_high_quality.reset_index()

# Rename the columns for better readability
bottom_10_high_quality.columns = ['country', 'High Quality Articles per capita(10^6)']

# Display the bottom 10 countries by high-quality article coverage
bottom_10_high_quality


Unnamed: 0,country,High Quality Articles per capita(10^6)
0,Bangladesh,0.005764
1,Egypt,0.009506
2,Ethiopia,0.01581
3,Japan,0.016064
4,Pakistan,0.016632
5,Colombia,0.019157
6,Congo DR,0.01955
7,Vietnam,0.020222
8,Uganda,0.020576
9,Algeria,0.021368


It is interesting to see that though countries like India, China had the least articles per capita, they have more high quality articles per capita. This might be because of the reasons I mentioned above. More courntries like Bangladesh, Pakistan, etc have the least high quality articles per capita. This might mean that while articles for these countries are present, they are not actually high quality and hence might not give us good insights into the working of the country.

### 5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [44]:
# Calculate the total coverage of articles per region
# This is done by dividing the count of articles by the population of the region
total_coverage_per_region = merged_data_3.groupby('region')['article_title'].count() / merged_data_3.groupby('region')['population'].first()

# Rank the regions by total coverage in descending order
total_coverage_per_region_ranked = total_coverage_per_region.sort_values(ascending=False)

# Calculate the high-quality coverage of articles per region
# High-quality articles are those with 'FA' or 'GA' quality
high_quality_articles = merged_data_3[merged_data_3['article_quality'].isin(['FA', 'GA'])]
high_quality_coverage_per_region = high_quality_articles.groupby('region')['article_title'].count() / merged_data_3.groupby('region')['population'].first()

# Reset the index and rename columns for better readability
total_coverage_per_region_ranked = total_coverage_per_region_ranked.reset_index()
total_coverage_per_region_ranked.columns = ['region', 'Articles per capita(10^6)']

# Display the ranked total coverage per region
total_coverage_per_region_ranked

Unnamed: 0,region,Articles per capita(10^6)
0,CARIBBEAN,2190.0
1,OCEANIA,710.0
2,CENTRAL AMERICA,376.0
3,SOUTHERN EUROPE,295.185185
4,WESTERN ASIA,203.333333
5,NORTHERN EUROPE,136.428571
6,EASTERN EUROPE,77.065217
7,WESTERN EUROPE,53.043478
8,EASTERN AFRICA,50.378788
9,SOUTHERN AFRICA,45.555556


It is interesting to note again that smaller regions have more articles per capita. This seems to be a trend. However, most Asian and African regions have less articles per capita. This might be because of the reasons I mentioned above again like literacy rate, internet penetration. It might also be due to local governmental laws and policies in some palces where there is no freedom of speech, access etc.

### 6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [41]:
# Sort the high-quality coverage per region in descending order
sorted_high_quality_coverage_per_region = high_quality_coverage_per_region.sort_values(ascending=False)

# Reset the index to convert the Series to a DataFrame
sorted_high_quality_coverage_per_region = sorted_high_quality_coverage_per_region.reset_index()

# Rename the columns for better readability
sorted_high_quality_coverage_per_region.columns = ['region', 'High Quality Articles per capita(10^6)']

# Display the sorted DataFrame
sorted_high_quality_coverage_per_region

Unnamed: 0,region,High Quality Articles per capita(10^6)
0,CARIBBEAN,90.0
1,CENTRAL AMERICA,20.0
2,SOUTHERN EUROPE,19.62963
3,OCEANIA,10.0
4,WESTERN ASIA,9.0
5,NORTHERN EUROPE,6.428571
6,EASTERN EUROPE,4.130435
7,SOUTHERN AFRICA,2.962963
8,WESTERN EUROPE,2.282609
9,EASTERN AFRICA,1.287879


One difference to see here from the one before is that the Asian countries lack high quality articles consistently when compared to Africa. More literate and internet penetrated countries like the Americas, Europe seems to have more hihg quality articles as well. This is an interesting observation. I expected that the African countries will have less high quality articles per capita but it is interesting to see that they have more high quality articles per capita than the Asian countries. One guess is that becaouse of more influence of the west in Africa, the wikipedia there might be richer.