## Imports and Dependencies

In this section, we import the necessary libraries that will be used throughout the notebook for data processing.

### Libraries Overview

- **`pandas`**: A powerful data manipulation library used for data analysis and manipulation.

In [25]:

# You will need to install these with pip/pip3 if you do not already have it
import pandas as pd

## Loading Datasets for Analysis

In this step, we load two datasets: 
1. **Population Data** containing the countries, population, and corresponding regions.
2. **Politicians Data** with article scores, revision IDs, and other relevant information about political figures across various countries.

### Steps:
1. **Load the Population Dataset**: 
   - The `country_population_with_region.csv` file is loaded into a DataFrame called `population_df`. This dataset includes countries, their population, and the respective regions.
   
2. **Load the Politicians Dataset**: 
   - The `articles_scores.csv` file is loaded into a DataFrame called `politicians_df`. This dataset contains details about politicians, including article quality scores retrieved from the ORES API, along with other article-related metadata.
   
3. **Display Data**: 
   - To verify the data is loaded correctly, we display the first few rows of each DataFrame using `display()`.


In [17]:
# Load the datasets
population_df = pd.read_csv('../data/country_population_with_region.csv')
politicians_df = pd.read_csv('../data/articles_scores.csv')

# Display the first few rows of both DataFrames to verify they are loaded correctly
print("Politicians with Quality Data:")
display(politicians_df.head())

print("\nPopulation Data (Countries with Region):")
display(population_df.head())

Politicians with Quality Data:


Unnamed: 0,name,url,country,revision_id,quality_score
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1233203000.0,Start
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1230460000.0,B
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1225662000.0,Start
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1234742000.0,Stub
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1195651000.0,Start



Population Data (Countries with Region):


Unnamed: 0,country,population,region
0,Algeria,46.8,NORTHERN AFRICA
1,Egypt,105.2,NORTHERN AFRICA
2,Libya,6.9,NORTHERN AFRICA
3,Morocco,37.0,NORTHERN AFRICA
4,Sudan,48.1,NORTHERN AFRICA


## Merging Datasets and Handling Mismatches

In this step, we merge the **Population** and **Politicians** datasets on the `country` field to consolidate data for further analysis. During the merge process, we handle any mismatches between the datasets and ensure only matched data is retained for further calculations.

### Steps:
1. **Merge Datasets**:
   - We use the `pd.merge()` function to merge the two datasets (`population_df` and `politicians_df`) on the `country` field. The merge is performed with an outer join to ensure we identify countries that are present in only one of the datasets.
   
2. **Handle Mismatches**:
   - We check for countries that do not match between the two datasets (i.e., countries that are either missing from the population dataset or the politicians dataset). These countries are saved to a file named `wp_countries-no_match.txt`.
   
3. **Filter Matched Data**:
   - We filter out the successfully matched records where the merge was successful (i.e., the `_merge` column indicates `'both'`).


In [24]:
# Merge the two datasets on the 'country' field
merged_df = pd.merge(politicians_df, population_df, on='country', how='outer', indicator=True)
print

# Save the countries that did not match between the two datasets
no_match_df = merged_df[merged_df['_merge'] != 'both'][['country']].drop_duplicates()
no_match_df.to_csv('../data/wp_countries-no_match.txt', index=False, header=False)

# Keep only the successfully matched data
matched_df = merged_df[merged_df['_merge'] == 'both']
matched_df.head()

Unnamed: 0,name,url,country,revision_id,quality_score,population,region,_merge
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,1233203000.0,Start,42.4,SOUTH ASIA,both
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1230460000.0,B,42.4,SOUTH ASIA,both
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1225662000.0,Start,42.4,SOUTH ASIA,both
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,1234742000.0,Stub,42.4,SOUTH ASIA,both
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,1195651000.0,Start,42.4,SOUTH ASIA,both


## Merging Datasets and Preparing the Final DataFrame

In this step, we merge the **Population** and **Politicians** datasets on the `country` field using an inner join to retain only successfully matched data. The final dataset is processed to select relevant columns for further analysis, and we ensure that columns are renamed to meet the assignment requirements.

### Steps:
1. **Merge Datasets**:
   - We use `pd.merge()` to merge the population and politicians data on the `country` field using an inner join, ensuring only countries present in both datasets are retained.
   
2. **Handle Mismatches**:
   - Although the merge is successful, we identify any countries that are not matched by checking the `_merge` column and log these unmatched countries. The list of unmatched countries is saved to a file named `wp_countries-no_match.txt`.
   
3. **Filter and Select Columns**:
   - After the merge, we filter and retain the relevant columns required for further analysis: `country`, `region`, `population`, `article_title`, `revision_id`, and `article_quality`.
   
4. **Rename Columns**:
   - The columns are renamed according to the assignment's requirements:
     - `name` to `article_title`
     - `quality_score` to `article_quality`

In [22]:
# Merge the two datasets on the 'country' field
merged_df = pd.merge(politicians_df, population_df, on='country', how='inner', indicator=True)

# Save the countries that did not match between the two datasets
no_match_df = merged_df[merged_df['_merge'] != 'both'][['country']].drop_duplicates()
no_match_df.to_csv('../data/wp_countries-no_match.txt', index=False, header=False)

# Keep only the successfully matched data
matched_df = merged_df[merged_df['_merge'] == 'both']

# Select relevant columns for the final DataFrame
final_df = matched_df[['country', 'region', 'population', 'name', 'revision_id', 'quality_score']]

# Rename columns as per the assignment's requirement
final_df = final_df.rename(columns={
    'name': 'article_title',
    'quality_score': 'article_quality'
})

## Saving the Final Merged Dataset

After merging the **Population** and **Politicians** datasets and preparing the final DataFrame, the next step is to save this data for future use and analysis. Additionally, we provide a summary of the operation to ensure clarity on the results of the merge.

### Steps:
1. **Save the Final Merged Dataset**:
   - The final DataFrame, which contains columns such as `country`, `region`, `population`, `article_title`, `revision_id`, and `article_quality`, is saved as `wp_politicians_by_country.csv`.
   
2. **Summary Output**:
   - We print a summary of the final merged dataset, including:
     - The total number of rows in the final DataFrame.
     - Confirmation that unmatched countries have been saved to `wp_countries-no_match.txt`.
     - Confirmation that the merged dataset has been saved to `wp_politicians_by_country.csv`.


In [20]:
# Save the final merged dataset to a CSV file
final_df.to_csv('../data/wp_politicians_by_country.csv', index=False)

# Summary
print("\nSummary of Merged Data:")
print(f"Final merged dataset: {final_df.shape[0]} rows")
print(f"Unmatched countries saved to 'wp_countries-no_match.txt'.")
print(f"Final merged dataset saved to 'wp_politicians_by_country.csv'.")


Summary of Merged Data:
Final merged dataset: 7002 rows
Unmatched countries saved to 'wp_countries-no_match.txt'.
Final merged dataset saved to 'wp_politicians_by_country.csv'.
