![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=world-childrens-day/world-childrens-day.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Callysto's Weekly Data Visualization


## World Children's Day

### Recommended Grade levels: 5-12

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

<center><img src="images/world-childrens-day.jpg" width=334 height=240><br>
<i style="font-size:9px;">Photo from <a href="https://stock.adobe.com/ca/images/world-childrenas-day-text-with-happy-little-children-arm-in-arm-watercolor-illustration-generative-ai/593202474">Adobe Stock</a> by <a href="https://stock.adobe.com/ca/contributor/202718039/cartoon-it?load_type=author&prev_url=detail">Cartoon-IT</a></i></center>

[**World Children's Day**](https://www.un.org/en/observances/world-childrens-day) is an annual observance dedicated to promoting and celebrating the rights and well-being of children worldwide. Recognized on November 20th each year, this day marks the anniversary of the adoption of the [Convention on the Rights of the Child](https://www.unicef.org/child-rights-convention) by the United Nations in 1989. 


### Goal

Our goal in this notebook is to uncover trends in the global population, focusing on the percentage of individuals aged 14 and under, and the number of children worldwide who are out of primary school. Specifically, we want to see if these trends are scaling up or down, and find out whether educational efforts are being made to help children attend primary school.

### Background

The well-being and education of younger generations play a pivotal role in shaping the future of our global community.  Children represent our future leaders, innovators, and caretakers, and their success contributes to the ongoing cycle of progress.

This background sets the stage for our exploration, emphasizing the significance of understanding and addressing trends in the global population, particularly those related to the percentage of individuals aged 14 and under, and the educational status of children worldwide.

## Gather

Our data is collected through [the World Bank](https://data.worldbank.org/). `Population ages 0-14 (% of total population)` is sourced by the *United Nations Population Division*, and can be found [here](https://data.worldbank.org/indicator/SP.POP.0014.TO.ZS). `Children out of school, primary` is sourced by the *UNESCO Institute for Statistics*, and can be found [here](https://data.worldbank.org/indicator/SE.PRM.UNER).

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
%pip install -q pyodide_http plotly nbformat geopandas ipywidgets pycountry-convert folium
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import pycountry_convert as pc
import ipywidgets
from ipywidgets import interact
import geopandas as gpd
import folium 

print("Libaries imported.")

### Data

In this notebook we will be working with 2 datasets. 

Our first dataset called `percentage_pop` focuses on *population ages 0-14 (% of total population)*. This metric serves as a key indicator of the demographic distribution and highlights the prominence of the younger age group on a global scale. The [WorldBank](https://data.worldbank.org/indicator/SP.POP.0014.TO.ZS) dataset defines the following definition:

<div style="background-color: #1E90FF; color: #FFFFFF; border: 1px solid #000080; border-left: 5px solid #000080; padding: 10px;">
    <strong>Development Relevance: </strong>
    <p style="margin-top: 10px;">
        Patterns of development in a country are partly determined by the age composition of its population. Different age groups have different impacts on both the environment and on infrastructure needs. Therefore the age structure of a population is useful for analyzing resource use and formulating future policy and planning goals with regards infrastructure and development. This indicator is used for calculating age dependency ratio (percent of working-age population). The age dependency ratio is the ratio of the sum of the population aged 0-14 and the population aged 65 and above to the population aged 15-64. In many developing countries, the once rapidly growing population group of the under-15 population is shrinking. As a result, high fertility rates, together with declining mortality rates, are now reflected in the larger share of the 65 and older population.
    </p>
</div>

The second dataset called `primary_school` focuses on children out of elementary/primary school, shedding light on the number of students that have access to primary education. The [WorldBank](https://data.worldbank.org/indicator/SE.PRM.UNER) dataset defines the following definition:

<div style="background-color: #1E90FF; color: #FFFFFF; border: 1px solid #000080; border-left: 5px solid #000080; padding: 10px;">
    <strong>Development Relevance: </strong>
    <p style="margin-top: 10px;">
        Large numbers of children out of school create pressure to enroll children and provide classrooms, teachers, and educational materials, a task made difficult in many countries by limited education budgets. However, getting children into school is a high priority for countries and crucial for achieving universal primary education.
    </p>
</div>

### Import the data

In [None]:
percentage_pop = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/world-children's-day/percentageofpopchildren.csv", skiprows=4)
primary_school = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/world-children's-day/outofprimaryschool.csv", skiprows=4)
del percentage_pop['Unnamed: 67']
del primary_school['Unnamed: 67']
print("Datasets imported.")

### Comment on the data

Now that we've imported our data, let's take a deeper look into rows and columns in our dataset. Note, our dataset is now in a **dataframe** format. Essentially, think of a dataframe as a *spreadsheet* where we have *columns* representing different categories or attributes of the data, and *rows* representing individual records or observations. Just like a spreadsheet organizes information into rows and columns, a dataframe provides a structured format for storing and analyzing data in Python.

In [None]:
display(percentage_pop)

Our dataset `percentage_pop` consists of several columns. Let's go over the ones that are most applicable to analysis. 

- **Country Name**: Identifies the name of the country.

- **Country Code**: Represents a unique code 3-letter code. 

- **1960-2022**: The subsequent columns (from **1960** to **2022**) track the percentage of the population aged 0-14 for their corresponding country.

In [None]:
display(primary_school)


Similarly, for `primary_school` we have the following columns:

- **Country Name**: Identifies the name of the country.

- **Country Code**: Represents a unique code 3-letter code. Specifically known as [ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) code.

- **1960-2022**: The subsequent columns (from **1960** to **2022**) track the number of students out of primary school for their corresponding country.

# Organize

Now that we have a better sense of the different columns and rows in our dataframe, let's *organize* or *transform* our data for useful analysis. In coding terms, this is known as **data-cleaning**. 

Data cleaning involves the process of identifying and removing errors, inconsistencies, or missing values in a dataset to ensure that the data is accurate. 

We'll begin by finding how many entries in our dataframe are **None** or **NaN**. 

In a dataframe, "None" or "NaN" (Not a Number) typically represents missing or undefined values. It indicates that the data for a particular cell or entry is not available or is undefined. 

In [None]:
total_none = percentage_pop.isnull().sum().sum()
print(f"Total number of missing values: {total_none}")

Now that we know every row has a useful sum of data in our dataframe, we are going to define **continents** for our countries. We will be using the library `pycountry_convert` to perform this conversion. This conversion will also be done using a **function**. 

A function is a self-contained, reusable set of instructions that performs a specific task or operation. Think of it as a predefined set of code that can be re-purposed and re-used multiple times. 

In [None]:
def country_code_to_continent(country_code):
    mapping = pc.map_country_alpha3_to_country_name()
    try:
        country_name = mapping[country_code]
        country_alpha2 = pc.country_name_to_country_alpha2(country_name)
        country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
        return country_continent_name
    except:
        return None

print('Test instance of our function, converting "CAN" (Canada):',country_code_to_continent('CAN'))

Testing our function, we get the correct output of *North America* for CAN (Canada). Now that we've tested the functionality of our function, we can apply it to each country in our dataframe in the two code blocks below. 

In [None]:
country_codes = list(percentage_pop['Country Code'])
print("List of country codes in our dataframe: ",country_codes)

In [None]:
# Convert country codes to continent names using our function country_code_to_continent
for country in country_codes:
    try:
        percentage_pop.loc[percentage_pop['Country Code'] == country, 'Continent'] = country_code_to_continent(country)
    except:
        pass
percentage_pop

Looking at our new dataframe output above, we see that we've added a new column called `Continent` which contains the corresponding continent for each country. However, we notice that some our column values are **None**. 

In our case, our function defined these rows as None because these entries are *not countries*. Instead, these are averages of particular areas around the world, or subsets of averages for combinations of countries. As a result, we can safely ignore these values and remove them as they will not be important for future analysis in our notebook.

In [None]:
pd.set_option('display.max_rows', None)
countries_without_label = percentage_pop[percentage_pop['Continent'].isna()].reset_index(drop=True)
temp = list(countries_without_label['Country Name'])
print("Rows that we can remove:  ")
display(temp)
pd.reset_option('display.max_rows')

Before removing these rows, there is **one** particular row of data that is useful despite not being a country. This is the row containing **world** data. This row contains data about the averages of all countries in this dataframe. As a result, we can use this row as a comparison metric. 

In [None]:
world_df = percentage_pop[percentage_pop['Country Name'] == 'World']
world_df

Now that we've set aside the rows that we want to remove, we can do them safely below. 

In [None]:
percentage_pop = percentage_pop.dropna(subset=['Continent'])
percentage_pop

# Explore and Interpret

Now that our dataframe has been properly cleaned, we can perform some rudimentary analysis. We can start by looking at the countries that consistency have higher percentages of population that are 14-years or younger. Similarly, we can look at the flip-side, and view countries that have lower percentages of population that are 14-years or younger. 

In [None]:
cols_to_check = percentage_pop.columns[4:-1]
hashmap_countries_max = {}
print("Countries that have the highest percentage of children under 14-years old per year: \n")
for column in cols_to_check:
    max_val = percentage_pop[column].max()
    results = percentage_pop.loc[percentage_pop[column] == max_val]
    
    country_name = results['Country Name'].values[0]
    hashmap_countries_max[country_name] = 1+hashmap_countries_max.get(country_name, 0)
    
    print(f"{country_name} - {column}: {max_val}%")

In [None]:
cols_to_check = percentage_pop.columns[4:-1]
hashmap_countries_min = {}
print("Countries that have the lowest percentage of children under 14-years old per year: \n")

for column in cols_to_check:
    min_val = percentage_pop[column].min()
    results = percentage_pop.loc[percentage_pop[column] == min_val]
    
    country_name = results['Country Name'].values[0]
    hashmap_countries_min[country_name] = 1+hashmap_countries_min.get(country_name, 0)
    
    print(f"{country_name} - {column}: {min_val}%")

Looking at the output for both highest and lowest percentages, we see there are visible trends between both outputs. The first trend is that both the *highest* and *lowest* percentage outputs are starting to trend *downwards*. Secondly, the data reveals a pattern where *one country* tends to *persist* in the output for *consecutive year*s before being replaced by another.  As a result, there is little variety in the different countries that have the highest and lowest percentages. 

To gain a clearer understanding of the distribution of countries in our two outputs, we can identify each unique occurrence of a country and add to its occurrence count when it is seen more than once. 

In [None]:
hashmap_countries_max = {k: v for k, v in sorted(hashmap_countries_max.items(), key=lambda item: item[1])}

print(f"Total number of unique countries for highest percentage: {len(hashmap_countries_max)}")
for country in hashmap_countries_max:
    print(f"{country}, Total count: {hashmap_countries_max[country]}")

print('\n')
hashmap_countries_min = {k: v for k, v in sorted(hashmap_countries_min.items(), key=lambda item: item[1])}

print(f"Total number of unique countries for lowest percentage: {len(hashmap_countries_min)}")
for country in hashmap_countries_min:
    print(f"{country}, Total count: {hashmap_countries_min[country]}")

Looking at the output above, we get a better insight into the distribution of unique countries in the dataframes for the highest and lowest percentages. 

For the highest percentage, *eight* different countries are identified, with varying counts for each. Notably, countries like *Uganda* and *Kenya* appear more frequently. On the other hand, the lowest percentage dataset involves only *four* distinct countries, including *Monaco* with the highest count. 

Now that we've done some rudimentary analysis on our dataframes, we can start **visualizing** the trends in our data. We can start by using our `world_df` dataframe which contains the average population percentage of children whom are 14-years and under. 

In [None]:
years = []
percentages = []

for column in cols_to_check:
    years.append(int(column))
    percentages.append(world_df[column].values[0])

px.line(x=years, y=percentages, labels={'x': 'Year', 'y': 'Percentage'}, title='World Population Percentage Under 14 from 1960-2022').show()

Looking at our output, we see that since 1960 the average population percentage of children whom are 14-years and under has been rapidly decreasing. The smallest percentage shown in the figure above is in the year *2022*, at *25.27%* rounded. What does a declining percentage of younger children in our global population suggest about future of different countries?

We can also narrow our focus by finding averages based on *continent*. 

In [None]:
individual_continents = percentage_pop[['Continent'] + list(cols_to_check)]

continents_avg = individual_continents.groupby('Continent').mean().reset_index()
continents_avg

By using our new `continents_avg` dataframe which contains the average percentage for each country in their respective continent, we can visualize the differences below.

In [None]:
continental_melted = pd.melt(continents_avg, id_vars=['Continent'], value_vars=cols_to_check, var_name='Year', value_name='Percentage')

px.line(continental_melted, x='Year', y='Percentage', color='Continent',labels={'x': 'Year', 'y': 'Percentage'}, title='Average Continental Population Percentage Under 14 from 1960-2022').show()

 **Africa** stands out with the *highest* population percentage under 14, illustrating the continent's *youthful* demographic profile. In contrast, **Europe** exhibits the *lowest* percentage, indicative of an *aging* population. Notably, **North America**, while initially starting with a relatively *high* percentage, has experienced a *rapid decline*, aligning with the overall downward trend observed across all continents. 

 We can also visualize individual countries grouped by their respective continents to get a better idea of which countries exhibit rapid decline, alongside the slighty few countries that show growth.

In [None]:
columns_to_plot = percentage_pop.columns[4:-1]

continents = ['North America', 'Asia', 'Europe', 'Oceania', 'Africa', 'South America']

continent_dropdown = ipywidgets.Dropdown(options=continents, description='Continent')
def update_plot(continent):
    continent_filtered = percentage_pop[percentage_pop['Continent'] == continent]

    per_year_df = pd.melt(continent_filtered, id_vars=['Country Name'], value_vars=columns_to_plot, var_name='Year', value_name='Percentage')

    world_avg_df = pd.melt(world_df, id_vars='Country Name', value_vars=columns_to_plot, var_name='Year', value_name='Percentage')
    # Set the country name for the world average
    world_avg_df['Country Name'] = 'World'  

    final_df = pd.concat([per_year_df, world_avg_df], ignore_index=True)

    continental_fig = px.line(final_df, x='Year', y='Percentage', color='Country Name', line_group='Country Name', hover_name='Country Name')
    continental_fig.update_layout(title=f'Countries in {continent} with Population Percentage Under 14', xaxis_title='Year', yaxis_title='Percentage', legend_title='Country', height=800).show()

interact(update_plot, continent=continent_dropdown)

We can also shift our visualizations by shifting to a spatial perspective through the use of a [*Folium*](https://pypi.org/project/folium/) map.  Utilizing a [*geojson*](https://en.wikipedia.org/wiki/GeoJSON) file, we can represent the variations in the percentage of the population under 14 across different regions. 

In [None]:
countries_geojson = gpd.read_file('https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/UnitedNations/countries.geojson')
countries_geojson

Before we visualize our Folium map, it's important to ensure that the data we have is *compatible*. Essentially, we need to *map* the countries found in the geojson file with our `percentage_pop` dataframe so that later we can match the corresponding countries with their correct `geometries`. The `geometry` column in this case contains information about a specific country's latitude, longitude, and borders so they can be accurately displayed on an map. 

In [None]:
merged_df = pd.merge(countries_geojson, percentage_pop, left_on='ISO_A3', right_on='Country Code', how='left')
merged_df

Let's display our newly merged dataframe `merged_df`. Using the drop-down menu at the top, you can select a particular *year* to visualize. Do certain years contain drastic differences than others?

**Note**: The map below takes slightly longer to load in than other code cells. Let this cell run until the map is fully loaded/rendered properly. 

In [None]:
percentage_country_map = ipywidgets.Output(layout={'border': '1px solid black'})

column_names = merged_df.columns[7:-1].tolist()
dropdown_options = ipywidgets.Dropdown(
    options=column_names,
    value=column_names[0],
    description='Column:',
    disabled=False
)

def update_choropleth(change):
    percentage_country_map.clear_output()
    with percentage_country_map:
        m = folium.Map()
        folium.Choropleth(
            geo_data=countries_geojson,
            data=merged_df,
            columns=['ADMIN', dropdown_options.value],  
            key_on='feature.properties.ADMIN',  
            fill_color='YlGn',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name=f'{dropdown_options.value}',
        ).add_to(m)
        display(m)

dropdown_options.observe(update_choropleth, names='value')
display(dropdown_options)
update_choropleth({'new': column_names[0]})

percentage_country_map

Moving on from our `percentage_pop` dataframe, we can shift our focus to our `primary_school` dataframe, which contains data centered around the number of children that are not in primary/elementary school. 

Similarly to our `percentage_pop` dataframe, we also need to perform data-cleaning to ensure the data is accurate and ready for further exploration.

In [None]:
total_none = primary_school.isnull().sum().sum()
print(f"Total number of missing values: {total_none}")

Compared to our `percentage_pop` dataframe, our `primary_school` dataframe contains many **NaN** values. This means many of the data entries are not suitable for analysis. As a result, let's remove any rows that missing *more than half* of their entries. 

In [None]:
cols_to_check = primary_school.columns[4:-1]
cols_to_check

In [None]:
none_counts = primary_school[cols_to_check].count(axis=1)

# Eliminate any rows with more than half of the year columns missing
filtered_primary_school = primary_school[none_counts >= len(cols_to_check) / 2]
filtered_primary_school = filtered_primary_school.reset_index(drop=True)
display(filtered_primary_school)

Now that our dataframe has been cleaned properly, we can begin visualizing the different number of children out of primary school based on country. Similarly to before, you can select a country to visualize using the tab titled *Country* below.

In [None]:
country_dropdown = ipywidgets.Dropdown(options=filtered_primary_school['Country Name'].unique(), description='Country')

def update_plot(country):
    country_data = filtered_primary_school[filtered_primary_school['Country Name'] == country]
    melted_country_data = pd.melt(country_data, id_vars=['Country Name'], value_vars=cols_to_check, var_name='Year', value_name='Population')
    px.line(melted_country_data, x='Year', y='Population',labels={'x': 'Year', 'y': 'Population'},title=f'Progression of Number of Children out of Primary School from 1960-2022 in {country}').show()
    
interact(update_plot, country=country_dropdown)

In our final visualization, we can also compare the progression of each country in terms of improving their means of providing access to primary education for children. 

We can achieve this by taking each country's most *recent year* that does not have a NaN value and compare it with their average for the *past 20 years*.

In [None]:
years_to_check = filtered_primary_school.columns[-23:]

country_names = []
recent_years = []
recent_values = []
average_values = []

for index, row in filtered_primary_school.iterrows():
    value_2022 = row.get('2022', None)
    
    if pd.isna(value_2022):
        for year in range(2021, 1999, -1):
            value_2022 = row.get(str(year), None)
            if not pd.isna(value_2022):
                break

    elif value_2022 is not None:
        year = 2022

    values_2002_to_2021 = row[years_to_check]

    average_2002_to_2021 = values_2002_to_2021.mean()

    country_names.append(row['Country Name'])
    recent_years.append(year)
    recent_values.append(value_2022)
    average_values.append(average_2002_to_2021)

comparison_fig = go.Figure()

comparison_fig.add_trace(go.Bar(x=country_names, y=recent_values,text=recent_years, hovertemplate='Year: %{text}<br>Number of Children out of Primary School: %{y}',name='Most Recent Year', marker_color='blue'))
comparison_fig.add_trace(go.Bar(x=country_names, y=average_values, name='20 Years Average', marker_color='orange'))

comparison_fig.update_layout(title="Comparison of Number of Children out of Primary School for Most Recent Year and 20 Years Averages from 1960-2022",xaxis_title='Country', yaxis_title='Number of Children out of Primary School',barmode='group', height=800).show()

Looking at the output, we see that many of the results are divided. Many countries, such as *Portugal* and *Morocco*, have taken strides to lower the number of children out of primary school. Since Portugal is difficult to visualize on the current scale of the figure, zoom in by *dragging your mouse* to a sub-section of only Portugal. However, there are still many countries that display difficulties to keep the number of children out of primary school low. 

# Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)