![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=SocialStudies&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# United Nations



The [Human Development Index](https://hdr.undp.org/data-center/human-development-index#/indicies/HDI) (HDI) stands as a pivotal tool designed by the United Nations to provide a comprehensive assessment of a nation's development. The emphasis behind the HDI is the fundamental principle that the well-being and capabilities of individuals should serve as the primary indicators of a country's progress, rather than relying solely on economic growth. Comprised of various key dimensions, the HDI aims to offer a holistic perspective on human development by evaluating the average progress on three fundamental aspects:

1. **A Long and Healthy Life**: This dimension encompasses the evaluation of life expectancy, reflecting the overall health and well-being of the population within a nation.

2. **Knowledge**: This dimension takes into account factors such as educational attainment and the overall accessibility to education, indicating the level of intellectual development within a society.

3. **A Decent Standard of Living**: The HDI also considers the standard of living within a country, taking into account various socioeconomic factors that contribute to the quality of life, such as income, employment opportunities, and access to basic amenities.

A more thorough analysis can be found using the UN's [technical notes](https://hdr.undp.org/sites/default/files/2021-22_HDR/hdr2021-22_technical_notes.pdf) diagram on how they developed their HDI scoring model: 

<div style="text-align:center"><img src="images/HDI_diagram.png" alt="HDI Diagram" /></div>


Throughout this notebook, we will be examining both HDI trends throughout different time periods (1990-2021) and the factors which contribute to a country's HDI score.

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
import pandas as pd
import plotly_express as px
from plotly.subplots import make_subplots
import folium
import geopandas as gpd
import plotly.graph_objs as go
import ipywidgets
from ipywidgets import interact
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score
import warnings
import math
warnings.filterwarnings("ignore")
print("Libraries imported.")

### Download the Data

Let's begin by downloading the Human Development Index (HDI) datasets. 

In [None]:
HDI_components = pd.read_excel("https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/UnitedNations/HDI_components.xlsx")
HDI_years = pd.read_excel("https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/UnitedNations/HDI_years.xlsx")

display(HDI_components, HDI_years)

### Examining HDI Trends
Let's begin our analysis using the `HDI_years` dataset; dataset in this case has been converted into a *dataframe*. 

Think of a dataframe as similar to a [*spreadsheet*](https://en.wikipedia.org/wiki/Spreadsheet), much like the ones you might use in programs like Microsoft Excel or Google Sheets. It's a table-like structure where data is organized into rows and columns. Each row corresponds to an individual record or observation, while each column represents a specific attribute or characteristic. 

In [None]:
HDI_years

Looking at our dataframe output above, we see that this dataframe contains information regarding the HDI values for various countries across the years *1990-2021*. It also offers insights into the changes in *HDI rankings*, the *average annual HDI growth rates* during specific periods, and the overall *progress of countries* in terms of human development.

### Taking a Deeper Look
Let's start with some basic analysis by taking a deeper look into the different changes in HDI rankings of the countries in our dataframe using the `Change in HDI rank 2015-2021` column.

In [None]:
for index in HDI_years.index:
    print(f"Country Name: {HDI_years['Country'][index]}, Change in HDI Ranking: {HDI_years['Change in HDI rank 2015-2021'][index]}")

Interesting! However, the output above is extremely long. Instead, let's just try to find the largest positive and negative ranking changes.

In [None]:
try:
    maximum_change = HDI_years['Change in HDI rank 2015-2021'].max()
    minimum_change = HDI_years['Change in HDI rank 2015-2021'].min()
except TypeError:
    print("Values in `Change in HDI rank 2015-2021` column are not numeric. Let's convert them to numeric.")

### Data-Cleaning
We ran into an error! We tried to find the maximum and minimum values in the column `Changes in HDI rank 2015-2021` but it appears the values are not considered **numerical** (in coding terms this would be considered an *int* or *float*). We have to perform **data-cleaning**. 

Data cleaning is an essential process in data analysis that involves identifying and changing any inaccuracies, inconsistencies, or errors within a dataframe.  In this particular context, the column `Change in HDI rank 2015-2021` contains non-numeric values. To address this issue, we will convert the values in this column to numeric format, allowing us to perform meaningful data analysis and interpretation moving forward.

In [None]:
# The line of code below is converting the values in our column into numerical values
HDI_years['Change in HDI rank 2015-2021'] = pd.to_numeric(HDI_years['Change in HDI rank 2015-2021'], errors='coerce')

maximum_change = HDI_years['Change in HDI rank 2015-2021'].max()
minimum_change = HDI_years['Change in HDI rank 2015-2021'].min()

max_country = HDI_years.loc[HDI_years['Change in HDI rank 2015-2021'].idxmax()]['Country']
min_country = HDI_years.loc[HDI_years['Change in HDI rank 2015-2021'].idxmin()]['Country']

print(f"Largest Positive Change: {maximum_change}, Country: {max_country}")
print(f"Largest Negative Change: {minimum_change}, Country: {min_country}")

Perfect! We have now found which country has had the largest positive ranking change (*China*), alongside the country which has had the largest negative ranking change (*Venezuela*). 

### Reflection Questions

1. What factors might have contributed to China's substantial positive ranking change over the years? On the flip-side, what potential factors could have led to Venezuela's lowered ranking? 
   
2. How do changes in HDI rankings reflect a country's progress in terms of various aspects of human development? What are the key indicators that contribute to these changes, and how do they shape a nation's trajectory over time?

### Finding more Trends
Similarly, we can also find the highest and lowest annual HDI growth between certain time-periods. 

In [None]:
for col in HDI_years.columns[11:15]:
    try:
        highest_val = HDI_years[col].max()
        lowest_val = HDI_years[col].min()
    except:
        print(f"Values in {col} column are not numeric. Let's convert them to numeric.\n")
        HDI_years[col] = pd.to_numeric(HDI_years[col], errors='coerce')
        highest_val = HDI_years[col].max()
        lowest_val = HDI_years[col].min()

    index_highest_val = HDI_years.loc[HDI_years[col].idxmax()]['Country']
    index_lowest_val = HDI_years.loc[HDI_years[col].idxmin()]['Country']

    print(f"Highest {col}: {highest_val} Country: {index_highest_val} \nLowest {col}: {lowest_val} Country: {index_lowest_val}\n")

Examining the data output above, the biggest take-away appears to be that the overall percentages of *annual HDI growth* are not that large. 

The relatively modest percentages of annual HDI growth seem to align with the idea that significant changes in human development typically occur *gradually over time* rather than through *abrupt shifts*. 

Enhancing various aspects of people's lives, such as their *health*, *education*, and *living standards*, is a process that takes time and persistent effort. Achieving significant improvements in these areas involves addressing complex societal challenges and implementing long-term strategies. Additionally, considering the diverse obstacles that nations may encounter, like economic or political instabilities, it becomes clear that making substantial advancements in human development is a slow process.

### Visualizing Changes
We can also convey information using visualizations. The visualization below shows the highest HDI values found in each year spanning from 1990-2021. 

In [None]:
HDI_cols = HDI_years.columns[2:10]
for column in HDI_cols:
    HDI_years[column] = pd.to_numeric(HDI_years[column], errors='coerce')

temp_data = []
for col in HDI_cols:
    max_country = HDI_years.loc[HDI_years[col].idxmax()]['Country']
    max_value = HDI_years[col].max()
    temp_data.append({'Country': max_country, 'Max HDI Value': max_value, 'Year': col})

highest_HDI_vals = pd.DataFrame(temp_data)

highest_HDI_fig = px.scatter(highest_HDI_vals, x='Country', y='Max HDI Value', text='Year',
                 labels=dict(x="Country", y="Highest HDI Value"),
                 title="Highest HDI Value by Country from 1990-2021").show()

Looking at the visualization above, *Norway* and *Switzerland* appear to have multiple years when they were considered to have the highest HDI value. 

The recurring presence of Norway and Switzerland with some of the highest HDI values suggests their consistent strong performance in various aspects of development over the years. Countries with continued high HDI scores usually have *robust systems* in healthcare, education, and other systems that support living standards. Their stable and effective governance, strong economies, and fair distribution of resources contribute to their ability to sustain these high rankings. This shows that their consistent focus on meeting the diverse needs of their people is essential for maintaining long-term progress in human development.

We can also categorize countries based on their HDI value. The [UN](https://hdr.undp.org/data-center/human-development-index#/indicies/HDI) separates countries based on their HDI scores as follows:

- Countries that are considered "Very High" have an HDI value of 0.8 or higher
- Countries that are considered "High" have an HDI value between 0.7-0.799
- Countries that are considered "Medium" have an HDI value between 0.55-0.699
- Countries that are considered "Low" have an HDI value below 0.55

We can implement this same categorization system and filter our countries based on this system. 

You can change the figure you want to visualize by changing the variable `which_fig` below. For example, instead of `which_fig = 'Medium` you can input `which_fig = 'Very High`.

In [None]:
# Change this value out of 4 categories: Very High, High, Medium, and Low. 
which_fig = 'Low'  

def create_hdi_fig(category, HDI_years):
    categories = {
        'Very High': HDI_years[HDI_years['HDI 2021'] >= 0.8],
        'High': HDI_years[(HDI_years['HDI 2021'] >= 0.7) & (HDI_years['HDI 2021'] < 0.8)],
        'Medium': HDI_years[(HDI_years['HDI 2021'] >= 0.55) & (HDI_years['HDI 2021'] < 0.7)],
        'Low': HDI_years[HDI_years['HDI 2021'] < 0.55]
    }

    selected_category = categories.get(category)

    if selected_category is not None:
        fig = go.Figure()
        for index, row in selected_category.iterrows():
            country = row['Country']
            hdi_values = [row[col] for col in HDI_cols]
            fig.add_trace(go.Scatter(x=HDI_cols, y=hdi_values, mode='lines+markers', name=country))
        fig.update_layout(title=f'HDI values for Countries in the {category} Category', xaxis_title='Year', yaxis_title='HDI', height=1000)
        return fig
    else:
        return None

selected_fig = create_hdi_fig(which_fig, HDI_years)

if selected_fig:
    selected_fig.show()
else:
    print("Invalid value for which_fig. Please select a valid option.")

Visualizing the data above can be a bit difficult considering the various number of different points that can overlap due to similar HDI scores. We can solve this issue by creating a figure based on *year*. 

In [None]:
def update_scatter_plot(selected_year):
    data = []
    for country in HDI_years['Country']:
        trace = go.Scatter(x=[country], y=[HDI_years.loc[HDI_years['Country'] == country, selected_year].values[0]], mode='markers', name=country)
        data.append(trace)

    layout = go.Layout(title=f'HDI Scatter Plot ({selected_year})', xaxis=dict(showticklabels=False, title='Country'), yaxis=dict(title='HDI'))
    selected_year_fig = go.Figure(data=data, layout=layout)
    selected_year_fig.show()

interact(update_scatter_plot, selected_year=HDI_cols)

### Reflection Questions

1. Choose two country that are near each other on the map. Analyze the trends between these countries. Do you notice any similarities or differences in their HDI scores, and what factors could potentially account for these variations?
   
2. Reflect on countries that have experienced low growth in their HDI rankings. What potential factors might have influenced their stagnated rank, and what can they do to potentially experience growth?

3. What impact might external factors such as global economic fluctuations or environmental challenges, on a country's HDI growth. How might these external influences shape a nation's development trajectory, and what measures can be implemented to mitigate potential negative effects?

### Map of Countries
In this next section, we'll be focusing on using *geojson* information to visualize our countries using a **Folium** map. 

In [None]:
countries_geojson = gpd.read_file('https://raw.githubusercontent.com/callysto/data-files/main/SocialStudies/UnitedNations/countries.geojson')
countries_geojson

Before we visualize our Folium map, it's important to ensure that the data we have is *compatible*. Essentially, we need to *map* the countries found in the geojson file with our `HDI_years` dataframe so that later we can match the corresponding countries with their correct `geometries`. The `geometry` column in this case contains information about a specific country's latitude, longitude, and borders so they can be accurately displayed on an map. 

After mapping, we have to perform some more data-cleaning, specifically checking to see if the countries are named the same in both dataframes. Ensuring that the countries share the same names in both dataframes is vital because we intend to map and merge the datasets based on the common identifier, which is the `Country` name. In this case, a mismatching country name means a country will not be mapped properly. 

In [None]:
geojson_country_names = countries_geojson['ADMIN']

hdi_country_names = HDI_years['Country']

matching_countries = set(geojson_country_names).intersection(hdi_country_names)
non_matching_countries = set(hdi_country_names) - matching_countries
non_matching_countries_geojson = set(geojson_country_names) - matching_countries

print(f'Non-matching geojson: {non_matching_countries_geojson}')
print(f'Non-matching dataframe countries: {non_matching_countries}')

Looking at the output, the important information to check is the *non-matching* names in the `HDI_years` dataframe (the bottom text output). Luckily, it appears that every country in the bottom-output has a corresponding match in the top output. 

Therefore, we can just change the names of the `HDI_years` dataframe to match the `countries_geojson` dataframe, resulting in matching country names in both dataframes.

In [None]:
mapping = {
    'Russian Federation': 'Russia',
    'Micronesia (Federated States of)': 'Federated States of Micronesia',
    'Cabo Verde': 'Cape Verde',
    "Korea (Democratic People's Rep. of)": 'North Korea',
    'North Macedonia': 'Macedonia',
    'Bahamas': 'The Bahamas',
    'Tanzania (United Republic of)': 'United Republic of Tanzania',
    'Türkiye': 'Turkey',
    'Serbia': 'Republic of Serbia',
    'Eswatini (Kingdom of)': 'Swaziland',
    'Guinea-Bissau': 'Guinea Bissau',
    'Timor-Leste': 'East Timor',
    "Lao People's Democratic Republic": 'Laos',
    'Congo': 'Republic of Congo',
    'Syrian Arab Republic': 'Syria',
    'Brunei Darussalam': 'Brunei',
    'Viet Nam': 'Vietnam',
    'Iran (Islamic Republic of)': 'Iran',
    'Czechia': 'Czech Republic',
    'Congo (Democratic Republic of the)': 'Democratic Republic of the Congo',
    'Bolivia (Plurinational State of)': 'Bolivia',
    'Moldova (Republic of)': 'Moldova',
    'Korea (Republic of)': 'South Korea',
    "Côte d'Ivoire": 'Ivory Coast',
    'Palestine, State of': 'Palestine',
    'Venezuela (Bolivarian Republic of)': 'Venezuela',
    'Hong Kong, China (SAR)': 'Hong Kong S.A.R.',
    'United States': 'United States of America'
}

HDI_years['Country'] = HDI_years['Country'].replace(mapping)

The final step in our data-cleaning for our Folium map is to remove any countries that do not have a `HDI rank`. Showing these countries would be pointless, as they have an HDI value of 0.

In [None]:
no_HDI_years = HDI_years[HDI_years['HDI rank'].isnull()]
display(no_HDI_years)
HDI_years.dropna(subset=['HDI rank'], inplace=True)

We can now finally merge our two dataframes, `HDI_rank` and `countries_geojson`. By merging the two dataframes, we can obtain the latitude, longitude, and borders of each country alongside the country's HDI values from 1990-2021. 

In [None]:
merged_data = pd.merge(HDI_years, countries_geojson, left_on='Country', right_on='ADMIN', how='left')
merged_data

Let's display our newly merged dataframe `merged_data`. Using the drop-down menu at the top, you can select a particular column year to visualize. Do certain years contain drastic differences than others? Similarly, are there particular years where countries have HDI rankings, but do not have them for other years?

**Note**: This map is *slow* to render. It may take up to a minute to load.

In [None]:
HDI_by_country = ipywidgets.Output(layout={'border': '1px solid black'})

column_names = merged_data.columns[2:10].tolist()
dropdown_options = ipywidgets.Dropdown(
    options=column_names,
    value=column_names[0],
    description='Column:',
    disabled=False
)

def update_choropleth(change):
    HDI_by_country.clear_output()
    with HDI_by_country:
        m = folium.Map()
        folium.Choropleth(
            geo_data=countries_geojson,
            data=merged_data,
            columns=['ADMIN', dropdown_options.value],  
            key_on='feature.properties.ADMIN',  
            fill_color='YlGn',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name=f'{dropdown_options.value} per Country',
        ).add_to(m)
        display(m)

dropdown_options.observe(update_choropleth, names='value')
display(dropdown_options)
update_choropleth({'new': column_names[0]})

HDI_by_country

### Reflection Questions

1. How do the *HDI values* of countries change over time? Are there discernible patterns or trends that emerge over the years?

2. How do the *geographical locations* of countries correlate with their HDI values? Are there any noticeable differences in HDI scores based on *regions* or geographical locations?
   
3. Considering the *average annual HDI growth percentages*, what can be inferred about the overall *progress* of countries in terms of human development? Are there *specific time period*s that exhibit notable improvements or setbacks in human development across various nations?

### Factors of Importance for HDI
Moving on from HDI yearly trends, we can now examine what *factors/components* directly contribute to a country's HDI score. 

In [None]:
HDI_components

Looking at the dataframe `HDI_components` above, we see there are 4 main components that contribute to a country's HDI score. The UN column descriptions defines these columns as the following:

| Column Name | Description |
| --- | --- |
| Life expectancy at birth (years) | Number of years a newborn infant could expect to live if prevailing patterns of age-specific mortality rates at the time of birth stay the same throughout the infant’s life. |
| Expected years of schooling (years) | Number of years of schooling that a child of school entrance age can expect to receive if prevailing patterns of age-specific enrolment rates persist throughout the child’s life. |
| Mean years of schooling (years) | Average number of years of education received by people ages 25 and older, converted from education attainment levels using official durations of each level. |
| Gross national income (GNI) per capita (2017 PPP $) | Aggregate income of an economy generated by its production and its ownership of factors of production, less the incomes paid for the use of factors of production owned by the rest of the world, converted to international dollars using PPP rates, divided by midyear population. |

Similarly to before, we need to perform some *data-cleaning* before getting into any analysis of our dataframe. Let's start simple by changing the column names into more appriopriate column names. 

First, `HDI rank 1.` is supposed to refer to the HDI rank of the country in 2020. Let's change the name to `HDI rank 2020` to better reflect what this column means. Next, the column `Human Development Index (HDI)` contains an extra space at the end of the column name. Let's remove this space for ease of work later on. 

In [None]:
HDI_components.rename(columns={'HDI rank.1': 'HDI rank 2020', 'Human Development Index (HDI) ': 'Human Development Index (HDI)'}, inplace=True)
HDI_components

Similarly to before, let's also remove any countries that do not have a `HDI rank`. We will, however, be using these countries in a later part of analysis. 

In [None]:
no_HDI = HDI_components[HDI_components['HDI rank'].isnull()]
no_HDI

Now that we've finished data-cleaning, let's perform similar analysis to our `HDI_years` dataframe. Let's find the *maximum*, *minimum*, and *mean* values in our 4 columns that contribute to a country's HDI score so that we can get a better understanding of the boundaries and trends within the dataframe. Furthermore, if these values are not numerical, we can also convert them here for future analysis.

In [None]:
for col in HDI_components.columns[2:7]:
    try:
        HDI_components[col].max()
        HDI_components[col].min()
        HDI_components[col].mean()
    except:
        print(f"Values in {col} column are not numeric. Let's convert them to numeric.\n")
        HDI_components[col] = pd.to_numeric(HDI_components[col], errors='coerce')
        HDI_components[col].max()
        HDI_components[col].min()
        HDI_components[col].mean()
    
    max_country = HDI_components.loc[HDI_components[col].idxmax()]['Country']
    min_country = HDI_components.loc[HDI_components[col].idxmin()]['Country']
    print(f"Maximum {col}: {HDI_components[col].max()}, Country: {max_country}")
    print(f"Minimum {col}: {HDI_components[col].min()}, Country: {min_country}")
    print(f"Mean {col}: {HDI_components[col].mean()}\n")

Looking at the output above, we can get a better sense of what values are considered great and now so great in the context of determining a country's HDI score. One interesting thing to point out is the average HDI score across all countries being 0.72. This is a surprisingly high score, as this score would be graded *High* based on the UN's own ranking system. There are 2 scenarios on why this could occur:

1. There are many countries that are highly ranked in the dataframe, leading to an overall higher average than expected.
2. There are few countries that are lowly ranked in the dataframe, leading to an overall higher average than expected.

Both scenario's lead to great outcomes, which is always a positive sight to see. 

### Gross National Income (GNI) compared to Human Development Index (HDI)

We can also perform some analysis in finding the correlation between a country's GNI ranking and their HDI ranking. Specifically, let's see if there are more countries that have a higher GNI ranking than HDI ranking, and vice-versa. 

Note: A country with a negative `GNI per capital rank minus HDI rank` means the country is better ranked by GNI than by HDI value.

In [None]:
try:
    counts = HDI_components['GNI per capita rank minus HDI rank'].apply(lambda x: 'positive' if x > 0 else 'negative').value_counts()
    print("Values are not numeric. Let's convert them to numeric. Converting...\n")
except:
    HDI_components['GNI per capita rank minus HDI rank'] = pd.to_numeric(HDI_components['GNI per capita rank minus HDI rank'], errors='coerce')
    counts = HDI_components['GNI per capita rank minus HDI rank'].apply(lambda x: 'positive' if x > 0 else 'negative').value_counts()
    positive_count = counts.get('positive', 0)
    negative_count = counts.get('negative', 0)

print("Number of Countries with a GNI ranking lower than their HDI ranking", positive_count)
print("Number of Countries with a GNI ranking higher than their HDI ranking:", negative_count)

Looking at the output above, the split between higher GNI and HDI values is very close. This would indicate a relatively balanced distribution, indicating that on average, a country's GNI rank correspond closely to their HDI rank.

We can visualize these differences side-by-side using two figures below.

In [None]:
rankings_fig = make_subplots(rows=1, cols=2, subplot_titles=("GNI per Capita Rank (without HDI rank) per Country", "HDI Rank per Country"))

rankings_fig.add_trace(go.Scatter(x=HDI_components['Country'], y=HDI_components['GNI per capita rank minus HDI rank'], mode='markers', name='GNI Fig'), row=1, col=1)
rankings_fig.add_trace(go.Scatter(x=HDI_components['Country'], y=HDI_components['HDI rank'], mode='markers', name='HDI Fig'), row=1, col=2)

rankings_fig.update_traces(hovertemplate='Country: %{x}<br>GNI per capita rank minus HDI rank: %{y}', 
                           row=1, col=1)

rankings_fig.update_traces(hovertemplate='Country: %{x}<br>HDI rank: %{y}', 
                           row=1, col=2)

rankings_fig.update_layout(title_text="Comparison of GNI per capita rank minus HDI rank and HDI rank for Different Countries",
                  showlegend=False)

rankings_fig.update_yaxes(title_text="GNI per capita rank minus HDI rank", row=1, col=1)
rankings_fig.update_yaxes(title_text="HDI rank", row=1, col=2)
rankings_fig.show()

A notable point to mention is that the `GNI per capital rank minus HDI rank` is mostly dependent on the *context* of the country's HDI.

A rank that is close to 0 for a country with a high HDI (ranking-wise, HDI is flipped, meaning 1 is the highest) would indicate a fairly high GNI value as well. However, if a country has a relatively low HDI score (ranking-wise, HDI is flipped, meaning 191 is the lowest) and also has a `GNI per capital rank minus HDI rank` close to 0, it means they have both a fairly bad HDI and GNI score. 

### Predicting HDI using Machine-Learning
We can investigate HDI more thoroughly by also trying to predict a country's HDI score using **machine-learning**. 

Machine learning is a powerful tool that enables computers to learn from data and make decisions without being explicitly programmed. It involves creating algorithms that can recognize patterns and make predictions or decisions based on those patterns. Think of it as a way for computers to learn from experience and improve over time. These algorithms are used in various applications, such as *recommendation systems*, *image and speech recognition*, and even *self-driving cars*, making them an essential part of our modern technological landscape.

**Note**: HDI can actually be calculated using a *formula* which will be implemented later in this notebook. The machine-learning portion is mainly meant to show how it is possible to create ways to accurately predict scores for metrics that do not have a scoring system in place. 

We'll be doing some machine-learning based on the columns: `Life expectancy at birth (years)`, `Expected years of schooling (years)`, `Mean years of schooling (years)`, `Gross national income (GNI) per capita (2017 PPP $)`. These will be called **features**. Features are essentially the *variables* or *attributes* that are analyzed and processed by a machine learning algorithm to produce an **output**. 

We'll also be using an **imputer** in our machine-learning analysis. An imputer is a technique/estimator used in machine learning to handle missing data by substituting missing values with estimated or calculated ones. It helps ensure that dataframes are complete and suitable for analysis, thereby improving the accuracy and reliability of the resulting models.

In [None]:
no_HDI = HDI_components[HDI_components['HDI rank'].notnull()]
features = ['Life expectancy at birth (years)', 'Expected years of schooling (years)', 'Mean years of schooling (years)', 'Gross national income (GNI) per capita (2017 PPP $)']

X = no_HDI[features]
y = no_HDI['Human Development Index (HDI)']

imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred = [1.0 if pred > 1.0 else pred for pred in y_pred]

r2 = r2_score(y_test, y_pred)
print("R-squared value of our model:", r2)

Looking at the output above, we have successfully obtained an **r-squared** value for our machine-learning model. This r-squared value (also known as the coefficient of determination) refers to how well the line drawn in a graph fits the data points. A value *close to 1* means the line fits the points very well, while a value *closer to 0* means the line doesn't fit the points as well. 

An r-squared value of **0.973** is great for our model, indicating an extremely strong fit of the line to the data points. 

Remember how we mentioned that the HDI score actually has a formula? This formula can be created in Python as well. The UN's [technical notes](https://hdr.undp.org/sites/default/files/2021-22_HDR/hdr2021-22_technical_notes.pdf) supplies diagrams that accurately portray how HDI scores are calculated:


<div style="text-align:center"><img src="images/table_vals.png" alt="HDI Diagram" /></div>

First, there are boundaries to the different *indicators* when calculating each dimensional index. If a value goes below the minimum, then it is rounded up to the minimum (resulting in an HDI score of 0 based on the calculation of a dimensional index). Similarly, if a value goes above the maximum, then it rounded to the maximum (resulting in a score of 1.0 or a perfect score). The reasoning of why they set boundaries to these dimensional indices is further elaborated by the United Nations (2022): 

"Minimum and maximum values (goalposts) are set in order to transform the indicators expressed in different units into indices between 0 and 1. These goalposts act as ‘the natural zeros’ and ‘aspirational targets,’ respectively, from which component indicators are standardized" (p. 2).

<div style="text-align:center"><img src="images/dimensional_index.png" alt="HDI Diagram" /></div>

This value is then calculated using the formula above. 

<div style="text-align:center"><img src="images/HDI_score.png" alt="HDI Diagram" /></div>

Each dimensional index is then multipled together and then taken to the 1/3rd power, resulting in the final HDI score.

An example of calculation can be found below: 

<div style="text-align:center"><img src="images/example.png" alt="HDI Diagram" /></div>

In [None]:
def get_index(value, min_val, max_val, is_income_index=False):
    if is_income_index:
        return (math.log(value) - math.log(min_val)) / (math.log(max_val) - math.log(min_val))
    else:
        return (value - min_val) / (max_val - min_val)

def calculate_index(life_expectancy, expected_schooling, mean_schooling, gni_per_capita):
    variables = [
        ('Life expectancy', life_expectancy, 20, 85),
        ('Expected schooling', expected_schooling, 0, 18),
        ('Mean schooling', mean_schooling, 0, 15),
        ('GNI per capita', gni_per_capita, 100, 75000)
    ]

    for i, (variable_name, value, min_val, max_val) in enumerate(variables):
        if value < min_val:
            variables[i] = (variable_name, min_val, min_val, max_val)
        elif value > max_val:
            variables[i] = (variable_name, max_val, min_val, max_val)

    life_expectancy, expected_schooling, mean_schooling, gni_per_capita = [v[1] for v in variables]

    health_index = get_index(life_expectancy, 20, 85)
    expected_schooling_index = get_index(expected_schooling, 0, 18)
    mean_schooling_index = get_index(mean_schooling, 0, 15)
    income_index = get_index(gni_per_capita, 100, 75000, True)
    return health_index, expected_schooling_index, mean_schooling_index, income_index

def calculate_hdi(life_expectancy, expected_schooling, mean_schooling, gni_per_capita):
    health_index, expected_schooling_index, mean_schooling_index, income_index = calculate_index(
        life_expectancy, expected_schooling, mean_schooling, gni_per_capita
    )

    index_education = (expected_schooling_index + mean_schooling_index) / 2
    hdi = (health_index * index_education * income_index) ** (1 / 3)
    return round(hdi, 3)

print("Function defined.")

We can test our own implementation by taking values found in our dataframe and testing to see if they match up correctly. 

For the invalid HDI score, we'll input 5 as the value for `life_expectancy`. Since this value is lower than the minimum of 20, using our formula this invalid HDI score should result in a score of 0.

In [None]:
# Example usage for Switzerland, a valid HDI score
life_expectancy = 83.9872
expected_schooling = 16.500299
mean_schooling = 13.85966
gni_per_capita = 66933.00454	

hdi = calculate_hdi(life_expectancy, expected_schooling, mean_schooling, gni_per_capita)
print(f"Switzerland HDI score: {hdi}")

# Example usage for an invalid HDI score
life_expectancy = 5
expected_schooling = 20
mean_schooling = 8
gni_per_capita = 80000

nonvalid_hdi = calculate_hdi(life_expectancy, expected_schooling, mean_schooling, gni_per_capita)
print(f"Calculated HDI score (should be 0): {nonvalid_hdi}")

Nice! These values are correct. We can now get back to our machine-learning implementations and use our derived formula to see if there are any significant deviations from our predicted scores. Let's use both values in a comparison visualization below. 

In [None]:
australia = HDI_components.loc[HDI_components['Country'] == 'Australia']
south_sudan = HDI_components.loc[HDI_components['Country'] == 'South Sudan']

australia_hdi = calculate_hdi(australia['Life expectancy at birth (years)'].values[0], australia['Expected years of schooling (years)'].values[0], australia['Mean years of schooling (years)'].values[0], australia['Gross national income (GNI) per capita (2017 PPP $)'].values[0])
south_sudan_hdi = calculate_hdi(south_sudan['Life expectancy at birth (years)'].values[0], south_sudan['Expected years of schooling (years)'].values[0], south_sudan['Mean years of schooling (years)'].values[0], south_sudan['Gross national income (GNI) per capita (2017 PPP $)'].values[0])

predicted_australia = model.predict(australia[features])
predicted_south_sudan = model.predict(south_sudan[features])

actual_vs_predicted = make_subplots(rows=2, cols=1, subplot_titles=['Australia', 'South Sudan'], shared_xaxes=True)

actual_vs_predicted.add_trace(go.Bar(x=[predicted_australia[0], australia_hdi],y=['Predicted', 'Actual'],orientation='h',name='Australia',text=['Predicted', 'Actual'],marker=dict(color=['salmon', 'salmon']),textposition='auto',textfont=dict(color='black'),showlegend=False),row=1, col=1)
actual_vs_predicted.add_trace(go.Bar(x=[predicted_south_sudan[0], south_sudan_hdi],y=['Predicted', 'Actual'],orientation='h',name='South Sudan',text=['Predicted', 'Actual'],marker=dict(color=['plum', 'plum']),textposition='auto',textfont=dict(color='black'),showlegend=False),row=2, col=1)

actual_vs_predicted.update_layout(title_text="Comparison of Actual and Predicted HDI Scores", barmode='group', yaxis=dict(autorange="reversed"))
actual_vs_predicted.update_yaxes(title_text="Performance", row=1, col=1)
actual_vs_predicted.update_yaxes(title_text="Performance", row=2, col=1)
actual_vs_predicted.update_xaxes(title_text="HDI Score", row=2, col=1).show()

Looking at our visualization output, it appears that both the *predicted* high-scoring HDI values and the low-scoring HDI values are relatively close their actual values. This suggests that the model's predictions for both high and low HDI values are fairly accurate and reliable. 

We can also apply our machine-learning model to all the countries in our dataframe. The visualization below will show the *actual* HDI values of the country on the left, and their *predicted* values using machine-learning on the right. 

In [None]:
X_test = imputer.fit_transform(no_HDI[features])

no_HDI['Predicted HDI'] = model.predict(X_test)
ml_fig = make_subplots(rows=1, cols=2, subplot_titles=("Actual HDI Value", "Predicted HDI Value"))

ml_fig.add_trace(go.Scatter(x=no_HDI['Country'], y=no_HDI['Human Development Index (HDI)'], name="Actual HDI Value"), row=1, col=1)
ml_fig.add_trace(go.Scatter(x=no_HDI['Country'], y=no_HDI['Predicted HDI'], name="Predicted HDI Value"), row=1, col=2)

ml_fig.update_layout(title_text="Actual vs Predicted HDI Value",
                  showlegend=True)

ml_fig.show()

Examining the presented graph, we can observe that the model has pretty accurately estimated each country's HDI score. There are some anomalies in our machine-learning graph, such as HDI values greater than 1.0 (which are impossible in our HDI scoring formula) but other than these slight differences, the model appears to worked successfully. 

### Reflection Questions

1. What *potential* benefits could arise from integrating machine-learning methodologies into *existing scoring systems*, such as the HDI, to gain deeper insights or improve accuracy?
   
2. How can the integration of machine-learning models help in identifying *patterns* or *trends* within the data that may not be immediately apparent through *traditional* analysis methods?
   
3. What *challenges* or *limitations* might arise when implementing machine-learning techniques in the context of *well-established scoring systems*, and how can these challenges be addressed effectively?

### Examining Non-HDI Countries

In the final part of our notebook, we will be examining countries that have no HDI rank, and try to accurately rank them based on their *feature* values, despite some of them being **null**. Null values refer to the absence of any particular value. 

In [None]:
null_countries = HDI_components[HDI_components['HDI rank'].isnull()]
null_countries

Now that we've identified which countries need predicted HDI values, let's use our machine-learning model to predict their HDI values, and plot them in a visualization.

In [None]:
temp = HDI_components.copy()
X_test = imputer.fit_transform(temp[features])
temp['Predicted HDI'] = model.predict(X_test)
null_countries = temp[temp['HDI rank'].isnull()]

null_countries_fig = go.Figure()
for index, row in null_countries.iterrows():
    null_countries_fig.add_trace(go.Scatter(x=[row['Country']], y=[row['Predicted HDI']], mode='markers', name=row['Country'], marker=dict(size=row['Predicted HDI']*14)))

null_countries_fig.update_layout(title_text="Predicted HDI Values for Countries without HDI Values",
                  xaxis_title="Country",
                  yaxis_title="Predicted HDI Value")

null_countries_fig.show()

### Reflection Questions

1. Do you think the *predicted* HDI values for the countries without HDI values are *accurate*? Are there any *external* factors to these countries that could be significantly biasing a country's HDI score?

2. What *additional* data points or variables could be included to provide a more *comprehensive* understanding of a country's HDI score?
   
3. In what ways might *cultural* or *historical* contexts impact the evaluation of different dimensions of the HDI?

# Conclusion

In this notebook, we imported data from the United Nations based on their research on [Human Development Index](https://hdr.undp.org/data-center/human-development-index#/indicies/HDI) to gain valuable insights into the developmental aspects of various countries. We were able to uncover *trends* and *patterns* in HDI scores across different time periods and visualized these trends. 

By delving into the dimensions that contribute to the HDI, such as life expectancy, education, and standard of living, we gained a *comprehensive* understanding of how these factors collectively shape a country's development landscape.

We also utilized the application of *machine-learning* techniques to make predictions and assess the performance of our model in estimating HDI scores. We extended our usage of our machine-learning model by predicting the HDI scores of countries that could not be scored using the formula the United Nations developed. 

Perhaps you can try *extension* activities such as investigating factors that would *decrease* a country's HDI score. 

If you'd like to learn more about extensions of HDI provided by the United Nations, see their [technical notes](https://hdr.undp.org/sites/default/files/2021-22_HDR/hdr2021-22_technical_notes.pdf). Some of these extensions include multi-dimensional poverty adjustment and adjusting for inequalities. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)