# The relationship between the Human Development Index and the prevalance of mental health disorders

## Research question

The aim of this analysis is to investigate the relationship between country-level socioeconomic status and the prevalence of mental health disorders. Studies have suggested that mental health is worse is relatively highly developed countries [(Barbalat & Franck, 2020)](https://bmjopen.bmj.com/content/bmjopen/10/4/e035055.full.pdf). A country's level of human development can be measured using the humand development index (HDI).

The HDI is a metric designed to quantify a country's socieeconomic status by using key dimensions of human development. The index is calculated using four key metrics: (1) life expectancy at birth - to asses a long and healthy life, (2) expected years of schooling - to asses access to knowledge of the younger generation, (3) average years of schooling - to assess access to knowledge of the older generation and (4) gross national income (GNI) per capita - to asses the standard of living. Each of these metrics is normalized to an index value and aggregated into the HDI using the following formula [(2020 HDR Technical Notes)](https://hdr.undp.org/system/files/documents/technical-notes-calculating-human-development-indices.pdf):

$HDI = (I_{Health} * I_{Education} * I_{Income})^\frac{1}{3}$

The HDI dataset contains HDI values from 1990 to 2021 The 'Global Trends in Mental Health Disorder' file contains the prevalence expressed in percentages for schizophrenia, bipolar disorder, eating disorders, anxiety disorders, drug use disorders, depression and alcohol use disorders from 1990 to 2017. 

### Data sources

The [Human Development Index](https://ourworldindata.org/human-development-index) was retrieved from Our World in Data (Roser, 2014) and originally published by the United Nations Development Programme. 

The [Mental Health dataset](https://www.kaggle.com/datasets/thedevastator/uncover-global-trends-in-mental-health-disorder?resource=download) was retrieved from kaggle by the author Amit. 



### Loading libraries

In [1]:
# Loading needed libraries
import yaml
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import json
import matplotlib.pyplot as mpl
import pylab as plt

from bokeh.io import output_file, show, output_notebook, export_png
from bokeh.models import ColumnDataSource, GeoJSONDataSource, LinearColorMapper, ColorBar, Label
from bokeh.plotting import figure
from bokeh.palettes import brewer

import panel as pn
import panel.widgets as pnw
import scipy
import scipy.stats as stats

from sklearn import linear_model

### Defining functions

In [2]:
# Getting data location from configuration file
def get_config():
    with open("config.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
    return config
    
def data_inspection(df):
    
    # Displaying number of columns and rows, datatypes, missing values
    print(f'Dataset contains {df.shape[0]} rows and {df.shape[1]} columns. \n')
    print(f'Datatypes: \n{df.dtypes} \n')
    print(f'Missing data per column: \n{df.isnull().sum()}')

    # Displaying number of countries and years included
    print(f'\nNumber of countries included: {len(df.Entity.unique())}')
    print(f'Number of years included: {len(df.Year.unique())}')
    
    # Displaying timespan of included years=
    print(f'Timespan: {df.Year.min()}-{df.Year.max()}')
   

def reformat_mh():
    # Drop index columns
    df = mental_disorders.drop('index', axis='columns')

    # Rename columns
    df = df.rename(columns = {'Schizophrenia (%)':'Schizophrenia',
    'Bipolar disorder (%)':'BPD', 
    'Eating disorders (%)':'ED',
    'Anxiety disorders (%)':'Anxiety',
    'Drug use disorders (%)':'Drugs',
    'Depression (%)':'Depression',
    'Alcohol use disorders (%)':'Alcohol'
    })

    # Drop missing values
    df = df.dropna()

    # Changing datatypes
    df['Year'] = df['Year'].astype(int)
    df['Schizophrenia'] = df['Schizophrenia'].astype(float)
    df['BPD'] = df['BPD'].astype(float)
    df['ED'] = df['ED'].astype(float)
    
    return df

def correct_countries(df, dict):
    for x in dict.items():
        df['Entity'] = df['Entity'].replace(x[0], x[1])

    return df


### Data loading

In [3]:
config = get_config()

# Human development index data
d1 = (config['dataset1'])
hdi = pd.read_csv(d1)

# Mental health disorder data
d2 = (config['dataset2'])
mental_disorders = pd.read_csv(d2)

# Shapefile containing geographical country data needed for visualisation
fp = (config['dataset3'])
# Reading file using geopandas (only loading needed columns)
map_df = gpd.read_file(fp)[['ADMIN', 'ADM0_A3', 'geometry']]

  mental_disorders = pd.read_csv(d2)


### Data inspection 

#### Mental health disorders

In [4]:
# Viewing dataset
mental_disorders.head()

Unnamed: 0,index,Entity,Code,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
0,0,Afghanistan,AFG,1990,0.16056,0.697779,0.101855,4.82883,1.677082,4.071831,0.672404
1,1,Afghanistan,AFG,1991,0.160312,0.697961,0.099313,4.82974,1.684746,4.079531,0.671768
2,2,Afghanistan,AFG,1992,0.160135,0.698107,0.096692,4.831108,1.694334,4.088358,0.670644
3,3,Afghanistan,AFG,1993,0.160037,0.698257,0.094336,4.830864,1.70532,4.09619,0.669738
4,4,Afghanistan,AFG,1994,0.160022,0.698469,0.092439,4.829423,1.716069,4.099582,0.66926


In [5]:
# Inspecting dataframe, datatypes, column and row counts and missing values
data_inspection(mental_disorders)

Dataset contains 108553 rows and 11 columns. 

Datatypes: 
index                          int64
Entity                        object
Code                          object
Year                          object
Schizophrenia (%)             object
Bipolar disorder (%)          object
Eating disorders (%)          object
Anxiety disorders (%)        float64
Drug use disorders (%)       float64
Depression (%)               float64
Alcohol use disorders (%)    float64
dtype: object 

Missing data per column: 
index                             0
Entity                            0
Code                           5412
Year                              0
Schizophrenia (%)             82678
Bipolar disorder (%)          89147
Eating disorders (%)           8317
Anxiety disorders (%)        102085
Drug use disorders (%)       102085
Depression (%)               102085
Alcohol use disorders (%)    102085
dtype: int64

Number of countries included: 276
Number of years included: 259
Timespan: 0-Year


Observations: 

1. There is a lot of missing data. This needs to be filtered out and checked if remaining data is still sufficient and if remaining countries correspond to the HDI dataset. 

2. Several of columns need to be converted to float. 

3. Column names can be shortened for convenience.

4. The dataset includes an unexpectedly high number of years.This needs to be examined further

In [6]:
# Checking time span 
print(mental_disorders['Year'].unique())

['1990' '1991' '1992' '1993' '1994' '1995' '1996' '1997' '1998' '1999'
 '2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
 '2010' '2011' '2012' '2013' '2014' '2015' '2016' '2017' 'Year' '1800'
 '1801' '1802' '1803' '1804' '1805' '1806' '1807' '1808' '1809' '1810'
 '1811' '1812' '1813' '1814' '1815' '1816' '1817' '1818' '1819' '1820'
 '1821' '1822' '1823' '1824' '1825' '1826' '1827' '1828' '1829' '1830'
 '1831' '1832' '1833' '1834' '1835' '1836' '1837' '1838' '1839' '1840'
 '1841' '1842' '1843' '1844' '1845' '1846' '1847' '1848' '1849' '1850'
 '1851' '1852' '1853' '1854' '1855' '1856' '1857' '1858' '1859' '1860'
 '1861' '1862' '1863' '1864' '1865' '1866' '1867' '1868' '1869' '1870'
 '1871' '1872' '1873' '1874' '1875' '1876' '1877' '1878' '1879' '1880'
 '1881' '1882' '1883' '1884' '1885' '1886' '1887' '1888' '1889' '1890'
 '1891' '1892' '1893' '1894' '1895' '1896' '1897' '1898' '1899' '1900'
 '1901' '1902' '1903' '1904' '1905' '1906' '1907' '1908' '1909' '1910'
 '1911

It currently contains years that go too far back (including prehistorical data).This can also factor into the high number of missing data. The dataset needs to be filtered to only contain data from the years 1990 to 2017.

#### Human development index

In [7]:
# Viewing the dataframe
hdi.head()

Unnamed: 0,Entity,Code,Year,Human Development Index
0,Afghanistan,AFG,1990,0.273
1,Afghanistan,AFG,1991,0.279
2,Afghanistan,AFG,1992,0.287
3,Afghanistan,AFG,1993,0.297
4,Afghanistan,AFG,1994,0.292


In [8]:
# Data inspection
data_inspection(hdi)

Dataset contains 5923 rows and 4 columns. 

Datatypes: 
Entity                      object
Code                        object
Year                         int64
Human Development Index    float64
dtype: object 

Missing data per column: 
Entity                       0
Code                       320
Year                         0
Human Development Index      0
dtype: int64

Number of countries included: 202
Number of years included: 32
Timespan: 1990-2021


Observations: 

1. There is missing data in the 'Code' column. However, 'Entity' has no missing data. Since these column essentialy contain the same information, 'Entity' can be used to identity countries instead of the country code.

2. Datatypes are correct.

3. Contains less countries than the other dataset, but includes a longer timespan (from 1990 to 2021). This, however, will be filtered later on when merging the dataset to only include data up until 2017.


### Data cleaning

In [9]:
# Removing ising data, renaming columns and converting datatypes
mh_clean = reformat_mh()

# Renaming country names to match other dataset
dict1 = {'Czech Republic':'Czechia','Macedonia':'North Macedonia'}
dict2 = {'Serbia':'Republic of Serbia',
    'Democratic Republic of Congo':'Democratic Republic of the Congo',
    'Bahamas':'The Bahamas',
    'Tanzania':'United Republic of Tanzania',
    'United States':'United States of America',  
    'Timor':'East Timor',
    "Cote d'Ivoire":'Ivory Coast',
    'Congo':'Republic of the Congo'}
mh_clean = correct_countries(mh_clean, dict1)
mh_clean = correct_countries(mh_clean, dict2)

In [10]:
# Renaming country names to match other dataframes
hdi_clean = correct_countries(hdi, dict2)

# Renaming columns geographical data to match other datasets
map_df.columns = ['Entity', 'Code', 'geometry']

# Dropping information for antartica so it won't be displayed later
map_df = map_df.drop(map_df.index[159])

### Data Exploration

In [12]:
import numpy as np
from bokeh.plotting import figure, show
pn.extension()

def plot_histogram(Variable, year):
    
    p = figure(width=500, height=400, toolbar_location=None,
           title="Distribution of the selected variable:")
    if Variable == "Human Development Index":
        df = hdi_clean

    else:
        df = mh_clean
        
    # Plotting per year
    df = df[df['Year'] == year]
        
    begin = min(df[Variable])
    end = max(df[Variable])
    bins = np.linspace(begin, end, 30)
    hist, edges = np.histogram(df[Variable], density=True, bins=bins)
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
             fill_color="darkslateblue", fill_alpha = 0.7, line_color="white")
    p.xaxis.axis_label = "Distribution"
    p.yaxis.axis_label = "Count"

    return p

years = list(range(min(mh_clean.Year),max(mh_clean.Year)+1))
variables = ["Schizophrenia", "BPD", "ED", "Anxiety", "Drugs", "Depression", "Alcohol","Human Development Index"]
mh = pn.interact(plot_histogram, Variable = variables, year = years)
pn.Row(mh)

### Data wrangling

In [13]:
# Merging HDI with mental disorders
mh_hdi = pd.merge(mh_clean, hdi, on=['Entity', 'Year', 'Code'])

# Merging geographical data
mh_hdi_geo  = map_df.merge(mh_hdi, on=['Entity', 'Code'])

Both data sets are normalized

### Statistical analysis

#### Simple linear regression

Checking for any linear relationship between the HDI and the prevalance of the mental health disorders

In [72]:
def simple_linear_regression(year,variable):
    
    # Plotting per year
    df = mh_hdi[mh_hdi['Year'] == year]
    
    # Setting plot parameters
    p = figure(width=500, height=400)
    x = df["Human Development Index"]
    y = df[variable]
    
    # Scatter plot
    p.circle(x,y, size=7, color="darkslateblue", alpha=0.6)
    
    # Initializing linear model
    regression_model = linear_model.LinearRegression()

    # Train the model using the data
    regression_model.fit(X = pd.DataFrame(df["Human Development Index"]), 
                     y = df[variable])
    
    # Trained model y-intercept
    intercept = regression_model.intercept_

    # Trained model coefficients
    slope = regression_model.coef_[0]
    
    # Getting model score
    score = regression_model.score(X = pd.DataFrame(df["Human Development Index"]), 
                     y = df[variable])
    
    # Adding model score (R2) to plot
    a = max(df["Human Development Index"]) - 0.15
    b = min(df[variable])
    
    mytext = Label(x=a, y=b, text = f'R\N{SUPERSCRIPT TWO} = {round(score,3)}', 
                       text_color = "darkred", text_font_size = "10pt")
    
    # Getting linear prediction
    y_predicted = [slope*i + intercept  for i in x]
    p.line(x,y_predicted,color='crimson', alpha = 0.7, line_width = 2,
            legend_label = 'y = '+str(round(slope,3))+'x + '+str(round(intercept,3)))
    
    p.add_layout(mytext)
    p.xaxis.axis_label = "Human Development Index (HDI)"
    p.yaxis.axis_label = "Prevalance (%)"
    p.legend.background_fill_alpha = 0.6
    
    return p 

In [73]:
years = list(range(min(mh_hdi.Year),max(mh_hdi.Year)+1))
variables = ["Schizophrenia", "BPD", "ED", "Anxiety", "Drugs", "Depression", "Alcohol"]

mh = pn.interact(simple_linear_regression, year=years, variable = variables)
pn.Row(mh)

There appears to be a positive correlation between HDI and all the variables except depression and alcohol disorders.

#### Polynomial regression

In [74]:
def polynomial_regression(year,variable):
    
    # Plotting per year
    df = mh_hdi[mh_hdi['Year'] == year]
    
    # Setting plot parameters
    p = figure(width=500, height=400)
    x = df["Human Development Index"]
    y = df[variable]
    
    # Scatter plot
    p.circle(x,y, size=7, color="darkslateblue", alpha=0.6)
    
    # Initializing linear model
    poly_model = linear_model.LinearRegression()

    # Dataframe of predictor variables
    predictors = pd.DataFrame([x, x**2]).T
    
    # Train model
    poly_model.fit(X = predictors, 
                y = df[variable])
    
    # Trained model y-intercept
    intercept = poly_model.intercept_

    # Trained model coefficients
    slope = poly_model.coef_
    
    # Getting model score
    score = poly_model.score(X = predictors, 
                     y = y)
    
    # set curve range
    poly_line_range = np.arange(min(x), max(x), 0.05)
    
    # Get first and second order predictors from range
    poly_predictors = pd.DataFrame([poly_line_range,
                               poly_line_range**2]).T
    
    # Get corresponding y values from the model
    y_values = poly_model.predict(X = poly_predictors)
    
    # Adding model score (R2) to plot
    a = max(df["Human Development Index"]) - 0.15
    b = min(df[variable])
    
    mytext = Label(x=a, y=b, text = f'R\N{SUPERSCRIPT TWO} = {round(score,3)}', 
                       text_color = "darkred", text_font_size = "10pt")
    
    p.line(poly_line_range, y_values, color='crimson', alpha = 0.7, line_width = 2,
            legend_label = 'y = '+str(round(slope[1],3))
               +f'x\N{SUPERSCRIPT TWO} + '+str(round(slope[0],3))
               +'x + '+str(round(intercept,3)))
    
    p.add_layout(mytext)
    p.xaxis.axis_label = "Human Development Index (HDI)"
    p.yaxis.axis_label = "Prevalance (%)"
    p.legend.background_fill_alpha = 0.6
    
    return p 

In [75]:
years = list(range(min(mh_hdi.Year),max(mh_hdi.Year)+1))
variables = ["Schizophrenia", "BPD", "ED", "Anxiety", "Drugs", "Depression", "Alcohol"]

mh = pn.interact(polynomial_regression, year=years, variable = variables)
pn.Row(mh)



#### Spearman's correlation test

In [None]:
scipy.stats.spearmanr

# Only 2017 for now
df_2017 = mh_hdi[mh_hdi['Year'] == 2017]
df_2017 = df_2017[["Entity", "Code","Year", "Anxiety"]]

#Merge dataframes gdf and df_2016.
merged = map_df.merge(df_2017, on = ['Entity','Code'])
#Read data to json.
year_json = json.loads(merged.to_json())
#Convert to String like object.
json_data = json.dumps(year_json)
merged

In [None]:
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer

#Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)

#Define a sequential multi-hue color palette.
palette = brewer['YlGnBu'][8]

#Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]

#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette, low = 0, high = 10)

#Define custom tick labels for color bar.
tick_labels = {'0': '0%', '1': '1%', '2':'2%', '3':'3%', '4':'4%', '5':'5%', '6':'6%','7':'7%', '8': '>8%'}

#Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8,width = 500, height = 20,
border_line_color=None,location = (0,0), orientation = 'horizontal', major_label_overrides = tick_labels)

#Create figure object.
p = figure(title = 'Global Anxiety prevalence, 2017', plot_height = 600 , plot_width = 950, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

#Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'Anxiety', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)

#Specify figure layout.
p.add_layout(color_bar, 'below')

#Display figure online in Jupyter Notebook.
output_notebook()
#Display figure.
show(p)

### Data visualisation

In [None]:
# Only 2017 for now
df_2017 = mh_hdi[mh_hdi['Year'] == 2017]
df_2017 = df_2017[["Entity", "Code","Year", "Anxiety"]]

#Merge dataframes gdf and df_2016.
merged = map_df.merge(df_2017, on = ['Entity','Code'])
#Read data to json.
year_json = json.loads(merged.to_json())
#Convert to String like object.
json_data = json.dumps(year_json)
merged

In [None]:
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer

#Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)

#Define a sequential multi-hue color palette.
palette = brewer['YlGnBu'][8]

#Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]

#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette, low = 0, high = 10)

#Define custom tick labels for color bar.
tick_labels = {'0': '0%', '1': '1%', '2':'2%', '3':'3%', '4':'4%', '5':'5%', '6':'6%','7':'7%', '8': '>8%'}

#Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8,width = 500, height = 20,
border_line_color=None,location = (0,0), orientation = 'horizontal', major_label_overrides = tick_labels)

#Create figure object.
p = figure(title = 'Global Anxiety prevalence, 2017', plot_height = 600 , plot_width = 950, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

#Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'Anxiety', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)

#Specify figure layout.
p.add_layout(color_bar, 'below')

#Display figure online in Jupyter Notebook.
output_notebook()
#Display figure.
show(p)

#### References
Barbalat, G., & Franck, N. (2020). Ecological study of the association between mental illness with human development, income inequalities and unemployment across OECD countries. BMJ open, 10(4), e035055.

Max Roser (2014) - "Human Development Index (HDI)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/human-development-index' [Online Resource]*