In [None]:
%config Completer.use_jedi = False

# World Development Indicators

To supplement the disaster dataset, we have selected 55 different world development indicators from the [World Data Bank](https://databank.worldbank.org/source/world-development-indicators). We took the per-year data from year 2000 to year 2020 and for all available countries. Detailed indicator description is given in the metadata csv file `databank_wdi_metadata.csv`, including the source, unit of measure, periodicity, aggregation method, statistical concept and methodology, development relevance and limitations. Raw data is saved in `databank_wdi_data.csv`, with the preprocessed dataset created in this notebook saved in `databank_wdi_data_clean.csv`. Cheers.

In [None]:
import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)

In [None]:
wdi = pd.read_csv("datasets/databank_wdi_data.csv", na_values="..")
wdi = wdi.rename(columns={
    "Country Name":"country_name",
    "Country Code":"country_code",
    "Series Name":"indicator_name",
    "Series Code":"indicator_code",
})
wdi = wdi.rename(columns={f"{y} [YR{y}]":f"{y}" for y in range(2000,2020+1)})
wdi.info()

In [None]:
wdi.describe()

In [None]:
print(f"Index unique: {wdi.index.is_unique}")
print(f"Dataframe has dupliates: {sum(wdi.duplicated()) > 0}, n={sum(wdi.duplicated())}")

In [None]:
print(f"Number of countries: {len(wdi.country_name.unique())}")
indicators = wdi.indicator_name.unique()
print(f"Number of indicators:{len(indicators)}")
print(f"Indicators:")
for i, s in enumerate(indicators):
    print(f"\t{i} --> {s}")

In [None]:
interesting_indicators = [
    "Population, total",
    "Population density (people per sq. km of land area)",
    "Surface area (sq. km)",
    "School enrollment, secondary (% gross)",
    "GDP (current US$)",
    "Energy use (kg of oil equivalent per capita)",
]

print(f"A list of some interesing indicators:")
for i, s in enumerate(indicators):
    if s in interesting_indicators:
        print(f"\t{i} --> {s}")

In [None]:
croatia = wdi[wdi.country_name == "Croatia"]
# croatia
croatia[croatia.indicator_name.isin(interesting_indicators)]

In [None]:
haiti = wdi[wdi.country_name == "Haiti"]
haiti[haiti.indicator_name.isin(interesting_indicators)]

## Discussion

Some data is missing. For some countries like Haiti, there was no information on `School enrollment, secondary (% gross)`, whereas for Croatia, there was no data about it only for year 2020. Missing data will need to be taken care of when applying machine learning models. Different models have different "preferences" which we will need to take care of ad hoc. It will likely make sense to interpolate some of the missing fields. For example, if the country surface area was reported for the past 3 years, but was not reported for year 2020 (like for Croatia), we might just want to copy the surface area from the previous year, as it very likely did not change. Or we could compute it by dividing the population by the population density. Overall, data polishing might require a lot of work and we should be cautious about it when reaching towards machine learning models.

TODO Analyse correlations in the data

TODO Use pandas bfill to interpolate missing values or something like that

TODO: Make sure that there are no obvious outliers by plotting some boxplots

TODO: Merge with the disasters

TODO: Use small multiples to create tons of graphs
