# Title
----

## Executive Summary

Through analysis, we have found that:
- Countries change income groups from year to year
- Higher income groups are growing while lower income groups are shrinking.
- Features are showing a growth in access and standards of living among all income groups.
- This all leads to the world becoming more developed.


The major indicator that determines income group is GNI per capita.  Countries fall into certain groups based off of this feature, which is also correlated to many world development indicators and is worth a deep dive.

In analyzing our data, we were able to separate our features into various categories:
- Economic Policy and Debt
- Education and Gender Issues
- Access to Advanced Communications
- Environment, Resources and Population
- Social Protection and Labor
- Health

We are looking to predict the economic standing of countries using world development so we dropped features that pertained to GNI and GDP. Feature selection is an important process to reduce training time and defend against overfitting. In selecting features, we used an anova based univariate method and also a tree based method. We were able to obtain a combination of 37, 25, 17 and 8 features to use when training our models.

In training and testing our models, we created a pipeline that would simulate every feature and hyperparameter combination using a 5 fold cross validation method in order to give us the best method that we would test on unseen data and re-tune according to performance. Our models performed extremely well in predicting high income groups and really well in predicting low income groups. The models started to dip in accuracy, precision and recall as they attempted to predict upper and lower middle income groups. The model that performed the best was the random forest model, using only 17 features to predict income groups. This model had an accuracy, precision, recall and f1-score of just under 94%. What really set the model apart from all other models was its ability to be exremely precise in predicting low income groups, predicting 177 out of 178 correctly.

## Introduction

There are a plethora of factors that determine how developed a country is in the world. Typically, countries are considered to be developed if they:

- Are highly industrialized.
- Have a large female workforce and women in high ranking positions.
- Use a disproportionate amount of the world’s resources.
- Have stable birth and death rates.
- Have higher levels of debt, usually for production.
- Have a high Gross Domestic Production (GDP) per Capita
- And much more.

Outside of the developed category, there are developing countries that range from being highly developed to underdeveloped. For example; although it is the world’s second largest economy, China falls into a developing country category because of the disparity in it’s wealth, quality of life, and GDP per capita. Argentina is also considered a developing country because of its political disarray, economic uncertainty and erosion of quality of life despite ranking higher in vast majorities of metrics as compared to other countries.


### Problem Statement:
Development within a country will always be a continuous journey because of ever changing conditions. In the pursuit of maintaining and upgrading these levels, we want to determine what stage of development a country is in and help it acquire the tools necessary to continue building itself.

In this project, specifically, we want to look at the income group countries fall into and their development factors in order to create a multiclass classification prediction model.

### Approach:
In this project, we look at world development indicators, provided by the World Bank, which will help us predict income groups.  There are a total of 1600 indicators that help paint a picture of the development level of countries and regions, as well as their economic outlook. Visual and statistical EDA are performed to understand the correlation that several of these indicators have with each other, GNI per capita, income groups and world development.  Some of these indicators pertain to access to basic utilities, population growth, education, employment, environement and more.  Machine learning models, specifically, classification models are built and tuned for optimization for predicting the income group a country falls into based off of world development indicators not related to GNI or GDP.  5 separate models are tuned and tested and cross referenced with various methods of feature selection in order to find the highest performing model with the least amount of features needed.

### The Client:
In using world development indicators to predict income groups, Non-Government Organizations (NGO), such as World Health Organization, International Save the Children Alliance, World Youth Alliance and more, can determine disparities between economic prosperity on the micro and macro level and its relation to the development of countries.  NGOs can look at certain features that go hand in hand with development and national wealth and extract information that GNI, alone, can't always display.  For instance, GNI may be underestimated in lower-income economies that have more informal, subsistence activities and GNI does not reflect inequalities in income distribution.  For example, China has shifted between lower middle income and upper middle income despite having the world's 2nd largest economy.  This reflects in disparity of development and wealth through out the country and NGOs may choose to help in certain facets of the issue. 

From our analysis and recommendations, these NGOs can determine where certain countries are headed in their development stages, whether they are progressing or digressing, and determine what can be done to help in the development process.

### Dataset
[World Bank World Development Indicator Datasets](https://datacatalog.worldbank.org/dataset/world-development-indicators)
- Has datasets with information on each of the 1600 indicators for each country dealing with world development over time, explanations of each indicator, placed into specific topics and information on the country's region, income group and more.  Each dataset is linked through certain unique values that could be used to combine them.

[World Bank Help Desk](https://datahelpdesk.worldbank.org/knowledgebase/articles/378834-how-does-the-world-bank-classify-countries) 
- Has historical information on income groups and the GNI range countries needed to fall in to be classified a certain income group.



## Data Wrangling
After looking through the initial data, we discovered that four datasets were needed in order to move forward with this project.  
- WDIData.csv had information on each of the 1600 indicators for each country dealing with world development over time.
- WDISeries.csv explained each indicator and placed them into a specific topic.
- WDICountry.csv had information on the country's region, income group and more.
- OGHIST.xls had income group history and criteria.

The concatenated, raw dataset had 422,196 rows and 90 columns.  The rows were of each indicator for each country, region and subset of similar countries while the columns contained changes over time, categorical information and definitions. Initially over half the data was missing, mostly because many indicators weren't being recorded as early as others.  The starting year was 1960.  We decided to keep it current and only used information between 2005 and 2016.

Through data wrangling, we were able to:
- Read in raw data
- Merge datasets
- Categorize indicators
- Drop insufficient information
- Drop unrelated columns
- Standardize column names
- Extract countries
- Drop columns with too much missing data
- Drop indicators with too much missing data
- Find the income group of each country for each year
- Drop incorrect data
- Pivot the dataset for analysis and modeling

We are left with 202 countries and 341 world development indicators.

<img src="https://github.com/dametreusv/world_development_indicators/blob/master/images/wrangle_missing_data.png">

Source code for the data wrangling can be found [here](https://nbviewer.jupyter.org/github/dametreusv/world_development_indicators/blob/master/WDI_milestone_report.ipynb#Wrangle) in the milestone report or [here](https://github.com/dametreusv/world_development_indicators/blob/master/WDI_wrangle.ipynb) in the wrangling notebook.