# <span style='color:#15317E'>04 Capstone Report</span>

In [1]:
######
#### Author: Byron Stuart
#### Data Science Immersive Capstone Project
#### Date 06 June 2017
######

## <span style='color:#7D6115'>Table of Contents</span>

<a href='#jupyter_notebooks'>00 Capstone Project Navigation</a><br>
<a href='#executive'>01 Executive Summary</a><br>
<a href='#import'>02 Importing, Querying and Sorting of Data</a><br>
<a href='#parsing'>03 Parsing of Data</a><br>
<a href='#statistical_analysis'>04 Statistical Analysis</a><br>
<a href='#describe_plot'>05 Describe and Plot</a><br>
<a href='#perform_model'>06 Perform Model</a><br>
<a href='#tune_and_evaluate'>07 Tune and Evaluate Model</a><br>
<a href='#data_pipeline'>08 Data Pipeline</a><br>
<a href='#statistical_and_visual'>09 Statistical and Visual Analysis</a><br>
<a href='#model_selection'>10 Model Selection and Implementation Process</a><br>
<a href='#interpret_findings'>11 Interpret Findings</a><br>
<a href='#stakeholders'>12 Recommendations for Stakeholders</a><br>

**The above links and any other internal links throughout this notebook do not work when viewing from GitHub**

## <span style='color:#7D6115'>00 Capstone Project Navigation</span> <a name='jupyter_notebooks' />
<a href='./01%20capstone_setup.ipynb' target='_blank'>Step 1 - Capstone Setup Notebook</a><br>
<a href='./02%20capstone_indicators.ipynb' target='_blank'>Step 2 - Capstone Indicator Notebook</a><br>
<a href='./03%20capstone_eda_and_models.ipynb' target='_blank'>Step 3 - Capstone EDA and Models Notebook</a><br>
Step 4 - Capstone Report Notebook

## <span style='color:#7D6115'>01 Executive Summary</span> <a name='executive' />

### Overview
To investigate and analyse the "Health Nutrition and Population Statistics" dataset which attempts to depict the state of human health across the world. This dataset - https://www.kaggle.com/theworldbank/health-nutrition-and-population-statistics - is provided by https://www.kaggle.com/theworldbank. The Content and Context paragraphs (below) are text from the Kaggle overview section.

More specifically the target of this project is to find indicators that are able to predict life expectancy in various groupings of countries around the world, some of these are geographic such as continents or regions and 3 special groupings will also be analysed i.e. 'Fragile and conflict affected situations', 'Heavily indebted poor countries (HIPC)' and 'Least developed countries: UN classification'.

### Key Findings
Real data can be messy even when it is not dirty and does not always behave the way you want it to. Specifically in the case of this dataset the values themselves are clean but not all indicators have been collected for every region or country. This is not a surprise as the ability to collect data depends on many factors. Regions and countries range from first world to third world, vast to very small and very many other factors. All of these can have a bearing on what data is available and what has actually been recorded.

Leaving aside the presence or absence of individual indicator data the overriding factor is that many statistical algorithms work better with big data sets, in this project there were only 55 years of data available and a decision was made to reduce the scope to the last 40 years. The data consists of only 1 value per year for each indicator and this data is not always available for every region or country. Therefore the bottom line is that the maximum possible number of values for any one indicator is 40. Some statistical algorithms really struggled with only having 40 values per indicator especially when combined with train/test splitting and trying to perform cross validation of the models with multiple folds. See section '06 Perform Model' for more details.

### Significant Indicators
For the **3 World Bank special groupings** there were 3 indicators that appeared in each of the special groupings:

    primary completion rate, female (% of relevant age group)
    birth rate, crude (per 1,000 people)
    adolescent fertility rate (births per 1,000 women ages 15-19)
    
Notice that 'life expectancy' and 'primary completion rate, female (% of relevant age group)' are trending up whilst 'birth rate, crude (per 1,000 people)' and 'adolescent fertility rate (births per 1,000 women ages 15-19)' go down. Remembering that correlation does not imply causation.
    
<img src="./images/special_categories_indicators.jpg" alt="special categories indicators" style="width:800px;float:left;">

For the **13 geographic groupings** there was one indicator that appeared in 10 of them:

    adolescent fertility rate (births per 1,000 women ages 15-19)
    
Some of the country groupings with the lowest 'adolescent fertility rate (births per 1,000 women ages 15-19)' also have the highest 'life expectancy' but it is not a consistent trend. Also the country groupings with the 2 highest 'adolescent fertility rate (births per 1,000 women ages 15-19)' have the lowest 'life expectancy'. More study is required to better understand these relationships.

<img src="./images/country_combinations_indicators.jpg" alt="country combinations indicators" style="width:500px;float:left;">

There was 1 indicator that appeared in 9 groupings:
    
    population, female (% of total)

There was 1 indicator that appeared in 8 groupings:

    age dependency ratio (% of working-age population)

There were 4 indicators that appeared in 7 groupings:
    
    rural population
    unemployment, male (% of male labor force)
    age dependency ratio, old
    gni per capita, atlas method (current us$)
    school enrollment, primary, female (% gross)

## <span style='color:#7D6115'>Future Work</span>
It is apparent from the data that there are many potential relationships in the data that could be studied. Taking such a broad approach of looking at many country groupings may have inhibited rather enhanced the analysis process.

It is recommended that a more narrow approach be taken on a few select regions or countries and try to find some really strong relationships. Once these were found it would be highly beneficial to find some additional data sets to help triangulate the results and also to overcome the problems of not enough data values in the World Bank original data.

### Kaggle descriptions
#### Content
"HealthStats provides key health, nutrition and population statistics gathered from a variety of international sources. Themes include population dynamics, nutrition, reproductive health, health financing, medical resources and usage, immunization, infectious diseases, HIV/AIDS, DALY, population projections and lending. HealthStats also includes health, nutrition and population statistics by wealth quintiles."

#### Context
"This dataset includes 345 indicators, such as immunization rates, malnutrition prevalence, and vitamin A supplementation rates across 263 countries around the world. Data was collected on a yearly basis from 1960-2016."

## <span style='color:#7D6115'>02 Importing, Querying and Sorting of Data</span> <a name='import' />
### Importing and Querying
The raw data is contained in one CSV text file and is able to be read in using a single line of code.

Upon querying the data it was found to be in a long format, with 4 features: 'Country Name', 'Country Code', 'Indicator Name', and 'Indicator Code' that indicate what the data is about and year columns 1960 through to 2015 that contain the data in numerical form.

There do not appear to be any data input errors, no extreme outliers were observed, the data that is there appears to be accurate and is in the form of numbers that represent totals or percentages.

To see the relevant section of code click here -> <a href='./01%20capstone_setup.ipynb#import_data' target='_blank'>Import Data</a>.

### Data Parsing and Subsetting
Looking at the data it is evident in the earlier years (1960 onwards) that for many of the individual indicators there is little data, that is for many countries or groupings of countries there is little or no data. It was decided to only use the last 40 years of data to help alleviate this problem.

Further to this, it was also decided to only use features that have a minimum amount of data across the different geographic  and special groupings so that comparisons can be done between these. More specifically **100** was chosen as the threshhold for the maximum number of allowable null years per indicator per set of countries chosen, if there are more nulls than this the indicator will be removed from any further analysis. To see the relevant section of code click here -> <a href='./01%20capstone_setup.ipynb#check_indicators' target='_blank'>check indicators code</a>.

There are some indicators which record values per 1,000 people and per 100,000 people, these have been converted to percentages for later plotting by **Tableau**.

<img src="./images/constants.jpg" alt="constants" style="width:800px;float:left;">

### Dataframes and sqlite
After the initial parsing and subsetting the original single dataframe was organised into separate dataframes for 'world' data, 'income' grouping data, 'combo' grouping data, and 'single' countries data as well as a dataframe containing the corresponding sets of indicators.

A sqlite database was created and the above dataframes were written to it to allow for easy access from other notebooks.

### Initial Analysis
At first it was thought that it would be beneficial to group the indicators together under broad categories based on keywords. This at least helped to understand what the indicators were related to as it is very difficult to look at 345 indicators all together. A basic exploration was done on the **World** dataset exploring the relationship between 'life expectancy' and some broad indicators. The indicators chosen related to the following keywords, 'undernourished', 'hiv', 'health_expenditure', 'malnutrition', 'overweight', 'immunization', 'sanitation', 'water', 'anemia', and 'unemployment'.

This basic exploration found that:
    * some indicators had more than 20 nulls and had to be dropped
    * the indicators did not have normal distributions
    * it was necessary to replace the remaining missing values with median values
    * some high correlations were found between water, sanitation, hiv and immunization and life expectancy

#### World coefficients
<img src="./images/world_coeffs.jpg" alt="world coefficients" style="width:600px;float:left;">

### Plot of improved water source rural for World
<img src="./images/world_rural_water.png" alt="world rural water access" style="width:400px;float:left;">

## <span style='color:#7D6115'>03 Parsing of Data</span> <a name='parsing' />

### Indicator Removal
Some indicators were removed as they are subsets of a broader grouping. For example 'population, total' was removed as there are 'population, female' and 'population, male' indicators that together are the same data.

Indicators related to population demographics such as 'female population 20-24' were removed, as were indicators related to death, mortality or survival as none of these will help to predict the life expectancy.

This simple process resulted in the removal of 116 indicators, see the indicator removal code <a href='./02%20capstone_indicators.ipynb#age_indicators' target='_blank'>here</a>. A side benefit of removing indicators especially when a large number are removed is that it makes it easier to identify indicators that you think may be of relevance to your target, that is it removes some of the fog of analysis. The indicators kept were written into csv files for futher analysis.

### Data Cleaning for Country Combinations
Further data cleaning was done for any specific country combination chosen to analyse. This data cleaning consisted of removing any indicators that had more than **20** nulls for the country combination being processed. This removal of nulls was different to the process referred to in the 'Data Parsing and Subsetting' section above as the indicator is only removed from analysis for that country combination (not all country combinations). This allows for more individualised analysis on individual country combinations and also reflects the fact that the data is not consistently filled across all country combinations.

In this data cleaning wherever nulls are found they are replaced by the median of the column. Lastly the predictor variables (X) are standardised ready for use by any models chosen and plotted to check for any outliers. The code for this processing can be examined <a href='./03%20capstone_eda_and_models.ipynb#cleaning_and_standardised' target='_blank'>here</a>.

## <span style='color:#7D6115'>04 Statistical Analysis</span> <a name='statistical_analysis' />
The data consists of **continuous** floating values that are in the form of percentages or numbers that relate specifically to the feature being measured per each year of data, for example an average age. Below are some sample statistics from the 'Fragile and conflict affected situations' data. Basically all the indicators have similar data for each of the country groupings.

### Fragile and Conflict Affected Situations Sample Feature Statistics

<img src="./images/statistics01.jpg" alt="statistics01" style="width:300px;float:left;">

<img src="./images/statistics02.jpg" alt="statistics02" style="width:450px;float:left;">

## <span style='color:#7D6115'>05 Describe and Plot</span> <a name='describe_plot' />
As indicated previously the data for the indicators does not follow a normal distribution. Some sample plots follow which give an idea of how the **continous** data looks. Also note the different scales, indicators can be percentages, dollar amounts, populations etc. Sometimes the values are decimals, they can also be in the thousands or millions, this makes it super important to standardise the data before performing any machine learning processes (see standardisation plot below).

It is possible to find many, many relationships between indicators for the same country combination or to compare them with the same indicators for other country combinations just by looking at plots. Indeed it would be possible to look for different relationships between different single country's within the same country combination, the possibilities are endless and it is beyond the scope of this project to look for many of these relationships.

A few sample plots are provided to give the reader a taste of the analysis possibilities.

### African Top 4 Lasso Predictors
<img src="./images/africa.jpg" alt="Africa" style="width:800px;float:left;">

### European Union Top 4 Lasso Predictors
<img src="./images/european_union.jpg" alt="European Union" style="width:800px;float:left;">

### GNI per capita
<img src="./images/gni.jpg" alt="GNI per capita" style="width:400px;float:left;">

### Standardised X example
As mentioned above here is an example plot of the X predictor variables after standardisation.
<img src="./images/standardised_x.jpg" alt="standardised X" style="width:800px;float:left;">

## <span style='color:#7D6115'>06 Perform Model</span> <a name='perform_model' />
Rather than attempting to select features manually or by a simple technique such as linear regression it was decided to let a machine learning technique do the work by brute force. Because of the still large number of features remaining - **217** - it was decided to use Lasso to reduce this number.

https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis 

"The Lasso (Least Absolute Shrinkage and Selection Operator) is a regression method that involves penalizing the absolute size of the regression coefficients. By penalizing (or equivalently constraining the sum of the absolute values of the estimates) you end up in a situation where some of the parameter estimates may be exactly zero. The larger the penalty applied, the further estimates are shrunk towards zero. This is convenient when we want some automatic feature/variable selection, or when dealing with highly correlated predictors, where standard regression will usually have regression coefficients that are 'too large'."

### Train and Test - Lasso
Prior to running the Lasso it is necessary to split the data into training and testing sets. The reason for this is to avoid overfitting of the model, the model is trained on the training set and is then tested on the testing set to see how accurate it is. Due to the relatively small number of data values (40 years of data) the training and testing sets were split 50/50 rather than the more convential method of assigning the training set a larger proportion of the values.

The LassoCV object is setup which then sets its alpha parameter automatically from the data, by performing cross-validation on the training data it is fitted to. Once we have these alpha parameters we setup a Lasso object with the alphas and fit using the training data.

Lasso will be assessed for accuracy by analysing the cross validation score mean, cross predicted R2 and the test score.

The Lasso assigns coefficient values to the features it is given, any that are 0 will be discarded and not used for any future testing. Different n_alphas and cv values were tried for the LassoCV but these did not seem to make any difference to the features chosen by Lasso. To see the code for the Lasso tests <a href='./03%20capstone_eda_and_models.ipynb#lasso_train_test' target='_blank'>view here</a>.

### Train and Test - Evaluation Models
The same train and test techniques will be applied to the evaluation models, however they will only use the specific indicators chosen for each country combination. To see which models will be evaluated look at section '07 Tune and Evaluate Models' below.

The evaluation models will be assessed by the cross validation score mean, cross predicted R2 and the test score.

### Cross Validation Limitations
Probably due to the relatively small number of data values (40 years of data) the cross validation is throwing a warning on the LogisticRegression.
'Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=3.' There should be enough values for this not to occur so it may have something to do with how the cross validation is splitting the data.

## <span style='color:#7D6115'>07 Tune and Evaluate Models</span> <a name='tune_and_evaluate' />

As there are 4 models being assessed they each have different parameters that can be set. The objective is to find parameters that work across the multiple data sets, not to try and tune separately for each data set.

**Parameters**

    LinearSVR
        penalty parameter C - the penalty is a squared l2 penalty, the bigger this parameter, the less regularization is used
        loss (function) -  'epsilon_insensitive' or 'squared_epsilon_insensitive'
        dual - solve the 'dual' optimization problem, preferred when n_samples > n_features
        max_iter - maximum number of iterations to run
    KNeighborsRegressor
        n_neighbors - number of neighbors to use
        weights (function) - 'uniform' or 'distance'
    LinearRegression
        none
    LogisticRegression
        multi_class - must be 'multinomial' in this case
        solver - 'newton-cg', 'sag' or 'lbfgs'
        max_iter - maximum number of iterations taken for the solvers to converge
        
The models will be assessed by the cross validation score mean, cross predicted R2 and the test score.

**Evaluate - parameter values**

    LinearSVR
        C=1000.0, loss='squared_epsilon_insensitive', dual=False, max_iter=500
        These values produced the best results, low values of C and max_iter produce terrible results
    KNeighborsRegressor
        n_neighbors=3, weights='distance'
        Changing the values made no noticable difference
    LinearRegression
        none
    LogisticRegression
        multi_class='multinomial', solver='sag', max_iter=500
        Changing the solver makes no difference, different max_iter make little difference
        Also note that LogisticRegression requires the target value to be an integer (not float)

## <span style='color:#7D6115'>08 Data Pipeline</span> <a name='data_pipeline' />

### Sequential Steps
In order to get to the final result of selecting a model based on overall performance across the different country combination data sets the following importing/sorting/querying/munging steps were required. These steps are spread across 3 sequentially used Jupyter notebooks, click <a href=#jupyter_notebooks>here</a> to see the list of Jupyter notebooks.

**1st Jupyter notebook**
    
    01 Read the CSV file in using pandas.read_csv function
    02 Remove the empty column using pandas functions
    03 Define the maxiumum number of null year values (100) to accept before rejecting the indicator
    04 Remove the data for years 1960 - 1974 using pandas functions
    05 Create a Python data dictionary
    06 Convert all the indicator names to lowercase
    07 Convert per 100,000 and per 1,000 indicators to percentages for easier analysis by Tableau
    08 Use custom written function that returns data by any combination of countries, indicators and years to:
        Create 'World' data set which includes all indicators
        Create 'Income' data set which only includes those indicators with no more than 100 null year values
        Create 'Country Combinations' data set which only includes those indicators with no more than 100 null year values
        Create 'Single Countries' data set which only includes those indicators with no more than 100 null year values
    09 Write all the data sets and their corresponding indicators to a SQLite database
**2nd Jupyter notebook**

    10 Read all the data sets and their corresponding indicators from a SQLite database
    11 Remove some features that are subsets of a broader grouping (avoid collinearity)
    12 Remove age and demographic indicators (as they are related to the target)
    13 Write life expectancy (target) and chosen country combination indicators to CSV files
**3rd Jupyter notebook**

    14 Read all the data sets from a SQLite database
    15 Read the life expectancy (target) and chosen country combination indicators from CSV files
    16 Process the chosen country combinations (see next cell for the list) by doing the following:
        Run 'Lasso' to determine indicators to keep for the specific data set
        Run the 4 chosen models using the set of indicators returned by Lasso
    17 Summarise the predictors chosen by Lasso
        Do this first for the World Bank special categories
        Secondly do this for the Geographic Country Combinations
    18 Write the non-standardised data to a csv file for use by Tableau
        Only including the Lasso summarised predictors that occur more than once
        Do this separately for the World Bank special categories and for the Geographic Country Combinations

### Chosen Country Combinations
These are the datasets for which the models were evaluated against.

**World Bank special categories**

    Fragile and conflict affected situations
    Heavily indebted poor countries (HIPC)
    Least developed countries: UN classification

**Geographic Country Combinations**

    Arab World
    Caribbean small states
    Central Europe and the Baltics
    East Asia & Pacific
    Euro area
    Europe & Central Asia
    European Union
    Latin America & Caribbean
    Middle East & North Africa
    North America
    Pacific island small states
    South Asia
    Sub-Saharan Africa

## <span style='color:#7D6115'>09 Statistical and Visual Analysis</span> <a name='statistical_and_visual' />
What follows are some predicted vs true values plots to give a rough idea of the performance of the 4 models that were evaluated. However appearances can be deceiving as sometimes models with very good looking plots performed surprisingly poorly on the scores. The caveat being that the dataset size (40 years of values) may also have had a debilitating effect on some of the models abilities to perform.

**A LinearSVR plot that looks OK except for the outliers**<br>
<img src="./images/LinearSVR.jpg" alt="LinearSVR" style="width:400px;float:left;"><br>

**A pretty good KNeighborsRegressor plot but the scores suggest otherwise**<br>
<img src="./images/KNeighborsRegressor.jpg" alt="KNeighborsRegressor" style="width:400px;float:left;">

**A typically good LinearRegression plot**<br>
<img src="./images/LinearRegression.jpg" alt="LinearRegression" style="width:400px;float:left;">

**A bad looking LogisticRegression plot, compare to the next plot**<br>
<img src="./images/LogisticRegression_bad.jpg" alt="LogisticRegression_bad" style="width:400px;float:left;">

**A good looking LogisticRegression plot**<br>
<img src="./images/LogisticRegression_good.jpg" alt="LogisticRegression_good" style="width:400px;float:left;">

**Shown here are the most commonly occurring coefficients for special categories and country combinations.**
For a complete set of all coeffcients for every country combination see the **03 capstone_eda_and_models** notebook.
<img src="./images/special_categories.jpg" alt="special_categories" style="width:600px;float:left;">
<img src="./images/country_combinations.jpg" alt="country_combinations" style="width:600px;float:left;">

## <span style='color:#7D6115'>10 Model Selection and Implementation Process</span> <a name='model_selection' />

### Predictor Model Evaluation
It was determined to use Lasso for feature selection and then to give these features to a variety of machine learning algorithms for evaluation to see which model is best at predicting the life expectancy target for each combination of countries.

Click <a href='./03%20capstone_eda_and_models.ipynb#other_models' target='_blank'>here</a> to see the code for running the model evaluations. See below for a brief overview of each model.
    
#### LinearSVR
Linear Support Vector Regression uses the same basic idea as Support Vector Machine (SVM), a classification algorithm, but applies it to predict real values rather than a class. It has flexibility in the choice of penalties and loss functions and scales better to large numbers of samples, it supports both dense and sparse input. SVR acknowledges the presence of non-linearity in the data and provides a proficient prediction model.

#### KNeighborsRegressor
When KNeighborsRegressor is used for regression problems the prediction is based on the mean or the median of the K-most similar instances. The input consists of the k closest training examples in the feature space. KNeighborsRegressor works well with a small number of input variables (p), but struggles when the number of inputs is very large so it may not work so well for country combinations that have more predictors (as chosen by Lasso).

#### LinearRegression
Linear regression attempts to model the relationship between variables by fitting a linear equation to observed data. Multiple Linear Regression which is what will be used in this case, it models the relationship between a scalar dependent variable and one or more independent variables.

#### LogisticRegression
Multinomial Logistic Regression models how a multinomial response variable depends on a set of X explanatory indicators.

#### Adjusted R2 Note
http://people.duke.edu/~rnau/rsquared.htm

"Generally it is better to look at adjusted R-squared rather than R-squared and to look at the standard error of the regression rather than the standard deviation of the errors.  These are unbiased estimators that correct for the sample size and numbers of coefficients estimated. Adjusted R-squared is always smaller than R-squared, but the difference is usually very small unless you are trying to estimate too many coefficients from too small a sample in the presence of too much noise. Specifically, adjusted R-squared is equal to 1 minus (n - 1)/(n – k - 1) times 1-minus-R-squared, where n is the sample size and k is the number of independent variables. Adjusted R-squared bears the same relation to the standard error of the regression that R-squared bears to the standard deviation of the errors: one necessarily goes up when the other goes down for models fitted to the same sample of the same dependent variable."

## <span style='color:#7D6115'>11 Interpret Findings</span> <a name='interpret_findings' />

It was not possible to find one model that was suitable for all country combinations. It is not known whether this is because of differences in the 'shape' of the data for each combination making them uniquely suited to one particular model and/or because of only having 40 years of data causing calculation problems.

### KNeighborsRegressor
Of the 4 models evaluated the KNeighborsRegressor model was a clear last, the values calculated for cross validated score mean were always negative suggesting something is wrong with the way these values are being calculated. As such this model will not be analysed any further. This may well be the result of only having 40 years of data, meaning there are only 40 values for any one indicator.

### LinearSVR
The LinearSVR was having similar problems to the KNeighborsRegressor model. The values calculated for cross validated score mean were always negative with 4 exceptions, suggesting something is wrong with the way these values are being calculated. Nevertheless LinearSVR was the best model for 'Fragile and conflict affected situations' and 'Latin America & Caribbean'. The other 2 exception datasets were:
    
    Middle East & North Africa
    South Asia

So overall LinearSVR was slightly better than KNeighborsRegressor but still clearly not acceptable over the datasets as a whole for modelling.

### LinearRegression
LinearRegression worked in some cases however there were more instances where a negative value was returned for the cross validated score mean (perhaps again due to only having 40 years of data). In 3 of the country combinations LinearRegression was actually the the most accurate model. In terms of the goal of determining which indicators contribute the most to the target of life expectancy LinearRegression is able to show the coefficient values which show the weighting for each indicator.

### LogisticRegression
On balance LogisticRegression easliy performed the most consistently of all the models. For 11 of the country combinations LogisticRegression was the most accurate. LogisticRegression was the only model that did not return any negative values for the cross validated score mean. As LogisticRegression returns an array of values for each feature it is not easy to show which features have the most influence. In terms of the goal of determining which indicators contribute the most to the target of life expectancy although LogisticRegression can predict life expectancy it is unable to easily show the coefficient values because a multinomial model was used.

### Model Accuracy Summary
Here is a summary of each models strengths and/or weaknesses. Cross validation score mean is 'CV Score', cross validation predicted R2 is 'Adj R2'. See the **Model Performance** section for a plot of the values for LinearRegression and LogisticRegression. This summary can only be based on the results that were achieved with the caveat that some models did not appear to have a large enough data set to calculate correctly.

| Model               | Predictions vs True | CV Score | Adj R2      | Test Score |
|---------------------|---------------------|----------|-------------|------------|
| LinearSVR           | good but outliers   | varies   | varies   | varies     |
| KNeighborsRegressor | good                | poor     | OK          | OK         |
| LinearRegression    | very good           | varies   | mostly good | good       |
| LogisticRegression  | varies              | varies   | good        | varies     |

### Model Chosen per Country Combination
These records are accurate as of the last run of the notebook, however the models chosen can vary due to the random nature of how the train/test splits are performed.

| Country Combination | Model Chosen | 
|---------------------|--------------|
| Fragile and conflict affected situations | SupportVectorRegression
| Heavily indebted poor countries (HIPC) | LogisticRegression
| Least developed countries: UN classification | LogisticRegression
| | |
| Arab World | LinearRegression
| Caribbean small states | LinearRegression
| Central Europe and the Baltics | LogisticRegression
| East Asia & Pacific | LogisticRegression
| Euro area | LogisticRegression
| Europe & Central Asia | LogisticRegression
| European Union | LogisticRegression
| Latin America & Caribbean | SupportVectorRegression
| Middle East & North Africa |LogisticRegression 
| North America | LogisticRegression
| Pacific island small states | LinearRegression
| South Asia | LogisticRegression
| Sub-Saharan Africa | LogisticRegression

### Model Performance
This plot clearly shows that LogisticRegression performs more consistently than LinearRegression. On average it has better cross validation scores and adjusted R2 scores. The plot also shows the LinearRegression anomalies of negative values indicating a possible failure to calculate correctly.

<img src="./images/model_performance.jpg" alt="model performance" style="width:1000px;float:left;">

### Combined Models Investigation
Some preliminary work was done on the sklearn VotingClassifier class to see if it is possible to combine more than one of the 4 evaluated models to achieve a better result. However no matter which 2 of the 4 models were passed in as estimators to VotingClassifier it always resulted in the following error:

'ValueError: Can't handle mix of multiclass and continuous'

Hence this investigation was terminated.

## <span style='color:#7D6115'>12 Recommendations for Stakeholders</span> <a name='stakeholders' />

### World Bank Special Groupings
For aid agencies, policy makers and other interested parties in the World Bank defined disadvantaged regions there are obvious areas they can target to help raise life expectancy.

    primary completion rate, female (% of relevant age group)
    birth rate, crude (per 1,000 people)
    adolescent fertility rate (births per 1,000 women ages 15-19)
    
On the present data it has been shown that the primary completion rate of females is rising as the life expectancy goes up. Simultaneously the crude birth rate and adolescent fertility rate are going down. Although it can't be definitively proven that programs to help raise the primary completion rate of females will raise life expectancy it is an area that deserves priority consideration not least because of all the other benefits that will flow. Similarly programs that address lowering the crude birth rate and adolescent fertility rate should also be given priority consideration.

For aid agencies, governments and other interested parties there are many things to consider given the wide range of country groupings and varied indicators. However 

### Country Groupings
For aid agencies, policy makers and other interested parties the most obvious area to address is the adolescent fertility rate. Although the pattern is not entirely uniform a lower adolescent fertility rate appears to have a positive effect on life expectancy. Research and implementation of programs that are designed to lower the adolescent fertility rate through education, contraception and other factors should be considered.

The next most common indicators that appeared in 8 and 7 groupings were: age dependency ratio (% of working-age population); rural population; unemployment, male (% of male labor force); age dependency ratio, old; and school enrollment, primary, female (% gross) are more complex issues to deal with so more research is required to break down these into smaller more manageable areas of study.

Similarly indicators that appeared in 6 groupings: rural population growth (annual %); age dependency ratio, young; gni per capita, atlas method (current us$); and immunization, measles (% of children ages 12-23 months) are also complex issues with the exception of immunizing for measles which can easily be addressed. For the others perhaps resources could be directed into ways of encouraging public and private investment into structures that help reduce age dependency, such as child care, aged care facilities and greater assistance for at home care.