<a href="https://colab.research.google.com/github/Jagroop-Dev/Xistence-Engine-WHO/blob/main/WHO_ModelTesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="display: flex;">
  <img src="https://github.com/Jagroop-Dev/Xistence-Engine-WHO/blob/main/WHOl1.png?raw=true"
       width="300px"
       height="300px"
       style="margin-right: 50px;" />
  
  <img src="https://github.com/Jagroop-Dev/Xistence-Engine-WHO/blob/main/WHO%20LOGO.png?raw=true"
       width="300px"
       height="300px" />
</div>

# Team 5: Xistence Engine Exploratory Data Analysis
        
### By Jagroop Singh, Graciela Diwa and Joachim Boyden

### imports

In [64]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import statsmodels.api as sm
import statsmodels.tools
import statsmodels.api as sm

In [65]:
X_test_fe = pd.read_csv('https://raw.githubusercontent.com/Jagroop-Dev/Xistence-Engine-WHO/refs/heads/main/X_test_fe.csv')
X_train_fe = pd.read_csv('https://raw.githubusercontent.com/Jagroop-Dev/Xistence-Engine-WHO/refs/heads/main/X_train_fe.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Jagroop-Dev/Xistence-Engine-WHO/refs/heads/main/y_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Jagroop-Dev/Xistence-Engine-WHO/refs/heads/main/y_train.csv')

## Full Model

In [66]:
full_feature_cols = ['const', 'Year', 'Under_five_deaths', 'Infant_deaths',
       'Adult_mortality', 'Alcohol_consumption', 'Hepatitis_B',
       'BMI', 'GDP_per_capita',
       'Population_mln', 'Schooling',
       'Economy_status_Developed', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America', 'Polio']

## Features Selection:

* **Economy_status_Developing**: This had a correlation of 1 with Economy_status_Developing and so dropping this feature losses almost no signal and reduces multicolinearity. This change alone often took the condition number from a eight digit number to a two digit number.

* **Country**: We dropped counrty as if paired with the year acts as a psudeo index as each country-year pair is unique.

* **Stepwise approach**: We utilised a stepwise fuction to only include features with a P-Value of below 0.05, this results in the remaining feature selection.

## Model Results

Running this feature selection on the data, we got this metrics:
* R-squared:        0.98
* AIC:              7282
* BIC:              7403
* Condition Number: 27.32
* train rmse 1.17
* test rmse: 1.21

The test is slightly overfit, at 2.83% more rmse than in the train, but this is under 5% so we have deemed it acceptible. We did experiment with adding regulisation (lasso and rigid), but it increased the overall rmse for both by over 5% each time.

## Minimal Ethical Model

The best ethical model we found was using the following features:
```python
feature_cols = ['GDP_per_capita', 'Under_five_deaths', 'Adult_mortality', 'const',
                'Economy_status_Developed', 'Population_mln', 'Infant_deaths', 'Year',
                'Region_Asia', 'Region_Central America and Caribbean', 'Region_European Union',
                'Region_Middle East', 'Region_North America', 'Region_Oceania', 'Region_Rest of Europe',
                'Region_South America']
```
> Resulted in 3.57% difference between Train RMSE (1.188) and Test RMSE (1.230)

The primary principle guiding feature selection was whether a feature posed a potential risk to user privacy. Stepwise feature selection was also applied to find the best combination of features and to reduce multicolinearity.

**INCLUDED**
* `GDP_per_capita, Year, Population_mln` - publicly available and non identifiable
* `Under_five_deaths, Adult_mortality, Infant_deaths` - Although medical, this is aggregated data, not individual level, and provides vital signals for the model
* `Economic_status_Developed` - Binary classification, indicator of overall resources available to region
* `Region_` - Regional context to health and life expectancy

**EXCLUDED**
* `Alcohol_consumption` - highly sensitive and could birng reputational harm to individuals or lead to discrimination
* `Hepatitis_B, Measles, Polio, Diphtheria, Incidents_HIV` - private and medical information, highly sensitive. May potentially lead to reputational harm or discrimination
* `Schooling` - Historically valuable indicator but can reinforce systemic inequalities, signals similar information to GDP
* `Thinness_ten_nineteen_years, Thinness_five_nine_years` - private and medical, also consider this is data derived from children who may not be able to fully consent to collection or usage of their data
* `BMI` - private, medical, also widely regarded as inaccurate due to limitations across race, age, sex. Also inclusion has marginal impact on model performance.


Further reasoning:\
Privacy and data sensitivity were considered. Medical information, such as that which would be protected under EU GDPR and UK GDPR were considered for exclusion. Also, data that could be considered under Protected Characteristics were excluded. Data that could result in damage to reputation were also excluded.  

The aim of this feature selection was to minimise data used whilst aiming to return a good prediction.

### Ethical Model Results
>* R-squared:          0.98
>* AIC:                7324.64
>* BIC:                7416.42
>* Condition Number:   20.48
>* Train_RMSE:         1.19
>* Test_RMSE:          1.23

>* RMSE % difference:  3.57

### Full model results
* R-squared:        0.98
* AIC:              7282
* BIC:              7403
* Condition Number: 27.32
* train rmse 1.17
* test rmse: 1.21



## Running the Ethical Model

In [57]:
# Fitting the ethical model
# feature selection
ethical_feature_cols = ['GDP_per_capita', 'Under_five_deaths', 'Adult_mortality', 'const',
                'Economy_status_Developed', 'Population_mln', 'Infant_deaths', 'Year',
                'Region_Asia', 'Region_Central America and Caribbean', 'Region_European Union',
                'Region_Middle East', 'Region_North America', 'Region_Oceania', 'Region_Rest of Europe',
                'Region_South America']



# initialising and fitting the model
ethical_lin_reg = sm.OLS(y_train, X_train_fe[ethical_feature_cols])
ethical_results = ethical_lin_reg.fit()
#results.summary()


# printing results for train and test
print('Ethical Model Results')
print(">* R-squared:         ", round(ethical_results.rsquared, 2))
print(">* AIC:               ", round(ethical_results.aic, 2))
print(">* BIC:               ", round(ethical_results.bic, 2))
print(">* Condition Number:  ", round(ethical_results.condition_number, 2))

ethical_y_pred_train = ethical_results.predict(X_train_fe[ethical_feature_cols])
ethical_rmse_train = sm.tools.eval_measures.rmse(y_train.values.flatten(), ethical_y_pred_train) # flatten if using X y csv
print(f'>* Train_RMSE:         {round(ethical_rmse_train, 2)}')

# running on test set
ethical_X_test_fe = X_test_fe[ethical_feature_cols]
ethical_y_test_pred = ethical_results.predict(ethical_X_test_fe)
ethical_rmse_test = sm.tools.eval_measures.rmse(y_test.values.flatten(), ethical_y_test_pred) # flatten if using X y csv
print(f'>* Test_RMSE:          {round(ethical_rmse_test, 2)}\n')

# RMSE percentage difference
percent_diff = 100 * abs(ethical_rmse_train - ethical_rmse_test) / ethical_rmse_train
print(f'>* RMSE % difference:  {round(percent_diff, 2)}')


Ethical Model Results
>* R-squared:          0.98
>* AIC:                7324.64
>* BIC:                7416.42
>* Condition Number:   20.48
>* Train_RMSE:         1.19
>* Test_RMSE:          1.23

>* RMSE % difference:  3.57


In [58]:
# Fitting the ethical model


# initialising and fitting the model
full_lin_reg = sm.OLS(y_train, X_train_fe[full_feature_cols])
full_results = full_lin_reg.fit()
#results.summary()


# printing results for train and test
print('full Model Results')
print(">* R-squared:         ", round(full_results.rsquared, 2))
print(">* AIC:               ", round(full_results.aic, 2))
print(">* BIC:               ", round(full_results.bic, 2))
print(">* Condition Number:  ", round(full_results.condition_number, 2))

full_y_pred_train = full_results.predict(X_train_fe[full_feature_cols])
full_rmse_train = sm.tools.eval_measures.rmse(y_train.values.flatten(), full_y_pred_train) # flatten if using X y csv
print(f'>* Train_RMSE:         {round(full_rmse_train, 2)}')

# running on test set
full_X_test_fe = X_test_fe[full_feature_cols]
full_y_test_pred = full_results.predict(full_X_test_fe)
full_rmse_test = sm.tools.eval_measures.rmse(y_test.values.flatten(), full_y_test_pred) # flatten if using X y csv
print(f'>* Test_RMSE:          {round(full_rmse_test, 2)}\n')

# RMSE percentage difference
percent_diff = 100 * abs(full_rmse_train - full_rmse_test) / full_rmse_train
print(f'>* RMSE % difference:  {round(percent_diff, 2)}')


full Model Results
>* R-squared:          0.98
>* AIC:                7282.74
>* BIC:                7403.21
>* Condition Number:   27.32
>* Train_RMSE:         1.18
>* Test_RMSE:          1.21

>* RMSE % difference:  2.83


# APPENDIX: Ethical model feature selection trials



## VO
>* R-squared:        0.9843535949579584
>* AIC:              7296.568331791556
>* BIC:              7445.723667535347
>* Condition Number: 1.1570115211473046e+16
>* RMSE:             1.1760419395325783
>
pop scaled- original all features
>* R-squared:        0.9846481072654
>* AIC:              7253.033812296681
>* BIC:              7402.189148040472
>* Condition Number: 1.1439814127157576e+16
>* RMSE:             1.1649210390771898

VIF Rec
>* R-squared:        0.9077510161357867
>* AIC:              20512.569565352685
>* BIC:              20621.567695319303
>* Condition Number: 61.15960669340647
>* RMSE:             21.105637864076023


```
#all the feautres to start w
feature_cols = ['const', 'Year', 'Infant_deaths', 'Under_five_deaths',
       'Adult_mortality', 'Alcohol_consumption', 'Hepatitis_B', 'Measles',
       'BMI', 'Polio', 'Diphtheria', 'Incidents_HIV', 'GDP_per_capita',
       'Population_mln', 'Thinness_ten_nineteen_years',
       'Thinness_five_nine_years', 'Schooling', 'Economy_status_Developed',
       'Economy_status_Developing', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America']
```


## V1

>* R-squared:        0.980020292649429
>* AIC:              7840.662722174737
>* BIC:              7943.924108458899
>* Condition Number: 1.5866767331111436e+16
>* RMSE:             1.3289544221444312
>

VIF rec
>* R-squared:        0.870998911499458
>* AIC:              12103.61533626432
>* BIC:              12178.193004136216
>* Condition Number: 30.503057505903552
>* RMSE:             3.376853444925223
>
Stepwise Rec
>* R-squared:        0.9799646775857649
>* AIC:              7837.0310389619635
>* BIC:              7911.608706833859
>* Condition Number: 2.5294714076884115e+17
>* RMSE:             1.3308027605981834


* remove Region bc similar reasons to Country

```
feature_cols = ['const', 'Year', 'Infant_deaths', 'Under_five_deaths',
       'Adult_mortality', 'Alcohol_consumption', 'Hepatitis_B', 'Measles',
       'BMI', 'Polio', 'Diphtheria', 'Incidents_HIV', 'GDP_per_capita',
       'Population_mln', 'Thinness_ten_nineteen_years',
       'Thinness_five_nine_years', 'Schooling', 'Economy_status_Developed',
       'Economy_status_Developing']
```
## V2 -- okay
>* R-squared:        0.9786602504084859
>* AIC:              7973.53511953705
>* BIC:              8025.165812679132
>* Condition Number: 9.676982049927363e+16
>* RMSE:             1.3734415595456866
>
VIF Rec
>* R-squared:        0.9521672302217338
>* AIC:              9818.69160693097
>* BIC:              9858.848812708144
>* Condition Number: 24.97417024518387
>* RMSE:             2.0562612439672776
>
Stepwise Rec
>* R-squared:        0.9786471744881255
>* AIC:              7972.938498668429
>* BIC:              8018.832448128057
>* Condition Number: 1.454726227865714e+17
>* RMSE:             1.3738622829033809
>
stepwise pop scaled
>* R-squared:        0.9794466774001356
>* AIC:              7887.510589615549
>* BIC:              7939.1412827576305
>* Condition Number: 5965290688594460.0
>* RMSE:             1.3478964972383716

* also excluding countries, regions
* Things you dont have to share at your job application (health related)
 removed
 * 'Alcohol_consumption', 'Hepatitis_B', 'Measles', 'BMI', 'Polio', 'Diphtheria', 'Incidents_HIV', 'Thinness_ten_nineteen_years',
       'Thinness_five_nine_years',
```
feature_cols = ['const', 'Year', 'Infant_deaths', 'Under_five_deaths',
       'Adult_mortality', 'GDP_per_capita',
       'Population_mln', 'Schooling', 'Economy_status_Developed',
       'Economy_status_Developing']
```

## V3 -- best so far = stepwise and population scaled!
>* R-squared:        0.9836129771132165
>* AIC:              7384.52376758664
>* BIC:              7482.048410188349
>* Condition Number: 93.28867972312926
>* RMSE:             1.203553930233103
>
VIF rec
>* R-squared:        0.9521672302217338
>* AIC:              9818.69160693097
>* BIC:              9858.848812708144
>* Condition Number: 24.97417024518387
>* RMSE:             2.0562612439672776
>
Stepwise rec
>* R-squared:        0.9830894437665191
>* AIC:              7448.57187311378
>* BIC:              7523.149540985676
>* Condition Number: 6.55056667569356e+16
>* RMSE:             1.2226283777898956
>
stepwise pop scaled
>* R-squared:        0.9840218527115269
>* AIC:              7324.6353752837
>* BIC:              7416.423274202956
>* Condition Number: 20.48436381513106
>* RMSE:             1.1884440355523156
>* TEST_RMSE:           1.230880967228096
>
>['GDP_per_capita', 'Under_five_deaths', 'Adult_mortality', 'const', 'Economy_status_Developed',
                'Population_mln', 'Infant_deaths', 'Year', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America']


* Regions back in, but mostly like V2 (health related removed)
```
feature_cols = ['const', 'Year', 'Infant_deaths', 'Under_five_deaths',
       'Adult_mortality', 'GDP_per_capita',
       'Population_mln', 'Schooling', 'Economy_status_Developed', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America']
```


## V4 -- not good

>* R-squared:        0.90064265468054
>* AIC:              11509.43995048745
>* BIC:              11595.491105724252
>* Condition Number: 81.05387425970994
>* RMSE:             2.9635722186026223
>
VIF rec
>* R-squared:        0.8922985444984101
>* AIC:              11692.186645133395
>* BIC:              11772.501056687744
>* Condition Number: 77.86857698012635
>* RMSE:             3.085505405046719
>
Stepwise Rec
>* R-squared:        0.9006259476908346
>* AIC:              11507.825150953626
>* BIC:              11588.139562507975
>* Condition Number: 13.415535373613116
>* RMSE:             2.963821371240409 \



* removed adult mortality, death under five, economy status (mostly like v3)
```
feature_cols = ['const', 'Year', 'Infant_deaths', 'GDP_per_capita',
       'Population_mln', 'Schooling', 'Economy_status_Developed', 'Region_Asia',
       'Region_Central America and Caribbean', 'Region_European Union',
       'Region_Middle East', 'Region_North America', 'Region_Oceania',
       'Region_Rest of Europe', 'Region_South America']
```