# FINAL FUNCTION

<div style="display: flex;">
  <img src="https://github.com/Jagroop-Dev/Xistence-Engine-WHO/blob/main/WHOl1.png?raw=true"
       width="300px"
       height="300px"
       style="margin-right: 50px;" />
  
  <img src="https://github.com/Jagroop-Dev/Xistence-Engine-WHO/blob/main/WHO%20LOGO.png?raw=true"
       width="300px"
       height="300px" />
</div>

# Team 5: Xsistence Engine Exploratory Data Analysis
        
### By Jagroop Singh, Graciela Diwa and Joachim Boyden

In [None]:
from sklearn.preprocessing import RobustScaler
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [None]:

X_train = pd.read_csv('https://raw.githubusercontent.com/Jagroop-Dev/Xistence-Engine-WHO/refs/heads/main/X_train.csv')

### `valid_input` Function Explanation

This function is a used to validate user input. It repeatedly prompts the user for the correct input until a valid response is received, based on specified criteria like data type, allowed options, and value ranges.

In [None]:
def valid_input(ui, valid_options=None, data_type=int, min_value=None, max_value=None):
    while True:
        try:
            user_input = input(ui)
            if data_type == int:
                user_input = int(user_input)
            elif data_type == float:
                user_input = float(user_input)

            # Add check for negative values
            if data_type in [int, float] and user_input < 0:
                print("Please enter a non-negative value.")
            elif min_value is not None and user_input < min_value:
                print(f"Please enter a value greater than or equal to {min_value}.")
            elif max_value is not None and user_input > max_value:
                print(f"Please enter a value less than or equal to {max_value}.")
            elif valid_options and user_input not in valid_options:
                print(f"Please choose one of the: {valid_options}")
            else:
                return user_input
        except ValueError:
            print(f"Please enter a valid {data_type.__name__}.")

### `ask_questions` Function Explanation

This is a very simple function that takes a `feature_name` as input and returns a formatted string to prompt the user for that feature's value.

In [None]:
def ask_questions(feature_name):
    return f"Please enter your {feature_name}: "

### `ask_consent` Function Explanation

This function asks the user for their consent to use advanced population data for the advanced model. It prompts the user with a Y/N question and returns a string indicating whether the 'advanced_model' or 'ethical_model' should be used based on the user's response.

In [None]:
def ask_consent():
    consent = input("Do you consent to using advanced population data which may be protected information, for better accuracy?(Y/N)").strip().lower()
    if consent == "y":
        print("You have selected to use the advanced model. ")
        return 'advanced_model'
    else:
        print("You have selected to use the standard model. ")
        return 'ethical_model'

### `encode_region` Function Explanation

This function takes a region name as input and returns the corresponding one-hot encoded column name used in the models. If the region is not recognized, it returns `None`.

In [None]:
def encode_region(region):
    regions = {
        "Middle East": "Region_Middle East",
        "European Union": "Region_European Union",
        "Asia": "Region_Asia",
        "North America": "Region_North America",
        "Central America and Caribbean": "Region_Central America and Caribbean",
        "South America": "Region_South America",
        "Rest of Europe": "Region_Rest of Europe",
        "Africa": "Region_Africa",
        "Oceania": "Region_Oceania"
    }
    return regions.get(region, None)

### `ask_features` Function Explanation

This function is responsible for prompting the user to enter the values for the features required by the chosen model (ethical or advanced). It uses the `valid_input` function to ensure the inputs are of the correct data type and within valid ranges where specified. It also handles the encoding of the economy status and region based on user input.

In [None]:
def ask_features(model_choice, user_answers):
    ethical_model_features = [
        "Economy status (Developed or Developing)",
        "GDP per Capita", "Under five deaths", "Adult mortality",
        "Population (millions)", "Infant deaths", "Year"
    ]

    advanced_model_features = [
        "Alcohol consumption", "Hepatitis B", "BMI", "Polio", "Diphtheria",
        "Incidents HIV", "Thinness ten nineteen years", "Thinness five nine years", "Schooling"
    ]

    user_inputs = {}

    feature_prompts = {
        "Economy status (Developed or Developing)": "Please enter the Economy status (Developed or Developing): ",
        "GDP per Capita": "Please enter the Gross Domestic Product per capita (in USD): ",
        "Under five deaths": "Please enter the Number of under-five deaths per 1000 population: ",
        "Adult mortality": "Please enter the Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population): ",
        "Population (millions)": "Please enter the Population of the country (in millions): ",
        "Infant deaths": "Please enter the Number of Infant Deaths per 1000 population: ",
        "Year": "Please enter the Year (between 2000 and 2024): ",
        "Alcohol consumption": "Please enter the Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol): ",
        "Hepatitis B": "Please enter the Hepatitis B (HepB) immunization coverage among 1-year-olds (%): ",
        "BMI": "Please enter the Average Body Mass Index of entire population: ",
        "Polio": "Please enter the Polio (Pol3) immunization coverage among 1-year-olds (%): ",
        "Diphtheria": "Please enter the Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%): ",
        "Incidents HIV": "Please enter the Deaths per 1 000 live births HIV/AIDS (0-4 years): ",
        "Thinness ten nineteen years": "Please enter the Prevalence of thinness among children and adolescents for Age 10 to 19 (%): ",
        "Thinness five nine years": "Please enter the Prevalence of thinness among children for Age 5 to 9(%): ",
        "Schooling": "Please enter the Number of years of Schooling(years): "
    }


    for feature in ethical_model_features:
        prompt = feature_prompts.get(feature, f"Please enter your {feature}: ")
        if feature == "Economy status (Developed or Developing)":
            economy_status = valid_input(
                prompt,
                valid_options=['Developed', 'Developing'],
                data_type=str
            )
            user_inputs['Economy_status_Developed'] = 1 if economy_status == "Developed" else 0

            user_inputs['Economy_status_Developing'] = 1 if economy_status == "Developing" else 0
        elif feature == "GDP per Capita":
            user_inputs['GDP_per_capita'] = valid_input(prompt, data_type=float)
        elif feature == "Year":
            user_inputs[feature] = valid_input(prompt, data_type=int, min_value=2000, max_value=2024) # Changed max_value to 2024
        else:
            user_inputs[feature] = valid_input(prompt, data_type=float)


    if model_choice == 'advanced_model':
        for feature in advanced_model_features:
            prompt = feature_prompts.get(feature, f"Please enter your {feature}: ")
            if feature == "Alcohol consumption":
                user_inputs['Alcohol_consumption'] = valid_input(prompt, data_type=float)
            elif feature == "Hepatitis B":
                 user_inputs['Hepatitis_B'] = valid_input(prompt, data_type=float)
            elif feature == "BMI":
                 user_inputs['BMI'] = valid_input(prompt, data_type=float)
            elif feature == "Polio":
                 user_inputs['Polio'] = valid_input(prompt, data_type=float)
            elif feature == "Diphtheria":
                 user_inputs['Diphtheria'] = valid_input(prompt, data_type=float)
            elif feature == "Incidents HIV":
                 user_inputs['Incidents_HIV'] = valid_input(prompt, data_type=float)
            elif feature == "Thinness ten nineteen years":
                 user_inputs['Thinness_ten_nineteen_years'] = valid_input(prompt, data_type=float)
            elif feature == "Thinness five nine years":
                 user_inputs['Thinness_five_nine_years'] = valid_input(prompt, data_type=float)
            elif feature == "Schooling":
                 user_inputs['Schooling'] = valid_input(prompt, data_type=float)
            else:
                user_inputs[feature] = valid_input(prompt, data_type=float) # Catch any others


    region = user_answers['Region']
    encoded_region = encode_region(region)
    if encoded_region:
        user_inputs[encoded_region] = 1


        regions_list = [
            "Region_Middle East", "Region_European Union", "Region_Asia",
            "Region_North America", "Region_Central America and Caribbean",
            "Region_South America", "Region_Rest of Europe", "Region_Africa",
            "Region_Oceania"
        ]
        for reg in regions_list:
            if reg != encoded_region:
                user_inputs[reg] = 0
    else:
        print("Invalid region selected!")
        return {}


    if 'Region' in user_inputs:
        del user_inputs['Region']


    return user_inputs

### `make_prediction` Function Explanation

This function takes a model dictionary (containing feature coefficients) and a DataFrame of input data and calculates the predicted life expectancy.

In [None]:
def make_prediction(model: dict, data: pd.DataFrame) -> float:

    if 'const' not in data.columns:
        data = sm.add_constant(data, has_constant='add')

    prediction = sum([model.get(feature, 0) * data[feature][0] for feature in model.keys() if feature in data.columns])

    return prediction

### `robust_scale` Function Explanation

This function fits the scaler on the training data (`X_train`) and then transforms the `input_data`.

In [None]:
def robust_scale(train: pd.DataFrame, input_data: dict) -> dict:
    rob = RobustScaler()


    numeric_cols = train.select_dtypes(include=['number']).columns
    scaler = rob.fit(train[numeric_cols])


    input_df = pd.DataFrame([input_data], columns=numeric_cols)


    scaled_input = scaler.transform(input_df)


    scaled_dict = pd.DataFrame(scaled_input, columns=numeric_cols).to_dict(orient='records')[0]

    return scaled_dict

### `feature_eng` Function Explanation

This function performs feature engineering on the user's input data. This includes applying transformations like log scaling and then scaling the numerical features using the `robust_scale` function. It ensures the final DataFrame has all the necessary columns for the prediction models.

In [None]:
def feature_eng(df):
    print(f"Original user inputs: {df}")


    df_processed = df.copy()


    df_processed['GDP_per_capita'] = np.log(df_processed['GDP_per_capita'] + 1)
    df_processed['Population_mln'] = np.log(df_processed['Population_mln'] + 1)



    numerical_cols_for_scaling = ['GDP_per_capita', 'Under_five_deaths', 'Adult_mortality', 'Population_mln', 'Infant_deaths', 'Year']


    df_numerical_for_scaling = df_processed[numerical_cols_for_scaling]

    scaled_df_values = robust_scale(X_train[numerical_cols_for_scaling], df_numerical_for_scaling.iloc[0].to_dict())



    for col in numerical_cols_for_scaling:
      df_processed[col] = scaled_df_values[col]

    for col in df.columns:
        if col not in df_processed.columns:
            df_processed[col] = df[col]



    df_processed = sm.add_constant(df_processed, has_constant='add')



    return df_processed

### Model Coef

This function performs feature engineering on the user's input data. This includes applying transformations like log scaling and then scaling the numerical features using the `robust_scale` function. It ensures the final DataFrame has all the necessary columns for the prediction models.

In [None]:
advanced_model = {'const'                  :70.676838,
'Year'                                     :0.257635,
'Under_five_deaths'                       :-2.883196,
'Infant_deaths'                           :-1.764888,
'Adult_mortality'                         :-6.114546,
'Alcohol_consumption'                     :-0.207149,
'Hepatitis_B'                             :-0.142623,
'BMI'                                     :-0.377977,
'GDP_per_capita'                           :1.084308,
'Population_mln'                           :0.279958,
'Schooling'                                :0.415376,
'Economy_status_Developing'               :-2.581425,
'Region_Asia'                              :0.251088,
'Region_Central America and Caribbean'     :1.994919,
'Region_European Union'                   :-0.606132,
'Region_Middle East'                       :0.071231,
'Region_North America'                    :0.242322,
'Region_Oceania'                          :-0.578854,
'Region_Rest of Europe'                    :0.552260,
'Region_South America'                     :1.590394,
'Polio'                                    :0.140878}

In [None]:
ethical_model = {'GDP_per_capita'                      : 1.0343,
                 'Under_five_deaths'                   : -2.9628,
                 'Adult_mortality'                     : -6.0962,
                 'const'                               : 70.7061,
                 'Economy_status_Developed'            : 2.8228,
                 'Population_mln'                      : 0.3300,
                 'Infant_deaths'                       : -1.7405,
                 'Year'                                : 0.2556,
                 'Region_Asia'                         : 0.4594,
                 'Region_Central America and Caribbean': 1.8377,
                 'Region_European Union'               : -0.8370,
                 'Region_Middle East'                  : -0.0440,
                 'Region_North America'                : 0.0686,
                 'Region_Oceania'                      : -0.7979,
                 'Region_Rest of Europe'               : 0.5545,
                 'Region_South America'                : 1.4445}

### `standardize_feature_names` Function Explanation

This function standardizes the feature names from the user input (which might have spaces or different capitalization) to match the format used in the model dictionaries and the DataFrame columns.

In [None]:
def standardize_feature_names(user_inputs):

    standardized_inputs = {}

    feature_map = {
        "GDP per Capita": "GDP_per_capita",
        "Under five deaths": "Under_five_deaths",
        "Adult mortality": "Adult_mortality",
        "Population (millions)": "Population_mln",
        "Infant deaths": "Infant_deaths",
        "Economy status (Developed or Developing)": "Economy_status_Developed",
        "Year": "Year",
        "Region": "Region",
        "BMI": "BMI"
    }

    for feature, value in user_inputs.items():

        standardized_inputs[feature_map.get(feature, feature)] = value

    return standardized_inputs

### `life_expectancy_predictor` Function Explanation

This is the main function that makes use of all of the various functions of getting user input, processing it, and making a life expectancy prediction using either the ethical or advanced model.

In [None]:
def life_expectancy_predictor():
    user_answers = {}

    model_choice = ask_consent()
    print("Model choice:", model_choice)

    print("Please select your region by choosing the correct number:\n")
    regions = {
        1: "Middle East",
        2: "European Union",
        3: "Asia",
        4: "North America",
        5: "Central America and Caribbean",
        6: "South America",
        7: "Rest of Europe",
        8: "Africa",
        9: "Oceania"
    }

    for key, value in regions.items():
        print(f"{key}. {value}")

    user_region = valid_input(
        "Enter your choice: ",
        valid_options=regions.keys(),
        data_type=int
    )


    user_answers['Region'] = regions[user_region]


    user_inputs = ask_features(model_choice, user_answers)




    standardized_user_inputs = standardize_feature_names(user_inputs)



    df = pd.DataFrame([standardized_user_inputs])



    engineered_user_inputs = feature_eng(df)


    if model_choice == 'advanced_model':
        prediction = make_prediction(advanced_model,engineered_user_inputs)
    else:
        prediction = make_prediction(ethical_model, engineered_user_inputs)


    print(f"Predicted Life Expectancy: {prediction:.2f} years")

In [None]:
life_expectancy_predictor()

Do you consent to using advanced population data which may be protected information, for better accuracy?(Y/N)n
You have selected to use the standard model. 
Model choice: ethical_model
Please select your region by choosing the correct number:

1. Middle East
2. European Union
3. Asia
4. North America
5. Central America and Caribbean
6. South America
7. Rest of Europe
8. Africa
9. Oceania
Enter your choice: 1
Please enter the Economy status (Developed or Developing): Developed


KeyboardInterrupt: Interrupted by user