# **Regression: Covid-19 Risk Prediction Notebook**

## Objectives

Fit and evaluate a regression model to predict the risk level for a Covid-19 patient based on their age and pre-existing health conditions.

## Inputs

* outputs/datasets/cleaned/TrainSetCleaned.csv
* outputs/datasets/cleaned/TestSetCleaned.csv

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML pipeline to predict hospital stay duration
* Labels map
* Feature Importance Plot 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-covid-19-study/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-covid-19-study'

---

## Load Data

In [5]:
import numpy as np
import pandas as pd

df = (pd.read_csv("outputs/datasets/cleaned/TrainSetCleaned.csv")
      .drop(labels=['DIED'], axis=1)
     )

print(df.shape)
df.head(3)

(49788, 15)


Unnamed: 0,SEX,INTUBED,PNEUMONIA,AGE,DIABETES,COPD,ASTHMA,INMSUPR,HIPERTENSION,OTHER_DISEASE,CARDIOVASCULAR,OBESITY,RENAL_CHRONIC,TOBACCO,ICU
0,Male,No,Yes,47,Yes,No,No,No,No,No,No,Yes,No,No,No
1,Male,No,Yes,35,No,No,No,No,No,No,No,No,No,No,Yes
2,Male,No,Yes,37,Yes,No,No,No,No,No,No,Yes,No,No,No


---

## MP Pipeline: Regressor

### Create ML pipeline

In [6]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor

def PipelineOptimization(model):
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['SEX', 'INTUBED', 'PNEUMONIA', 'DIABETES', 'COPD', 
                                                                'ASTHMA', 'INMSUPR', 'HIPERTENSION', 'OTHER_DISEASE', 
                                                                'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC', 
                                                                'TOBACCO', 'ICU'])),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),
        ("feat_scaling", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

ModuleNotFoundError: No module named 'feature_engine'

Custom Class for hyperparameter optimisation

In [None]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

### Split Train Test Set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['RISK_LEVEL'], axis=1),
    df['RISK_LEVEL'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

### Grid Search CV - Sklearn

#### Use default hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

hyperparameter optimisation search using default hyperparameters

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check Results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

#### Do an extensive search on the most suitable model to find the best hyperparameter configuration.

Define model and parameters, for Extensive Search

In [None]:
models_search = {
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
}

params_search = {
    "GradientBoostingRegressor": {
        'model__n_estimators': [100, 300],
        'model__learning_rate': [1e-1, 1e-2, 1e-3], 
        'model__max_depth': [3, 10, None],
    }
}

Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check Results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
print(grid_search_summary)

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0, 0]
print(best_model)

Parameters for best model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
print(best_parameters)

Define the best regressor, based on search

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
print(best_regressor_pipeline)

### Assess feature importance

### Evaluate on Train and Test Sets

Evaluate Performance

### Regressor with PCA

Let's explore potential values for PCA n_components.

Apply PCA separately to the scaled data

### Rewrite ML Pipeline for Modelling

### Grid Search CV – Sklearn

#### Use standard hyperparameters to find the most suitable model.

Do a quick optimisation search

Check results

#### Do an extensive search on the most suitable model to find the best hyperparameter configuration.

Define model and parameters for extensive search

Extensive GridSearch CV

Check results

Check the best model

Parameters for best model

Define the best regressor

### Evaluate Regressor on Train and Tests Sets

## Convert Regression to Classification

### Convert numerical target to bins, and check if it is balanced

### Rewrite ML Pipeline for Modelling

### Load algorithms for classification

### Split Train Test Sets

### Grid Seach CV – Sklearn

#### Use standard hyper parameters to find most suitable model

GridSearch CV

Check results

#### Do an extensive search on the most suitable model to find the best hyperparameter configuration.

Define models and parameters

Extensive GridSearch CV

Check results

Check the best model

Parameters for best model

- We are saving this content for later

Define the best clf pipeline

### Assess feature importance

We can assess feature importance for this model with .feature_importances_

### Evaluate Classifier on Train and Test Sets

Custom Function

List that relates the classes and tenure interval

We can create manually

### Which pipeline to choose?

We fitted 3 pipelines:

- Regression
- Regression with PCA
- Classifier

The regressor pipelines didn't reach the expected performance threshold (0.7 R2 score) for the train and test set.

The classifier was tuned on Recall for class 0 (tenure <4 months), since we are interested to detect prospects that may churn soon.

- It has reasonable performance for class 0 (<4 months) and class 2 (+20 months)
- Class 1 (4 to 20 months) has weak performance.

### Refit pipeline with best features

#### Rewrite Pipeline

### Split Train Test Set, only with best features

Subset Best Features

#### Grid Search CV – Sklearn

We are using the same model from the previous GridCV search

And the best parameters from the previous GridCV search

You will need to type in manually since the hyperparameter values have to be a list. The previous dictionary is not in this format.

GridSearch CV

Check results

Check the best model

Define the best clf pipeline

### Assess feature importance

### Evaluate Classifier on Train and Test Sets

### Push files to the repo

We will generate the following files

- Train set
- Test set
- Modeling pipeline
- label map
- features importance plot

### Train Set: features and target

### Test Set: features and target

### Modelling pipeline

ML pipeline for predicting tenure

### List mapping target levels to ranges

Map for converting numerical variable to categorical variable

### Feature importance plot