# Loan Prediction classification excercise

We consider the dataset file `dataset.csv`, contained in the `data/loan-prediction` directory.
A description of the dataset is available in the `README.txt` file on the same directory.

The **goal** is to use the informatino from past loan applicants contained in `dataset.csv` to predict whether a *new applicant* should be granted a loan or not.

### Load the dataset and handle missing values

In [76]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

DATASET_PATH = "./data/loan-prediction/dataset.csv"

# loading the dataset
data = pd.read_csv(DATASET_PATH, sep=",", index_col="Loan_ID")
print(f"The shape of the dataset is: {data.shape}")
print(data.head())

# handling missing values
from pandas.api.types import is_numeric_dtype
data = data.apply(lambda x: x.fillna(x.median()) if is_numeric_dtype(x) else x.fillna(x.mode().iloc[0]))

The shape of the dataset is: (614, 12)
         Gender Married Dependents     Education Self_Employed  \
Loan_ID                                                          
LP001002   Male      No          0      Graduate            No   
LP001003   Male     Yes          1      Graduate            No   
LP001005   Male     Yes          0      Graduate           Yes   
LP001006   Male     Yes          0  Not Graduate            No   
LP001008   Male      No          0      Graduate            No   

          ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
Loan_ID                                                                      
LP001002             5849                0.0         NaN             360.0   
LP001003             4583             1508.0       128.0             360.0   
LP001005             3000                0.0        66.0             360.0   
LP001006             2583             2358.0       120.0             360.0   
LP001008             6000     

### Handling outliers

Winsorization is a technique for handling outliers in statistics. It replaces extreme values in a dataset with values closer to the center of the distribution (the median).

In [77]:
# winsorize ApplicantIncome, CoapplicantIncome and LoanAmount
import scipy.stats as stats
stats.mstats.winsorize(data.ApplicantIncome, limits=0.05, inplace=True)
stats.mstats.winsorize(data.CoapplicantIncome, limits=0.05, inplace=True)
stats.mstats.winsorize(data.LoanAmount, limits=0.05, inplace=True)

# Apply log-transformation to ApplicantIncome and assign it to a new column
data["Log_ApplicantIncome"] = data.ApplicantIncome.apply(np.log)
# Apply log-transformation to LoanAmount and assign it to a new column
data["Log_LoanAmount"] = data.LoanAmount.apply(np.log)

### Encoding categorical features: one-hot encoding

One-hot encoding is a way to convert categorical data into a numerical representation by creating a binary vector for each category. The resulting vector has a length equal to the number of categories, and each element in the vector corresponds to a specific category.

In [78]:
# get all columns which are not numeric and not the loan status
categorical_features = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != "Loan_Status"]
data_with_dummies = pd.get_dummies(data, columns=categorical_features)

# as a convention, I prefer to place the column to be predicted as the last one
columns = data_with_dummies.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index("Loan_Status")))
data_with_dummies = data_with_dummies.loc[:, columns]

# encoding the Loan_Status label
data = data_with_dummies
data.Loan_Status = data.Loan_Status.map(lambda x: 1 if x == "Y" else -1)

# print(data.head())

print(data["Loan_Status"].value_counts()[1])
print(data["Loan_Status"].value_counts()[-1])

422
192


### Building a predictive model

In [79]:
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# split the dataset into training and testing

# extract the feature matrix from our original dataframe the feature matrix X
# is composed of all the columns except "Loan_Status" (the target class label)
X = data.iloc[:, :-1]

# we want the extract the target class from column vector y
y = data.Loan_Status

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape {X_test.shape}")

Training set shape: (491, 22)
Test set shape (123, 22)


### Feature scaling: why/when

**REMEMBER**: not every learning model is sensitive to different feature scales!

For example, in case of logistic regression the vector of model parameters we come up with when we minimize the negative log-likelihood, using gradient descent solution, is not affected by different feature scales, except for a constant.

You can convince yourself of this by computing the gradient of hte negative log-likelihood using non-scaled and scaled features.

Other models instead are not invariant with respect to scalar transformations of the input (features), and leads to completely different results if features are not properly scaled.

### Feature scaling: how

Feature scaling **cannot** be done looking at the whole dataset!
In other words, either you standardize or normalize you features, you must do it considering only the training set portion of your dataset.
The same scaling, then, should be applied to the test set.

In [80]:
from sklearn import preprocessing

std_scaler = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scaler.transform(X_train)

minmax_scaler = preprocessing.MinMaxScaler().fit(X_train)
X_train_minmax = minmax_scaler.transform(X_train)

# this can be also done with pandas
# X_train_mean = X_train.mean()
# X_train_std = X_train.std()
# X_train_std = (X_train - X_train_mean)/X_train_std
# X_train_max = X_train.max()
# X_train_min = X_train.min()
# X_train_minmax = (X_train - X_train_min)/(X_train_max - X_train_min)

Now we can work with 3 different feature matrices:
- X_train
- X_train_std
- X_train_minmax


In [81]:
def evaluate(true_values, predicted_values):
  print(f"accuracy = {accuracy_score(true_values, predicted_values):.3f}")
  print(f"area under the ROC curve = {roc_auc_score(true_values, predicted_values)}")

model = LogisticRegression(solver="liblinear")

model.fit(X_train, y_train)

print(f"***** performance on the test set *****")
evaluate(y_test, model.predict(X_test))

print(f"***** classification report *****")
print(classification_report(y_test, model.predict(X_test)))

***** performance on the test set *****
accuracy = 0.789
area under the ROC curve = 0.6651702786377709
***** classification report *****
              precision    recall  f1-score   support

          -1       0.93      0.34      0.50        38
           1       0.77      0.99      0.87        85

    accuracy                           0.79       123
   macro avg       0.85      0.67      0.68       123
weighted avg       0.82      0.79      0.75       123



### Let's try to use cross-validation

In [82]:
model = LogisticRegression(solver="liblinear")
cv = cross_validate(model, X, y, cv=10, scoring=("roc_auc", "accuracy"), return_train_score=True)
pd.DataFrame(cv)

# model evaluation using cross-validation
print("***** Evaluate Average Performance on Cross-Validation Set *****")
print("Avg. Test Set Accuracy = {:.3f}".format(np.mean(cv["test_accuracy"])))
print("Avg. Test Set ROC AUC = {:.3f}".format(np.mean(cv["test_roc_auc"])))

model = LogisticRegression(solver = "liblinear")
k_fold = KFold(n_splits=10, shuffle=True, random_state=42)
cv = cross_validate(model, X, y, cv=k_fold, scoring=("roc_auc", "accuracy"), return_train_score=True)

# model evaluation using cross-validation
print("***** Evaluate Average Performance on Cross-Validation Set *****")
print("Avg. Test Set Accuracy = {:.3f}".format(np.mean(cv["test_accuracy"])))
print("Avg. Test Set ROC AUC = {:.3f}".format(np.mean(cv["test_roc_auc"])))

***** Evaluate Average Performance on Cross-Validation Set *****
Avg. Test Set Accuracy = 0.806
Avg. Test Set ROC AUC = 0.758
***** Evaluate Average Performance on Cross-Validation Set *****
Avg. Test Set Accuracy = 0.808
Avg. Test Set ROC AUC = 0.767


## Model selection and evaluation

So far, we have just focused on a very specific instance of a logistic regression model.

In other words, we haven't spent time trying to tune any "meta-parameter" (known as hyperparameter) of our model.
We used default values of hyperparameters for our logistic regression model, according to `scikit-learn`

We didn't perform any actual model selection, as hyperparameters are fixed.
The figures we output for test accuracy/ROC AUC scores are our estimates of generalization performance of our model.

Most of the time we may need to do one of the following:
- Fix a "family" of models and perform hyperparameter selection.
- Choose between a set of models, each one with fixed set of hyperparameters.
- A mixture of the above, where we have to select the best hyperparameters of the best model picked from a set of different models.

In any case, we also need to provide an estimate of the generalization performance of the chosen model.

## Select best hyperparameters of a fixed family of models
### Using validation set
We have just one model, logistic regression, with a dictionary of hyperparameters C.
C is a regularization parameter that controls the trade-off between achieving a low error rate on the training data and having a simpler model (i.e., smaller weights).

In [83]:
models_and_hyperparams = {
  "LogisticRegression": (
    LogisticRegression(solver="liblinear"),
    {"C": [0.01, 0.05, 0.1, 0.5, 1, 2]},
  )
}

# outer splitting: training vs test set (80/20)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=73, stratify=y
)

# inner splitting (within the outer training set): training vs validation (80/20)
# training set is used to train the model, validation set is used to select the best hyperparameters
X_train_train, X_validation, y_train_train, y_validation = train_test_split(
  X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

training_scores = {}
validation_scores = {}

best_training_score = {}
best_validation_score = {}

model = models_and_hyperparams["LogisticRegression"][0]
hyperparams = models_and_hyperparams["LogisticRegression"][1]

for hp in hyperparams:
  training_scores[hp] = {}
  validation_scores[hp] = {}
  
  for val in hyperparams[hp]:
    model.set_params(**{hp: val})
    
    model.fit(X_train_train, y_train_train)
    
    training_score = accuracy_score(y_train_train, model.predict(X_train_train))
    training_scores[hp][val] = training_score
    
    validation_score = accuracy_score(y_validation, model.predict(X_validation))
    validation_scores[hp][val] = validation_score
    
    if not best_validation_score:
      best_validation_score[hp] = (val, validation_score)
    else:
      if best_validation_score[hp][1] < validation_score:
        best_validation_score[hp] = (val, validation_score)

print("***** Evaluate Performance on Validation Set *****")
print(validation_scores)
print("***** Best Accuracy Score on Validation Set *****")
print(best_validation_score)

# we set the model's hyperparameters to those leading to the best score on the validation test
best_params = dict([(list(best_validation_score.keys())[0], list(best_validation_score.values())[0][0])])
model.set_params(**best_params)

# we fit this model to the whole training set portion
model.fit(X_train, y_train)
print("***** Evaluate Performance on the whole Test Set *****")
evaluate(y_test, model.predict(X_test))

***** Evaluate Performance on Validation Set *****
{'C': {0.01: 0.6868686868686869, 0.05: 0.7575757575757576, 0.1: 0.7676767676767676, 0.5: 0.797979797979798, 1: 0.7878787878787878, 2: 0.797979797979798}}
***** Best Accuracy Score on Validation Set *****
{'C': (0.5, 0.797979797979798)}
***** Evaluate Performance on the whole Test Set *****
accuracy = 0.813
area under the ROC curve = 0.7119195046439628


### Using cross-validation (single hyperparameter)

In [84]:
models_and_hyperparams = {
  "LogisticRegression": (
    LogisticRegression(solver="liblinear"),
    {"C": [0.01, 0.05, 0.1, 0.5, 1, 2]},
  )
}

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=73, stratify=y
)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

model = models_and_hyperparams["LogisticRegression"][0]
hyperparams = models_and_hyperparams["LogisticRegression"][1]

gs = GridSearchCV(
  estimator=model,
  param_grid=hyperparams,
  cv=k_fold,
  scoring="accuracy",
  verbose=True,
  return_train_score=True,
)
gs.fit(X_train, y_train)
pd.DataFrame(gs.cv_results_)

print("Best hyperparameter: {}".format(gs.best_params_))
print("Best accuracy score: {:.3f}".format(gs.best_score_))
evaluate(y_test, gs.predict(X_test))

Fitting 10 folds for each of 6 candidates, totalling 60 fits
Best hyperparameter: {'C': 0.5}
Best accuracy score: 0.813
accuracy = 0.813
area under the ROC curve = 0.7119195046439628


### Using cross-validation (multiple hyperparameters)

In [85]:
models_and_hyperparams = {
  "LogisticRegression": (
    LogisticRegression(solver="liblinear"),
    {"C": [0.01, 0.05, 0.1, 0.5, 1, 2], "penalty": ["l1", "l2"]},
  )
}

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=31)

model = models_and_hyperparams["LogisticRegression"][0]
hyperparams = models_and_hyperparams["LogisticRegression"][1]

gs = GridSearchCV(
  estimator=model,
  param_grid=hyperparams,
  cv=k_fold,
  scoring="accuracy",
  verbose=True,
  return_train_score=True,
)

gs.fit(X_train, y_train)
pd.DataFrame(gs.cv_results_)

print("Best hyperparameter: {}".format(gs.best_params_))
print("Best accuracy score: {:.3f}".format(gs.best_score_))
evaluate(y_test, gs.predict(X_test))

Fitting 10 folds for each of 12 candidates, totalling 120 fits
Best hyperparameter: {'C': 1, 'penalty': 'l1'}
Best accuracy score: 0.811
accuracy = 0.813
area under the ROC curve = 0.7119195046439628


## Select best model out of a set of family of models with fixed hyperparameters
### Using cross validation

In [86]:
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=73, stratify=y
)

models = {
  "LogisticRegression": LogisticRegression(solver="liblinear", max_iter=1000),
  "LinearSVC": LinearSVC(),
  "DecisionTreeClassifier": DecisionTreeClassifier(),
  "RandomForestClassifier": RandomForestClassifier(),
  "GradientBoostingClassifier": GradientBoostingClassifier(),
  # i could add more models here
}

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_scores = {}
for model_name, model in models.items():
  cv_scores[model_name] = cross_val_score(model, X_train, y_train, cv=k_fold, scoring="accuracy")

cv_df = pd.DataFrame(cv_scores).transpose()

cv_df['avg_cv'] = np.mean(cv_df, axis=1)
cv_df['std_cv'] = np.std(cv_df, axis=1)
cv_df = cv_df.sort_values(['avg_cv', 'std_cv'], ascending=[False,True])

# model Selection: Logistic Regression is the best overall method, therefore we pick that!
# now we need to provide an estimate of its generalization performance. 
# to do so, we evaluate it against the test set portion we previously held out.
model = models[cv_df.index[0]]
# re-fit the best selected model on the whole training set
model.fit(X_train, y_train)
# evaluation
print("***** Evaluate Performance on Training Set *****")
evaluate(y_train, model.predict(X_train))
print("***** Evaluate Performance on Test Set *****")
evaluate(y_test, model.predict(X_test))



***** Evaluate Performance on Training Set *****
accuracy = 0.817
area under the ROC curve = 0.7166075763998613
***** Evaluate Performance on Test Set *****
accuracy = 0.813
area under the ROC curve = 0.7119195046439628


### Select the best hyperparameters AND the best model from a family of models

In [87]:
models_and_hyperparams = {
  "LogisticRegression": (
    LogisticRegression(),
    {"C": [0.01, 0.05, 0.1, 0.5, 1, 2], "penalty": ["l1", "l2"]},
  ),
  "RandomForestClassifier": (
    RandomForestClassifier(),
    {"n_estimators": [10, 50, 100]},
  ),
  "DecisionTreeClassifier": (
    DecisionTreeClassifier(),
    {
      "criterion": ["gini", "entropy"],
      "max_depth": [i for i in range(1, X.shape[1] + 1)],
    },
  ),
}

# create 10 folds for estimating generalization error
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=73)

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=37, stratify=y
)

# we will collect the average of the scores on the 10 outer folds in this dictionary
# with keys given by the names of the models in models_and_hyperparams
average_scores_across_outer_folds_for_each_model = dict()

# find the model with the best generalization error
for name, (model, params) in models_and_hyperparams.items():
  # this object is a classifier that also happens to choose
  # its hyperparameters automatically using inner_cv
  model_optimizing_hyperparams = GridSearchCV(
    estimator=model,
    param_grid=params,
    cv=inner_cv,
    scoring="accuracy",
    verbose=True,
  )

  # estimate generalization error on the 10-fold splits of the data
  scores_across_outer_folds = cross_val_score(
    model_optimizing_hyperparams, X_train, y_train, cv=outer_cv, scoring="accuracy"
  )

  # get the mean accuracy across each of outer_cv's 10 folds
  average_scores_across_outer_folds_for_each_model[name] = np.mean(
    scores_across_outer_folds
  )
  performance_summary = "Model: {name}\nAccuracy in the 10 outer folds: {scores}.\nAverage Accuracy: {avg}"
  print(
    performance_summary.format(
      name=name,
      scores=scores_across_outer_folds,
      avg=np.mean(scores_across_outer_folds),
    )
  )
  print()

print(
  "Average score across the outer folds: ",
  average_scores_across_outer_folds_for_each_model,
)

many_stars = "\n" + "*" * 100 + "\n"
print(
  many_stars
  + "Now we choose the best model and refit on the whole dataset"
  + many_stars
)

best_model_name, best_model_avg_score = max(
  average_scores_across_outer_folds_for_each_model.items(),
  key=(lambda name_averagescore: name_averagescore[1]),
)

# get the best model and its associated parameter grid
best_model, best_model_params = models_and_hyperparams[best_model_name]

# now we refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_model = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_model.fit(X_train, y_train)

print("Best model: \n\t{}".format(best_model), end="\n\n")
print(
  "Estimation of its generalization performance (accuracy):\n\t{}".format(
    best_model_avg_score
  ),
  end="\n\n",
)
print(
  "Best parameter choice for this model: \n\t{params}"
  "\n(according to cross-validation `{cv}` on the whole dataset).".format(
    params=final_model.best_params_, cv=inner_cv
  )
)


y_true, y_pred, y_pred_prob = y, final_model.predict(X), final_model.predict_proba(X)
print()
print(classification_report(y_true, y_pred))
roc = roc_auc_score(y_true, y_pred_prob[:, 1])
acc = accuracy_score(y_true, y_pred)
print("Accuracy = [{:.3f}]".format(acc))
print("Area Under the ROC = [{:.3f}]".format(roc))

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fitting 10 folds for each of 12 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Model: LogisticRegression
Accuracy in the 10 outer folds: [0.76       0.89795918 0.81632653 0.81632653 0.85714286 0.85714286
 0.79591837 0.79591837 0.75510204 0.7755102 ].
Average Accuracy: 0.812734693877551

Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Fitting 10 folds for each of 3 candidates, totalling 30 fits
Model: RandomForestClassifier
Accuracy in the 10 outer folds: [0.68       0.81632653 0.73469388 0.7755102  0.7755102  0.83673469
 0.81632653 0.81632653 0.73469388 0.7755102 ].
Avera

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best model: 
	LogisticRegression()

Estimation of its generalization performance (accuracy):
	0.812734693877551

Best parameter choice for this model: 
	{'C': 1, 'penalty': 'l2'}
(according to cross-validation `StratifiedKFold(n_splits=10, random_state=73, shuffle=True)` on the whole dataset).

              precision    recall  f1-score   support

          -1       0.81      0.46      0.59       192
           1       0.80      0.95      0.87       422

    accuracy                           0.80       614
   macro avg       0.80      0.71      0.73       614
weighted avg       0.80      0.80      0.78       614

Accuracy = [0.798]
Area Under the ROC = [0.791]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
