---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/Credit-Card-Fraud-Detection-/main/images/myself.png" align=right>

# **Fetal Health Classification Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

**Reducing child mortality** is a fundamental goal of global health initiatives and a key indicator of human progress. The United Nations aims to eliminate preventable deaths of newborns and children under five by 2030, targeting an under‑5 mortality rate below 25 per 1,000 live births. Maternal mortality is another critical issue, with most deaths occurring in low-resource settings and being largely preventable.

Cardiotocography (CTG) offers a cost-effective and non-invasive method to assess fetal health, monitoring fetal heart rate (FHR), movements, and uterine contractions. **This project leverages AutoML techniques to classify fetal health conditions** based on CTG data, aiming to support early diagnosis and intervention, ultimately contributing to maternal and child health improvements.
<p align="center">
<img width=50% src="https://github.com/gabrielcapela/AutoML_Classification/blob/main/aditya-romansa-5zp0jym2w9M-unsplash.jpg?raw=true">
</p>

The purpose of this project is to apply **Automated Machine Learning**, in order to demonstrate the practicality of this type of tool. The data used was taken from [Kaggle](https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification?resource=download).

# Business Understanding 

Child and maternal health are critical indicators of a country’s overall development and healthcare quality. The **reduction of child mortality is a priority** within the United Nations Sustainable Development Goals (SDGs), aiming to lower preventable deaths of newborns and children under five years old. The UN targets an under-5 mortality rate of at most 25 per 1,000 live births by 2030. Similarly, maternal mortality remains a major global health challenge, with approximately 295,000 deaths occurring during pregnancy and childbirth in 2017, 94% of which took place in low-resource settings. Many of these deaths could have been prevented with timely medical intervention.

One of the most effective methods to assess fetal health and reduce risks to both mother and baby is **Cardiotocography** (CTG). This technology is a non-invasive, cost-effective diagnostic tool that monitors fetal heart rate (FHR), fetal movements, uterine contractions, and other physiological parameters. By analyzing CTG data, **healthcare professionals can detect early signs of fetal distress**, allowing for timely medical decisions that may prevent complications such as stillbirths, hypoxia, and premature birth-related issues.

Despite its effectiveness, **the interpretation of CTG data remains challenging and subjective**, as it relies heavily on the expertise of healthcare providers. Classification systems powered by Machine Learning can significantly enhance this process by providing consistent, data-driven assessments of fetal health conditions. By leveraging historical CTG data, an ML approach can help classify fetal health status into categories such as:

*   Normal: No signs of distress, healthy fetal condition.
*   Suspicious: Potential risk factors that may require further observation.
*   Pathological: Clear indicators of fetal distress, requiring immediate medical intervention.

This project aims to develop an **AutoML model** capable of classifying fetal health status based on CTG data. By doing so, it seeks to:

1.  Improve diagnostic accuracy by reducing human subjectivity in CTG interpretation.
2.  Assist healthcare professionals in making faster and more informed decisions.
3.  Enhance maternal and child health outcomes by enabling earlier interventions.

Ultimately, this initiative contributes to public health efforts by providing a data-driven approach to fetal health assessment, supporting medical professionals in preventing perinatal complications and aligning with global healthcare goals.

# Data Understanding

The dataset used can be downloaded from this [page](https://github.com/gabrielcapela/AutoML_Classification/blob/f36182cf3a150ce48632b0531ddd6373b2c0e3f9/fetal_health.csv).

## Obtaining and Summary Analysis of Data

Let's start by importing the data and previewing the first few rows to illustrate the meaning of each column:

In [1]:
# Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importing the dataset
df = pd.read_csv('https://github.com/gabrielcapela/AutoML_Classification/raw/refs/heads/main/fetal_health.csv')
#Showing the first 5 lines
print(f"The data has {df.shape[0]} rows and {df.shape[1]} variables")
df.head()

The data has 2126 rows and 22 variables


Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


Below is the meaning of each variable:

*   **baseline_value**: the mean fetal heart rate (FHR) in beats per minute (bpm).
*   **accelerations**: the mean number of FHR accelerations per second.
*   **fetal_movement**: the mean number of detected fetal movements per second.
*   **uterine_contractions**: the mean number of detected uterine contractions per second.
*   **light_decelerations**: the mean number of mild FHR decelerations per second.
*   **severe_decelerations**: the mean number of severe FHR decelerations per second.
*   **prolongued_decelerations**: the mean number of prolonged FHR decelerations per second.
*   **abnormal_short_term_variability**: the percentage of time with abnormal short-term FHR variability.
*   **mean_value_of_short_term_variability**: the mean value of short-term FHR variability in bpm.
*   **percentage_of_time_with_abnormal_long_term_variability**: the percentage of time with abnormal long-term FHR variability.
*   **mean_value_of_long_term_variability**: the mean value of long-term FHR variability in bpm.
*   **histogram_width**: the width of the FHR histogram, indicating the range of variations.
*   **histogram_min:** the minimum recorded FHR value in the histogram.
*   **histogram_max**: the maximum recorded FHR value in the histogram.
*   **histogram_number_of_peaks**: the number of peaks in the FHR histogram.
*   **histogram_number_of_zeroes**: the number of zero values in the FHR histogram.
*   **histogram_mode**: the most frequent FHR value.
*   **histogram_mean**: the mean FHR value recorded in the histogram.
*   **histogram_median**: the median FHR value recorded in the histogram.
*   **histogram_variance**: the variance of FHR values in the histogram, indicating dispersion.
*   **histogram_tendency**: the overall tendency of the FHR histogram (positive values indicate an upward trend, negative values indicate a downward trend).

*   **fetal_health**: the classification of fetal health status:

1 → Normal

2 → Suspect

3 → Pathological

## Pandas Profiling

In order to explore AutoML tools, I will be using **Pandas Profiling**, which is a powerful tool that automates exploratory data analysis (EDA) and makes it easy to visualize and understand the characteristics of a dataset. With the ability to **generate interactive reports in HTML** format, it provides detailed insights into each variable, such as distributions, missing values, and outliers.

In [3]:
# Importing the required package
from ydata_profiling import ProfileReport

# Creating the report
profile = ProfileReport(df, title="Report", explorative=True)

# Transforming to html format
profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Click [**HERE**](https://gabrielcapela.github.io/fetal_health_classification/report.html) to see the report

**Some observations** can already be made:

*   The dataset shows several very **strong correlations** between certain variables, especially between histogram-related features (such as histogram_mean, histogram_max, histogram_median, etc.) and variability variables, such as abnormal_short_term_variability and mean_value_of_short_term_variability. These correlations indicate that the variables may be measuring similar aspects of the same information.

*   The dataset contains **11 duplicate rows** (0.5%), which means that some observations are repeated. This can distort the model's results, since the model may "learn" patterns incorrectly from these repetitions. These duplicates will be removed before PyCaret takes action, to prevent the model from overfitting redundant data, which can negatively impact performance.

*   The severe_decelerations variable is **significantly unbalanced**, with 96.8% of the data concentrated in a single class. However, it is worth noting that this is not the target variable of the model, which means that its impact on model training may be indirect.
T
*   The **fetal_health target** variable is unbalanced across its three classes; PyCaret will be configured to handle this issue.

*   Several variables, such as accelerations, fetal_movement, and prolonged_decelerations, have a **large number of zero values**. These zeros are probably not input errors, but rather indications that, at the time of the scan, these events did not occur. Therefore, these zeros can be interpreted as valid information and represent the absence of certain features in the fetal scan.

# Data Preparation

As a first action we will delete the duplicate lines

In [4]:
# Deleting duplicate rows
print(f'Rows deleted: {len(df) - len(df.drop_duplicates())}')
# Drop the duplicates
df.drop_duplicates(inplace=True)

Rows deleted: 13


## About PyCaret

**PyCaret** is a Python AutoML library that simplifies the process of building, testing, and optimizing machine learning models. It automates tasks such as model selection, hyperparameter tuning, and performance evaluation, making the workflow faster and more accessible.

For more details, visit [PyCaret](https://pycaret.org).

PyCaret already performs several steps of **data preprocessing automatically**, such as missing value treatment, categorical variable encoding, feature scaling, and outlier detection. Manual preprocessing can sometimes lead to better results, but we will stick to PyCaret’s built-in capabilities to **evaluate the efficiency of AutoM**L.

## Division of data

Although PyCaret already automatically splits the data into training and testing data within the setup() function, choosing to **pre-separate a training set before** delivering it to AutoML can bring important benefits to the reliability and robustness of the model evaluation. One of the benefits, for example, is avoiding data leakage – If the model is optimized based on PyCaret's internal test set, it can indirectly "learn" patterns that it would not have in a real application.
Having an external set for final validation – This allows a more realistic evaluation of the model's performance before putting it into production.

In [5]:
# Separating data into test and training
test = df.sample(frac=0.15)
train = df.drop(test.index)

test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)

print(test.shape)
print(train.shape)

(317, 22)
(1796, 22)


# Modeling

In [6]:
# Importing the necessary packages
from pycaret.classification import setup, compare_models, models, create_model, predict_model
from pycaret.classification  import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.classification  import save_model, load_model

## Defining the setup

PyCaret's setup() is the main function for initializing a machine learning experiment. It prepares the data, sets up the pipeline configuration, and applies preprocessing automatically.

In [7]:
# Initializing the PyCaret environment with target variable balancing
classification_setup = setup(data=df, target='fetal_health', fix_imbalance=True)

Unnamed: 0,Description,Value
0,Session id,1791
1,Target,fetal_health
2,Target type,Multiclass
3,Target mapping,"1.0: 0, 2.0: 1, 3.0: 2"
4,Original data shape,"(2113, 22)"
5,Transformed data shape,"(4090, 22)"
6,Transformed train set shape,"(3456, 22)"
7,Transformed test set shape,"(634, 22)"
8,Numeric features,21
9,Preprocess,True


## Creating the pipeline


The pipeline is an **automated preprocessing and modeling workflow** that ensures that all transformations applied to the training data are replicated on the test data and new predictions.

When we call setup(), PyCaret creates a pipeline that includes steps such as missing value handling, categorical variable encoding, normalization, outlier removal, and feature selection. This pipeline is automatically applied to all trained models, ensuring consistency and **eliminating the need to repeat these steps manually**.

In [9]:
# Creating the pipeline
reg = setup(data = train,
            target = 'fetal_health',
            normalize = True,
            #log_experiment = True,
            experiment_name = 'test_01')

Unnamed: 0,Description,Value
0,Session id,2076
1,Target,fetal_health
2,Target type,Multiclass
3,Target mapping,"1.0: 0, 2.0: 1, 3.0: 2"
4,Original data shape,"(1796, 22)"
5,Transformed data shape,"(1796, 22)"
6,Transformed train set shape,"(1257, 22)"
7,Transformed test set shape,"(539, 22)"
8,Numeric features,21
9,Preprocess,True


## Comparing the models

PyCaret's **compare_models()** function automatically evaluates multiple classification models and compares them based on performance metrics. It uses 10-fold cross-validation (fold=10), ensuring a more robust evaluation by splitting the training data into 10 parts and averaging the results.

By default, models are evaluated using the following metrics:

*   **Accuracy**: The percentage of correct predictions.
*   **AUC** (Area Under the Curve): Measures the model’s ability to distinguish between classes.
*   **Recall**: The proportion of true positive predictions among all actual positives.
*   **Precision**: The proportion of true positive predictions among all positive predictions.
*   **F1 Score**: The harmonic mean of precision and recall, balancing both metrics.
*   **Kappa**: Measures the agreement between predicted and actual classes, adjusted for chance.
*   **MCC**(Matthews Correlation Coefficient): Measures the quality of classification considering all confusion matrix outcomes, useful for imbalanced problems.

This function allows you to quickly identify the best-performing model based on your preferred metric, without having to manually train and tune each model.

In this project we will choose **AUC-ROC** (Area Under the Curve - Receiver Operating Characteristic) as the evaluation metric for this fetal health classification problem, as this metric measures the model's ability to distinguish between the three classes, regardless of any class imbalance.

In [11]:
import logging
# Suppressing the log output from LightGBM and other libraries
logging.getLogger("pycaret").setLevel(logging.WARNING)
logging.getLogger("lightgbm").setLevel(logging.WARNING)

# Now running compare_models using the AUC metric for ordering
best = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9507,0.9873,0.9507,0.9507,0.9494,0.8603,0.8622,1.142
xgboost,Extreme Gradient Boosting,0.9483,0.9871,0.9483,0.9487,0.9472,0.8546,0.8564,0.188
rf,Random Forest Classifier,0.9284,0.9856,0.9284,0.9259,0.9247,0.7907,0.7951,0.336
catboost,CatBoost Classifier,0.9459,0.9854,0.9459,0.9453,0.9444,0.8459,0.8481,3.773
et,Extra Trees Classifier,0.9181,0.9784,0.9181,0.9137,0.9137,0.7597,0.7644,0.249
knn,K Neighbors Classifier,0.8862,0.9345,0.8862,0.8785,0.878,0.6554,0.6643,0.069
nb,Naive Bayes,0.7009,0.9215,0.7009,0.8629,0.7382,0.4289,0.4951,0.043
dt,Decision Tree Classifier,0.9094,0.8803,0.9094,0.9106,0.9086,0.7489,0.7514,0.048
dummy,Dummy Classifier,0.7804,0.5,0.7804,0.6091,0.6842,0.0,0.0,0.047
lr,Logistic Regression,0.8862,0.0,0.8862,0.8879,0.8842,0.6807,0.6841,1.904


In [12]:
# Checking the best model 
print(best)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=2076, reg_alpha=0.0, reg_lambda=0.0, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)


## Instantiating the model

PyCaret's compare_models() function **does not train** a definitive model, it simply evaluates and compares different regression algorithms, displaying a grid with their performance metrics. To use a specific model, **you need to select and instantiate it manually** using the create_model() function.

The chosen model is the **Gradient Boosting Regressor**, which is based on decision trees that **combine multiple weak estimators to create a stronger and more accurate model**. It works by training the trees sequentially, where each new tree learns to correct the errors of the previous ones, reducing the residual error of the model.

In [None]:
# Instantiating the model
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2692.0808,25258915.6574,5025.8249,0.8378,0.4651,0.2847
1,2627.5565,21212231.2786,4605.6738,0.8843,0.3967,0.2439
2,2914.5591,30823481.7479,5551.8899,0.8179,0.4883,0.3436
3,1999.9408,13476677.2338,3671.0594,0.9179,0.3613,0.2777
4,3009.0081,34314709.4011,5857.8758,0.68,0.5787,0.2448
5,2414.9851,19645954.7801,4432.3757,0.8399,0.4593,0.3501
6,2722.7516,23964166.2875,4895.3209,0.8103,0.4066,0.3088
7,2502.4707,15051773.0486,3879.6615,0.918,0.3691,0.3014
8,2097.4436,11009028.2647,3317.9856,0.8705,0.3071,0.2518
9,1962.2882,11303775.4218,3362.1088,0.8813,0.4131,0.2735


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

## Model Tuning

The Model Tuning step in PyCaret consists of **adjusting the model's hyperparameters** to optimize its performance. Instead of using default values, this phase **searches for the best combination of parameters** through techniques such as Random Grid Search.

In [None]:
# Hyperparameter tuning
tuned_gbr = tune_model(gbr, optimize='R2')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3018.7954,27184014.1774,5213.8291,0.8254,0.4952,0.3282
1,2590.9251,21081326.1634,4591.4405,0.885,0.3957,0.2457
2,3006.9781,32118494.6325,5667.3181,0.8103,0.4703,0.3083
3,1967.5294,13712204.3697,3702.9994,0.9164,0.3939,0.3093
4,3082.5206,36577248.927,6047.9128,0.6589,0.585,0.2471
5,2400.0926,19009689.9243,4360.0103,0.845,0.4108,0.3131
6,2751.8003,24506509.1942,4950.405,0.806,0.4198,0.3211
7,2697.1972,18040227.5162,4247.3789,0.9017,0.377,0.3157
8,2166.6087,11732893.192,3425.3311,0.862,0.3032,0.2468
9,2159.3293,13238070.1823,3638.4159,0.861,0.4722,0.3057


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In this case, the tuned model did not present better results than the original. However, it is worth **highlighting the importance and practicality** of this function.

# Evaluation

## Interactive Model Evaluation

The evaluate_model() function in PyCaret allows you to interactively visualize the performance of the fitted model. It generates several graphs and metrics that **help analyze the quality of predictions and identify potential problems**, such as overfitting.

In [None]:
# Evaluating the model
evaluate_model(tuned_gbr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## Making the predictions

The predict_model() function in PyCaret is used to **generate predictions with a previously trained and tuned model**. It can be applied to both test data and new data sets to evaluate the model's performance.

When called without a specific data set, predict_model(tuned_gbr) **returns predictions for the test data automatically separated by PyCaret**.

Later on, we will also use it for data separated at the beginning, never exposed to PyCaret.

In [None]:
# Making the predictions
predict_model(tuned_gbr);

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2597.8849,23370476.3041,4834.302,0.8639,0.4358,0.3044


We can see the R2 metric with 0.8639, **very similar** to the 0.8458 obtained with the training data.

## Finalizing the model

The finalize_model() function in PyCaret is used to train the fine-tuned model on the **entire available dataset**, ensuring that it uses as much information as possible before deploying it.

By default, PyCaret sets aside a portion of the data for testing, but **when finalizing the model, it is re-trained using 100% of the training data** to improve generalization before deploying it to new data.

In [None]:
# Finalizing the model
final_gbr = finalize_model(tuned_gbr)

In [None]:
type(final_gbr)

pycaret.internal.pipeline.Pipeline

In [None]:
# Making predictions with the model trained with all the data provided to PyCaret
pred_holdout = predict_model(final_gbr, data=train)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2168.0185,14726116.5617,3837.4623,0.902,0.3537,0.2546


As expected, we have better metrics, as the model is being tested with data used in its training (**overfitting**).

In [None]:
# Checking the parameters
print(final_gbr)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['age', 'bmi', 'children'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapper(include=['sex', 'smoker', 'region'],
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('ordinal_encoding',
                 TransformerWrapper(include=['sex', 'smoker'],
                                    transfor...
                                                                        {'col': 'smoker',
                                                                         'data_type': dtype('O'),
                                                                         'mapping': no     0
yes    1
NaN   -1
dtype: int64}]))),
                ('onehot_encoding',
                 TransformerWrapper(include=['region'],
                      

## Predicting on new data


We will use the **separate dataset before using Pycaret**, so as to ensure the reliability of the model.

In [None]:
# Making Predictions with Unseen Data
unseen_predictions = predict_model(final_gbr, data=test)
unseen_predictions.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2338.6557,16011178.8905,4001.3971,0.8718,0.4091,0.2845


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,prediction_label
0,18,female,28.215,0,no,northeast,2200.830811,3907.452642
1,26,male,32.900002,2,yes,southwest,36085.21875,34719.015207
2,50,male,32.299999,2,no,southwest,9630.397461,11540.35269
3,49,female,27.17,0,no,southeast,8601.329102,11452.363425
4,57,female,38.0,2,no,southwest,12646.207031,14812.582825


R2 of 0.8718, **very similar** to the results obtained by the PyCaret test set, making the model acceptable.

# Saving the model

The save_model() function is used to save a trained and finalized model, allowing it to be easily loaded and used later without the need for retraining. This is especially **useful for production deployment**, where the model can be saved and reused to make predictions in real time or on new datasets.

The model is saved in a pickle file format (.pkl), which makes it easy to store and portable across environments.

In [None]:
save_model(final_gbr,'Modelo_Final_03_06_25')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['age', 'bmi', 'children'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['sex', 'smoker', 'region'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordinal_encoding',
                  TransformerWrapper(include=['sex', 'smoker'],
                                     transfor...
                                                                         {'col': 'smoker',
                                                                          'data_type': dtype('O'),
                                                                          'mapping': no     0
 yes    1
 NaN   -1
 dtype: int64}]))),
                 ('onehot_encoding',
                  TransformerWrapper(include=['region'],
    

## Loading a model

In [None]:
# Loading a saved model
saved_final_gbr = load_model('Modelo_Final_03_06_25')

Transformation Pipeline and Model Successfully Loaded


In [None]:
# Making new predictions with the saved model
new_prediction = predict_model(saved_final_gbr, data=test)
new_prediction.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2227.6783,14881735.4755,3857.6852,0.8866,0.3522,0.2624


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,prediction_label
0,50,male,34.200001,2,yes,southwest,42856.839844,42078.306828
1,27,female,24.1,0,no,southwest,2974.125977,7262.706422
2,51,male,33.330002,3,no,southeast,10560.492188,11780.298853
3,38,female,34.799999,2,no,southwest,6571.543945,8053.835581
4,27,male,32.584999,3,no,northeast,4846.919922,7291.681332
