# Intro to Data Science 
## Part VII. - Regression and Embedding pipelines

### Table of contents

- #### Regression
    - <a href="#What-is-Regression?">Theory</a>
    - <a href="#Linear-regression---OLS">Linear regression</a>
    - <a href="#Ridge-regression">Ridge regression</a>
    - <a href="#LASSO">LASSO regression</a>
    - <a href="#Bayesian-Ridge-regression">Bayesian regression</a>
    - <a href="#Support-Vector-regression">Support Vector regression</a>
    - <a href="#XGBoost">XGBoost</a>

- #### Managing model lifecycle
    - <a href="#Reusing-trained-pipelines">Reusing trained pipelines</a>
        - <a href="#Saving-pipelines">Exporting pipelines</a>
        - <a href="#Loading-pipelines">Loading pipelines</a>
    - <a href="#Tracking-sklearn-models">Managing model lifecycle with MLFlow</a>
        - <a href="#What-is-MLFlow?">MLFlow Experiments</a>
        - <a href="#Tracking-Experiments">Tracking Experiments</a>
        - <a href="#Loading-saved-models">Saving and loading models</a>
    - <a href="#Track-and-save-regression-models">Track multiple experiments</a>

---

## What is Regression?
Regression is a type of supervised machine learning where the goal is to predict a continuous target variable based on one or more input features. Unlike classification, which assigns discrete labels, regression models estimate numeric values.

It is also _"a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a __dependent variable__ and one or more __independent variable__s (or 'predictors')."_ — from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>.

While traditional statistical regression focuses on understanding relationships between variables, in Data Science, regression is primarily used for predictive modeling.

## Why is it important?
Regression is one of the most fundamental techniques in machine learning, widely used for predicting continuous values based on past data. It serves as a baseline for more complex models and helps uncover relationships between variables.  

_"Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning."_ — from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>.  

Common applications include:
- **Stock market forecasting** – Predicting future stock prices based on historical trends.
- **Salary prediction** – Estimating salaries based on education, experience, and industry.
- **Network traffic analysis** – Predicting bandwidth usage over time.
- **Traffic flow estimation** – Forecasting vehicle congestion based on historical traffic data.

## Tools
Various regression techniques exist, each suited for different types of data and problem complexity:
- **Linear regression** – Fits a straight-line relationship between variables.
- **Ridge regression** – Adds regularization to prevent overfitting in high-dimensional data.
- **LASSO regression** – Performs feature selection by shrinking coefficients of less important variables to zero.
- **Bayesian regression** – Incorporates prior knowledge into the regression model.
- **Support Vector Regression (SVR)** – Captures non-linear relationships using kernel functions.
- **XGBoost** – A gradient-boosting technique that excels in structured data prediction.

Each method has strengths and limitations, and choosing the right one depends on the dataset and problem requirements.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

from sklearn.pipeline import Pipeline

In [None]:
def plot_pred(y, predicted):
    fig, ax = plt.subplots()
    ax.scatter(y, predicted, edgecolors='k')
    ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
    ax.set_xlabel('Measured')
    ax.set_ylabel('Predicted')
    plt.show()

    
def plot_california(ax):
    ax.scatter(medinc_train, y_train, edgecolors='k', s=10)
    ax.set_xlabel("Median income of the building")
    ax.set_ylabel("Median value of owner-occupied homes in $1000's")
    ax.set_xlim([0, medinc_train.max()])
    ax.set_ylim([0, y_train.max() * 1.5])
    
    
def plot_curve(estimator, param, values, ax):   
    for color, value in zip(colors, values):
        estimator = estimator.set_params(**{param: value}).fit(medinc_train, y_train)
        ax.plot(curve_x, estimator.predict(curve_x), '-', c=color, lw=2, label=value)
    plot_california(ax)
    ax.legend(loc='upper right')

    
def show_score(model, X, y, cv=10, metric=None):
    scores = cross_val_score(model, X, y, cv=cv, scoring=metric)
    return "Accuracy: {:.2f} (+/- {:.2f})".format(scores.mean(), scores.std() * 2)

colors = ['g', 'r', 'y', 'c', 'm', 'b']

## Variations on a Theme

The traditional linear regression problem is formulated as follows:
$$ y_i = {x}_i {\beta} $$
for each observation $i$, or more compactly:
$$ {y} = {X}{\beta} $$
where:
- $ {X} $ is the matrix of observed input values (features),
- $ {y} $ is the vector of observed outputs (target variable),
- $ {\beta} $ is the weight (coefficient) vector that we aim to estimate.  

In **Ordinary Least Squares (OLS)** regression, we estimate ${\beta}$ by minimizing a loss function, specifically the **sum of squared residuals (SSR)**:
$$ \mathrm{Cost}({\beta}) = \mathrm{SSR}({\beta}) = \sum _i (\hat y_i - y_i)^{2}. $$

However, standard OLS can suffer from **overfitting**, especially when dealing with high-dimensional data. To address this, various **regularized** regression methods modify the loss function by adding a penalty on the size of the coefficients. These include:

- **Ridge regression** (L2 regularization), which adds the sum of the squared coefficients:
  $$ \mathrm{Cost}({\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \beta _i^{2}. $$

- **LASSO regression** (L1 regularization), which adds the sum of the absolute values of the coefficients:
  $$ \mathrm{Cost}({\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \vert \beta _i \vert. $$

### **Why does this matter?**
This technique, known as <a href="https://en.wikipedia.org/wiki/Regularization_(mathematics)">**regularization**</a>, helps prevent **overfitting**, which occurs when a model learns noise in the training data rather than generalizable patterns.  

- **Ridge regression** prevents extreme coefficient values but does not force any coefficients to be exactly zero.
- **LASSO regression** can shrink some coefficients to zero, effectively performing **feature selection** by removing less important variables.

### **Example Dataset**  
To illustrate regularization, we will use the <a href="https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset">**California housing dataset**</a>, a modern alternative that provides median house values in California districts.  

However, it’s worth discussing the <a href="https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.load_boston.html">**Boston housing dataset**</a>, which was widely used in the past but has been **deprecated in `scikit-learn` due to ethical concerns**. The dataset includes a feature representing the proportion of the population that is Black, which was used as a predictor for housing prices. This highlights a significant issue in data science: **historical biases in datasets can reinforce societal inequalities when used in predictive models**. Addressing bias in data is a critical part of responsible machine learning.  

For more details on Ridge and LASSO regression, check out this <a href="https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/">Analytics Vidhya tutorial</a>.  

---

## Loading the California housing dataset

In [None]:
from sklearn.datasets import fetch_california_housing

In [None]:
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_trimmed, y_trimmed = X[y < 5.0], y[y < 5.0]
X_train, X_test, y_train, y_test = train_test_split(X_trimmed, y_trimmed, random_state=42)

medinc = X_trimmed[:, 0][:, np.newaxis]
medinc_train = X_train[:, 0][:, np.newaxis]
medinc_test = X_test[:, 0][:, np.newaxis]

curve_x = np.linspace(0, 20, num=300)[..., np.newaxis]

print(housing.DESCR)

## Linear regression - OLS

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
ols = Pipeline([('poly', PolynomialFeatures()), 
                ('ols', LinearRegression())])
parameters = {'poly__degree': range(1,16)}
ols_grid = GridSearchCV(ols, 
                        parameters, 
                        cv=5,
                        n_jobs=2, 
                        scoring='neg_mean_squared_error')
ols_grid.fit(medinc_train, y_train)

In [None]:
ols_grid.best_estimator_

In [None]:
ols_grid.best_params_

In [None]:
show_score(ols_grid.best_estimator_, medinc_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = ols_grid.best_estimator_.predict(medinc_test)
plot_pred(y_test, y_hat)

Plot some example curve with different degrees.

In [None]:
fig, ax = plt.subplots()
plot_curve(ols, 'poly__degree', [1, 2, 3, 5, 13], ax)

## Ridge regression

In [None]:
import sklearn
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

In [None]:
ridge = Pipeline([
    ('scale', StandardScaler()),
    ('poly', PolynomialFeatures(degree=5)), 
    ('ridge', Ridge())
])
params = {'ridge__alpha': np.logspace(-1, 13, 29)}
ridge_grid = GridSearchCV(ridge, 
                          params, 
                          cv=5,
                          n_jobs=2, 
                          scoring='neg_mean_squared_error')
ridge_grid.fit(medinc_train, y_train)

In [None]:
ridge_grid.best_params_

Available scorers are:

In [None]:
print(' -', '\n - '.join(sklearn.metrics._scorer._SCORERS.keys()))

In [None]:
show_score(ridge_grid.best_estimator_, medinc_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = ridge_grid.best_estimator_.predict(medinc_test)
plot_pred(y_test, y_hat)

Plot some example curves to see how the regularization parameters "deform" the 5 degree polynomial we saw in the previous plot.

In [None]:
fig, ax = plt.subplots()
plot_curve(ridge, 'ridge__alpha', [1e-13, 1e-6, 1e-1, 1e0, 1e2], ax)

## LASSO

Least absolute shrinkage and selection operator

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Pipeline([
    ('scale', StandardScaler()),
    ('poly', PolynomialFeatures(degree=5)), 
    ('lasso', Lasso(max_iter=100_000))
])
params = {'lasso__alpha': np.logspace(-5, 13, 19)}
lasso_grid = GridSearchCV(lasso, 
                          params, 
                          cv=5,
                          scoring='neg_mean_squared_error')
lasso_grid.fit(medinc_train, y_train)

In [None]:
show_score(lasso_grid.best_estimator_, medinc_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = lasso_grid.best_estimator_.predict(medinc_test)
plot_pred(y_test, y_hat)

LASSO also serves as a **feature selection** tool. By setting the alpha parameter high enough, it drives some coefficients to **exactly zero**, effectively removing less important features from the model. However, if alpha is set too high, too many features are eliminated, leading to **underfitting**, where the model becomes too simple to capture the underlying patterns in the data.

In [None]:
coefs = pd.DataFrame()
pipe = Pipeline([('poly', PolynomialFeatures(degree=5)),
                 ('lasso', Lasso(max_iter=100_000))])

for alpha in np.logspace(-5, 13, 19):
    pipe = pipe.set_params(lasso__alpha=alpha).fit(medinc_train, y_train)
    coefs[alpha] = pipe.named_steps['lasso'].coef_[1:]

coefs.T

In [None]:
fig, ax = plt.subplots()
plot_curve(lasso, 'lasso__alpha', [1e-5, 1e-2, 1e-1, 1e1, 1e8], ax)

## Bayesian Ridge Regression

Bayesian Ridge Regression is similar to standard Ridge Regression, but instead of setting a fixed $\lambda$ parameter for $\ell_{2}$ regularization, it treats $\lambda$ as a random variable and estimates it from the data using a Bayesian framework. This allows the model to incorporate **uncertainty** in the regularization strength, making it more adaptable to different datasets.

In [None]:
from sklearn.linear_model import BayesianRidge

In [None]:
bayes = Pipeline([('poly', PolynomialFeatures(degree=5)), 
                  ('bayes', BayesianRidge())])
params = {'bayes__alpha_1': np.logspace(-5, 5, 5),
          'bayes__alpha_2': np.logspace(-5, 13, 5),
          'bayes__lambda_1': np.logspace(-5, 13, 5),
          'bayes__lambda_2': np.logspace(-5, 13, 5)}
bayes_grid = GridSearchCV(bayes, 
                          params,
                          cv=5,
                          scoring='neg_mean_squared_error')
bayes_grid.fit(medinc_train, y_train)

In [None]:
show_score(bayes_grid.best_estimator_, medinc_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = bayes_grid.best_estimator_.predict(medinc_test)
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(bayes, 'bayes__alpha_1', [1e-5, 1e-2, 1e-1, 1e1, 1e2], ax)

## Support Vector Regression

Support Vector Machines (SVMs) are commonly used for classification, but they can also be adapted for regression tasks. This approach, called **Support Vector Regression (SVR)**, maintains the core idea of SVMs: instead of minimizing the error directly, it attempts to fit a model that ignores deviations within a certain margin while still penalizing larger errors.

The key steps in SVR are:  
a) Identify a subset of **support vectors**, which are the most influential training points that define the regression function.  
b) Fit a linear model that best captures the relationship between input and output while keeping deviations within a specified margin.  
c) (For nonlinear relationships) Transform data points into a **higher-dimensional space** where a linear model can be fitted more effectively.  
d) Instead of explicitly transforming the data (which can be computationally expensive), apply **kernel functions** to implicitly operate in the higher-dimensional space.  

This technique allows SVR to model both **linear and nonlinear relationships** while being robust to outliers and generalizing well to unseen data.  

In [None]:
from sklearn.svm import SVR

In [None]:
svr = SVR(kernel='rbf', C=1e3, gamma=5e-5, degree=5)
svr.fit(medinc_train, y_train)
show_score(svr, medinc, y_trimmed, metric='neg_mean_squared_error')

In [None]:
y_hat = svr.predict(medinc_test)
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(svr, 'kernel', ['linear', 'poly', 'rbf'], ax)

In [None]:
fig, ax = plt.subplots()
plot_curve(svr, 'degree', [2, 3, 4, 5], ax)

## [XGBoost](https://xgboost.readthedocs.io/en/latest/model.html)

XGBoost is short for **Extreme Gradient Boosting** which is a Gradient Boosted Tree method. Boosted tree is an **ensemble method**, basically training multiple trees on the same training set results a more robust solution. It is important that boosted trees incorporates a **regularization term** in its objective function. In this sense, boosted trees are the same as random forests. The difference comes from the training process. 

XGBoost use additive training: in each step it adds individual trees by selecting the best tree each time. The best tree is the **simplest tree** (tree structure score is minimal) **with the most information gain**.

For more detailed explanation please consult with these [slides](https://web.njit.edu/~usman/courses/cs675_spring20/BoostedTree.pdf) and this [tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) or with this [wiki page](https://en.wikipedia.org/wiki/Gradient_boosting) on gradient boosting. Install it using the `conda install py-xgboost` command.

## [XGBoost](https://xgboost.readthedocs.io/en/latest/model.html)

XGBoost (**Extreme Gradient Boosting**) is an optimized gradient boosting algorithm designed for efficiency and scalability. It builds an ensemble of decision trees iteratively, correcting errors from previous trees while incorporating regularization to prevent overfitting.  

### How XGBoost Works  

1. **Initialize a weak model**  
   The algorithm starts with an initial prediction (e.g., the mean of the target variable).  

2. **Compute Residuals (Pseudo-Residuals)**  
   Instead of directly fitting the target variable, XGBoost fits the residual errors (gradients) from the previous step:  
   $$ r_i = y_i - \hat{y}_i $$  
   These residuals indicate how much improvement is needed in the next iteration.  

3. **Train a New Decision Tree**  
   A new tree is trained to predict these residuals. This tree learns to minimize the loss function by capturing patterns in the errors.  

4. **Compute Leaf Weights**  
   Each leaf of the tree gets a weight that determines how much it contributes to the final prediction. XGBoost optimizes these weights using a second-order Taylor approximation, making it faster and more stable.  

5. **Update the Predictions**  
   The new tree's output is scaled by a learning rate **(shrinkage factor, $\eta$)** and added to the previous prediction:  
   $$ \hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta f_t(x_i) $$  
   This gradual update process helps prevent overfitting.  

6. **Apply Regularization**  
   XGBoost incorporates L1 (LASSO) and L2 (Ridge) regularization to penalize complex models and avoid overfitting:  
   $$ \text{Loss} = \sum (y_i - \hat{y}_i)^2 + \lambda ||\beta||^2 + \alpha ||\beta||_1 $$  

7. **Repeat Until Convergence**  
   Steps 2-6 are repeated, adding new trees iteratively until the stopping criteria (e.g., number of trees, early stopping) is met.  

### Key Optimizations in XGBoost  

- **Gradient and Hessian-based Optimization**: Uses both first and second derivatives of the loss function for more efficient updates.  
- **Feature Importance**: Evaluates which features contribute most to predictions.  
- **Handling Missing Values**: Automatically learns optimal splits for missing data.  
- **Parallel Processing**: Splits computations across multiple cores for faster training.  
- **Early Stopping**: Stops training when validation performance stops improving.  

For further details, check the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/model.html).  

### Key Advantages of XGBoost  

- **Speed & Efficiency**: Optimized for parallel computation, handling large datasets efficiently.  
- **Built-in Regularization**: Helps in avoiding overfitting, making it more generalizable.  
- **Feature Importance**: Provides built-in methods to understand which features are the most influential.  
- **Handles Missing Values**: Unlike traditional models, XGBoost can infer missing values intelligently.  
- **Scalability**: Works well with large-scale datasets and supports distributed computing.  

### Installation  
To install XGBoost, you can use:  
```bash
conda activate szisz_ds_2025
pip install xgboost
```

For a more detailed explanation, refer to these [slides](https://web.njit.edu/~usman/courses/cs675_spring20/BoostedTree.pdf), this [tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/model.html), or this [wiki page](https://en.wikipedia.org/wiki/Gradient_boosting) on gradient boosting.

In [None]:
from xgboost.sklearn import XGBRegressor

In [None]:
xgb = XGBRegressor()
xgb.fit(medinc_train, y_train)
y_hat = xgb.predict(medinc_test)
show_score(xgb, medinc, y_trimmed, metric='neg_mean_squared_error')

In [None]:
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(xgb, 'n_estimators', [1, 5, 10, 25, 100], ax)

## Feature Selection & Engineering for Regression

Effective feature selection and engineering are crucial for improving the performance and interpretability of regression models. Poorly chosen features can lead to **multicollinearity**, overfitting, or poor generalization, while well-engineered features can improve model accuracy and efficiency.

### Handling Multicollinearity

Multicollinearity occurs when two or more predictor variables are highly correlated, leading to unstable coefficient estimates in regression models. This can reduce the interpretability of the model and increase variance in coefficient estimates. To address multicollinearity:
- **Variance Inflation Factor (VIF)**: Compute VIF for each feature; high values (typically >10) indicate severe multicollinearity.
- **Principal Component Analysis (PCA)**: Transform correlated variables into uncorrelated principal components.
- **Feature Selection**: Remove one of the correlated features if both convey similar information.

### Interaction Terms

Interaction terms capture relationships between two or more features that may have a combined effect on the target variable. Instead of treating each feature independently, we create new features by multiplying or combining them:
- **Example**: In a real estate model, `House Size` and `Number of Bedrooms` may interact, affecting price differently than when considered separately.
- **Polynomial Features**: Higher-order terms like $ x_1 x_2 $ or $ x_1^2 $ can capture nonlinear effects.

### Feature Scaling

Some regression models, particularly regularized methods like Ridge and LASSO, require proper scaling to ensure fair treatment of all features:
- **Standardization (Z-score normalization)**: Centers data to mean 0 and scales to unit variance.
- **Min-Max Scaling**: Rescales features to a fixed range, typically [0,1].
- **Robust Scaling**: Uses median and interquartile range, reducing sensitivity to outliers.

### Encoding Categorical Variables

Regression models typically require numerical input, so categorical features must be converted into a numerical representation:
- **One-Hot Encoding**: Converts categorical variables into binary (0/1) columns.
- **Ordinal Encoding**: Assigns integer values to ordered categories.
- **Target Encoding**: Replaces categories with their mean target value (useful but prone to leakage).

Properly handling feature selection and engineering can significantly improve model performance and interpretability, making regression models more robust and generalizable.

In [None]:
# Handling Multicollinearity - Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = housing.feature_names
    vif_data["VIF"] = [variance_inflation_factor(X_trimmed, i) for i in range(X.shape[1])]
    return vif_data

vif_df = calculate_vif(X_train)
print(vif_df.sort_values(by="VIF", ascending=False))  # High VIF indicates multicollinearity

In [None]:
# Creating Interaction Terms
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_interaction = poly.fit_transform(X_train)
X_test_interaction = poly.transform(X_test)

print("Original feature count:", X_train.shape[1])
print("Feature count after adding interactions:", X_train_interaction.shape[1])

---

## Managing Model Lifecycle

### Reusing Trained Pipelines

Trained pipelines can be used outside of the training environment, making it possible to deploy models, share them across teams, or resume training later.

#### Saving Pipelines

First, we need to **serialize** the model. This process saves the entire pipeline object into a file, allowing us to move and reload it elsewhere.  

**Important:** The libraries and their versions used during saving and loading must be identical to ensure compatibility.

In [None]:
import pickle

with open('xgboost_model.pickle', 'wb') as picklefile:
    pickle.dump(xgb, picklefile)

#### Loading Pipelines

Loading and using a saved pipeline is straightforward, provided that the same libraries (with identical versions) are installed. This ensures compatibility and prevents potential errors due to differences in model serialization formats.

In [None]:
import pickle

with open('xgboost_model.pickle', 'rb') as picklefile:
    model = pickle.load(picklefile)

In [None]:
show_score(model, medinc, y_trimmed, metric='neg_mean_squared_error')

### Tracking sklearn Models

A common mistake, even among experienced professionals, is training models without properly tracking experiments. When multiple pipeline configurations, hyperparameters, and models are tested, it becomes difficult to remember which combination performed best. To avoid this, a tracking solution like MLflow can be used.

#### What is <a href="https://mlflow.org/">MLflow</a>?

From its <a href="https://mlflow.org/docs/latest/index.html">documentation</a>:  

_"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It covers four key areas:"_

- _Tracking experiments to log and compare parameters and results (MLflow Tracking)._
- _Packaging ML code in a reproducible format for collaboration and deployment (MLflow Projects)._
- _Managing and serving models across different ML frameworks and environments (MLflow Models)._
- _Providing a central repository for model versioning, stage transitions, and collaboration (MLflow Model Registry)._

#### Tracking Experiments with MLflow

To track your experiments with MLflow:

1. Install MLflow:
    ```bash
    conda activate szisz_ds_2025
    pip install mlflow
    ```
2. Start the MLflow tracking server:
    ```bash
    mlflow ui
    ```
3. Use the MLflow library to log and manage experiments.
4. Open the tracking UI in your browser:
    ```
    http://localhost:5000
    ```

Once the tracking server is running, you can monitor and compare your experiments through the MLflow UI.

In [None]:
import mlflow
import mlflow.sklearn

In [None]:
with mlflow.start_run(run_name="xgboost-default"):
    xgb = XGBRegressor()
    xgb.fit(medinc_train, y_train)
    
    # Log parameter values
    for param, val in xgb.get_params().items():
        mlflow.log_param(param, val)
    
    # Log metrics of the run
    predictions = xgb.predict(medinc_test)
    r2 = r2_score(y_test, predictions)
    rmse = mean_squared_error(y_test, predictions)
    ev = explained_variance_score(y_test, predictions)
    
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("ev", ev)
    
    # Log pictures
    fig, ax = plt.subplots()
    plot_curve(xgb, 'n_estimators', [1, 5, 10, 25, 100], ax)
    fig.savefig('xgboost_default_model_curve.png', transparent=True)
    mlflow.log_artifact('xgboost_default_model_curve.png')
    
    # Log the model itself
    mlflow.sklearn.log_model(xgb, "xgboost_default_model", input_example=medinc_train[:10])

#### Loading saved models

Exported models can be loaded later. You have to check the logged model details on the UI in order to get the model path:
<img src="pics/mlflow_ui_model_details.png" width=500>

In [None]:
xgb_loaded = mlflow.sklearn.load_model("path/from/the/mlflow/ui")
show_score(xgb_loaded, medinc, y_trimmed, metric='neg_mean_squared_error')

### Track and save regression models

Use the pipelines we built previously to:
- track them using mlflow (kudos for using functions and/or loops)
- compare the results on the mlflow UI

----