<font color='navy'> 
    
# MLflow Recipes for Data Preprocessing and Model Training  

</font>

## Introduction
Machine learning (ML) projects often involve various data preprocessing steps and model training phases. In this lab, we will demonstrate how to use MLflow Recipes to streamline and automate these processes. MLflow Recipes provide a convenient way to define and execute reusable steps for data ingestion, preprocessing, model training, and more. We will apply MLflow Recipes to the Customer Churn dataset, exploring data preprocessing techniques and training machine learning models for churn prediction.



## Objectives
The main objectives of this lab are:

1. To introduce the concept of MLflow Recipes for automating machine learning workflows.
2. To demonstrate how to use MLflow Recipes to define and execute data preprocessing and model training steps.
3. To compare and evaluate the performance of different machine learning models for income prediction.


## Tools and Libraries
For this lab, we will use the following tools and libraries:

1. Python 3.x
2. Jupyter Notebook
3. Pandas library for data manipulation and analysis
4. Numpy library for mathematical operations
5. Scikit-learn library for machine learning algorithms
6. Matplotlib library for data visualization
7. Seaborn library for data visualization
8. MLflow for managing machine learning experiments and pipelines

## Data
We will use the Customer Churn dataset for this lab. This dataset contains information about customers, including attributes like customer age, contract duration, and monthly charges. The goal is to predict whether a customer will churn (leave) or not based on these attributes.


## 1. Importing Libraries and Loading Data
Let's start by importing the necessary libraries and loading the Diabetes dataset.

In [1]:
#! pip install mlflow
#! pip install "flaml[automl]"

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
import os
from mlflow.recipes import Recipe
import mlflow


## 2. Load MLflow

In [3]:
# Load MLflow
%load_ext autoreload
%autoreload 2

## 3. Creating MLflow pipeline


#### 3.1 Defining the steps
Create an MLflow Recipe and define the steps for data preprocessing and model training.

In [4]:
# Create an MLflow Recipe
r = Recipe(profile="local")

# Clean the recipe
r.clean()

# Inspect the recipe
r.inspect()

# Run the 'ingest' step
r.run("ingest")



MlflowException: Failed to find recipe.yaml!

#### 3.2 Exploratory Data Analysis
Perform some basic Exploratory Data Analysis (EDA) on the ingested dataset. This step includes visualizing data distributions and relationships.

In [None]:
# Perform some basic EDA on the ingested dataset
ingested_data = r.get_artifact("ingested_data")
# Iterate through columns for EDA
for col in ingested_data.columns:
    if col == "Churn":  # Exclude the target variable
        continue

    plt.figure(figsize=(10, 6))

    if ingested_data[col].dtype == "object":  # Check if the column is non-numeric
        sns.countplot(data=ingested_data, x=col)
        plt.title(f'Distribution of {col}')
        plt.xticks(rotation=45)  # Rotate x-axis labels for readability
    else:
        sns.boxplot(data=ingested_data, x="Churn?", y=ingested_data[col])
        plt.title(f'Distribution of {col} by Churn')

    plt.show()


#### 3.3 Data Splitting
Run the 'split' step to split the data into training and testing sets.



In [None]:
# Run the 'split' step
r.run("split")

#### 3.4 Data Transformation
Run the 'transform' step for data transformation and feature scaling.



In [None]:
# Run the 'transform' step
r.run("transform")

#### 3.5 Data Training
Run the 'train' step to train machine learning models.



In [None]:
### Run the 'train' step
r.run("train")

#### 3.6 Model Evaluation
Evaluate the trained models.


In [None]:
r.run("evaluate")

#### 3.6 Model Registry
Register and store the trained models.


In [None]:
r.run("register")

## Task 1: ML Models Performance Comparison

**Purpose:** Compare the performance of multiple machine learning models for churn prediction.

**Instructions:**

1. In the `3.5 Data Training` section, modify `train.py` to train classifiers other than Random Forests. For instance, consider using models like Logistic Regression.
2. After training the additional models, conduct a thorough evaluation of their performance on the test dataset.
3. Compare the performance of each machine learning model run. Ensure you have recorded the run ID for each model.
4. Provide explanatory comments to explain the significance of model comparison and highlight any insights derived from the evaluation.
5. Conclude by discussing which model appears to be the most suitable for the churn prediction task based on the evaluation results.

Logistic regression is unsuitable due to extremely low recall (0.148) and F1 score (0.25) indicating failure to identify churners.
Random Forest is the most suitable model, as it demonstrates superior performance on metrics critical for imbalanced classification, particularly recall (0.475), F1 score (0.617), and precision-recall AUC (0.756).

## Task 2: Automated Model Selection with automl/flaml


**Purpose:** Utilize the FLAML (Fast, Lightweight, and Multi-Layered) library for automated model selection.

**Instructions:**

1. Install flaml Python library using the command:  `pip install "flaml[automl]"`.
2. Configure FLAML to perform automated machine learning tasks, including hyperparameter tuning, algorithm selection, and model evaluation. Hint: you need to modify the `recipe.yaml` file.
3. Discuss the benefits and limitations of using automated model selection with FLAML in comparison to manual model selection.

In comparison to manual model selection, FLAML offers speed, systematic exploration, and strong baseline performance, but at the cost of reduced interpretability, lower transparency, and reliance on correct metric and validation design. In this assignment, FLAML's selection of an ExtraTreees model demonstrate its effectiveness for tabular, imbalanced classification tasks, while manual model selection remains valuable for interpretabilityu and domain-driven modeling decisions.