# Low code Automated ML 
## Introduction

This notebook is automatically generated by a low-code UI wizard based on the settings provided; feel free to adjust these settings to fine-tune the results to your preference.

The main steps in this notebook are:

1. Load the Data
2. Featurization
3. Train an AutoML Trial with FLAML to Find the Best Model
4. Save the Final Machine Learning Model
5. Predicting with the Saved Model.

> [!IMPORTANT]
> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**

In [None]:
%pip install scikit-learn==1.5.1


In [None]:
import logging
import warnings
 
logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)
logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)
warnings.simplefilter('ignore', category=FutureWarning)
warnings.simplefilter('ignore', category=UserWarning)

## Step 1: Load the Data

- Reads raw data from the given path, transforming the data according to the selected model type. 


In [None]:
import re
import pandas as pd
import numpy as np

X = pd.read_csv(
    "abfss://FabricsWS@onelake.dfs.fabric.microsoft.com/AWLakeHouse.Lakehouse/Files/RawDataAdventureWorks_Sales_2017/AdventureWorks_Sales_2017.csv",
)
X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c))  # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving

target_col = re.sub('[^A-Za-z0-9_]+', '_', "OrderQuantity")


In [None]:
display(X)

## Step 2: Featurization

The next step is to prepare the data for training. This process involves casting the data types, handling missing values, etc. The data is then split into training and testing sets.


In [None]:
# Set Functions if needed for Featurization
 
import pandas as pd
def fillna_based_on_dtype(df):
    for col in df.columns:
        if pd.api.types.is_float_dtype(df[col]):
            df[col].fillna(df[col].mean(), inplace=True)  # Fill NaN with mean for floats
        elif pd.api.types.is_integer_dtype(df[col]) or pd.api.types.is_bool_dtype(df[col]):
            df[col] = df[col].astype('float64', errors="ignore").fillna(df[col].median()).astype('Int64', errors="ignore") # Fill NaN with median for integers and boolean
        elif pd.api.types.is_datetime64_any_dtype(df[col]):
            df[col].fillna(df[col].mode()[0], inplace=True)  # Fill NaT with mode for datetime
        elif pd.api.types.is_string_dtype(df[col]) or pd.api.types.is_categorical_dtype(df[col]):
            df[col].fillna(df[col].mode()[0], inplace=True)  # Fill NaN with mode for strings and category
        else:
            df[col].fillna(0, inplace=True)  # Default fill value for other types
 
    return df

In [None]:


# convert object type to nearest dtype
X = X.convert_dtypes()
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X, test_size=0.2)
 
# fillna and select columns for model training
X_train = fillna_based_on_dtype(X_train).select_dtypes(include=[int, float, 'datetime', 'category'])
X_test = fillna_based_on_dtype(X_test).select_dtypes(include=[int, float, 'datetime', 'category'])
y_train = X_train.pop(target_col) 
y_test = X_test.pop(target_col) 


## Step 3: Train an AutoML Trial with FLAML to Find the Best Model

With your data in place, you can now define the model. You will also use MLfLow and Fabric Autologging to track the experiments.

### Set up MLflow experiment tracking

MLflow is an open source platform that is deeply integrated into the Data Science experience in Fabric and allows to easily track and compare the performance of different models and experiments without the need for manual tracking. For more information, see [Autologging in Microsoft Fabric](https://aka.ms/fabric-autologging).

In [None]:
# MLFlow Logging Related

import mlflow

mlflow.autolog(exclusive=False)
mlflow.set_experiment("AW_MLmodel-AutoMLExperiment")


#### Configure the AutoML trial and settings

Import the required classes and modules from the FLAML package and instantiate AutoML, which automates the machine learning pipeline.

In [None]:
# Import the AutoML class from the FLAML package
from flaml import AutoML

# Define AutoML settings
settings = {
    "time_budget": 120, # Total running time in seconds
    "task": "regression",  # Task type 
    "log_file_name": "flaml_experiment.log",  # FLAML log file
    "eval_method": "cv",
    "n_splits": 3,
    "max_iter": 10, 
    "force_cancel": True, 
    "seed": 41 , # Random seed 
    "mlflow_exp_name": "AW_MLmodel-AutoMLExperiment",  # MLflow experiment name
    "use_spark": True, # whether to use Spark for distributed training
    "n_concurrent_trials": 3,  # the maximum number of concurrent trials 
    "verbose": 1,  
    "featurization": "auto", 
}

# Create an AutoML instance
automl = AutoML(**settings)


#### Run the AutoML trial

Execute the AutoML trial, using a nested MLflow run to track the experiment within the existing MLflow run context. The trial is conducted on the processed dataset with the target variable ``OrderQuantity``, and the defined settings are passed to the `fit` function for configuration.

In [None]:
with mlflow.start_run(nested=True, run_name="AW_MLmodel"):
    automl.fit(
        X_train=X_train, 
        y_train=y_train,  # target column of the training data 
    )

## Step 4: Save as the final machine learning model

Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric.

In [None]:
model_path = f"runs:/{automl.best_run_id}/model"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name="AW_MLmodel")

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")

## Step 5: Predicting with the saved model.

Microsoft Fabric allows users to operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine.

You can generate batch predictions directly from the Microsoft Fabric notebook or from a given model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict).

1. Load the model for batch scoring and generate the prediction results.

In [None]:
model_name = "AW_MLmodel"
from synapse.ml.predict import MLFlowTransformer

feature_cols = X_train.columns.to_list()
model = MLFlowTransformer(
    inputCols=feature_cols,
    outputCol=target_col,
    modelName=model_name,
    modelVersion=registered_model.version,
)

df_test = spark.createDataFrame(X_test)
batch_predictions = model.transform(df_test)


In [None]:
display(batch_predictions)

2. Save the predictions to a table.

In [None]:
saved_name = "AdventureWorks_Sales_2017.csv_predictions".replace(".", "_")
batch_predictions.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save(f"Tables/{saved_name}")

## Data Wrangler Working

In [3]:
df = spark.read.format("csv").option("header","true").load("Files/RawDataADLS")
# df now is a Spark DataFrame containing CSV data from "Files/RawDataADLS".
display(df)

StatementMeta(, bbda6919-59e2-42cb-84e6-cab0f83cfebc, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 1a4a01d4-f2bf-419c-a09f-cf92ee8a4b8c)

**Code generated by Data Wrangler**

In [None]:
# Code generated by Data Wrangler for PySpark DataFrame

def clean_data(df):
    # Sort by column: 'ReturnDate' (ascending)
    df = df.sort(df['ReturnDate'].asc())
    return df

df_clean = clean_data(df)
display(df_clean)

In [None]:
# Code generated by Data Wrangler for pandas sample

def clean_data(pandas_df):
    # Sort by column: 'ReturnDate' (ascending)
    pandas_df = pandas_df.sort_values(['ReturnDate'])
    return pandas_df

# Loaded variable 'df' from kernel state
pandas_df = df.limit(5000).toPandas()

pandas_df_clean = clean_data(pandas_df.copy())
pandas_df_clean.head()