# **Notebook to create and register model**

In this notebook we will create and register machine learning model. we will be using mlflow library along with others to create the model. some important features of this code snippit 

**Experiment Name**: The code checks if an experiment with the name “MyExperimentRevenue” exists and creates one if it doesn’t. Make sure that the name does not conflict with any existing experiments.

**Model Registration**: The model is registered with the name “MyModelRevenue”. If a model with this name already exists in the Model Registry, this could cause a conflict.

**Data Split**: The data is split into training and test sets with a 70:30 ratio. Depending on the size and distribution of your data, you might want to adjust this.

**Autologging**: Autologging is enabled with mlflow.spark.autolog(). This logs all parameters, metrics, and a model from spark.ml estimators and model transformers. If you only want to log certain information, you might want to use manual logging instead.


# Instructions 
- please ensure the steps defined in "01-createtable" are completed.
- please change the lakeHouseName to your own 


In [None]:
# Lakehouse name , replace with your own
lakeHouseName = "dataverse_development_cds2_workspace_unqf0798579be6eee118bc36045bd003"

# load data in our data frame from temp table. important to note that we will only load records where reveneu is not zero

query =f"""SELECT * FROM {lakeHouseName}.tbl_temp_traintestdata where crffa_revenue !=0"""
dfmain = spark.sql(query)


StatementMeta(, , , Waiting, )

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import mlflow
import mlflow.spark
from datetime import datetime
from mlflow.tracking import MlflowClient

# Enable autologging
mlflow.spark.autolog()

# set experiment and model name
experiment_name = "MyExperimentRevenue"
model_name = "MyModelRevenue"

# Check if the experiment exists
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment:
    experiment_id = experiment.experiment_id
else:
    # Create an experiment and get its ID
    experiment_id = mlflow.create_experiment(experiment_name)

# Define the feature columns
feature_columns = ["crffa_csat", "crffa_noofreturns", "crffa_educationalbackground", "crffa_yearofbirth", "crffa_recency", "MntWines",
 "MntFruits", "MntMeatProducts", "MntFishProducts", "MntBakeryProducts", "MntBeverageProds", "MntDairyProds", "NumDealsPurchases", 
 "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases"]

# Assemble the features into a feature vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df = assembler.transform(dfmain)

# Split the data into training and test sets
train_data, test_data = df.randomSplit([0.7, 0.3])

# Define the linear regression model
lr = LinearRegression(featuresCol="features", labelCol="crffa_revenue")

# Fit the model to the training data
lr_model = lr.fit(train_data)

# Log model
with mlflow.start_run(experiment_id=experiment_id) as run:
    mlflow.spark.log_model(lr_model, "model")
    run_id = run.info.run_id

    # Register model
    model_uri = "runs:/" + run_id + "/model"
    try:
        mlflow.register_model(model_uri, model_name)
    except Exception as e:
        print(f"Model {model_name} already exists. Creating a new version of the model.")
        client = MlflowClient()
        client.create_model_version(name=model_name, source=model_uri, run_id=run_id)

    # Evaluate the model and log metrics
    evaluator = RegressionEvaluator(labelCol="crffa_revenue")
    
    # Evaluate on training data
    train_predictions = lr_model.transform(train_data)
    train_rmse = evaluator.evaluate(train_predictions, {evaluator.metricName: "rmse"})
    train_r2 = evaluator.evaluate(train_predictions, {evaluator.metricName: "r2"})
    mlflow.log_metric("train_rmse", train_rmse)
    mlflow.log_metric("train_r2", train_r2)

    # Evaluate on test data
    test_predictions = lr_model.transform(test_data)
    test_rmse = evaluator.evaluate(test_predictions, {evaluator.metricName: "rmse"})
    test_r2 = evaluator.evaluate(test_predictions, {evaluator.metricName: "r2"})
    mlflow.log_metric("test_rmse", test_rmse)
    mlflow.log_metric("test_r2", test_r2)

print(f"Run ID: {run_id}")
print(f"model uri : {model_uri}")

mlflow.end_run()

In [None]:
# Print the metrics so we can asses the model performance. 
print("Training Data Metrics:")
print(f"RMSE: {train_rmse}")
print(f"R2: {train_r2}")

print("\nTest Data Metrics:")
print(f"RMSE: {test_rmse}")
print(f"R2: {test_r2}")


StatementMeta(, , , Waiting, )

Training Data Metrics:
RMSE: 9535.306863897882
R2: 0.7789848129153963

Test Data Metrics:
RMSE: 8678.230330837348
R2: 0.8166836336022797


In [None]:
coef_dict = dict(zip(feature_columns, lr_model.coefficients))
sorted_features = sorted(coef_dict.items(), key=lambda x: abs(x[1]), reverse=True)

# Calculate total of absolute coefficients
total_coef = sum(abs(coef) for feature, coef in sorted_features)

for feature, coef in sorted_features:
    percentage_influence = (abs(coef) / total_coef) * 100
    print(f"{feature}: {percentage_influence:.2f}%")
