
# MLflow and Spark Integration

This document details the conceptual integration of MLflow with Apache Spark for machine learning workflows, specifically focusing on a 5G Quality of Service (QoS) dataset.

## MLflow Overview

MLflow is an open-source platform designed to manage the entire machine learning lifecycle.  Its key components address distinct challenges in ML development:

* **Tracking:** MLflow Tracking provides a centralized repository for experiments.  It logs parameters, code versions, metrics, and artifacts (models, plots, etc.) associated with each run. This allows for easy comparison of different model versions and hyperparameter settings, promoting reproducibility and facilitating the selection of the best-performing model.  In the context of a 5G QoS dataset, this could track the performance of different algorithms (e.g., regression, classification) with various hyperparameters on metrics relevant to QoS, such as latency, throughput, and packet loss.

* **Projects:** MLflow Projects define reproducible, reusable workflows.  Projects specify the code, dependencies, and environment required to run an ML experiment, ensuring consistency across different environments (local, cloud, etc.).  For the 5G QoS dataset, a project could encapsulate data preprocessing steps, model training, and evaluation using Spark, ensuring that the entire pipeline can be easily rerun or deployed.  This is particularly crucial when working with large datasets typical in 5G network analysis.

* **Models:** MLflow Models provide a standard format for packaging and deploying machine learning models.  The format supports various model flavors (e.g., scikit-learn, TensorFlow, PyTorch) and deployment targets (e.g., REST API, batch inference).  Once a model trained on the 5G QoS dataset is selected via tracking, it can be packaged into an MLflow Model for deployment to a production environment, such as a system monitoring network performance. This ensures consistency between training and inference.

* **Registry:** The MLflow Model Registry enables centralized model management, including versioning, stage transitions (e.g., staging, production), and annotations.  This component facilitates model governance and simplifies the deployment process. In the context of the 5G QoS data, different versions of QoS prediction models can be managed and deployed, with detailed notes about performance characteristics and the dataset used for training.

## Spark Integration

Apache Spark's distributed computing capabilities are crucial for handling large-scale datasets like those encountered in 5G network analysis. Integrating MLflow with Spark offers several advantages:

* **Scalable Model Training:** Spark's distributed processing allows for parallel model training on large 5G QoS datasets.  MLflow can track the training process across the Spark cluster, consolidating metrics and artifacts from each executor.

* **Data Preprocessing with Spark:** Spark's DataFrame API provides tools for efficient data preparation and feature engineering. MLflow Projects can leverage these capabilities to preprocess the 5G data, ensuring consistency and reproducibility before model training.  Preprocessing steps (cleaning, transformation, feature extraction) can be logged in MLflow, enabling better understanding and reproducibility of results.


* **Model Deployment with Spark:** MLflow Models can be deployed in a Spark environment for real-time or batch inference. This facilitates integration into existing Spark-based data pipelines. A deployed model can analyze incoming 5G QoS data streams, enabling real-time monitoring and anomaly detection.

* **Unified Workflow:**  The integration combines Spark's data processing prowess with MLflow's ML lifecycle management, streamlining the end-to-end workflow for 5G QoS analysis.  This unified platform supports the development, deployment, and monitoring of QoS models.


By combining MLflow's experiment management and model deployment capabilities with Spark's distributed computing power, it becomes possible to build robust, scalable machine learning pipelines for analyzing 5G QoS datasets.  The entire process, from data ingestion to model deployment, can be effectively managed and monitored, leading to faster iteration, greater reproducibility, and more reliable results.


# MLflow and Spark Integration with 5G Quality of Service Dataset

In [None]:
#installing required library
!pip install pyspark mlflow

In [None]:
#Initializing spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MLflow_Spark_Integration_5G") \
    .config("spark.jars.packages", "org.mlflow:mlflow-spark:2.8.0") \
    .getOrCreate()

print("Spark session initialized.")

In [None]:
#Loading and preprocessing dataset
from pyspark.sql.functions import regexp_replace, col

# Load the dataset
file_path = "Quality of Service 5G.csv"  # Update with the correct path
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Preprocess the data: Convert text-based numeric values to numeric types
df = df.withColumn("Signal_Strength", regexp_replace(col("Signal_Strength"), " dBm", "").cast("float")) \
       .withColumn("Latency", regexp_replace(col("Latency"), " ms", "").cast("float")) \
       .withColumn("Required_Bandwidth", regexp_replace(col("Required_Bandwidth"), " Mbps", "").cast("float")) \
       .withColumn("Allocated_Bandwidth", regexp_replace(col("Allocated_Bandwidth"), " Mbps", "").cast("float")) \
       .withColumn("Resource_Allocation", regexp_replace(col("Resource_Allocation"), "%", "").cast("float"))

# Show the first few rows and schema
df.show(5)
df.printSchema()

In [None]:
#Defining ML pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

In [None]:
# Index the Application_Type column (convert it to numeric categories)
indexer = StringIndexer(inputCol="Application_Type", outputCol="Application_Type_Indexed")

In [None]:
# Assemble features
feature_columns = ["Signal_Strength", "Latency", "Required_Bandwidth", "Allocated_Bandwidth", "Resource_Allocation"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

In [None]:
# Logistic Regression Model (example use case: predicting Application_Type)
lr = LogisticRegression(featuresCol="features", labelCol="Application_Type_Indexed", maxIter=10)

# Pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])

In [None]:
#Setting up MLFlow
import mlflow
import mlflow.spark

# Set MLflow tracking URI (local or remote)
mlflow.set_tracking_uri("http://localhost:5000")  # Replace with your MLflow server URI

# Set experiment name
experiment_name = "5G_Quality_of_Service_Experiment"
mlflow.set_experiment(experiment_name)

print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment Name: {experiment_name}")

In [None]:
#Training and Tracking the model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

with mlflow.start_run():
    # Split the data into training and test sets
    train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

    # Fit the pipeline model
    model = pipeline.fit(train_data)

    # Log the Spark ML model
    mlflow.spark.log_model(model, "spark_model")

    # Evaluate the model
    predictions = model.transform(test_data)
    evaluator = MulticlassClassificationEvaluator(labelCol="Application_Type_Indexed", metricName="accuracy")
    accuracy = evaluator.evaluate(predictions)

    # Log metrics and parameters
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_param("model_type", "Logistic Regression")

    print(f"Model accuracy: {accuracy}")


In [None]:
#Registering the model
model_uri = f"runs:/{mlflow.active_run().info.run_id}/spark_model"
registered_model_name = "5G_QoS_Logistic_Regression"

In [None]:
mlflow.register_model(model_uri, registered_model_name)
print(f"Model registered as: {registered_model_name}")

In [None]:
#Deploying the model

# Load the model back in Spark and use it for predictions
loaded_model = mlflow.spark.load_model(f"models:/{registered_model_name}/1")

In [None]:
# Use the model for prediction on new data
new_predictions = loaded_model.transform(test_data)
new_predictions.show(5)

In [None]:
#Visualizing the results
!mlflow ui

#Access the UI at http://localhost:5000.