# 2. Model Prototyping & MLflow Integration

**Objective:** Build a baseline sentiment analysis model, train it on the processed data, and integrate with MLflow for experiment tracking and model registration. The logic developed here will be refactored into our `src/pipeline/tasks` for the production Prefect pipeline.

In [None]:
import pandas as pd
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import os

# Import our config settings
# This requires the notebook to be run from the project root
# or the 'src' directory to be in the Python path.
import sys
if 'src' not in sys.path:
    sys.path.insert(0, '../src')
    if not os.path.exists('../src'):
        sys.path.pop(0)
        sys.path.insert(0, 'src')

# Change CWD to project root if running from /notebooks
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print(f"Changed directory to: {os.getcwd()}")

from src.config.settings import settings

: 

## 1. Load & Prepare Data

We'll use the `reference` data for this prototype, as it's our clean, validated dataset. In a real pipeline, we'd use the `processed` data, but for this prototype, `reference` is fine.

In [None]:
try:
    data = pd.read_csv(settings.REFERENCE_DATA_PATH)
except FileNotFoundError:
    print(f"Error: Data file not found at {settings.REFERENCE_DATA_PATH}")
    print("Please ensure you have run 'dvc pull' or the file exists.")

data.head()

In [None]:
# Define features (X) and target (y)
X = data['text']
y = data['sentiment']

# Create a train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=settings.MODEL_TEST_SPLIT_SIZE,
    random_state=settings.MODEL_RANDOM_STATE,
    stratify=y  # Ensure balanced classes in splits
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

## 2. Configure MLflow

We set the tracking URI (to our local `mlruns` directory) and the experiment name. If the experiment doesn't exist, MLflow creates it.

In [None]:
mlflow.set_tracking_uri(settings.MLFLOW_TRACKING_URI)
mlflow.set_experiment(settings.MLFLOW_EXPERIMENT_NAME)

experiment = mlflow.get_experiment_by_name(settings.MLFLOW_EXPERIMENT_NAME)
print(f"MLflow Experiment Name: {experiment.name}")
print(f"MLflow Experiment ID: {experiment.experiment_id}")
print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")

## 3. Define and Train Model within an MLflow Run

We'll use `mlflow.start_run()` to create a new experiment run. Inside this context, we will:
1.  Define a `scikit-learn` pipeline (TfidfVectorizer + LogisticRegression).
2.  Log hyperparameters (params).
3.  Train the model.
4.  Evaluate the model and log metrics (accuracy, f1, etc.).
5.  Use `mlflow.sklearn.log_model()` to log the trained model pipeline.

In [None]:
# Define parameters for our model
params = {
    "tfidf__ngram_range": (1, 2),  # Use unigrams and bigrams
    "tfidf__max_features": 1000,    # Limit feature space
    "logreg__C": 1.0,                # Logistic regression regularization strength
    "logreg__solver": "liblinear",
    "logreg__random_state": settings.MODEL_RANDOM_STATE
}

# Start an MLflow run
with mlflow.start_run() as run:
    run_id = run.info.run_id
    print(f"Starting MLflow Run ID: {run_id}")
    
    # 1. Define the model pipeline
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('logreg', LogisticRegression())
    ])
    
    # Set parameters
    pipeline.set_params(**params)
    
    # 2. Log parameters
    print("Logging parameters...")
    mlflow.log_params(params)
    mlflow.log_param("test_split_size", settings.MODEL_TEST_SPLIT_SIZE)
    
    # 3. Train the model
    print("Training model...")
    pipeline.fit(X_train, y_train)
    
    # 4. Evaluate and log metrics
    print("Evaluating model and logging metrics...")
    y_pred = pipeline.predict(X_test)
    
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_weighted": f1_score(y_test, y_pred, average="weighted"),
        "precision_weighted": precision_score(y_test, y_pred, average="weighted"),
        "recall_weighted": recall_score(y_test, y_pred, average="weighted")
    }
    
    mlflow.log_metrics(metrics)
    print(f"Metrics: {metrics}")
    
    # 5. Log the model
    print("Logging model...")
    mlflow.sklearn.log_model(
        sk_model=pipeline,
        artifact_path="model",  # This creates a 'model' subdirectory in the artifact store
        registered_model_name=settings.MODEL_REGISTRY_NAME # Register the model
    )
    
    # Log a tag for this run
    mlflow.set_tag("run_type", "prototype")
    
    print("MLflow run complete.")

## 4. Review Run in MLflow UI

Now, you can check the MLflow UI to see the run, its parameters, metrics, and the registered model.

1.  Open a terminal in the project root.
2.  Make sure your virtual environment is active.
3.  Run `mlflow ui`
4.  Open `http://127.0.0.1:5000` in your browser.

You should see the `SentimentModelRetraining` experiment with one run. Clicking it will show the logged params and metrics. In the "Artifacts" section, you'll see the `model` folder. In the "Models" tab, you'll see `prod-sentiment-classifier` with one version (Version 1).

## 5. Load Model from Registry (Simulation)

Let's simulate how another service (like `KubeSentiment`) would load this model for inference.

In [None]:
# Load the model using its registered name and stage
# 'None' stage automatically gets the latest version
model_uri = f"models:/{settings.MODEL_REGISTRY_NAME}/None"

print(f"Loading model from: {model_uri}")
loaded_model = mlflow.sklearn.load_model(model_uri)

# Test with a sample prediction
sample_text = ["This is a fantastic product!", "I am very angry."]
predictions = loaded_model.predict(sample_text)
print(f"Sample predictions: {list(zip(sample_text, predictions))}")