# MLflow Experiment Tracking with Logistic Regression

## What is MLflow?

**MLflow** is an open-source platform for managing the end-to-end machine learning lifecycle. It is especially useful for tracking experiments in machine learning projects, making sure that your results are reproducible, comparable, and well-documented.

### Why Use MLflow for Tracking?

In most machine learning workflows, you try out different:

* Algorithms or models
* Hyperparameters (like `C` in logistic regression)
* Data preprocessing strategies
* Training/test splits

This leads to multiple experiments whose results you need to compare and analyze. MLflow allows you to:

* Track all these experiments automatically
* Log every run's parameters and metrics
* Save trained models and artifacts
* Visualize results in a web UI

### Key Benefits of MLflow Tracking

| Feature            | Benefit                                             |
| ------------------ | --------------------------------------------------- |
| 📦 Log parameters  | Keep track of every setting (e.g., hyperparameters) |
| 📊 Log metrics     | Evaluate and compare accuracy, loss, etc.           |
| 💾 Save models     | Automatically store trained models                  |
| 🌐 Web UI          | Visually compare experiments across different runs  |
| 🧪 Reproducibility | Restore any previous experiment and re-run it       |

---

## Overview

We use a small dataset (`bankdata.csv`) to train a logistic regression classifier. The MLflow tracking API is used to:

* Log hyperparameters (e.g., regularization strength `C`)
* Log evaluation metrics (e.g., accuracy)
* Save the trained model artifact

All runs are recorded and can be viewed with MLflow's UI.

---

## Requirements

Install dependencies:

```bash
pip install mlflow scikit-learn pandas
```

---

## Dataset

The input file `bankdata.csv` should have a structure like:

```
age,job,marital,education,default,balance,housing,loan,contact,...,y
58,management,married,tertiary,no,2143,yes,no,unknown,...,no
```

* The target column is `y`, indicating the class label.
* Categorical features are label-encoded before training.

---

## Code Breakdown

### 1. Import Libraries

```python
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import pandas as pd
```

### 2. Load and Preprocess Dataset

```python
df = pd.read_csv("bankdata.csv")

# Label-encode categorical features
for col in df.select_dtypes(include=["object"]).columns:
    df[col] = LabelEncoder().fit_transform(df[col])

X = df.drop("y", axis=1)
y = df["y"]
```

### 3. Train-Test Split

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

### 4. Start MLflow Run and Train Model

```python
C = 0.1
with mlflow.start_run():
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
```

### 5. Log to MLflow

```python
    mlflow.log_param("C", C)               # Log hyperparameter
    mlflow.log_metric("accuracy", acc)     # Log evaluation metric
    mlflow.sklearn.log_model(model, "model")  # Log model artifact
```

---

## Running the Script

Save the code in a Python file, e.g., `train.py`. Then run:

```bash
python train.py
```
or run directly in jupyter notebook


To launch the MLflow UI:

```bash
mlflow ui
```

Access the interface at [http://localhost:5000](http://localhost:5000)

---

## Output in MLflow

Each run logs:

* The hyperparameter `C`
* The `accuracy` score
* A versioned model in the run artifacts

You can compare multiple runs directly in the UI.

---

## Next Steps

* Enable autologging with `mlflow.sklearn.autolog()` to automatically track parameters and metrics
* Register the model in the MLflow Model Registry for deployment and lifecycle management
* Add model signature and input examples for inference support

---

For more details, see the [MLflow documentation](https://mlflow.org/docs/latest/index.html).

In [None]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import pandas as pd

df = pd.read_csv("bankdata.csv")
for col in df.select_dtypes(include=["object"]).columns:
    df[col] = LabelEncoder().fit_transform(df[col])

X = df.drop("y", axis=1)
y = df["y"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

C = 0.1
with mlflow.start_run():
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))

    mlflow.log_param("C", C)
    mlflow.log_metric("accuracy", acc)
    mlflow.sklearn.log_model(model, "model")