

## End-to-End MLOps Pipeline for Spam Detection

**Using Modular Python, DVC, DVC Live, and AWS S3**

![Image](https://ml-ops.org/img/mlops-phasen.jpg)

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://media.geeksforgeeks.org/wp-content/uploads/20250526123213427120/MlOps-Lifecycle_.webp)

This project builds a **production-grade MLOps pipeline** for a Spam Detection use case. It transitions from an experimental, notebook-based workflow to a **fully automated, versioned, and reproducible pipeline** using **DVC** and **AWS S3**.

The pipeline is modular, parameterized, experiment-driven, and cloud-backed—mirroring real-world industry MLOps practices.

---

## Phase 1: Project Initialization & Modular Setup

### 1. Repository Setup

* Create a new GitHub repository
* Clone it locally
* All development and experiments happen inside this repository

### 2. Modular Code Structure

* Replace the monolithic Jupyter Notebook with **modular Python scripts**
* Create a `src/` directory containing independent pipeline components
* Each script performs **one clearly defined task**

Example structure:

```
src/
 ├── data_ingestion.py
 ├── data_preprocessing.py
 ├── feature_engineering.py
 ├── model_training.py
 └── model_evaluation.py
```

### 3. Logging System

* Use Python’s `logging` module
* Configure:

  * **StreamHandler** → console logs
  * **FileHandler** → persistent `.log` files
* Support log levels such as `DEBUG`, `INFO`, and `ERROR`

This enables:

* Debugging during development
* Traceability during automated DVC runs

### 4. Exception Handling

* Wrap all major logic inside `try-except` blocks
* Log errors instead of failing silently
* Raise exceptions after logging to fail fast when required

This ensures the pipeline **fails gracefully and transparently**.

---

## Phase 2: Modular Pipeline Components

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/0%2Amu59arK69UHW7ler.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/0%2ABgalYghgF7tejjzW.png)

![Image](https://www.researchgate.net/publication/344335518/figure/fig1/AS%3A938415151394817%401600747015701/TF-IDF-vectorization-process.ppm)

![Image](https://miro.medium.com/1%2Ai0o8mjFfCn-uD79-F1Cqkw.png)

Each pipeline stage is implemented as an **independent Python module**.

### 1. Data Ingestion

* Pull raw data from a source (GitHub URL or AWS S3)
* Perform an initial **train–test split**
* Save outputs to the `data/raw/` directory

### 2. Data Preprocessing

* Drop irrelevant columns
* Encode target labels (spam → 1, ham → 0)
* Remove duplicate records
* Clean text (lowercasing, tokenization, stemming)

### 3. Feature Engineering

* Convert text data into numerical vectors
* Use **TF-IDF Vectorization**
* Control vocabulary size (e.g., `max_features = 500`)
* Persist vectorized outputs

### 4. Model Training

* Train a **Random Forest Classifier**
* Use hyperparameters defined externally
* Save the trained model as a `.pkl` file in `models/`

### 5. Model Evaluation

* Evaluate model performance
* Compute metrics:

  * Accuracy
  * Precision
  * Recall
* Save metrics to a `metrics.json` file

---

## Phase 3: Pipeline Automation with DVC (`dvc.yaml`)

![Image](https://dagshub.com/docs/feature_guide/assets/pipeline/dag.png)

![Image](https://files.realpython.com/media/1_new-dvc_add.e7d290c59325.png)

![Image](https://i.sstatic.net/k2JjK.png)

Manual script execution is replaced with **DVC pipeline orchestration**.

### DVC Pipeline Concept

* Each pipeline step becomes a **DVC stage**
* Stages define:

  * Command (`cmd`)
  * Dependencies (`deps`)
  * Outputs (`outs`)

### Example `dvc.yaml`

```yaml
stages:
  data_ingestion:
    cmd: python src/data_ingestion.py
    deps:
      - src/data_ingestion.py
    outs:
      - data/raw
```

### Key Benefits

* `dvc repro` runs the entire pipeline
* DVC automatically skips stages if nothing changed
* Saves time and compute
* Enables deterministic reproduction

You can visualize dependencies using:

```
dvc dag
```

---

## Phase 4: Parameterization with `params.yaml`

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/1%2AKxY4W7eqFb_kgP8xl6Zc3Q.png)

![Image](https://blog.paperspace.com/content/images/size/w750/2019/12/yaml.png)

![Image](https://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Hard-coding values is removed entirely.

### Centralized Configuration

All tunable values are stored in `params.yaml`:

```yaml
data_ingestion:
  test_size: 0.2

feature_engineering:
  max_features: 500

model_building:
  n_estimators: 100
  max_depth: 10
```

### Parameter Loader Function

```python
def load_params(params_path: str) -> dict:
    try:
        with open(params_path, 'r') as file:
            params = yaml.safe_load(file)
        logger.debug('Parameters retrieved from %s', params_path)
        return params
    except FileNotFoundError:
        logger.error('File not found: %s', params_path)
        raise
    except yaml.YAMLError as e:
        logger.error('YAML error: %s', e)
        raise
    except Exception as e:
        logger.error('Unexpected error: %s', e)
        raise
```

### Usage in Pipeline Components

```python
params = load_params('params.yaml')
test_size = params['data_ingestion']['test_size']
max_features = params['feature_engineering']['max_features']
model_params = params['model_building']
```

Any parameter change now **automatically propagates** through the pipeline.

---

## Phase 5: Experiment Tracking with DVC Live

![Image](https://doc.dvc.org/static/c25e0c4a1b71ee27e896e86eca1c70f8/c71fc/dvclive-vscode-compare.png)

![Image](https://dagshub.com/blog/content/images/2021/07/Experiment-Tracking-Comparison-1.png)

![Image](https://www.inetsoft.com/images/website/hr_employee_attrition_analysis.jpg)

### Setup

```bash
pip install dvclive
```

### DVC Live Integration

```python
from dvclive import Live

with Live(save_dvc_exp=True) as live:
    live.log_metric('accuracy', accuracy)
    live.log_metric('precision', precision)
    live.log_metric('recall', recall)
    live.log_params(params)
```

### Experiment Workflow

* `dvc exp run` → executes an experiment
* Each run is tracked automatically
* Metrics and parameters are stored per experiment
* Compare results using:

  * `dvc exp show`
  * DVC VS Code extension

### Experiment Control

* `dvc exp remove <exp-name>` → delete experiment
* `dvc exp apply <exp-name>` → restore a previous run

This enables **systematic hyperparameter tuning** without manual bookkeeping.

---

## Phase 6: AWS S3 Integration for Remote Storage

![Image](https://d2908q01vomqb2.cloudfront.net/e1822db470e60d090affd0956d743cb0e7cdf113/2023/02/17/Arch_Layout_Disconnect_Replication_Image6.png)

![Image](https://assets.datacamp.com/production/repositories/6549/datasets/ef10bf4c21182d40dd2509d3042dca7b3faf8dfc/DVC_remote_config.png)

![Image](https://docs.aws.amazon.com/images/IAM/latest/UserGuide/images/intro-diagram%20_policies_800.png)

### 1. IAM Configuration

* Create an IAM user with **Programmatic Access**
* Attach `AdministratorAccess` (development only)
* Generate:

  * Access Key
  * Secret Key

### 2. S3 Bucket

* Create a bucket (e.g., `dvc-s3-proj`)
* This bucket becomes the **DVC remote backend**

### 3. DVC Remote Setup

```bash
aws configure
dvc remote add -d storage s3://dvc-s3-proj
```

### 4. Push Artifacts to Cloud

```bash
dvc push
```

This uploads:

* Data versions
* Models
* Pipeline outputs

Git tracks metadata, while S3 stores large artifacts.

---

## Final Result

This workflow delivers:

* Fully modular ML code
* Automated, dependency-aware pipelines
* Parameterized experimentation
* Experiment tracking with lineage
* Cloud-backed data and model versioning
* Reproducibility across machines and teams







## Command Reference for the End-to-End MLOps Workflow

**(Git + Modular Python + DVC + DVC Live + AWS S3)**

![Image](https://towardsdatascience.com/wp-content/uploads/2024/08/1ub_u88a4MB5Uj-9Eb60VNA.jpeg)

![Image](https://mlops-guide.github.io/assets/dvc/dvc_diagram.png)

![Image](https://miro.medium.com/1%2ARXD8hqmXheaJ8cxOG61XEg.jpeg)

This section documents **all commands used throughout the lifecycle** of the Spam Detection MLOps project. Commands are grouped by purpose and follow the actual execution order used in a real pipeline.

---

## 1. Git and Environment Setup

![Image](https://nvie.com/img/git-model%402x.png)

![Image](https://uidaholib.github.io/get-git/images/workflow.png)

![Image](https://media.geeksforgeeks.org/wp-content/uploads/20200803190037/WelcomepythonVisualStudioCode03082020185517-660x370.png)

These commands establish version control, enable collaboration, and track code evolution.

* **`git clone <url>`**
  Clones the GitHub repository from the remote server to the local machine.

* **`cd <folder_name>`**
  Moves into the project directory where all development will occur.

* **`code .`**
  Opens the project directory in Visual Studio Code.

* **`git status`**
  Displays the current repository state, including staged, unstaged, and untracked files.

* **`git add .`**
  Stages all modified and new files for commit.

* **`git commit -m "message"`**
  Records the staged changes locally with a descriptive commit message.

* **`git push origin main`**
  Pushes committed changes from the local branch to the remote GitHub repository.

These commands are executed repeatedly throughout the project as the pipeline evolves.

---

## 2. Python Pipeline Execution (Manual Runs)

![Image](https://www.jhkinfotech.com/blog/wp-content/uploads/2025/02/Build-a-Machine-Learning-Pipeline-in-Python-3.jpg?v=1738728817)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1400/1%2ASVzzwKUdici7WPT4VtUayg.jpeg)

![Image](https://pythonbasics.org/wp-content/uploads/2015/12/start-python-script.png)

Before automation, each pipeline component is executed independently to validate correctness.

* **`python src/data_ingestion.py`**
  Runs the script responsible for loading raw data and performing the initial train-test split.

* **`python src/pre_processing.py`**
  Executes data cleaning steps such as label encoding, duplicate removal, and text normalization.

* **`python src/feature_engineering.py`**
  Converts cleaned text into numerical features using TF-IDF vectorization.

* **`python src/model_training.py`**
  Trains the Random Forest classifier and saves the trained model artifact.

* **`python src/model_evaluation.py`**
  Computes evaluation metrics (accuracy, precision, recall) and stores them as structured outputs.

These commands confirm that **each module works independently** before being wired into DVC.

---

## 3. DVC Pipeline Management

![Image](https://dagshub.com/docs/feature_guide/assets/pipeline/dag.png)

![Image](https://christophergs.com/assets/images/dvc_workflow2.png)

![Image](https://doc.dvc.org/static/39d86590fa8ead1cd1247c883a8cf2c0/cb690/project-versions.png)

DVC replaces manual execution with automated, dependency-aware pipelines.

* **`dvc init`**
  Initializes DVC in the repository and creates required configuration files.

* **`dvc repro`**
  Executes the full pipeline defined in `dvc.yaml`.
  Only stages with changed dependencies are re-run.

* **`dvc dag`**
  Visualizes the pipeline as a Directed Acyclic Graph (DAG), showing stage dependencies.

* **`dvc commit`**
  Manually records changes to data or model outputs when auto-tracking is not used.

* **`dvc stage add`**
  Creates a new pipeline stage directly from the terminal using:

  * `-n` for stage name
  * `-d` for dependencies
  * `-o` for outputs
  * `-p` for parameters

Once added, stages are stored in `dvc.yaml` and tracked by Git.

---

## 4. Experiment Tracking with DVC Live

![Image](https://doc.dvc.org/vscode-customize-table-9ab6cf66aec3b0adfd695f6d80bb2deb.gif)

![Image](https://www.kdnuggets.com/wp-content/uploads/awan_7_best_tools_machine_learning_experiment_tracking_5.png)

![Image](https://i0.wp.com/dvc.org/wp-content/uploads/2021/07/hyperparameters-july-website.png?fit=2000%2C1385\&quality=80\&ssl=1)

These commands enable systematic experimentation and metric comparison.

* **`pip install dvclive`**
  Installs the DVC Live library used for logging metrics and parameters.

* **`dvc exp run`**
  Executes the pipeline as an experiment.
  Each run records metrics, parameters, and artifacts without affecting the main branch.

* **`dvc exp show`**
  Displays a comparison table of all experiments, including metrics and parameter values.

* **`dvc exp remove <exp_name>`**
  Deletes a specific experiment and its associated metadata.

* **`dvc exp apply <exp_name>`**
  Restores the project (code, data, and parameters) to the state of a selected experiment.

This allows controlled hyperparameter tuning with full lineage tracking.

---

## 5. AWS S3 and Remote Storage Integration

![Image](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2018/05/01/Prediction1-1.png)

![Image](https://miro.medium.com/v2/resize%3Afit%3A1200/1%2AIE7FM99th8pDlIIKHWM2lg.png)

![Image](https://blog.nashtechglobal.com/wp-content/uploads/2025/07/image-28.png)

These commands connect the local DVC pipeline to cloud-based storage.

* **`pip install dvc[s3]`**
  Installs DVC with AWS S3 support.

* **`pip install awscli`**
  Installs the AWS Command Line Interface.

* **`aws configure`**
  Configures AWS credentials locally by storing:

  * Access Key
  * Secret Key
  * Default region

* **`dvc remote add -d <name> s3://<bucket-name>`**
  Registers an S3 bucket as the default DVC remote storage.

* **`dvc push`**
  Uploads all DVC-tracked data, models, and pipeline outputs to the S3 bucket.

Git stores metadata, while S3 stores large artifacts.

---

## Summary

This command set supports:

* Version-controlled ML development
* Modular pipeline validation
* Automated dependency-based execution
* Parameterized experimentation
* Cloud-backed data and model versioning


