# Day 70 - CI/CD Pipeline for Machine Learning

In this notebook, I learned how to automate the **training and testing of machine learning models** using a **Continuous Integration and Continuous Deployment (CI/CD)** pipeline.

The goal was to:
- Understand how CI/CD fits into the **MLOps lifecycle**
- Create a simple ML project structure compatible with CI/CD
- Automate testing and model training using **GitHub Actions**
- Learn how workflows can be triggered automatically on push or pull requests

This setup helps ensure that every change in the codebase is **automatically tested, trained, and validated**, making the ML workflow more robust and production-ready.

---

### **1. Introduction**

**CI/CD** stands for **Continuous Integration** and **Continuous Deployment (or Delivery)**.

In traditional software engineering, CI/CD automates the process of integrating new code, testing it, and deploying it into production.

For **Machine Learning (ML)** projects, CI/CD ensures that every time new data or code changes are made, the ML pipeline — including **data processing, model training, evaluation, and deployment** — runs automatically, ensuring consistent, reliable, and reproducible results.

---

### **2. Why CI/CD is Used in Machine Learning**

ML projects differ from traditional software projects because they include:

* **Data dependency** — Models rely heavily on the quality and version of data.
* **Model retraining** — Models need regular retraining when new data arrives.
* **Experiment tracking** — Different hyperparameters, algorithms, and metrics must be recorded.
* **Automation** — Manual training and evaluation take time; automation ensures faster delivery and fewer human errors.

Hence, CI/CD in ML automates:

* The **training and evaluation** process whenever code or data changes.
* The **testing of code** (e.g., data validation, model accuracy thresholds).
* The **deployment** of updated models into production environments automatically.

**Benefits:**

* Ensures **consistency** across environments.
* Reduces **manual intervention** and **errors**.
* Enables **faster iteration** and **reproducibility**.
* Simplifies **collaboration** through version control and automation.

---

### **3. Tools Commonly Used in CI/CD for ML**

| **Stage**                  | **Tools Used**                                  |
| -------------------------- | ----------------------------------------------- |
| **Version Control**        | Git, GitHub, GitLab                             |
| **CI/CD Automation**       | GitHub Actions, Jenkins, GitLab CI/CD, CircleCI |
| **Experiment Tracking**    | MLflow, DVC, Weights & Biases                   |
| **Containerization**       | Docker                                          |
| **Cloud/Deployment**       | AWS, GCP, Azure, Streamlit, Flask               |
| **Testing**                | Pytest, Unittest                                |
| **Environment Management** | Conda, pip, requirements.txt                    |

---

### **4. Typical CI/CD Workflow for ML Pipeline**

Here’s how a **typical CI/CD ML pipeline** is structured —

- **Step 1: Data Extraction and Cleaning**
- **Step 2: Train/Test Split**
- **Step 3: Model Building**
- **Step 4: Model Evaluation**
- **Step 5: Save Outputs**
- **Step 6: Push Code and Files to GitHub**
- **Step 7: Create a GitHub Actions Workflow**
- **Step 8: CI/CD Auto-Runs Pipeline**



## Model Training

In [6]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, f1_score, recall_score
import warnings
warnings.filterwarnings('ignore')
sns.set(style='white')

In [7]:
# Load Data
dataset = pd.read_csv(r'C:\Arman\FSDS GenAI 2025\Practice\CICD Pipeline\iris.csv')

# Feature names (Ensure no extra spaces or parentheses)
dataset.columns = [colname.strip(' (cm)').replace(" ", "_") for colname in dataset.columns.tolist()]
features_names = dataset.columns.tolist()[:4]

In [8]:
# Feature Engineering
dataset['sepal_length_width_ratio'] = dataset['sepal_length'] / dataset['sepal_width']
dataset['petal_length_width_ratio'] = dataset['petal_length'] / dataset['petal_width']

# Select Features (Correct the duplicate column issue)
dataset = dataset[['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 
                   'sepal_length_width_ratio', 'petal_length_width_ratio', 'target']]

In [9]:
# Split Data
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=44)

# X_train, y_train, X_test, y_test
X_train = train_data.drop('target', axis=1).values.astype('float32')
y_train = train_data.loc[:, 'target'].values.astype('int32')
X_test = test_data.drop('target', axis=1).values.astype('float32')
y_test = test_data.loc[:, 'target'].values.astype('int32')

In [10]:
# Logistic Regression
logreg = LogisticRegression(C=0.0001, solver='lbfgs', max_iter=100, multi_class='multinomial')
logreg.fit(X_train, y_train)
predictions_lr = logreg.predict(X_test)

In [15]:
cm_lr = confusion_matrix(y_test, predictions_lr)
f1_lr = f1_score(y_test, predictions_lr, average='micro')
prec_lr = precision_score(y_test, predictions_lr, average='micro')
recall_lr = recall_score(y_test, predictions_lr, average='micro')
# Accuracy
train_acc_lr = logreg.score(X_train, y_train) * 100
test_acc_lr = logreg.score(X_test, y_test) * 100

In [13]:
# Random Forest
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)
predictions_rf = rf_reg.predict(X_test)

# Convert predictions to class labels
predictions_rf_class = np.round(predictions_rf).astype(int)

In [14]:
f1_rf = f1_score(y_test, predictions_rf_class, average='micro')
prec_rf = precision_score(y_test, predictions_rf_class, average='micro')
recall_rf = recall_score(y_test, predictions_rf_class, average='micro')

# Accuracy
train_acc_rf = rf_reg.score(X_train, y_train) * 100
test_acc_rf = rf_reg.score(X_test, y_test) * 100

In [23]:
# Saving Scores to a File
with open('scores.txt', "w") as score:
    score.write("Random Forest Train Var: %2.1f%%\n" % train_acc_rf)
    score.write("Random Forest Test Var: %2.1f%%\n" % test_acc_rf)
    score.write("F1 Score: %2.1f%%\n" % f1_rf)
    score.write("Recall Score: %2.1f%%\n" % recall_rf)
    score.write("Precision Score: %2.1f%%\n" % prec_rf)

    score.write("\n\n")

    score.write("Logistic Regression Train Var: %2.1f%%\n" % train_acc_lr)
    score.write("Logistic Regression Test Var: %2.1f%%\n" % test_acc_lr)
    score.write("F1 Score: %2.1f%%\n" % f1_lr)
    score.write("Recall Score: %2.1f%%\n" % recall_lr)
    score.write("Precision Score: %2.1f%%\n" % prec_lr)

---

## **GitHub Setup for CI/CD Pipeline**

Once the local machine learning pipeline is complete, the next step is to automate it using **GitHub Actions**.

Follow these steps to set up the CI/CD pipeline for your ML project:

---

### **Step 1: Create a New GitHub Repository**

1. Go to [https://github.com/new](https://github.com/new).
2. Create a new repository named **`cicd_pipeline`** (you can choose it as public or private).


### **Step 2: Upload Project Files**

After creating the repository, upload all the necessary files for the pipeline:

**Files to upload:**

* `iris.csv` → Dataset file
* `train_model.py` → Python script that trains and evaluates the model
* `requirements.txt` → List of dependencies required for the pipeline


### **Step 3: Create GitHub Actions Workflow Folder**

Inside the repository, create a hidden folder named **`.github/workflows/`**.

This folder will contain your CI/CD workflow configuration file.


### **Step 4: Add the Workflow File (`run.yml`)**

Create a new file named **`run.yml`** inside the `.github/workflows/` folder.

This file defines the CI/CD pipeline steps that GitHub Actions will execute every time you push or commit changes.


### **Step 5: Commit and Push Changes**

After adding the workflow file, commit and push all files to GitHub:


### **Step 6: View CI/CD Pipeline in GitHub Actions Tab**

1. Go to your repository on GitHub.
2. Click on the **“Actions”** tab (next to Code, Issues, Pull Requests).
3. You’ll see a workflow named **“CI-CD Pipeline”** running automatically.
4. The pipeline performs the following:

   * Installs dependencies.
   * Executes your training script (`train_model.py`).
   * Validates successful model training.

If everything is correctly configured, you’ll see a green **“Workflow run completed successfully”** message.


### **Final Outcome**

After completing all steps:

* Your ML pipeline is **automated**.
* Every push or pull request to the repository triggers the **training workflow**.
* The workflow ensures that:

  * All dependencies are installed,
  * The ML model is trained and evaluated,
  * Errors are caught early before deployment.

---

## **GitHub Setup and Execution Screenshots**

Below are the step-by-step visuals showing how the CI/CD pipeline was created and executed using **GitHub Actions**.

---

### **Step 1: Creating the Repository and Uploading Files**

You can see the newly created repository named **`cicd_pipeline`**.

The essential files — `iris.csv`, `train_model.py`, and `requirements.txt` — have been uploaded successfully.

*Repository structure after uploading files*

![Repository structure](screenshots\screenshot1.png)

---

### **Step 2: Creating the Workflow Folder and File**

Inside the repository, a new folder `.github/workflows/` is created, and within it, a file named **`run.yml`** is added.

This file defines the CI/CD workflow that will automatically execute the ML pipeline.

*Creating `.github/workflows/run.yml` file in GitHub*
![Creating run.yml file](screenshots\screenshot2.png)

---

### **Step 3: Adding the Workflow File (run.yml)**

The workflow file named `run.yml` was added inside `.github/workflows/`.

This file defines the steps that GitHub Actions will perform automatically whenever new code is pushed to the repository.


*Repository showing `.github/workflows/run.yml` file created*
![Workflow file added](screenshots\screenshot3.png)

---

### **Step 4: Initial Commit and Pipeline Trigger**


After committing and pushing the workflow file, the pipeline automatically starts running.

You can see the **“Model Training CICD”** workflow triggered under the **Actions** tab.

*Workflow run started in GitHub Actions*
![Pipeline triggered](screenshots\screenshot4.png)

---

### **Step 5: Debugging the First Pipeline Failure**

The initial workflow run failed because the dataset path used was an absolute local path.

This was fixed by updating the path in `train_model.py` to a relative one (`iris.csv`).

*Initial failed pipeline run showing the error in Actions tab*
![Failed pipeline](screenshots\screenshot5.png)

---

### **Step 6: Successful Pipeline Execution**

After correcting the path, a new commit triggered the workflow again — this time it executed successfully.

All steps (setup, checkout, train model) completed without errors, indicated by the green check mark ✅.

*Successful CI/CD pipeline run in GitHub Actions*
![Successful run](screenshots\screenshot6.png)

---

### **Final Outcome**

* The CI/CD pipeline automatically trains and evaluates the ML model on every push.
* Any future code or data updates will trigger the same automated workflow.
* This ensures consistent, reproducible, and testable ML model training.

---


## Conclusion

This notebook demonstrated how to integrate **CI/CD principles** into a machine learning workflow using **GitHub Actions**.  
The pipeline automatically:
- Triggers when new code is pushed to the repository  
- Executes training and evaluation scripts  
- Validates model performance  
- Ensures the workflow remains stable and reproducible  

This setup helps bridge the gap between **machine learning development and deployment**, creating a reliable automation process for model updates.


## Key Learnings

- Understood how CI/CD connects to the **MLOps lifecycle**
- Learned how to define a **GitHub Actions workflow** for ML automation
- Implemented **automated training and evaluation** through pipeline jobs
- Experienced how pipelines ensure **consistency and reproducibility** in ML models
- Observed how automation can catch issues early before production deployment

---