<a href="https://colab.research.google.com/github/annaphuongwit/ML-OPs/blob/main/05_github_actions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automating Your Workflow with GitHub Actions

This notebook will guide you through setting up your first automated workflow using GitHub Actions. Automation is a key skill for any developer or data scientist, as it helps catch errors early and ensures your project is always in a working state.

> This notebook shows how to setup a Workflow for the data formatter your instructor used in the last lesson.

> If you want to follow the notebook in your own repository, make sure that the typo in the `requirements.txt` (`panda` instead of `pandas` is fixed)

### What are GitHub Actions?

GitHub Actions is a powerful automation tool built directly into GitHub. It allows you to create custom workflows that automatically run in response to events in your repository, like when you push code or create a pull request. We'll use it for Continuous Integration (CI), which means automatically running our script to ensure it works correctly whenever we update our project.

Workflows are defined in special files written in a format called YAML. Let's build ours.

---
## 1.&nbsp; Create the Workflow File 📝\

GitHub needs to find your workflow files in a very specific location. You'll create this structure in your project using the VS Code file explorer.

1.  In the root directory of your project, create a new folder named `.github`. (The dot at the beginning is important!)
2.  Inside the `.github` folder, create another folder named `workflows`.
3.  Inside the `workflows` folder, create a new file named `ci.yml`.

Your project structure should now look like this:

```
project-folder/
├── .github/
│   └── workflows/
│       └── ci.yml
├── data/
├── src/
└── ... (other files)
```
4.  Open the empty `ci.yml` file. We'll add our code to it piece by piece.

> **Pre-made Workflows**
>
> While we are building our workflow from scratch to understand each part, you should know that GitHub provides pre-made templates for common projects.
>
> These templates are a fantastic starting point for your own projects and can save you a lot of time. You can explore them and learn more in the [GitHub documentation on starter workflows](https://docs.github.com/en/actions/using-workflows/using-starter-workflows).
>
> For this lesson, however, we'll continue building ours manually to learn what each line does.

---
## 2.&nbsp; Name the Workflow and Define Triggers 🏷️

Every workflow needs a name and a set of triggers that tell it when to run. Add the following code to your `ci.yml` file.

In [None]:
name: CI

on:
  push:
    branches: [ main ]
    paths:
      - "src/**"
      - "data/**"
      - "requirements.txt"
      - ".github/workflows/**"
  pull_request:
    paths:
      - "src/**"
      - "data/**"
      - "requirements.txt"
      - ".github/workflows/**"

* `name: CI`: This is simply the display name for your workflow. You'll see this name on the "Actions" tab of your GitHub repository.
* `on:`: This keyword defines the trigger events.
* `push:`: This makes the workflow run whenever someone pushes code to the repository. We've added two filters to it:
    * `branches: [ main ]`: This specifies that the workflow should only run for pushes to the `main` branch.
    * `paths:`: This is a crucial optimisation. The workflow will only run if the pushed commits include changes to files within these specific folders or files. The `**` is a wildcard that means "any file or any sub-folder inside this directory." This prevents the workflow from running unnecessarily when you only change a file like `README.md`.
* `pull_request:`: This trigger runs the workflow whenever a pull request is opened or updated. This is great for checking that changes work correctly before they are merged into your main branch.

---
## 3.&nbsp; Define the Job ⚙️

A workflow is made up of one or more jobs. A job is a set of steps that execute on a virtual machine called a runner. Add the following below the `on:` section:

In [None]:
jobs:
  run-formatter:
    runs-on: ubuntu-latest
    timeout-minutes: 10

* `jobs:`: This keyword starts the section where we define our job(s).
* `run-formatter:`: This is the unique name (or ID) we've given our job.
* `runs-on: ubuntu-latest`: This tells GitHub to prepare a fresh virtual machine running the latest version of Ubuntu Linux to execute our job. This is a common and reliable choice for most projects.
* `timeout-minutes: 10`: This is a safety feature. If the job takes longer than 10 minutes for any reason, GitHub will automatically cancel it. This prevents runaway processes from using up your free Actions minutes.

---
## 4.&nbsp; Add the Steps 🧱

Now we'll define the individual tasks, or steps, that our job will perform. These go inside the job, indented under `steps:`.

### 4.1. Check out repository
The first step is always to get a copy of your code onto the runner.

In [None]:
    steps:
      - name: Check out repository
        uses: actions/checkout@v4

* `steps:`: Starts the list of steps for the `run-formatter` job.
* `- name: ...`: Each step starts with a hyphen (`-`) and can have a `name` that describes what it does. This name appears in the logs, making it easy to follow along.
* `uses: actions/checkout@v4`: This is the most important part. `uses` tells the workflow to use a pre-built Action. An Action is a reusable piece of code that performs a common task.
    * The [`actions/checkout@v4`](https://github.com/marketplace/actions/checkout) action is the official action for checking out (downloading) your repository's code onto the runner. You can find it and thousands of other actions on the [GitHub Marketplace](https://github.com/marketplace?type=actions). Visiting the action's page shows you its documentation, different versions (`@v4` is the version), and examples of how to use it.

### 4.2. Set up Python
Next, we need to install the correct version of Python on the runner.

In [None]:
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

* `uses: actions/setup-python@v5`: We use another official action here. As its name implies, the [`setup-python`](https://github.com/marketplace/actions/setup-python) action installs a specific Python version and adds it to the system's path.
* `with:`: This keyword allows you to provide input parameters to an action.
* `python-version: "3.11"`: We're telling the `setup-python` action that we specifically need version 3.11 of Python for our script to run.

### 4.3. Install dependencies
Just like on your local machine, you need to install the packages listed in `requirements.txt`.

In [None]:
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

* `run:`: This keyword is different from `uses`. It executes shell commands directly on the runner.
* The Pipe Character (`|`): In YAML, the pipe character is called a "Literal Block Scalar." It simply means that the following indented lines are treated as a single, multi-line string. This is the standard way to run a sequence of commands in a single `run` step, with each command on a new line.
* The commands themselves should be familiar: `python -m pip install --upgrade pip` ensures you have the latest version of pip, and `pip install -r requirements.txt` installs all the project dependencies.

### 4.4. Run the formatter script
Finally, with the environment fully prepared, we can run our Python script.

In [None]:
      - name: Run the formatter script
        run: |
          python src/formatter.py --input data/raw_sales_data.csv --output data/cleaned_sales_data.csv

---
## 5.&nbsp; The Complete Workflow File 📜

When you're finished, your `ci.yml` file should look exactly like this:

In [None]:
name: CI

on:
  push:
    branches: [ main ]
    paths:
      - "src/**"
      - "data/**"
      - "requirements.txt"
      - ".github/workflows/**"
  pull_request:
    paths:
      - "src/**"
      - "data/**"
      - "requirements.txt"
      - ".github/workflows/**"

jobs:
  run-formatter:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Check out repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run the formatter script
        run: |
          python src/formatter.py --input data/raw_sales_data.csv --output data/cleaned_sales_data.csv

---
## 6.&nbsp; Pushing and Exploring the Logs ☁️

Now that your file is complete, it's time to commit it and see it in action.

In [None]:
git add .github/workflows/ci.yml

In [None]:
git commit -m "feat(ci): Implement data formatting pipeline"

In [None]:
git push

Once you push, navigate to your repository on GitHub in your web browser and click on the Actions tab. You will see your new "CI" workflow running!

Click on the workflow run that was just started by your commit. On the left, you'll see your job, `run-formatter`. Click it.

You will see the list of steps you named in your YAML file. A green checkmark ✅ next to each one means it succeeded. You can click the arrow next to any step to expand it and see the detailed output, just as if you had run the commands in your own terminal. This is fantastic for checking that everything worked as expected.

---
## 7.&nbsp; What Happens When Things Go Wrong? 💥

A CI workflow isn't just to confirm success; its real power is in catching failures early. Let's intentionally break our code to see what the logs look like.

1.  In VS Code, open the `src/formatter.py` script.
2.  Find the `main` function and comment it out. Your code might look something like this:
    ```python
    # def main(input_path, output_path):
         ... function logic ...
    ```
3.  Save the file. Now, commit and push this "broken" code:

In [None]:
git add src/formatter.py

In [None]:
git commit -m "break: Temporarily disable main function"

In [None]:
git push

4.  Go back to the Actions tab on GitHub. A new workflow run will have started.
5.  This time, you will see a ❌ next to the workflow run, instantly telling you something is wrong.
6.  Click into the failed run and the `run-formatter` job. Find the step with the red 'X' (it will be "Run the formatter script") and expand it.
7.  Instead of a success message, you will see a full Python error message and traceback. This tells you exactly what went wrong and where. This is the core value of CI: fast, automated feedback.
8.  To fix it, simply uncomment the `main` function in your local file, commit, and push again. The workflow will re-run and go back to green.

---
## 8.&nbsp; Challenge 😀

Now it's time to apply everything you've learned to your own `sentiment-analysis-project`.

Your challenge is to create a brand new GitHub Actions workflow from scratch. This workflow's job is to automatically run your `predict.py` script to ensure it can successfully load the trained model and make predictions on sample text.

Use the `ci.yml` file we built together in this lesson as your guide. Think about what needs to stay the same and what needs to change.

Commit your new workflow file to your repository and push it to GitHub. Can you get the green checkmark ✅ on your own project?

If you can, then purposefully break your code and explore the failure ❌ logs. (Of course, when you're done, get your code in working order again 😉).