<a href="https://colab.research.google.com/github/annaphuongwit/AI-Engineering-_AI_Chatbot/blob/main/06_ruff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Professionalising Your Project with Ruff

You've successfully built a collaborative workflow and a complete sentiment analysis pipeline. The project works, but now it's time to elevate your code to a professional standard. This involves ensuring your scripts are not just functional, but also clean, consistent, and free of potential errors.

In this notebook, you'll learn to use [Ruff](https://docs.astral.sh/ruff/), an extremely fast Python linter and formatter.

Our mission is to take a messy script and use Ruff to systematically find and fix all the issues, turning it into a professional-quality codebase.

---
## 1.&nbsp; The Code Cleanup Mission 🕵️

To learn Ruff's core features in a controlled environment, we'll start with a dummy script. A colleague has left `messy_data_processor.py` in a messy state. It works, but it contains a mix of obvious errors, stylistic inconsistencies, and a subtle, hidden bug.

1. Create a new file in your project named `messy_data_processor.py`.
2. Copy the code below into this new file.

In [None]:
import os
import pandas as pd

def process_data(df: pd.DataFrame) -> pd.DataFrame:
    """Cleans and processes the dataframe."""
    print("Starting data processing!")

    my_variable = 123

    # Clean column names by stripping whitespace and converting to lower case
    df.columns = df.columns.str.strip().str.lower()

    # Ensure 'price' is a numeric column, coercing errors
    if 'price' in df.columns:
        df["price"] = pd.to_numeric(df["price"], errors="coerce")

    return df

if __name__ == "__main__":
    data = {"Product Name": ["  Laptop ", " Mouse", "Keyboard"],
            "Price": ["1200", "25.50", "75"]
           }
    initial_df = pd.DataFrame(data)

    cleaned_df = process_data(initial_df)
    print("Cleaned DataFrame:")
    print(cleaned_df)

### 1.1. Install Your Toolkit

With our messy script ready, let's install our tool. Make sure your project's virtual environment is activated in the VS Code terminal, then install Ruff.

In [None]:
conda install -c conda-forge ruff

---
## 2.&nbsp; Finding & Fixing Errors 🔍

First, let's hunt for common and easily fixed problems using `ruff check`. This command is a [linter](https://docs.astral.sh/ruff/linter/); it analyses code to find potential errors, bugs, and stylistic issues based on a set of rules.

### 2.1. Finding the Errors

Let's run Ruff's linter on our entire project. The `.` tells Ruff to analyse all Python files in the current directory and any subdirectories. This is the standard way to ensure your whole codebase is clean.

> If you're actively working on a single file and want to check only that one, you can also provide a specific path, like `ruff check messy_data_processor.py`. For now, we'll stick with checking the whole project.

In [None]:
ruff check .

Ruff finds the unused import and variable:
```
messy_data_processor.py:1:8: F401 [*] `os` imported but unused
messy_data_processor.py:8:5: F841 [*] Local variable `my_variable` is assigned to but never used
Found 2 errors.
[*] 2 fixable with the --fix option.
```

### 2.2. Understanding the Rules

But what do `F401` and `F841` actually mean? A key skill is knowing how to find out. Ruff has a built-in command, `ruff rule`, that acts as a mini-documentation. It's a great way to learn why a rule is important.

Run the following in your terminal:

In [None]:
ruff rule F841

The output gives you a clear explanation of the rule ("Local variable `...` is assigned to but never used") and an example. This is an invaluable tool for understanding any rule Ruff reports.

### 2.3. Autofixing What's Safe

You may have notice when we ran `ruff check` the text:
```
Found 2 errors.
[*] 1 fixable with the `--fix` option (1 hidden fix can be enabled with the `--unsafe-fixes` option).
```

The `--fix` flag tells Ruff to automatically fix any issues it deems "safe" - changes that won't alter the logic of your code. Let's run it.

In [None]:
ruff check . --fix

The output now says `Found 2 errors (1 fixed, 1 remaining).`

If you check `messy_data_processor.py`, the `import os` line is gone. Ruff autofixed this because removing an unused import is always safe.

However, the `my_variable` line is still there. Ruff considers this "unsafe" to remove automatically, as you might have put it there for debugging. It leaves the final decision to you.

After some consideration, you realise `my_variable` is useless and not there for any future tasks. So, manually delete the `my_variable = 123` to finish tidying up your code.

---
## 3.&nbsp; Customising the Rulebook 📖

Our script still has a debug `print()` statement and inconsistent styling (mixed quotes). We'll now [configure Ruff](https://docs.astral.sh/ruff/configuration/) to catch these issues.

### 3.1. Configuring Ruff with `pyproject.toml`

We can customise Ruff's behaviour using a `pyproject.toml` file in our project's root directory. This is the standard for configuring Python tools.

1.  Create a file named `pyproject.toml` in the root of your project.
2.  Add the following configuration. This tells Ruff's linter (`[tool.ruff.lint]`) to select (`select = [...]`) several rule sets.

In [None]:
[tool.ruff.lint]
# Enable specific rule sets by their code
select = ["E", "F", "I", "T20"]

The rules you choose are a matter of project preference, but here are some common starting points:
- **`E`, `F`** (Errors & Flakes): These are the essentials. They catch outright bugs (like undefined variables) and major style violations. Almost every project should start with these.
- **`I`** (Import Sorting): This rule organises your `import` statements automatically. It's a highly recommended 'quality of life' rule that prevents merge conflicts and makes code easier to read.
- **`T20`** (Print Statements): This is a stricter rule. You enable it to enforce a 'no print statements in production code' policy. It's great for libraries or applications, but you might choose to disable it for simple, one-off analysis scripts.

> **Where do you find other rules?**
> The best place to explore them is on the [official Ruff documentation website](https://docs.astral.sh/ruff/rules/). You can also see a full list in your terminal by running `ruff rule --all`.

### 3.2. Finding the Print Statement

With our new configuration, let's run the check again.

In [None]:
ruff check .

Success! Ruff now flags the `T201` violation for the debug `print()` statement. It's not autofixable because some `print` calls are necessary for a script's output, making automatic deletion unsafe.

Manually delete the `print("Starting data processing!")` line to complete this task.

---
## 4.&nbsp; Enforcing a Consistent Style 🎨

The final cleanup step is to enforce a consistent style. This is a job for the [formatter](https://docs.astral.sh/ruff/formatter/). Unlike the linter, which finds errors, the formatter rewrites your code to match the style guide.

1.  Let's configure the formatter. Open `pyproject.toml` and add settings for `line-length` and `quote-style`.

In [None]:
[tool.ruff]
line-length = 88

[tool.ruff.lint]
select = ["E", "F", "I", "T20"]

[tool.ruff.format]
quote-style = "double"
docstring-code-format = true

- `line-length = 88`: We'll set the line length to 88. The original Python standard (PEP 8) suggested 79, but with modern widescreen monitors, 88 has become a common convention, popularised by the [black](https://black.readthedocs.io/en/stable/) formatter. It offers a good balance between readability and screen space.
- `quote-style = "double"`: We'll use double quotes, another black default. The most important thing is not which style you choose, but that you choose one and apply it consistently.
- `docstring-code-format = true`: This is a nice touch. It tells Ruff to also format any code examples inside your docstrings, keeping them consistent with the rest of your code.

### 4.1. Checking for Changes with `--check`
Run the formatter with the `--check` flag. This command doesn't change any files; it just tells you if any files would be reformatted.

In [None]:
ruff format . --check

Ruff exits with an error message: `1 file would be reformatted`. This confirms we have a style issue to fix.

> `ruff format .` is the standard command to automatically format every file in your project at once. This is typically what you want to do to ensure the whole codebase follows a single, consistent style. You can also check a specific file by providing its path, like `ruff format src/train.py`, which is useful when you only want to focus on one file.

### 4.2. Previewing Changes with `--diff`
Now that we know there's an issue, let's see exactly what Ruff wants to change using the `--diff` flag. This acts as a "dry run", showing you the proposed changes without modifying the file.

In [None]:
ruff format . --diff

### 4.3. Applying the Formatting
Once you're happy with the preview, run the formatter command to rewrite the file.

In [None]:
ruff format .

Check `messy_data_processor.py`. Ruff has automatically changed all the single-quoted strings to double quotes and reformatted the code to respect the 88-character line limit.

---
## 5.&nbsp; Making an Exception ✂️

Sometimes, you need to intentionally break a rule, for example, when adding a temporary variable for debugging. Ruff allows you to do this with a `# noqa` comment.

In [None]:
def process_data(df: pd.DataFrame) -> pd.DataFrame:
    # We want to inspect the dataframe head during debugging, but it's unused.
    data_snapshot = df.head()  # noqa

    # ... rest of the function

By adding `# noqa`, you tell Ruff to ignore all rules for this line only.

However, we are ignoring all rules! Maybe we only want to ignore this line as we know it's an unused variable and we're saving it for later. It's best to tell ruff that we know it's an unused variable, but it can check it for other problems.

This is done by adding the rule code to the `noqa` comment:

In [None]:
def process_data(df: pd.DataFrame) -> pd.DataFrame:
    # We want to inspect the dataframe head during debugging, but it's unused.
    data_snapshot = df.head()  # noqa: F841

    # ... rest of the function

By adding `# noqa: F841`, you tell Ruff to specifically ignore the "unused variable" rule for this line only. This is much safer than a blanket `# noqa` which would ignore all rules.

---
## 6.&nbsp; Challenges 😀

You've learned the fundamentals of cleaning a single script. Now it's time to apply these skills to your main `sentiment-analysis-project` and automate your quality checks.

### Challenge 1: Configure Your Ruleset

Your first task is to become familiar with the vast number of rules Ruff offers.

1.  **Explore the Docs:** Visit the [official Ruff rules documentation](https://docs.astral.sh/ruff/rules/).
2.  **Choose Your Rules:** Browse through the different categories. Do you want to enforce docstring conventions (`D`)? Find commented-out code (`ERA`)? Or perhaps use rules that encourage more modern Python code, like `pyupgrade` (`UP`)?
3.  **Update Your Config:** Add or change the rules in the `select` list in your `pyproject.toml` file to match your preferences. There's no single right answer; this is about choosing a standard that works for you.

### Challenge 2: Clean Your Sentiment Analysis Project

Now, apply your chosen ruleset to the project you've been building throughout this series.

1.  **Run the Linter:** In your `sentiment-analysis-project` directory, run `ruff check .` and see what it finds in your `src/train.py` and `src/predict.py` files.
2.  **Run the Formatter:** Run `ruff format .` to apply your chosen style guide consistently across the entire project.
3.  **Commit the Changes:** Once your project is clean and adheres to your new rules, commit all the changes with a clear message like `style: Apply Ruff linting and formatting`.

### Challenge 3: Automate Your Quality Gate

Your final and most important task is to add Ruff to your GitHub Actions workflow. This will create a "quality gate" that automatically checks every new push and pull request, ensuring no messy code ever makes it into your main branch.

1.  **Update `requirements.txt`:** Make sure `ruff` is listed in your project's dependencies.
2.  **Edit your `ci.yml` file:** Add a new job to your workflow for linting and formatting. Think back to how you added the `pytest` job in the previous lesson. You will need steps to check out the code, set up Python, install dependencies, and then run `ruff check .` and `ruff format . --check`.
3.  **Push and Verify:** Commit your updated workflow file and push it to GitHub. Go to the **Actions** tab on your repository and confirm that your new `lint-and-format` job runs and passes with a green checkmark ✅.

---
## Bonus Challenge: The Pre-commit Hook (For Early Finishers) 🏎️

If you've finished the main challenges and want to explore a more advanced, professional technique, this one is for you. This is an optional, self-directed task.

Your bonus challenge is to investigate and implement a pre-commit hook for Ruff. A pre-commit hook automatically runs checks on your code before you're allowed to make a commit, ensuring no errors ever enter your project's history.

We won't provide instructions for this. Part of being a developer is learning how to research and integrate new tools. You'll need to find the `pre-commit` framework documentation, figure out how to configure it for Ruff, and decide if it's a workflow you want to adopt.