<div class='bar_title'></div>

*Enterprise AI*

# Tutorial 6 - Code Modulerization and Experimental Tracking

Gunther Gust / Viet Nguyen<br>
Chair of Enterprise AI

Summer Semester 25

<img src="https://github.com/GuntherGust/tds2_data/blob/main/images/d3.png?raw=true" style="width:20%; float:left;" />

In previous assignments, you explored how to use Jupyter notebooks to quickly prototype simple machine learning projects. Notebooks are designed for rapid experimentation, allowing you to combine code, results, visualizations, and Markdown-based documentation in a single, shareable environment. This format is especially effective for teaching and collaboration (arguarbly, some hardcore programmers think otherwise $^{[1]}$).

Despite their widespread adoption in the scientific and data science communities, Jupyter notebooks have several limitations when it comes to modern software development. If you inspect a notebook as a plain text file, you'll notice that it stores all content (code, outputs, images, and Markdown texts) in a single JSON object. This design introduces several drawbacks:

1. *Version control is difficult*: Every time a notebook is executed, the outputs change, even if the code doesn’t. This makes it hard to track meaningful changes and collaborate effectively in an iterative development process.
2. *Non-linear execution*: Code in notebooks can be run in any order, which breaks the logical, linear flow that most programming languages rely on. This can lead to hidden state issues and non-reproducible behavior.
3. *Poor modularity and code reuse*: Since notebooks often contain all logic in a single document, it's harder to separate concerns, build reusable components, or maintain clean architecture.
4. *Limited support for testing and deployment*: Notebooks lack native mechanisms for automated testing, continuous integration, or production deployment workflows.

The last two issues are especially problematic, which maked notebooks unsuitable for building production-level software systems.

In this tutorial, we will walk through the next step toward production-level practices by modularizing our notebook code into separate Python files (`.py`). This means that we will extract the core logic such as data processing, model training, and model evaluation into separate modules that can be `imported` and reused. This module structure not only makes the codebase easier to maintain and test, but also helps to integrate our code with tools like [Weights & Biases](https://wandb.ai/) for experiment tracking. By isolating components, we gain flexibility to rerun or update parts of the pipeline without touching the entire notebook. This structure is essential for deploying models in real-world systems, where reproducibility, versioning, and automation are critical.

---
$^{[1]}$: In 2018, there was a debate on Twitter between Joel Grus and Jeremy Howard regarding the merits of Jupyter notebooks. Joel Grus voiced his criticisms in a talk titled [_I Don’t Like Notebooks_](https://youtu.be/7jiPeIFXb6U?feature=shared), arguing that notebooks encourage poor software engineering practices. Two years later, Jeremy Howard offered a rebuttal in his talk [_I Like Notebooks_](https://youtu.be/9Q6sLbz37gk?feature=shared) to advocate how notebooks have been instrumental to the success of his `fastai` project and to accessible deep learning education.

Both Grus and Howard are respected figures in the data science community, though they appeal to different audiences. Joel Grus is known for his emphasis on foundational understanding and software discipline, while Jeremy Howard is recognized for democratizing deep learning through practical, code-first education at scale.

## 1. Modularization of the notebook

In this tutorial, we will work with a modularized version of the code located in the `tutorial_06_python_code` folder. The structure of the project is organized as follows:

```
tutorial_06_python_code/
│
├── data/                              # data folder
│   └── housing.csv                    # dataset
│
├── src/                               # Source code directory
│   ├── __init__.py                    # Making the folder into a Python package
│   ├── data_loader.py                 # Functions to load and inspect data
│   ├── preprocessing.py               # Feature engineering methods (e.g., encoding, imputing missing values)
│   ├── model.py                       # Model definition, training, and evaluation
│   ├── config.py                      # Constants like test size, random state, etc.
│   └── utils.py                       # Helper functions (e.g., for visualization, metrics)
│
└── run.py                             # Entry point script to execute the training pipeline
```

This structure allows us to reuse some of the logic developed in Tutorial 2, while improving clarity and maintainability. Rather than placing all code inside a single notebook, we break the workflow into separate components:
- `data/`: contains the datasets to be experimented. 
- `src/`: contains the core logic for each step of the machine learning workflow, including data loading, preprocessing, modeling, and supporting utilities.
- `run.py`: serves as the entry point to the pipeline. It integrates all components from the `src directory` into a coherent and executable training process.

Note that we simplify the modularization by not putting the implementations into separate `classes`, but only `functions`. You can read more about modularization of a machine learning project in chapter 10 of the online book [The Pragmatic Programmer for Machine Learning](https://ppml.dev/) by Marco Scutari, Mauro Malvestio.

### 1.1. What is `__init__.py`?
It is a special file that turns a directory into a **Python package**. Its presence tells Python to treat this folder as a package so its modules can be imported. For example, this allows you to perform `relative importing`. Suppose you want to import a function `load_data` from `data_loader.py` to `model.py`:

```python
# In model.py
from .data_loader import load_data
```

Without `__init__.py`:
- Relative imports break because `src` isn't a recognized package
- Tools like `pytest` or `pip` might throw `ModuleNotFoundError` 
- Harder to transition your code into a proper library or deployment pipeline

To avoid unwanted bugs and make your code structure behave consistently in all contexts, it is recommended to add an empty `__init__.py` to `src` folder, and all sub-folders of `src`.

### 1.2. Modularized Code

If you're working on this notebook locally, you'll notice that the code logic has now been modularized into separate components inside the `src` folder, following the project structure described earlier. In `run.py`, we bring these components together to form a complete training pipeline using the following imports:

```python
from src.config import TEST_SIZE, DATA_PATH, DATA_RANDOM_STATE, MODEL_RANDOM_STATE
from src.data_loader import load_housing_data
from src.preprocessing import split_data, impute_train_data, impute_test_data, remove_categorical_columns
from src.model import create_forest_regressor, train_model, get_predictions
from src.utils import compute_mae, compute_mape
```

Note that this modularization isn't meant to be production-ready or fully optimized. Its purpose is to demonstrate how separating different responsibilities (like configuration, preprocessing, modeling, and utilities) makes your code cleaner, more maintainable, and reusable.

For example, the function `impute_train_data` is written in a way that allows reuse with different imputation strategies. In the current code, we apply it with the `mean` strategy to impute numerical columns, while removing categorical columns for simplicity. However, if you decide to handle categorical features, you can reuse the same function like this:

```python
cat_imputer, cat_train = impute_train_data(X_test, ["furnishingstatus"], "most_frequent")
```
Similarly, the utility functions `compute_mae` and `compute_mape` are designed to be generic. They can be reused across different datasets as long as the inputs follow the expected format described in each function’s docstring. Modular design like this is especially helpful when scaling up to more complex projects or transitioning toward deployment, where testability, maintainability, and reusability are essential.

## 1.3. Running the code

You can run the code by simply go into the terminal, and type the below command at the project folder:

> python run.py

You will see this input at the terminal:

![Running code in terminal](images/code_terminal.png)

If you are not familiar with using terminals, we can also run the code from this notebook:

In [1]:
!python ./tutorial_06_python_code/run.py

In-sample: Mean Absolute Error: 414244.20	 Mean Absolute Percentage Error: 0.09
Out-sample: Mean Absolute Error: 1045208.27	 Mean Absolute Percentage Error: 0.24


When you add `!` in a code cell in Jupyter notebook, it calls the system shell (terminal) behind the scenes, similar to running it directly at the terminal. Here, since your notebook is outside of the folder `tutorial_06_python_code`, we have to specify the path to where the `run.py` file exists. The `./` is a relative path that tells Python to look for the file starting from the *current directory*. Here, it means we start looking at the `course-materials`, and go one level into the `tutorial_06_python_code` folder.

## 1.4. Import functions to notebook

In addition to running the `run.py` script, you can import and reuse the functions defined in the `src` folder inside a Jupyter notebook. For example, let's re-use the load_data function:

In [2]:
from tutorial_06_python_code.src.data_loader import load_data

Instead of using the `housing.csv` data, we can load the `insurance.csv` inside the `course-materials/solutions` folder:

In [3]:
path = "./solutions/data/insurance.csv"
df = load_data(path)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Then, we can re-use the imputation function to impute the categorical columns:

In [4]:
df.isna().sum()

age         0
sex         0
bmi         4
children    0
smoker      0
region      7
charges     0
dtype: int64

Let's assume our entire dataset is a train set, we can reuse the `impute_train_data` function to impute the categorical `region` column using `most_frequent` strategy:

In [5]:
from tutorial_06_python_code.src.preprocessing import impute_train_data
imputer, cat_imputed = impute_train_data(df, ["region"], "most_frequent")
df["region"] = cat_imputed
df.isna().sum()

age         0
sex         0
bmi         4
children    0
smoker      0
region      0
charges     0
dtype: int64

This short demo shows how modularizing your code not only makes your project easier to maintain, but also allows you to reuse specific functions directly in a notebook for experimentation or analysis. Whether you're running the full pipeline via `run.py` or interacting in Jupyter, separating logic into modules gives you flexibility and clarity. This is a key habit for scaling up to real-world projects.

## 2. Experimental Tracking with Weight & Biases (WanDB)

Weights & Biases (WanDB) is a powerful tool that helps data scientists track machine learning experiments, datasets, and system information with just a few lines of code. WanDB is free to use for individuals and research organizations, which makes it widely accessible for academic and non-commercial projects. It supports awide range of machine learning frameworks without requiring users to switch tools. These include TensorFlow, Keras, PyTorch, Scikit-learn, FastAI, and more.

<div style="text-align: center;">
  <img src="images/wandb.webp" alt="WanDB" width="700"/>
</div>

### Key features
All tracked information is sent to an intuitive inferface (UI) provided by Weights & Biases. This dashboard allows for:
- Easy visualization and analysis of metrics and logs
- Fast and interactive model comparison, including hyperparameter tuning results and experiment histories
- Collaboration support: you can share results with your team via the platform's web interface or through custom reports.

Weights & Biases is a highly flexible and scalable tool used by individual developers and large AI research teams (e.g., OpenAI, DeepMind, and NVIDIA) to track experiments, visualize results, optimize models, and ensure reproducibility in machine learning workflows.

### 2.1. Why is WanDB important in large-scale Machine Learning?

Deep learning models often involve training large neural networks on massive datasets, which can take hours, days, or even weeks to complete. During this long process, many experiments are run to try different architectures, hyperparameters, or training strategies to improve performance. Keeping track of all these experiments manually is nearly impossible.

Weights & Biases (WandB) is a powerful tool that helps data scientists and researchers to automatically track, visualize, and compare these experiments in real-time. It enables easy monitoring of training progress, recording of metrics like loss and accuracy, saving model checkpoints, and sharing results with collaborators - all in one centralized platform. This makes managing complex deep learning workflows much more organized and efficient.

### 2.2. Getting started with WanDB:

Before using WanDB, we need to set up a WanDB account and install the library altogether:

1. Create an account: visit [WanDB's website](https://wandb.ai/) to register an account with your email
2. Install wandb (for local users): `pip install wandb`. If you use VS CodeSpace, skip this.
3. After logging in, select your profile at the top-right window of the dashboard, then select `API key` and copy the key
4. Initialize a wandb project using `wandb.init()`, which will prompt you to input your API Key

We will do the 4) step in the next section.

### 2.2. Tracking experiment with WanDB

In this section, we will explore another dataset stored in `tutorial_06_python_code/data/advertising.csv`, build a machine learning model, and use WanDB to track the training process.

In [6]:
adv = load_data("./tutorial_06_python_code/data/advertising.csv")
adv.head(3)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3


Let's initialize a WanDB project. If you run the code below the first time, it will asks you to input the API key:

In [7]:
import wandb

wandb.init(
    project="demo-sale-regression"
)

[34m[1mwandb[0m: Currently logged in as: [33maveragejett[0m ([33mdmt-linkpred[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Next, we create a pipeline as follows:
- Split the dataset
- Create a ML model
- Fit the train set
- Evaluate on the test set using MSE
- Log the loss function and test results on WanDB

In [8]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = adv.drop("Sales", axis=1)
Y = adv["Sales"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)

model = AdaBoostRegressor(random_state=0)
# train
model.fit(X_train, Y_train)
# log training loss for each boosting iteration using staged_predict
for i, y_pred_train in enumerate(model.staged_predict(X_train)):
    train_loss = mean_squared_error(Y_train, y_pred_train)
    wandb.log({"train_mse": train_loss, "boosting_iteration": i})

# test predictions
preds = model.predict(X_test)
mse = mean_squared_error(preds, Y_test)
wandb.log({"test_mse": mse})

If you have finished running the experiments, run `wandb.finish()` to save all the artifacts to WanDB:

In [9]:
wandb.finish()

0,1
boosting_iteration,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
test_mse,▁
train_mse,█▆▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
boosting_iteration,49.0
test_mse,1.46527
train_mse,0.6175


You can go to the Dashboard on WanDB's Dashboard to see the visualizations of the above run:

<div style="text-align: center;">
  <img src="./images/exp.png" alt="WanDB" width="1000"/>
</div>


Here is a simple experiment. When you want to explore which model performs best among several choices, or when tuning different configurations of the same model, the number of experiments can quickly grow out of control. It is critical to keep track of all experiments carefully. This means recording not only the final results but also the settings used (like hyperparameters, data splits, and random seeds), and intermediate progress (such as loss curves). This careful tracking allows you to:

- Easily compare models and configurations to identify the best performing one without guessing or manual record keeping.
- Ensure reproducibility, meaning that anyone (including future you!) can rerun an experiment and get the same results by following the saved configuration and data.
- Debug and recover from interruptions by resuming training from saved checkpoints instead of starting over.
- Share your work efficiently with teammates, who can then understand, reproduce, and build upon your experiments.

## 2.3. WanDB in Research

For example, in one of our research projects, a model with more than 10 different configurations was run multiple times. Since the dataset and model size are huge, the whole process took nearly a month to complete all experiments! Below are 40 runs that took about a week to complete:

<div style="text-align: center;">
  <img src="./images/wandb_research.png" alt="WanDB" width="1000"/>
</div>

Using WanDB can help us to:
- Track training progress by monitoring loss and other metrics over time, ensuring the model trains properly.
- Recover training in case of crashes or interruptions by loading checkpoints instead of retraining from scratch (saving weeks of compute time).
- Organize experiments and easily identify the best hyperparameter combinations without manual notes or spreadsheet chaos.
- Visualize results with intuitive dashboards and share findings instantly with our collaborators.

## Conclusion

In this tutorial, you've learned the importance of modularizing your code by moving from a monolithic Jupyter notebook to a more organized, script-based workflow. This is a crucial step in building scalable and maintainable machine learning projects. You also explored how to use Weights & Biases (WandB) to track experiments, monitor model performance, and ensure reproducibility, which are all essential practices for real-world ML development. 

In the next assignment, you'll take this one step further by integrating WandB with ZenML. With ZenML, you'll learn how to structure your workflows into reusable pipeline steps (such as data loading, training, evaluation, and deployment), while WandB will continue to serve as a tracking and visualization layer for these steps. Together, they provide a powerful foundation for building reproducible, automated, and collaborative ML systems from experimentation to deployment.