# **Machine Learning Project Structure & Infrastructure Best Practices**


---

# **1. Why Project Structure Matters**

Machine learning projects tend to become:

* a mixture of scripts, notebooks, configs, logs, datasets
* code called from different locations (VS Code, Jupyter, CLI, Slurm, Hydra, wandb)
* full of fragile relative paths, duplicated logic, and ad-hoc scripts

If you don’t impose structure early, the project turns into chaos.

A well-designed structure ensures:

* reproducibility
* clarity and separation of responsibilities
* stable paths regardless of where the interpreter is launched
* easy packaging / deployment / dockerization
* compatibility with MLFlow, wandb, hydra, DVC, and CI/CD
* easier onboarding of collaborators

Below is the structure used by professional teams.

---

# **2. The Recommended ML Project Layout (“src layout”)**

This is the **modern, PyPA-endorsed, ML-industry standard**.

```
myproject/
│
├── pyproject.toml
├── README.md
├── LICENSE
│
├── src/
│   └── myproject/
│       ├── __init__.py
│       ├── core/
│       │   ├── training.py
│       │   ├── evaluation.py
│       │   └── model_builder.py
│       │
│       ├── data/
│       │   ├── loaders.py
│       │   └── dataset_utils.py
│       │
│       └── utils/
│           ├── __init__.py
│           └── paths.py
│
├── notebooks/
│   └── experiments.ipynb
│
├── scripts/
│   ├── train.py
│   ├── evaluate.py
│   ├── infer.py
│   └── prepare_data.py
│
├── configs/
│   ├── train.yaml
│   ├── eval.yaml
│   └── model.yaml
│
├── data/
│   ├── raw/
│   ├── processed/
│   └── metadata/
│
├── checkpoints/
└── logs/
```

This solves 99% of file-path issues and keeps everything clean and scalable.

---

# **3. What Each Folder Is For (Professional Explanation)**

### **3.1 src/myproject/** — Your importable Python package

Contains all reusable code.

If someone runs:

```python
pip install .
```

this is the only folder that becomes importable.

Inside:

### **core/**

Contains the *logic*:

* training loop
* evaluation / metrics
* model builder
* loss functions
* augmentation pipeline

### **data/**

Contains:

* dataset utilities
* PyTorch DataLoaders
* transforms
* data preprocessing modules

### **utils/**

Contains utility functions:

* stable `project_root()`
* logging setup
* reproducibility utilities
* file/resource handling
* argument parsing

Never mix ML logic with utilities.

---

### **3.2 scripts/**

Contains *executable scripts* only:

* train.py
* evaluate.py
* infer.py
* hyperparameter search
* data preparation

Scripts should:

* be short
* only orchestrate
* never define logic
* call functions from `src/myproject/**`

---

### **3.3 configs/**

All configurations live here:

* hyperparameters
* model architecture names
* dataset paths
* optimizer settings
* experiment metadata

Even without Hydra, YAML configs are crucial for reproducibility.

---

### **3.4 data/**

Organized datasets:

```
data/raw/
data/processed/
data/metadata/
```

NEVER check in large raw datasets to GitHub.
Use `.gitignore`.

---

### **3.5 notebooks/**

Keep notebooks separate from scripts.

Rules:

* notebooks explore, scripts run
* never import notebooks from code
* never put logic inside notebooks
* notebooks should call functions from `src/myproject`

---

### **3.6 checkpoints/**

Saved model weights.

Rules:

* never commit them to GitHub
* use checkpoints with timestamps
* optionally integrate with DVC or wandb.artifacts

---

### **3.7 logs/**

Training logs, MLFlow, wandb outputs.

Not version controlled.

---

# **4. The Most Important Utility in ML Projects: Stable Paths**

You **must not** depend on:

* the current working directory
* where VS Code launched the Python file
* where a Jupyter kernel was started
* how Hydra changed your cwd

You always resolve paths using code relative to the repository root.

A professional utility module:

### `src/myproject/utils/paths.py`

```python
from pathlib import Path
import importlib.resources as ir
import sys

_MARKERS = ["pyproject.toml", ".git"]

def project_root() -> Path:
    here = Path(__file__).resolve()
    for p in (here, *here.parents):
        if any((p / m).exists() for m in _MARKERS):
            return p
    return here.parent

def entry_script_dir() -> Path | None:
    try:
        p = Path(sys.argv[0]).resolve()
        return p.parent if p.exists() else None
    except Exception:
        return None

def resource_path(*parts: str) -> Path:
    root = project_root()
    return root.joinpath(*parts)
```

This solves everything.

---

# **5. Best Practices for ML Code Structure**

## **5.1 Training code best practices**

Training code lives in:

```
src/myproject/core/training.py
```

Training scripts:

```
scripts/train.py
```

Call the training logic like:

```python
from myproject.core.training import train
from myproject.utils import resource_path
import yaml

cfg = yaml.safe_load(open(resource_path("configs", "train.yaml")))
train(cfg)
```

Your script remains small and orchestrational.

---

## **5.2 Evaluation code**

* Evaluate only by loading a model + dataset
* Keep clean separation between training and evaluation logic
* Put all metrics in `evaluation.py`

---

## **5.3 Model builder pattern**

Files like:

```
src/myproject/core/model_builder.py
```

Provide a function:

```python
def build_model(cfg):
    name = cfg["name"]
    num_classes = cfg["num_classes"]
    ...
```

This keeps `train.py` clean.

---

## **5.4 Config files**

Use YAML for hyperparameters.

Example:

`configs/train.yaml`:

```yaml
model:
  name: resnet18
  num_classes: 10

training:
  epochs: 30
  lr: 0.001

data:
  batch_size: 32
  train_dir: data/raw/train
  val_dir: data/raw/val
```

Configs make experiments reproducible.

---

## **5.5 Use logging, not print()**

Create:

```
src/myproject/utils/logger.py
```

Use python’s `logging` with a timestamp format.

---

## **5.6 Reproducibility utilities**

```
src/myproject/utils/reproducibility.py
```

Set:

```python
def set_seed(seed=42):
    import random, numpy as np, torch
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
```

---

# **6. Best Practices for ML Infrastructure**

## **6.1 Version control (Git)**

Do not commit:

```
data/
checkpoints/
logs/
__pycache__/
.ipynb_checkpoints/
```

Use `.gitignore`.

---

## **6.2 Experiment tracking**

Industry-standard tools:

* wandb
* MLFlow
* TensorBoard

Prefer MLFlow or wandb.

---

## **6.3 Dataset versioning (DVC)**

If you have large datasets:

```
pip install dvc
```

Track data versions.

---

## **6.4 Environment management**

Use one of:

* conda
* poetry
* pyenv + uv

My recommendation: **Poetry** or **UV**.

---

## **6.5 Virtual environment inside project**

Standard layout:

```
myproject/.venv/
```

Avoid global conda environments.

---

## **6.6 Use a lightweight Dockerfile**

If deploying:

```
FROM python:3.11-slim
COPY . /app
WORKDIR /app
RUN pip install .
CMD ["python", "scripts/train.py"]
```

---

## **6.7 Continuous Integration**

GitHub Actions for:

* lint
* tests
* style
* packaging

---

# **7. How VS Code Should Be Configured**

## **7.1 Set the workspace root**

Your VS Code workspace root should be:

```
myproject/
```

## **7.2 Set Python path**

In `.vscode/settings.json`:

```json
{
  "python.autoComplete.extraPaths": ["./src"]
}
```

This instantly fixes all import issues.

## **7.3 Configure debugging**

Create:

`.vscode/launch.json`:

```json
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Train",
      "type": "python",
      "request": "launch",
      "program": "scripts/train.py",
      "env": { "PYTHONPATH": "${workspaceFolder}/src" }
    }
  ]
}
```

This guarantees correct imports.

---


This structure is used by:

* Google Brain
* Meta FAIR
* HuggingFace
* PyTorch Lightning
* NVIDIA NeMo
* OpenAI academic projects
* University research labs

---


