Notebook good practices

Demo project for applying good practices to Jupyter notebooks.

Initial code is lacking in many ways. A series of refactoring steps can turn it into a well organized and reusable project.

Refactoring steps

Improve notebook
- Add Markdown cells to describe the notebook and indicate its main parts.
- Change code to be more pythonic.
- Create functions to prevent code duplication.

Manage dependencies

Install and init Poetry.
Configure Poetry to create the virtual environment in the project folder.
```
poetry config virtualenvs.in-project true
```

Configure dependencies according to the following versions:

[tool.poetry.dependencies]
python = "^3.7"
jupyter = "^1.0.0"
matplotlib = "^3.3.2"
sklearn = "^0.0"
pandas = "^1.1.3"
tensorflow = "^2.3.1"
seaborn = "^0.11.0"
black = "^21.10b0"

Install dependencies and test the notebook.

Organize notebooks

Split the notebook into four separate parts for loading and transforming data, visualizing data, training models and visualizing results. Transformed data should be serialized as JSON files like so:

training_data = {
   "x_train": x_train.tolist(),
   "x_test": x_test.tolist(),
   "y_train": y_train.tolist(),
   "y_test": y_test.tolist(),
   "n_features": n_features,
   "n_classes": n_classes
}
with open("iris_training_data.json", "w+") as f:
   json.dump(training_data, f, ensure_ascii=False, indent=4)
   # This file can be opened with json.load()

# Same for visualization data

More raw and processed data into a dedicated data/ subdirectory.
Save models after training into a dedicated models/ subdirectory.

Externalize code
- Move code for data preprocessing, model creation and training, and visualizations to functions in Python files data_transformation.py, training.py and visualizations.py. These files must be located in a dedicated src/ subfolder also containing an empty __init__.py file.
- Import and use this code in the notebooks like so:
```
import sys
sys.path.append("..")
from src.data_transformation import one_hot_encode, get_data_and_names, scale, split
```
Add unit tests
- Add pytest to development dependencies and install it.
- Create unit tests for data preprocessing functions in a dedicated tests/ subfolder. Tests should be run using the following command.
```
python - m pytest tests/
```

Create a notebook workflow

Add papermill to dependencies and install it.
Update the training notebook to parameterize the number of models created, through a Papermill parameter named n_models.

Create a workflow notebook which runs the four other notebooks in correct order, using Papermill like so:

import papermill as pm
import time
from pathlib import Path
import os

timestamp = time.time()
base_folder = f"./papermill_executions/{timestamp}"
Path(base_folder).mkdir(parents=True, exist_ok=True)

# Same for other notebooks (excepting parameter)
notebook_name = "3-train-models.ipynb"
pm.execute_notebook(
    notebook_name,
    os.path.join(base_folder, notebook_name),
    parameters=dict(n_models=4)
);

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Untitled.ipynb		Untitled.ipynb
iris.csv		iris.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Notebook good practices

Refactoring steps

About

Uh oh!

Releases

Packages

Languages

License

ensc-fc/notebook-good-practices

Folders and files

Latest commit

History

Repository files navigation

Notebook good practices

Refactoring steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages