# Production 1: Getting Started

## Introduction

+ Working with code in production is hard. 
+ We will rarely have the chance to work on a greenfield development and will be able to define all its specifications. Most of the time, we will work in collaborative environments and use code to produce outputs and to communicate with colleagues (and ourselves in the future).
+ Sometimes, we may be offered the option to scrap a system and start from scratch. This option should be considered carefully and, most of the time, rejected.
+ Working with legacy code will be the norm:

    - Legacy code includes our own code.
    - Legacy code may have been written by colleagues with different approaches, philosophies, and skills.
    - Legacy code may have been written for old technology.

+ Most of the time, legacy code works and *this* is the reason we are working with it.

## Software Entropy

+ Software entropy is the natural evolution of code towards chaos.
+ Messy code is a natural consequence of change:

    - Requirements change.
    - Technology change.
    - Business processes change.
    - People change.

+ Software entropy can be managed. Some techniques include:

    - Apply a code style.
    - Reduce inconsistency.
    - Continuous refactoring.
    - Apply reasonable architectures.
    - Apply design patterns.
    - Testing and CI/CD.
    - Documentation.

## Technical Debt

+ *Technical debt* is future work that is owed to fix issues with the current codebase.
+ Technical debt has principal and interest: complexity spreads and what was a simple *duct tape* solution becomes the source of complexity in downstream consumers.

## Complexity of ML Systems

+ ML systems are complex: they involve many components, and the interaction among those components determines the behaviour of the system. Adding additional complexity by using poor software development practices can be avoided.
+ Building ML Systems is most of the time a team sport. Our tools should be designed for collaboration.

# Reference Architecture

+ [Agrawal and others (2019)](https://arxiv.org/abs/1909.00084) propose the reference architecture below.

<div>
<img src="./images/01_flock_ref_arhitecture.png" height="500"/>
</div>


+ Throughout the course, we will write Python code for the different components of this architecture.

# Source Control

## Git and GitHub

+ Git is a version control system that lets you manage and keep track of your source code history.
+ If you have not done so, please get an account on [Github](https://github.com/) and set up SSH authentication:

    - Check for [existing SSH keys](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/checking-for-existing-ssh-keys).
    - If needed, create an [SSH Key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent).
    - [Add SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account#adding-a-new-ssh-key-to-your-account) to your Github account.

+ If you need a refresher of Git commands, a good reference is [Pro Git](https://git-scm.com/book/en/v2) (Chacon and Straub, 2014).

## What do we include in a commit?

* Generally, we will use Git to maintain data transformation and movement *code*.
* It is good practice to not use Git to maintain data inputs or outputs of any kind. 
* Some exceptions include: settings, experimental notebooks used to document design choices. 
* Things to avoid putting in a repo: Personal Identity Information (PII), passwords and keys.

## Version Control System Best Practices

+ Commit early and commit often.
+ Use meaningful commits:

    - The drawback of committing very frequently is that there will be incomplete commits, errors, and stepbacks in the commit messages. Commit messages include: "Committing before switching to another task", "Oops", "Undoing previous idea", "Fire alarm", etc.
    - In Pull Requests, squash commits and write meaningful messages. 

+ Apply a branch strategy.
+ Submit clean pull requests: verify that the latest branch has been merged and review any conflicts.

## Commit Messages

+ Clear commit messages help document your code and allow you to trace the reasoning behind design decisions. 
+ A few guidelines for writing commit messages:

    - Use markdown: GitHub interprets commit messages as markdown.
    - First line is a subject:

        * No period at the end.
        * Use uppercase as appropriate.
    
    - Write in imperative form in the subject line and whenever possible:

        * Do:  "Add connection to db", "Connect to db"
        * Do not:  "This commit adds a connection to db", "Connection to db added"

    - The body of the message should explain why the change was made and not what was changed.

        * Diff will show changes in the code, but not the reasoning behind it.

    - Same rules apply for Pull Requests.

+ Many of these points are taken from [How to Write a Git Commit Message](https://cbea.ms/git-commit/) by Chris Beams.

## Branching Strategies

+ When working standalone or in a team, you should consider your [branching strategy](https://www.atlassian.com/agile/software-development/branching).
+ A branching strategy is a way to organise the progression of code in your repo. 
+ In [trunk-based branching strategy](https://www.atlassian.com/continuous-delivery/continuous-integration/trunk-based-development), each developer works based on the *trunk* or *main* branch. (Ryaboy, 2021)]

<div>
<image src="./images/01_trunk_based_development.png" height=300>
</div>

+ After each bug fix, enhancement, or upgrade is complete, the change is  integrated to *main*.
+ Generally, part of a larger Continuous Integration/Continuous Deployment (CI/CD) process.

## VS Code and Git

+ An Interactive Development Environment (IDE) is software to help you code. 
+ IDEs are, ultimately, a matter of personal taste, but there are advantages to using the popular solutions:

    - Active development and bug fixes.
    - Plugin and extension ecosystems.
    - Active community for help, support, tutorials, etc.

+ Avoid the *l33t coder* trap: *vim* and *emacs* may work for some, but *nano* and VS Code are great solutions too. 
+ Reference: [Using Git source control in VS Code](https://code.visualstudio.com/docs/sourcecontrol/overview).
+ A few tips:

    - From the source control menu, one can easily stage files, commit, and push/pull to origin.

    - Other commands can be accessed via the command palette (`Ctrl + Shift + P`). For instance, one can select or create a new branch using the option *Git: Checkout to*.

# Python Virtual Environments

+ There are many reasons to control our development environment, including version numbers for Python and all the libraries that we are using:

    - Reproducibility: We want to be able to reproduce our process in a production environment with as few changes as possible. 
    - Backup and archiving: Saving our work in a way that can be used in the future, despite Python and libraries evolving.
    - Collaboration: Working with colleagues on different portions of the code involves everyone having a standard platform to run the codebase.

+ We can achieve the objectives above in many ways, including virtualising our environments, packaging our code in containers, and using virtual machines, among others.
+ Most of the time, creating a virtual environment will be part of the initial development setup. This vritual environment will help us *freeze* the python version and some version libraries.

## Setting Up the Environment with uv

+ [uv](https://docs.astral.sh/uv/) is a fast command-line tool for Python package and environment management. It combines the roles of `venv` and `pip`.
+ From the terminal, create a virtual environment with: `uv venv <env-name> --python <version>`. For example, `uv venv production-env --python 3.11` creates a new environment called `production-env` using Python 3.11.
+ Activate the environment with:
  - macOS/Linux: `source <env-name>/bin/activate`
  - Windows (Git Bash): `source <env-name>/Scripts/activate`
+ Other useful commands are:

    - Verify uv installation: `uv --version`
    - Add a new package to the environment: `uv add <package-name> --active`
    - Install all required packages for the project: `uv sync --active`
    - Create a lockfile of exact package versions: `uv lock`

+ You can find more detailed instructions in [setup.md](../../SETUP.md).

# Setup a Logger

+ We will use Python's logging module and will provision our standard loggers through our first module.
+ The module is located in `./05_src/utils/logger.py`.
+ Our notebooks will need to add `../05_src/` to their path and load environment variables from `../05_src/.env`. Notice that these paths are based on the notebook's location. 

### Logger highlights

A few highlights about `./05_src/utils/logger.py`:

+ This logger has two handlers: 

    - A `FileHandler` that will save logs to files that are a datetime index.
    - A `StreamHandler` handler that outputs messages to the stdout.

+ Each logger can set its own format. 
+ The log directory and log level are obtained from the environment.
+ According to the [Advanced Logging Tutorial](https://docs.python.org/2/howto/logging.html#logging-advanced-tutorial): 

    >"A good convention to use when naming loggers is to use a module-level logger, in each module that uses logging, named as follows: 
    >
    >`logger = logging.get_logger(__name__)`.
    >
    >This means that logger names track the package/module hierarchy, and itâ€™s intuitively obvious where events are logged just from the logger name."

Run the code below to verify that your setup is working.

In [8]:
%load_ext dotenv
%dotenv 

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [9]:
from pathlib import Path
import sys

notebook_dir = Path.cwd()
src_path = (notebook_dir / "../../05_src").resolve()

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))  # insert(0) gives it priority

In [10]:
notebook_dir

WindowsPath('c:/Users/JesusCalderon/work/dsi_production/01_materials/labs')

In [11]:
from utils.logger import get_logger
_logs = get_logger(__name__)
_logs.info("Hello world!")

2026-01-14 18:20:01,042, 492669213.py, 3, INFO, Hello world!


# A Few Remarks

## On Jupyter Notebooks

+ Jupyter Notebooks are great for drafting code, fast experimentation, demos, documentation, and some prototypes.
+ They are not ideal for production code or experiment tracking.

## On Copilot and other AI Code Generators

+ AI-assisted coding is a reality, and developers are incorporating it into their day-to-day activities.
+ This technology will allow you to solve questions, resolve syntax issues, and bring new ideas. However, you may want to consider a few items:

    - You are still responsible for your code: understand what the code assistant has proposed and make appropriate changes.
    - System architecture is important, and generative AI may induce architectural decisions that may impact system performance significantly.
    - If you are starting out as a developer, give yourself a chance to make mistakes and learn by trial and error. Code assistants can help you when you get stuck, but experimentation is great for learning.
    - Not all models are the same: some perform better than others.