# Production 1: Setting Up A Repo

## Introduction

+ Working with code in production is hard. Rarely we will have a chance to work on a greenfield development and will get a chance to define all of its specifications.
+ Sometimes, we may be offered the option of scraping a system and starting from scratch. This option should be considered carefully and, most of the time, rejected.
+ Working with legacy code will be the norm:

    - Legacy code includes our own code.
    - Legacy code may have been written by colleagues with different approaches, philosophies, and skills.
    - Legacy code may have been written for old technology.

+ Most of the time, legacy code works and *this* is the reason we are working with it.

## Software Entropy

+ Software entroy is the natural evolution of code towards chaos.
+ Messy code is a natural consequence of change:

    - Requriements change.
    - Technology change.
    - Business processes change.
    - People change.

+ Software entropy can be managed. Some techniques include:

    - Apply a code style.
    - Reduce inconsistency.
    - Continuous refactoring.
    - Apply reasonable architectures.
    - Apply design patterns.
    - Testing and CI/CD.
    - Documentation.

+ *Technical debt* is future work that is owed to fix issues with the current codebase.
+ Technical debt has principal and interest: complexity spreads and what was a simple *duct tape* solution becomes the source of complexity in downstream consumers.
+ ML systems are complex: they involve many components and the interaction among those components determines the behaviour of the system. Adding additional complexity by using poor software development practices can be avoided.
+ Building ML Systems is most of the time a team sport. Our tools should be designed for collaboration.

# A Reference Architecture

## What are we building?

+ [Agrawal and others](https://arxiv.org/abs/1909.00084) propose the reference architecture below.

![Flock reference architecture (Agrawal etl al, 2019)](./img/flock_ref_arhitecture.png)

+ Through the course, we will write the code in Python for the different components of this architecture. 


# Repo File Structure

+ A simple standard file structure can go a long way in reducing entropy. Some frameworks will impose file structures, but generally the following pattern works well:

```
./
./data/
./env/
./logs/
./src/
./tests/
...
./docs/
./.gitignore
./readme.md
./requirements.txt
...
```

+ A few notes on the file structure:

    - `./data/` is a general data folder which is generally subdivided in a namespace. 
    
        * This is an optional component and meant to hold development data or a small feature store.
        * Include in `.gitignore`. 

    - `./env/` is the Python virtual environment. It is included in `.gitignore`.
    - `./logs/` is the location of log files. It is included in `.gitignore`.
    - `./src/` is the source folder. 

        * Contains most of the modules and module folders.
        * It is the reference directory to all relative paths.

    - `./src/.env` contains environment variable definitions. 

        * We will add settings to this file and read them throughout our setup.
        * A single convenient location to maintain connection strings to DB, directory locations, and settings.
    
    - `./readme.md` a description of the project and general guidance on where to find information in the repo.
    - `./requirements.txt` Python libraries that are required for this repo.
    - `./.gitignore` includes all files that should be ignored in change control. 

+ The pattern lends itself to standard .gitignore files that include `./env/` 

# Source Control



## Git and Github

+ Git is a version control system that lets you manage and keep track of your source code history.
+ If you have not done so, please get an account on [Github](https://github.com/) and setup SSH authentication:

    - Check for [existing SSH keys](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/checking-for-existing-ssh-keys).
    - If needed, create an [SSH Key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent).
    - [Add SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account#adding-a-new-ssh-key-to-your-account) to your Github account.

+ If you need a refresher of Git commands, a good reference is [Pro Git](https://git-scm.com/book/en/v2) (Chacon and Straub, 2014).

## What do we include in a commit?

* Generally, we will use Git to maintain data transformation and movement *code*.
* It is good practice to not use Git to maintain data inputs or outputs of any kind. 
* Some exceptions include: settings, experimental notebooks used to document design choices. 
* Things to avoid putting in a repo: Personal Identity Information (PII), passwords and keys.

## Version Control System Best Practices

+ Commit early and commit often.
+ Use meaningful commits:

    - The drawback of commiting very frequently is that there will be incomplete commits, errors and stepbacks in the commit messages. Commit messages include: "Committing before switching to another task", "Oops", "Undoing previous idea", "Fire alarm", etc.
    - In Pull Requests, squash commits and write meaningful messages. 

+ Apply a branch strategy.
+ Submit clean pull requests: verify that latest branch is merged and review conflicts.

## Commit Messages

+ Clear commit messages help document your code and allow you to trace the reaoning behind design decisions. 
+ A few guidelines for writing commit messages:

    - Use markdown: Github interprets commit messages as markdown.
    - First line is a subject:

        * No period at the end.
        * Use uppercase as appropriate.
    
    - Write in imperative form in the subject line and whenever possible:

        * Do:  "Add connection to db", "Connect to db"
        * Do not:  "This commit adds a connection to db", "Connection to db added"

    - The body of the message should explain why the change was made and not what was changed.

        * Diff will show changes in the code, but not the reasoning behind it.

    - Same rules apply for Pull Requests.

+ Many of these points are taken from [How to Write a Git Commit Message](https://cbea.ms/git-commit/) by Chris Beams.

## VS Code and Git

+ An Interactive Development Environment (IDE) is software to help you code. 
+ IDEs are, ultimately, a matter of personal taste, but there are advantages to using the popular solutions:

    - Active development and bug fixes.
    - Plugin and extension ecosystems.
    - Active community for help, support, tutorials, etc.

+ Avoid the *l33t coder* trap: *vim* and *emacs* may work for some, but *nano* and VS Code are great solutions too. 
+ VS Code integrates with Git. 
+ Reference: [Using Git source control in VS Code](https://code.visualstudio.com/docs/sourcecontrol/overview).
+ A few tips:

    - VS Code has a Git tab indicated with the icon: 
    ![](./img/source_control_icon.png)

    - From the source control menu, one can easily stage files, commit, and push/pull to origin.

    - Other commands can be accessed via the command pallete (`Ctrl + Shift + P`). For instance, one can select or create a new branch using the option *Git: Checkout to*.

# Python Virtual Environments

+ There are many reasons to control our development environment, including version numbers for Python and all the libraries that we are using:

    - Reproducibility: we want to be able to reproduce our process in a production environment with as little change as possible. 
    - Backup and archiving: saving our work in a way that can be used in the future, despite Python and libraries evolving.
    - Collaboration: work with colleagues on different portions of the code involves everyone having a standard platform to run the codebase.

+ We can achieve the objectives above in many ways, including vritualizing our environments, packaging our code in containers, and using virtual machines, among others.
+ Most of the time, creating a virtual environment will be part of the initial development setup. This vritual environment will help us *freeze* the python version and some version libraries. 

## Setting up the environment

+ The simplest way to add a new virtual environment is to use the command: `python -m venv env`.
+ This command will start a new virtual environment in the subfolder `./env`.
+ To *activate* this environment use `./env/Scripts/Activate.ps` (windows).
+ Optionally, consider the Python add-on for VS Code that activates the environment automatically for you.

# Setup a Logger

+ We will use Python's logging module and will provision our standard loggers through our first module.
+ The module is located in `./src/logger.py`.
+ Our notebooks will need to add `../src/` to their path and load environment variables from `../src/.env`. Notice that these paths are based on the notebook's location. 

In [None]:
%load_ext dotenv
%dotenv ../src/.env

In [None]:
import sys
sys.path.append("../src")

In [None]:
from logger import get_logger
_logs = get_logger(__name__)
_logs.info("Hello world!")

# A Note about Copilot

+ AI-assisted coding is a reality. I would like your opinions about the use of this technology.
+ I will start the course with Copilot on, but if it becomes too distracting, I will be happy to turn it off. 
+ Copilot is a nice tool, but it is not for everyone. If you are starting to code or are trying to level up, I recommend that you leave AI assistants (Copilot, ChatGPT, etc.) for later.