# The Computational Almanac

## Standards and conventions

### Why do we use conventions and standards?
By establishing or using a common convention that everyone adheres to, you reduce mental
overhead and communication errors in your team

### When should you use conventions
- When a decision is not relevant to your research question
- If there is no clear advantage to a decision, default to convention

On the flip side, conventions are good candidates for new research

### Version control

Probably one of the most important skills to learn.

Use git; Learn the basics: git add, git commit, git push, git branch, etc. Expand from there as needed. 

Use a web-based version control tool like GitHub, GitLab, or BitBucket (these all interface with git).

#### Benefits
- Git is freedom; you don't have to fear breaking your code (or your colleagues' code); You can always revert back to a working version.
- You no longer have to write file names like this: final_version36_last_one_i_promise_number2.txt

#### Tips
- Make changes on a new branch
- Make changes in small increments
- Commit often. If you make multiple changes and then your code throws an error, it is harder to determine what change caused the bug
- Write helpful commit messages; this makes it easier to find where you need to revert to

### Automate the boring stuff

We want as much of our brain power as possible going towards our research question, so questions like "should I use tabs or spaces?", "How many characters long should my lines be?" take up a small (but significant) amount of mental time and energy. Luckily, there are tools that automate these menial tasks.

- Code formatting: black, autopep8
- Linting, type checking: mypy, ruff

### Naming

Choose a style and stick to it: e.g., snake_case, CamelCase, etc. Different languages will have different standards (Python uses snake_case, Java uses CamelCase).

- Variable/class names should describe what they are.
- Function/method names should describe what they do.

In [None]:
# Bad
def process_data(x, y, z, w):
    result = y * (x - z) - w
    return result

# Good
def process_total_payment(base_price, quantity, discount, tax):
    total_payment = quantity * (base_price - discount)  - tax
    return total_payment

### Documentation

In academia, often you are the only person (aside from possibly a colleague or
two) who is ever going to view your code, so writing clean, documented code
is going to primarily benefit you.

You are your closest collaborator and past you isn’t going
to answer any questions present you has forgotten.

#### Comments
Tip: don't say what the code does; say why the code was written the way it was. 

If you made a conscious design choice for a particular piece of code that is unintuitive, make sure to document it so that your colleagues (or future you) doesn't come back and try to rewrite it only to encounter the same deadends you have already solved.

In [None]:
# Bad
def safe_divide(numerator, denominator):
    epsilon = 1e-8
    # the numerator is divided by the denominator and a small constant
    return numerator / (denominator + epsilon)

# Good
def safe_divide(numerator, denominator):
    epsilon = 1e-8
    # epsilon is added to the denominator to avoid division by zero
    return numerator / (denominator + epsilon)

### Type hints

- Helps you reason about your code
- Don't have to keep track of object types; less mental overhead; self-documenting
- They become much more powerful when combined with a linter that checks them e.g., ruff, mypy (catches bugs before they happen)

In [None]:
def count_amino_acid(sequence: list[str], amino_acid: str) -> int:
    return sum([1 for aa in sequence if aa == amino_acid])

count_amino_acid(sequence=["C", "A", "R", "V", "A", "Y"], amino_acid="A")

2

### Virtual environments

Different projects require different (and often conflicting) dependencies. Virtual environments provide isolation that helps prevent potiential conflicts. Helps ensure consistent environments across different machines (somewhat; see Docker containers for a deeper level of isolation).

`python3 -m venv .venv`

or

`conda create --name myenv`

### When to use notebooks vs scripts?

#### Notebooks

Good for: experimenting, visualization, presenting

Bad for: complex projects

#### Scripts

Good for: code that will be reused (hint: most code should be)

Bad for: ?

Advantages: 
- Enforces linear execution of code
- Easier to manage variable scope

### Packaging is easy. No downsides. Only upsides.

- Only takes one small file (that you can copy-paste) to setup a package
- Simplifies importing your code both within a project and between projects
- Makes it easier to share your code with others and yourself

For example, you can install this package with

```bash
pip install https://github.com/Elliot-D-Hill/template.git
```

You can also install in local development ("editable") mode where changes to the codes are automatically reflect in the program's behavior

```bash
pip install -e .
```

### Project structure

Pick a structure and stick to it. This repository follows a format similar to many other python projects.

```bash
.
├── LICENSE                 # Who can use your project and how.
├── README.md               # Help background, setup, and basic usage of your software
├── .gitignore              # determines which files will not be tracked by git
├── config.toml             # Consolidates important variables into one place
├── data                    # Stores data; often has multiple subdirectories
├── log.ipynb               # The sandbox
├── pyproject.toml          # Used for packaging
├── requirements.txt        # Project dependencies are listed here
├── src                     # Package source code
│   └── example
│       ├── __init__.py
│       ├── __main__.py
│       ├── config.py
│       ├── dataset.py
│       └── model.py
└── tests                   # Unit tests that test the source code
    ├── test_dataset.py
    └── test_model.py
```

### When should you write your own code vs import from a package?

- If a package has the functionality you need
- If the package is trusted (high star count on GitHub is a good indicator)

Benefits of using other people's code:
- Good packages write tests for their code (you probably don't)
- Popular packages are used, tested, and reviewed informally by their users every day (raises our confidence in them)
- Adopting other peoples code let's you get to your research problem faster. As a researcher, we often only change one or two aspects of a method/model/algorithm for a given project. That means that we can save a lot of time by importing as many of the parts we are not customizing as possible. 

Example:

Your project is to design a custom loss function. You probably want to avoid also writing your optimizer and model architecture from scratch. You probably want to reuse standard model architectures.

### When should you optimize your code?

- Only when you need to
- When it is free/easy to do; doesn't take a lot of time or add a lot of tech dept
- If you must optimze your code. Focus on bottlenecks first; find bottlenecks via code profilers e.g., cProfile in Python

Performance optimization is typically the last part of your code you want
to improve. This is because optimizations are often time consuming to
implement and can introduce substantial code complexity

### Configuration files

Consolidates parameters into fewer places. 

Avoids having to touch the source code (and possibly introduce new errors) to change program behavior. 

By having all tunable parameters in one place, it makes it easy to make your methods section of your paper.

Avoids "magic" variables

In [None]:
# Bad
def calculate_area():
    return 3.14159 * 5 * 5

# Good
PI: float = 3.14159
def calculate_area(radius: float | int):
    area = PI * radius * radius
    return area

### Write modular code
A good rule of thumb: functions and classes should have roughly one responsibility

Modular code is easier to read, reuse, reason through, debug, and write tests.

In [None]:
# Bad
def process_data(data):
    # Responsibility 1: Calculate average
    total = sum(data)
    average = total / len(data)
    # Responsibility 2: Filter outliers
    outliers = [x for x in data if x >= 2 * average]
    # Responsibility 3: Generate a report
    n = len(data)
    report = f"Average: {average}, Total data points: {n}, Outliers removed: {n - len(outliers)}"
    return outliers, report

# Good
def calculate_average(data):
    total = sum(data)
    return total / len(data)

def filter_outliers(data, average):
    return [x for x in data if x >= 2 * average]

def generate_report(data, outliers_removed):
    n = len(data)
    return f"Average: {calculate_average(data)}, Total data points: {n}, Outliers removed: {n - len(outliers_removed)}"

def process_data(data):
    average = calculate_average(data)
    outliers_removed = filter_outliers(data, average)
    report = generate_report(data, outliers_removed)
    return outliers_removed, report


### Unit tests

Automates manual testing.

Allows you to refactor fearlessly.

It's likely that you are already unknowingly writing them. 

In [None]:
def test_process_total_payment():
    expected = 90
    result = process_total_payment(base_price=10, quantity=10, discount=0, tax=10)
    assert expected == result

test_process_total_payment()

# Design patterns (Under construction)

### Compose complex objects out of simpler objects

3 design patterns that help with this:

- Factory method pattern
- Strategy pattern
- Builder pattern
- Composition
- Dependency injection

Factory methon pattern

In [None]:
def get_loss_function(loss_fn):
    match loss_fn:
        case "MSE":
            return MSE
        case "BCE":
            return BCEWithLogitsLoss
        case _:
            raise NotImplementedError(f"Loss function '{loss_fn}' is not implemented.")

def get_pooling_layer(layer: str):
    if layer == "max":
        ...
    elif layer == "mean":
        ...
    else:
        raise NotImplementedError(f"Pooling layer '{layer}' is not implemented.")

def get_normalization_layer(layer: str):
    layers = {
        "batch": BatchNorm,
        "layer": LayerNorm,
    }
    try:
        return layers[layer]
    except:
        raise NotImplementedError("Normalization layer '{layer}' is not implemented.")

In [None]:
from torch.nn import Module

class Network(Module):
    def __init__(self, criterion, pooling_layer, normalization_layer):
        self.criterion = criterion
        self.pooling_layer = pooling_layer
        self.normalization_layer = normalization_layer
        self.linear = Linear(in_features=in_features, out_features=out_features)

In [None]:
def make_network():
    loss_function = get_loss_function(loss_fn=loss_fn)
    pooling_layer = get_pooling_layer(layer=pool_layer)
    normalization_layer = get_normalization_layer(layer=norm_layer)
    return Network(
        loss_function=loss_function, 
        pooling_layer=pooling_layer, 
        normalization_layer=normalization_layer
        )

# Style (Under construction)

### Establishing a baseline
Your goal at the start of a project should be establishing a baseline result fast as possible. From there you can optimize as needed.

### Coding style

Prefer simple to clever code.