# Research engineering:<br>Scientific research workflow

Anton Babkin

March 23, 2018

# Objectives
- Transparent, well documented research
- Reproducibility 
  - run provided code and get same results
- Collaboration
- Reusability

# Tools
- Python as main programming language.
- R, Julia and other community-supported kernels.
- Jupyter Notebook: documentation, presentation, interactive work.
- Git: collaboration, backup and record keeping.
- Google cloud storage: data warehousing and archiving.

# Project and code structure
## Managing complexity

> Dijkstra pointed out that no one's skull is really big enough to contain a modern computer program (Dijkstra 1972), which means that we as software developers shouldn't try to cram whole programs into our skulls at once; we should try to organize our programs in such a way that we can safely focus on one part of it at a time.
> The goal is to minimize the amount of a program you have to think about at any one time. You might think of this as mental juggling - the more mental balls the program requires you to keep in the air at once, the more likely you'll drop one of the balls, leading to a design or coding error.

Steve McConnell, Code Complete, 2ed

## Levels of design
Steve McConnell, Code Complete, 2ed

1. Software system.
2. Division into subsystems or packages.  
Typically needed on any project that takes longer than a few weeks.
3. Division into classes.  
Project takes longer than a few days.
4. Division into routines.  
Project takes more than a few hours.
5. Internal routine design.

## File structure

- `main.ipynb`: Project entry point. Overview. Links to parts. Main results.
- `lib/`: Code and notebooks. Serves as Python package for the project.
- `data/`: Permanent data storage.
- `tmp/`: Temporary data and other auxiliary files.

## Version control
- The entire project is on Git VC.
- `data/` and `tmp/` are ignored.
- `data/` is manually synced with central data repository. Snapshots are created at time of major "release" for reproducibility.

## Git

- Central *private* repository on GitLab/GitHub.
- Consists of permanent "master" and "master-public" branches, and temporary branches for features and collaboration.
- Completed features are merged back into "master".
- Every time a project reaches certain milestone on "master", it gets tagged.
- "master-public" contains limited set of files and only reflects history between milestone releases. 
  - this branch is synced with a *public* repository.
  - how to do it? :)
- In local repo, don't work on "master", create "wip" branches for features.
- Only push to central repo branches that need to be shared with others.

# Modularity and encapsulation in Python

While working on one small piece, forget how everything outside of it works. After the piece is finished, forget about it's internals. It's easier to think 11 times about 10 lines of code at a time (10 functions 10 lines each), than to think one time about 100 lines of code.

- Functions
- Classes and objects
- Modules and packages

## Functions
- local scope
- use temporary variables
- can use bad variable names like `a`, `b`, `tmp`, `x1` and `x2`, yay! :)
- don't forget `global foo` if you want to modify `foo` from outer scope
  - although it's easier to think about functions that have no side effects
- it's okay to make functions with one line of code in them
- it's okay to make functions that are only called once
- generators

## Classes and objects
- bundle related functions together
- bundle state + methods
- cookie cutter
- interface vs internals
- `_underscore` members: pseudo-private
- `__double_underscore__` functions: `__init__`, `__str__`, `__add__`, ...

## Modules and packages

- Bundle closely related code, separate loosely related or unrelated code.
- "module" is any Python script file, and it can be imported.
- Module may contain executable code and definitions.
- "package" is a folder with `__init__.py` in it.
- Package may contain modules and subpackages.
- `__init__.py` is usually either empty or imports from submodules and subpackages.
- Module vs Class?

### `import`

- When you run `from pack.subpack.mod import func`
  - `pack/__init__.py` gets executed
  - `pack/subpack/__init__.py` gets executed
  - `pack/subpack/mod.py` gets executed
  - `func` defined in `mod.py` remains in current namespace
  - !!! all variables defined along the way remain in memory, they can be accessed and modified by `func`
- If you then `import pack`, *nothing* is executed
  - even if `pack/__init__.py` has changed
  - because it has already been executed before
  - restart kernel, `importlib.reload` or [`%autoreload`](http://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html)

### `from ... import ...` vs `import ...`

> Note that when using `from package import item`, the item can be either a submodule (or subpackage) of the package, or some other name defined in the package, like a function, class or variable. The import statement first tests whether the item is defined in the package; if not, it assumes it is a module and attempts to load it. If it fails to find it, an ImportError exception is raised.

> Contrarily, when using syntax like `import item.subitem.subsubitem`, each item except for the last must be a package; the last item can be a module or a package but can’t be a class or function or variable defined in the previous item.

[Modules and packages](https://docs.python.org/3/tutorial/modules.html) - official Python tutorial.

# Example 1: This project
- `lib/`
  - [`__init__.py`](lib/__init__.py) - top level declarations of `lib` package
  - [`chap1.ipynb`](lib/chap1.ipynb) - keep all code in the notebook
  - [`chap2.ipynb`](lib/chap2.ipynb) - import single module, interface vs implementation
  - [`chap2.py`](lib/chap2.py) - module for `chap2.ipynb`
  - [`chap3.ipynb`](lib/chap3.ipynb) - bundle multiple modules in subpackage
  - `chap3/` - subpackage for `chap3.ipynb`
    - [`__init__.py`](lib/chap3/__init__.py) - declarations of `chap3` subpackage
    - [`sec1.py`](lib/chap3/sec1.py)
    - [`sec2.py`](lib/chap3/sec2.py)
  - [`chap4.ipynb`](lib/chap4.ipynb) - reference other chapters
  - [`chap4.py`](lib/chap4.py)

## Show main results in the top level notebook

Importing executes the script and brings everything defined in it to `chap2` namespace.

In [None]:
%cd ..

In [None]:
from lib import chap2
%matplotlib inline

Defined functions are available.

In [None]:
chap2.print_answer()

In [None]:
chap2.summarize()

All module global variables are available too.

In [None]:
chap2.df.plot()

# Example 2: UMetrics

1. How to organize code, documentation, data, writeups and slides in a project of non trivial size?
2. Two parallel lines of work in notebooks. How to merge them?


# Misc

- Use config files and put them in .gitignore.
- There are tools to connect code from different notebooks.
  - `%run lib/chap1.ipynb`
    - executes `chap1.ipynb` in isolated environment
    - then brings all variables to current namespace
  - [`ipynb`](https://github.com/ipython/ipynb)
    - `from ipynb.fs.full import chap1`: execute all code
    - `from ipynb.fs.defs import chap2`: only execute definitions
    - fails if notebooks use ipython magics, see [this issue](https://github.com/ipython/ipynb/issues/6)
- This needs further understanding, but class defined and imported from top-level module is shared between lower level modules. It means that using it as namespace is not safe.