# Tutorial 1: Reproducible Environments
(Continued from `README.md`)

## Overview

* Requirements: The Bare Minimum 

* Using a Data Science Template: `cookiecutter`

* Virtual Environments: `conda` and environment files
* Revision Control: git and a git workflow
   * Installing, Enabling, and using nbdime
* The Data Science DAG
   * make, Makefiles and data flow
* Python Modules
   * Creating an editable module
* Testing: doctest, pytest, hypothesis

We'll start out by checking that all the requirements are met from the previous exercises (started in `README.md`)

### Exercise 1: Install the requirements

* Anaconda
* Cookiecutter
* make
* git

### Test your installation

In [1]:
!conda --version   # or `$CONDA_EXE --version` in some environments

conda 4.13.0


In [2]:
!make --version

GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0


In [3]:
!git --version

git version 2.35.1


### Exercise 2: Start your cookiecutter-based project

Create a project called `Bus Number Tutorial`:

    Use conda as your virtualenv manager
    Use python 3.6 or greater

When complete, you should have a fully populated project directory, complete with customized README.md.

We will be working in this project from now on.

### Exercise 2b:

Explore the `README.md` from your new `bus_number_tutorial` repo

(Hint: You can use the `%load` magic, or `!cat` to look at it in your notebook)

In [5]:
!cat ../README.md

bus-number-101

bus 101

GETTING STARTED
---------------

* Create and switch to the  virtual environment:
```
cd bus-number-101
make create_environment
conda activate bus-number-101
make requirements
```
* Explore the notebooks in the `notebooks` directory

Project Organization
------------
* `LICENSE`
* `Makefile`
    * top-level makefile. Type `make` for a list of valid commands
* `README.md`
    * this file
* `data`
    * Data directory. often symlinked to a filesystem with lots of space
    * `data/raw`
        * Raw (immutable) hash-verified downloads
    * `data/interim`
        * Extracted and interim data representations
    * `data/processed`
        * The final, canonical data sets for modeling.
* `docs`
    * A default Sphinx project; see sphinx-doc.org for details
* `models`
    * Trained and serialized models, model predictions, or model summaries
    * `models/trained`
        * Trained models
    * `models/output`
        * predicti

### Exercise 3: Set up your virtual environment and install all dependencies

Create and activate your `bus_number_tutorial` conda environment using the above make commands.

Your `active environment` should be `bus_number_tutorial`


In [6]:
!conda info


     active environment : bus-number-101
    active env location : /Users/hector/opt/anaconda3/envs/bus-number-101
            shell level : 2
       user config file : /Users/hector/.condarc
 populated config files : 
          conda version : 4.13.0
    conda-build version : 3.18.11
         python version : 3.7.6.final.0
       virtual packages : __osx=10.16=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /Users/hector/opt/anaconda3  (writable)
      conda av data dir : /Users/hector/opt/anaconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/osx-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /Users/hector/opt/anaconda3/pkgs
                          /Users/hector/

**Note:** If you are using **JupyterHub**, the bash magics `!` and `%%bash` will not work as expected, that is, they will drop you into your root JupyterHub environment, as opposed to the conda kernel that you a running this notebook in, and you will not see `bus_number_tutorial`. To get around this, you will need to run the bash commands in this notebook from a terminal instance with your `bus_number_tutorial` conda environment activated.

If done correctly, you should also be able to import from `src`

In [7]:
# if importing src doesn't work, try `make requirements`
import src

### Exercise 4: Pick up this tutorial in your new repo

* Run jupyter notebook and open `notebooks/10-reproducible-environment.ipynb`

If you're currently running this notebook and the checks from the previous exercises worked, then you're in business!

Keep going from here!

## Revision Control: `git`

How do we keep track of our changes? We use **git**.

Before we do anything interesting, let's initialize a git repository (repo) here.


### Exercise 5: Initialize a git repo for `bus_number_tutorial`

```
git init
git add .
git commit -m "Initial Import"
```

In [8]:
!git status

On branch main
nothing to commit, working tree clean


We will get back to using git again soon.

### Exercise 6: Add a dependency
Modify the environment file so that `make requirements` installs some additional packages
* install `joblib` using conda
* install `nbdime` using pip

In [9]:
# Check that you now have joblib  and nbdime installed
# Don't forget that you need to run `make requirements` once you've change the `environment.yml` file
import joblib
import nbdime

### Exercise 7: Basic git interactions

Check the changes to your `environment.yml` file into your git repo

See what has changed with git:

In [10]:
!git status

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   ../environment.lock[m
	[31mmodified:   ../environment.yml[m
	[31mmodified:   10-reproducible-environment.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [11]:
!git diff -u ../environment.yml

[1mdiff --git a/environment.yml b/environment.yml[m
[1mindex 5e8c244..d9d7e11 100644[m
[1m--- a/environment.yml[m
[1m+++ b/environment.yml[m
[36m@@ -2,9 +2,10 @@[m [mname: bus-number-101[m
 dependencies:[m
   - pip[m
   - pip:[m
[31m-    - -e .  # conda >= 4.4 only[m
[31m-    - python-dotenv>=0.5.1[m
[31m-    - nbval[m
[32m+[m[32m      - -e . # conda >= 4.4 only[m
[32m+[m[32m      - python-dotenv>=0.5.1[m
[32m+[m[32m      - nbval[m
[32m+[m[32m      - nbdime[m
   - setuptools[m
   - wheel[m
   - sphinx[m
[36m@@ -15,5 +16,5 @@[m [mdependencies:[m
   - nb_conda[m
   - pandas[m
   - requests[m
[32m+[m[32m  - joblib[m
   - python>=3.6[m
[31m-[m


To add or reject your changes incrementally:

In [None]:
#!git add -p
#!git reset -p


Commit the changes

In [12]:
!git commit -v

hint: Waiting for your editor to close the file... 7[?47h[>4;2m[?1h=[?2004h[?1004h[1;24r[?12h[?12l[22;2t[22;1t[29m[m[H[2J[?25l[24;1H"~/code_projects/eternal-rtn/bus-number-101/.git/COMMIT_EDITMSG" 91L, 2378B[2;1H▽[6n[2;1H  [3;1HPzz\[0%m[6n[3;1H           [1;1H[>c]10;?]11;?[2;1H# Please enter the commit message for your changes. Lines starting[2;67H[K[3;1H# with '#' will be ignored, and an empty message aborts the commit.[3;68H[K[4;1H#
# On branch main
# Changes to be committed:
#[7Cmodified:   ../environment.lock
#[7Cmodified:   ../environment.yml
#[7Cmodified:   10-reproducible-environment.ipynb
#
# ------------------------ >8 ------------------------
# Do not modify or remove the line above.
# Everything below it will be ignored.
diff --git a/environment.lock b/environment.lock
index 9547873..53606c9 100644
--- a/environment.lock
+++ b/environment.lock
@@ -41,6 +41,7 @@ dependencies:
   - ipywidgets=7.6.5=pyhd3eb1b0_1
   - jedi=0.18.1=py39he

In [14]:
# You should have no differences in your branch now
# Except for those that you've made by running notebooks
!git status

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   10-reproducible-environment.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


## The Data Science DAG
DAG = Directed Acyclic Graph. 

That means the process eventually stops. (This is a good thing!) 

It also means we can use a super old, but incredibly handy tool to implement this workflow: `make`.

### Make, Makefiles, and the Data Flow


We use a `Makefile` to organize and invoke the various steps in our Data Science pipeline.
You have already used this file when you created your virtual environment in the first place:
```
make create_environment
```
Here are the steps we will be working through in this tutorial:
<img src="references/cheat_sheet.png" alt="Reproducible Data Science Workflow" width="400"/>

A [PDF version of the cheat sheet](references/cheat_sheet.pdf) is also available.



### What's my make target doing?
If you are ever curious what commands a `make` command will invoke (including any invoked dependencies), use `make -n`, which lists the commands without executing them:

In [16]:
%%bash
cd .. && make -n requirements

python3 test_environment.py


We use a cute **self-documenting makefiles trick** (borrowed from `cookiecutter-datascience`) to make it easy to document the various targets that you add. This documentation is produced when you type a plain `make`:

In [17]:
%%bash
cd .. && make

To get started:
  >>> [1mmake create_environment[m
  >>> [1mconda activate bus-number-101[m

[1mProject Variables:[m
PROJECT_NAME = bus-number-101

[1mAvailable rules:[m
[36manalysis           [m Convert predictions / transforms / experiments into output 
                    data 
[36mclean              [m Delete all compiled Python files 
[36mclean_interim      [m Delete all interim (DataSource) files 
[36mclean_models       [m Delete all trained models 
[36mclean_predictions  [m Delete all predictions 
[36mclean_processed    [m Delete all processed datasets 
[36mclean_raw          [m Delete the raw downloads directory 
[36mcreate_environment [m Set up virtual environment for this project 
[36mdata               [m convert raw datasets into fully processed datasets 
[36mdelete_environment [m Delete the virtual environment for this project 
[36mlint               [m Lint using flake8 
[36mpredict            [m predict / transform / run experiments 
[36m

### Under the Hood: The Format of a Makefile

```
## Comment to appear in the auto-generated documentation
thing_to_build: space separated list of dependencies
	command_to_run            # there is a tab before this command.
	another_command_to_run    # every line gets run in a *new shell*
```



In [18]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw:
	@echo "Fetch raw data"


Writing Makefile.test


In [19]:
# Note: you can run a specific Makefile with with -f option
!make -f Makefile.test data

Fetch raw data
Build Datasets


Note: If you see: ```*** missing separator.  Stop.``` it's because you have used spaces instead of **tabs** before your commands. 

### Exercise 8: What does this `Makefile.test` print when you run `make train`?

In [20]:
!make -f Makefile.test train

Fetch raw data
Build Datasets
do a data transformation
do train/test split
Train Models


### Exercise 9: What happens when you add a cycle to a Makefile
Set up a makefile with a cyclic dependency and run it

In [21]:
%%file Makefile.test

data: raw
	@echo "Build Datasets"
train_test_split:
	@echo "do train/test split"
train: data transform_data train_test_split
	@echo "Train Models"
transform_data:
	@echo "do a data transformation"
raw: data
	@echo "Fetch raw data"

Overwriting Makefile.test


In [22]:
!make -f Makefile.test data

make: Circular raw <- data dependency dropped.
Fetch raw data
Build Datasets


Using a Makefile like this is an easy way to set up a process flow expressed as a Directed Acyclic Graph (DAG).

**Note**: We have only scratched the surface here. The are lots of interesting tricks you can do with make.
* http://zmjones.com/make/
* http://blog.byronjsmith.com/makefile-shortcuts.html
* https://www.gnu.org/software/make/manual/


## Back to Revision Control: git workflows

Git isn't really a collaboration tool. It's more a tool for implementing collaboration workflows.

What do we mean by workflow? A process built on top of git that incorporates **pull requests** and **branches**. Typically, this is provided by sites like: GitHub, GitLab, BitBucket.


### Exercise 10: 

Create a GitHub/GitLab/BitBucket repo and sync your repo to it.


In [23]:
# your remote repo should now show up
!git remote -v

origin	git@github.com:cuevash/bus-number-101.git (fetch)
origin	git@github.com:cuevash/bus-number-101.git (push)


For example (using SSL):

    origin	git@github.com:${GITHUB_USERNAME}/bus_number_tutorial.git (fetch)
   
    origin	git@github.com:${GITHUB_USERNAME}/bus_number_tutorial.git (push)


## GitHub workflow cheatsheet
See https://github.com/hackalog/bus_number/wiki/Github-Workflow-Cheat-Sheet

## Life Rules for using `git`

* Always work on a branch: `git checkout -b my_branch_name`. Delete branches once they are merged.
* **Never** push to master. Always **work on a branch** and do a pull request.
* Seriously, don't do work on master if you are collaborating with **anyone**.
* If you pushed it anywhere, or shared it with anyone, don't `git rebase`. In fact, if you're reading this, don't `git rebase`. Save that for when you are comfortable solving git merge nightmares on your own.

Here are some common tasks in git/github

### Starting the day. Where was I? What was I doing?
```
git branch         # What branch am I currently on? e.g. {my_branch}
git status         # anything I forgot to commit? If so...
git commit ...     # Commit work in progress
```

### Didn't I do some work at home last night?
```
git checkout master       # leave whatever branch I was on
git fetch origin --prune  # Check for something new
git merge origin/master   # If updates available, update!
git branch --merged master # check for any merged branches that can be safely deleted
git branch -d {name_of_merged_branch} # delete any fully merged branches
```

### Anything fun happening upstream?

```
git checkout master
git fetch upstream --prune  # grab latest changes from upstream repo
git merge upstream/master   # merge them into local copy of my form
git push origin master      # push latest upstream changes to my forked repo
git branch --merged master # check for any merged branches that can be safely deleted
git branch -d {name_of_merged_branch} # delete any fully merged branches
```

Now that `master` is up to date, you should merge whatever happened in `master` into your development branch:
```
git checkout {my_branch}
git merge master               # merges master->{my_branch}
git push origin {my_branch}    # Let Github know about the merge
```

#### Some useful references if `gitflow` isn't second nature to you yet
* Introduction to GitHub tutorial: https://lab.github.com/githubtraining/introduction-to-github
* Git Handbook: https://guides.github.com/introduction/git-handbook/

### Exercise 11:
* Create a branch called `add_sklearn`
* Add a scikit-learn dependency
* Check in these changes using git to your local repo
* Push the new branch to GitHub
* Create a pull request to merge this branch into master
* Merge your PR (delete the branch afterwards)
* Sync your local repo with GitHub, including deleting the merged branches

## Python Modules
By default, we keep our source code in a module called `src`. (this can be overridden in the cookieccutter)

This is enabled via one line in `environment.yml`:
```
- pip:
  - -e .
```

This creates an **editable module**, and looks in the current directory for a file called `setup.py` to indicate the module name and location

This lets you easily use your code in notebooks and other scripts, and avoids any `sys.path.append` silliness

### ASIDE: Semantic Versioning

Semantic versioning (or *semver*), refers to the convention of versioning with a triple:

    MAJOR.MINOR.PATCH

With the following convention: when releasing new versions, increment the:

*    MAJOR version when you make **incompatible API changes**,
*    MINOR version when you **add functionality** in a backwards-compatible manner, and
*    PATCH version when you make backwards-compatible **bug fixes**.

If you have no other plan, this is a great convention to follow.

For an obscene amount of detail on this concept, see https://semver.org/

### Exercise 11:
* add your favorite utility function to `src/utils`
* increment the version number of the editable package (do this in `setup.py`)
* run `make requirements` (required if you added dependencies for your utility function)
* import your utility function and run it from this notebook

In [None]:
# A handy magic that allows us to edit modules and have them stay up to date in the notebook. In this case, src.
%load_ext autoreload
%autoreload 2

## Testing: doctest, pytest, coverage


Python has built in testing frameworks via:
* doctests:https://docs.python.org/3/library/doctest.html#module-doctest
* unittest: https://docs.python.org/3/library/unittest.html

Additionally, you'll want to make regular use of:
* pytest: https://docs.pytest.org/en/latest/
* pytest-cov: https://pypi.org/project/pytest-cov/
* hypothesis: https://hypothesis.readthedocs.io/en/latest

Cookiecutter (vanilla flavoured) comes witha setup for the `tox` testing framework built in.
* https://tox.readthedocs.io/en/latest/

### Exercise 12:

Add a `make test` target to your makefile that:
* runs doctests
* runs pytest unit tests
* (extra credit) Displays test coverage results
    
When you run `make test`, you will find tests that will fail in `src/test_example.py`. Fix them in the next exercise.

In [None]:
!cd .. && make test

***Note:*** `make test` is normally functionality built into `cookiecutter-easydata`. We're building it from scratch here for the sake of practice.

### Exercise 13:
Fix the failing tests

In [None]:
# Should pass all tests now!
!cd .. && make test

### Exercise 14:
* Check in all your changes to git
* Merge them into your master branch via a PR in GitHub

In [None]:
!git status