## Stage 3: What do I need to install?
Maybe your experience looks like the typical python dependency management (https://xkcd.com/1987/):

<img src=https://imgs.xkcd.com/comics/python_environment.png>

Furthermore, data science packages can have all sorts of additional non-Python dependencies which makes things even more confusing, and we end up spending more time sorting out our dependencies than doing data science. If you take home nothing else out of this tutorial, learn this stage. I promise. It will save you, and everyone who works with you, many days of your life back.



### Reproducibility Issues:
* (NO-ENVIRONMENT-INSTRUCTIONS) Chicken and egg issue with environments. No environment.yml file or the like. (Even if there are some instructions in a notebook).
* (NO-VERSION-PIN) Versions not pinned. E.g. uses a dev branch without a clear indication of when it became released.
* (IMPOSSIBLE-ENVIRONMENT) dependencies are not resolvable due to version clashes. (e.g. need <=0.48 and >=0.49)
* (ARCH-DIFFERENCE) The same code runs differently on different architectures
* (MONOLITHIC-ENVIRONMENT) One environment to rule (or fail) them all. 



### Default Better Principles
* **Use (at least) one virtual environment per repo**: And use the same name for the environment as the repo.
* **Generate lock files**: Lock files include every single dependency in your dependency chain. Lock files are necessarily platform specific, so you need one per platform that you support. This way you have a perfect version pin on the environment that you used for that moment in time.
* **Check in your environment creation instructions**: That means an `environment.yml` file for conda, and its matching lock file(s). 

## The Easydata way: `make create_environment`
We like `conda` for environment management since it's the least bad option for most data science workflows. There are no perfect ways of doing this. Here are some basics.



### Setting up your environment
### clone the repo
```
   git clone https://github.com/acwooding/easydata-tutorial
   cd easydata-tutorial
```

### Initial setup

* **YOUR FIRST TASK OF THIS STAGE***: Check if there is a CONDA_EXE environment variable set with the full path to your conda binary; e.g. by doing the following:

```
export | grep CONDA_EXE

```
* **NOTE:** if there is no CONDA_EXE, you will need to find your conda binary and record its location in the CONDA_EXE line of `Makefile.include`

Recent versions of conda have made finding the actual binary harder than it should be. This might work:
```
   >>> which conda
   ~/miniconda3/bin/conda
```

* Create and switch to the virtual environment:
```
make create_environment
conda activate easydata-tutorial
make update_environment
```

Now you're ready to run `jupyter notebook` (or jupyter lab) and explore the notebooks in the `notebooks` directory.

From within jupyter, re-open this notebook and run the cells below.


**Your next Task**: Run the next cell to ensure that the packages got added to the python environment correctly.

In [None]:
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

### Updating your conda and pip environments
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your conda and pip environments using a file called `environment.yml` file, which lists the packages you want in your python environment.

(If you ever forget which `make` subcommand to run, you can run `make` by itself and it will provide a list of subcommands that are available.)


When adding packages to your python environment, **never do a `pip install` or `conda install` directly**. Always edit `environment.yml` and `make update_environment` instead.

Your `environment.yml` file will look something like this:
```
name: easydata-tutorial
  - pip
  - pip:
    - -e .  # conda >= 4.4 only
    - python-dotenv>=0.5.1
    - nbval
    - nbdime
    - umap-learn
    - gdown
    - # Add more pip dependencies here
  - setuptools
  - wheel
  - git>=2.5  # for git worktree template updating
  - sphinx
  - bokeh
  - click
  - colorcet
  - coverage
  - coveralls
  - datashader
  - holoviews
  - matplotlib
  - jupyter
  - # Add more conda dependencies here
...
```
Notice you can add conda and pip dependencies separately. For good reproducibility, we recommend you always try and use the conda version of a package if it is available.

Once you're done your edits, run `make update_environment` and voila, your python environment is up to date.

**Git Bonus Task:** To save or share your updated environment, check in your `environment.yml` file using git.


**YOUR NEXT TASK** in the Quest: Updating your python environment to include the `seaborn` package. But first, a quick tip with using `conda` environments in notebooks:

#### Using your conda environment in a jupyter notebook
If you make a new notebook, and your packages don't seem to be available, make sure to select the `easydata-tutorial` Kernel from within the notebook. If you are somehow in another kernel, select **Kernel -> Change kernel -> Python[conda env:easydata-tutorial]**. If you don't seem to have that option, make sure that you ran `jupyter notebooks` with the `easydata-tutorial` conda environment enabled, and that `which jupyter` points to the correct (`easydata-tutorial`) version of jupyter.

You can see what's in your notebook's conda environment by putting the following in a cell and running it:

In [None]:
%conda info

Another useful cell to include is the following.

If you want your environment changes to be immediately available in your running notebooks, make sure to run a notebook cell containing:

In [None]:
%load_ext autoreload
%autoreload 2

If you did your task correctly, the following import will succeed.

In [None]:
import seaborn as sns

Remember, you should **never** do a `pip install` or `conda install` manually. You want to make sure your environment changes are saved to your data science repo. Instead, edit `environment.yml` and do a `make update_environment`.

Your **NEXT TASK of this stage**: Run `make env_challenge` and follow the instructions if it works.

### BONUS Task: Lockfiles
* Do this if there's time *

Lockfiles are a way of separating the list of "packages I want" from "packages I need to install to make everything work". For reproducibility reasons, we want to keep track of both files, but not in the same place. Usually, this separating is done with something called a "lockfile."

Unlike several other virtual environment managers, conda doesn't have lockfiles. To work around this limitation, Easydata generates a basic lockfile from `environment.yml` whenever you run `make update_environment`.

This lockfile is a file called `environment.{$ARCH}.lock.yml` (e.g. `environment.i386.lock.yml`). This file keeps a record of the exact environment that is currently installed in your conda environment `easydata-tutorial`. If you ever need to reproduce an environment exactly, you can install from the `.lock.yml` file. (Note: These are architecture dependent, so don't expect a mac lockfile to work on linux, and vice versa).

For more instructions on setting up and maintaining your python environment (including how to point your environment at your custom forks and work in progress) see [Setting up and Maintaining your Conda Environment Reproducibly](../reference/easydata/conda-environments.md).


**Your BONUS Task** in the Quest: Take a look at the lockfile, and compare it's content to `environment.yml`. Then ask yourself, "aren't I glad I don't have to maintain this list manually?" 