## Stage 3: What do I need to install?
Maybe your experience looks like the typical python dependency management (https://xkcd.com/1987/):

<img src=https://imgs.xkcd.com/comics/python_environment.png>

Furthermore, data science packages can have all sorts of additional non-Python dependencies which makes things even more confusing, and we end up spending more time sorting out our dependencies than doing data science. If you take home nothing else out of this tutorial, learn this stage. I promise. It will save you, and everyone who works with you, many days of your life back.



### Reproducibility Issues:
* (NO-ENVIRONMENT-INSTRUCTIONS) Chicken and egg issue with environments. No environment.yml file or the like. (Even if there are some instructions in a notebook).
* (NO-VERSION-PIN) Versions not pinned. E.g. uses a dev branch without a clear indication of when it became released.
* (IMPOSSIBLE-ENVIRONMENT) dependencies are not resolvable due to version clashes. (e.g. need <=0.48 and >=0.49)
* (ARCH-DIFFERENCE) The same code runs differently on different architectures
* (MONOLITHIC-ENVIRONMENT) One environment to rule (or fail) them all. 



### Default Better Principles
* **Use (at least) one virtual environment per repo**: And use the same name for the environment as the repo.
* **Generate lock files**: Lock files include every single dependency in your dependency chain. Lock files are necessarily platform specific, so you need one per platform that you support. This way you have a perfect version pin on the environment that you used for that moment in time.
* **Check in your environment creation instructions**: That means an `environment.yml` file for conda, and its matching lock file(s). 

## The Easydata way: `make create_environment`
We like `conda` for environment management since it's the least bad option for most data science workflows. There are no perfect ways of doing this. Here are some basics.



### Setting up your environment

### Initial setup

* Make note of the path to your conda binary:
```
   $ which conda
   ~/miniconda3/bin/conda
```
* ensure your `CONDA_EXE` environment variable is set to this value (or edit `Makefile.include` directly)
```
    export CONDA_EXE=~/miniconda3/bin/conda
```
* Create and switch to the virtual environment:
```
cd easydata-tutorial
make create_environment
conda activate easydata-tutorial
make update_environment
```

Now you're ready to run `jupyter notebook` (or jupyterlab) and explore the notebooks in the `notebooks` directory.

From within jupyter, run this notebook and the cells below.

XXX Move me to the end
For more instructions on setting up and maintaining your python environment (including how to point your environment at your custom forks and work in progress) see [Setting up and Maintaining your Conda Environment Reproducibly](../reference/easydata/conda-environments.md).


Time to see what we got from our initial install!

In [1]:
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

### Updating your conda and pip environments
The `make` commands, `make create_environment` and `make update_environment` are wrappers that allow you to easily manage your conda and pip environments using the `environment.yml` file.

(If you ever forget which `make` command to run, you can run `make` by itself and it will provide a list of commands that are available.)


When adding packages to your python environment, **do not `pip install` or `conda install` directly**. Always edit `environment.yml` and `make update_environment` instead.

Your `environment.yml` file will look something like this:
```
name: easydata-tutorial
  - pip
  - pip:
    - -e .  # conda >= 4.4 only
    - python-dotenv>=0.5.1
    - nbval
    - nbdime
    - umap-learn
    - gdown
    - # Add more pip dependencies here
  - setuptools
  - wheel
  - git>=2.5  # for git worktree template updating
  - sphinx
  - bokeh
  - click
  - colorcet
  - coverage
  - coveralls
  - datashader
  - holoviews
  - matplotlib
  - jupyter
  - # Add more conda dependencies here
...
```
To add any package available from conda, add it to the end of the list. If you have a PYPI dependency that's not avaible via conda, add it to the list of pip installable dependencies under `  - pip:`.

Once you're done your edits, run `make update_environment` and voila, your python environment is up to date.

To save or share your updated environment, check in your `environment.yml` file using git.


Now try updating your environment to include `seaborn`. But first a tip with using `conda` environments in notebooks:

#### Using your conda environment in a jupyter notebook
If you make a new notebook, select the `easydata-tutorial` environment from within the notebook. If you are somehow in another kernel, select **Kernel -> Change kernel -> Python[conda env:easydata-tutorial]**. If you don't seem to have that option, make sure that you ran `jupyter notebooks` with the `easydata-tutorial` conda environment enabled, and that `which jupyter` points to the correct (`easydata-tutorial`) version of jupyter.

If you want your environment changes to be immediately available in your running notebooks, make sure to run a notebook cell containing:

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import seaborn as sns

#### Lock files
Now, we'll admit that this workflow isn't perfectly reproducible in the sense that conda still has to resolve versions from the `environment.yml`. To make it more reproducible, running either `make create_environment` or `make update_environment` will generate an `environment.{$ARCH}.lock.yml` (e.g. `environment.i386.lock.yml`). This file keeps a record of the exact environment that is currently installed in your conda environment `easydata-tutorial`. If you ever need to reproduce an environment exactly, you can install from the `.lock.yml` file. (Note: These are architecture dependent).

Run `make env_challenge` to test your environment and complete this Challenge.