# Stage 5: Does it work as expected?

The analysis is done. The results are in. They look good. It's time to share with someone else, and maybe more than my future self. 

* How easy is it for someone else to run my code? 
* How likely are they to get the same results?
* How hard is it for someone else to check that things work as expected?

### Reproducibility Issues

* (MISSING-STATE) Can't reproduce because of some missing state. e.g. cells were run out of sequence. Variable was changed but notebook wasn't rerun.
* (VARIABLE-SCARCITY) A variable name was re-used (possibly as a result of copy/pasting code from elsewhere), setting up cognitive dissonance, or confusing code.
* (ARCH-DIFFERENCE) The same code runs differently on different architectures

### Default Better Principles

Write tests. Use them. Even if only to check that your imports work, Datasets load, and notebooks run to completion.
* **Test running clean notebooks**: We can't say this enough. Always test a clean run of a notebook before checking it in! `Kernel -> Restart & Run All` for the win. Most if not all `MISSING-STATE` and `VARIABLE-SCARCITY` issues would be resolved by this simple workflow step. Automate this to make it even easier. Running analysis steps can be slow. Make the slow ones manual tests, and the quick ones, like this notebook, run in CI. 
* **Test run from a fresh environment**: Blow it away. Start from scratch. See if you have any unaccounted for `MISSING-STATE` hiding in your environment.
* **BONUS-Test on another architecture**: This is where CI is your friend. It's usually easy to run on at least one platform that's different from your own.

## The Easydata Way: `make test`
We recognize that being creative is an integral part of doing data science work and research. We don't want to get in the way of that. In particular, we don't specify how you should do your work in the brainstorming phase (other than recommending that you check-in your work via git whenever something seems to work).

We love test-driven development, but we're agnostic about it's use with your data science workflow. This isn't software engineering, this is data science. However, once you hit the editing phase of your work, once you start to put together something beyond a scattered Sucky First Draft (SFD), it's time to start wearing that SEng hat. And if you care about sharing with your data science neighbours and friends, testing is paramount. 

Let's see what this looks like in our Penguin Example.

## The Penguin Data Analysis

The previous notebook gives us some idea of what the penguins data looks like by giving as all the
2D views of the data. However, the data is 4-dimensional. While four dimensions is low enough that we can sort of reconstruct what the full dimensional data looks like in our heads, it's better if we can reduce the dimension of our data to visualize it. 

Let's try out [UMAP](https://github.com/lmcinnes/umap) as a dimension reduction technique. By reducing the
dimension in a way that preserves as much of the structure of the data
as possible we can get a visualisable representation of the data
allowing us to "see" the data and its structure and begin to get some
intuition about the data itself.

### Default Better Principle: Test run from a fresh environment


In [None]:
import umap

from src.data import Dataset
from src import paths

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Turns out, we're missing `umap`! If you `pip install` or `conda install` UMAP from the command line, it will lead to **MISSING-STATE** in your environment when you test run your work from your own machine. Instead, let's add `umap-learn` to our `environment.yml` file to avoid this bug.

**Hint:** To add [UMAP](https://github.com/lmcinnes/umap), you'll need to update your environment. We recommend installing umap via the conda-forge channel. For example, you can add `my-package` to from `my-conda-channel` via a line like this in your `environment.yml`.
```
- my-conda-channel::my-package
```
In this case:
```
- conda-forge::umap-learn
```
And don't forget to `make update_environment` after adding it!

## Load up the Datasets we'll use

In [None]:
penguins = Dataset.load("penguins-clean").data
scaled_penguin_data = Dataset.load("penguins-scaled").data

UMAP follows the sklearn API and has a method ``fit`` which we
pass the data we want the model to learn from. Since, at the end of the
day, we are going to want to reduced representation of the data we will
use the ``fit_transform`` method which first calls ``fit`` and
then returns the transformed data as a numpy array.

In [None]:
reducer = umap.UMAP()

In [None]:
embedding = reducer.fit_transform(scaled_penguin_data)
embedding.shape

The result is an array with 334 samples, but only two feature columns
(instead of the four we started with). This is because, by default, UMAP
reduces down to 2D. Each row of the array is a 2-dimensional
representation of the corresponding penguin. Thus we can plot the
``embedding`` as a standard scatterplot and color by the target array
(since it applies to the transformed data which is in the same order as
the original).

In [None]:
plt.scatter(
    embedding[:, 0], 
    embedding[:, 1], 
    c=[sns.color_palette()[x] for x in penguins.species_short.map({"Adelie":0, "Chinstrap":1, "Gentoo":2})])
plt.gca().set_aspect('equal', 'datalim')
plt.title(f'UMAP projection of the Penguin dataset, fontsize=12')

### Reference image
Here what we got for the image in the previous cell. According to your eye-ball test, did you get the expected results? Does your image look similar? Exactly the same?
<img src="../reports/figures/umap_penguins_42_reference.png" width=500 height=400 />


## It's Test Time

First let's see what happens if we run things from a clean notebook and evnironment!

### Default Better Principle: Test running clean notebooks
There were a bunch of **MISSING-STATE** and **VARIABLE-SCARCITY** bugs in this Notebook to start with. Did you fix them all? Try testing it now.
```
Kernel -> Restart & Run All
```

### Default Better Principle: Test run from a fresh environment
You could have used `pip install umap-learn` or `conda install -c conda-forge umap-learn` UMAP from the command line. This would lead to **MISSING-STATE** in your environment when you test run your work from your own machine. To check that you don't have state hiding in your environment that you didn't add to your `environment.yml` run the following:

```
conda deactivate
make delete_environment
make create_environment
conda activate easydata-tutorial
make update_environment
```
Then, finally, try the full Notebook run again:
```
Kernel -> Restart & Run All
```

**Fun Fact** Easydata has a built in `run_notebook` utility function. It's perfect for automating testing of notebook runs.

### Default Better Principle: Have unit tests and continuous integration (CI) tests

Easydata makes it easy for you to run your tests via `make test`. By default it locally runs all tests (slow and CI), and in CI, it only runs CI tests (so you can separate your fast/slow/necessarily local tests). 

Give it a go. `make test` now.

If you passed `make test` continue on to complete this challenge.

## Complete the challenge
Fingers crossed, now that things are tested from scratch, everything actually now works as expected. Let's find out. 


Run `make test_challenge` to check that you've completed the challenge and continue with your reproducibility quest.