# Stage 4: Where do I find the data?

This is data science and data analysis after all. We need data. And when we have data, we need to know what data we have. Here are the big questions around finding and prepping data:
* Where do I find the data?
* How can I use the data?
* How do I know that I have the right data?
* **BONUS** How can I make it easy to retrieve and use again? (Without copy and pasting code)

### Imagine: a hypothetical workflow for downloaded data 
(Assuming you're using reproducible environments already!)

1. **Downloading data**: First, you look for a dataset to use. You get one. You download the data, or you copy and paste a script you find off of the internet for how to download the data. 
2. **Data prep**: Then you copy and paste the munging script that you saw for how to get it into a pandas dataframe.
3. **Data prep-2**: You try some analysis, but it doesn't quite work, you need to clean the data up a bit more, so you go back to doing some more prep, this time, your own way. 
4. **Analysis**: Finally, you can run your analysis and do some data science work. 
5. ...time passes...
6. **Re-downloading**: You mention the result and a colleague asks you about the work, so you go back to dig up the code. You decide to see if you can share your work. You start to run your code, but then realize that the dataset has been updated. It's not clear what's been changed, but you notice in the output that there seem to be more entries in the dataframe. When your script downloaded it, you replaced the previous dataset with the new one. 
7. **Checking licenses**: When you go to try and figure out what happened, and start looking up info on the dataset. The source you used for information the first time doesn't seem to be up-to-date so you keep digging until you find the original source. In the process, you realize you forgot to look at the license. Better check on that before sharing your work. Thankfully, it's CC-BY-NC, and you're not using it for work purposes. You take a sigh of relief, as this could have been a show-stopper, and you continue.
8. **Data prep-3**: You can't roll back to the old data, and the new data has some new gotchas, so you spend some time trying to sort that out. Finally you have something that you can re-run your analysis on, and share with your colleague. 
9. **Analysis-2**: Fingers crossed with the new larger dataset that things work our roughly the same. There's no easy way now to figure out what's going on if the results look different from before...
10. **Sharing**: Phew. Wiping the sweat from your brow, things look roughly similar. You can pass on the work to your colleague.

In the hypothetical (or not...this may be pulling from personal experience) situation above, getting to the analysis stage the second time around took just as long, if not longer because of trying to figure out what had changed about the dataset, and tracking down the dataset information and licenses, which were no longer as easy to find.

### Some issues
In the previous workflow, here were some of the issues that made it hard to reproduce:

* **NO-DATA-LICENSE**: No data license with the data. In fact, we forgot to even check the first time. It wasn't a showstopper, this time, but it could have easily been.
* **NO-DATA-METADATA**: No metadata with the data. We didn't have a description of what the data was in a way that made it easy to see what the data was. And in this case, what might have changed with the new metadata.
* **NO-DATA-HASH**: No data hash. This meant that there was no way to catch early on that the data was different. 
* **COOKED-DATA**: Data gets cooked when it gets changed. In this case, the change was downloading the new data on top of the old. Most frequently, we see this happening when data that gets preped gets saved over the raw data. Repeat after me: "Data is immutable".
* **BLACK-MAGIC**: The data prep was a form of black magic. In this example, we copied and pasted, and we didn't pay attention to what was inside it, it just worked the first time. In particular, the download script had a hard-coded path that the download went to and it would redownload the file into the same place every time. 


## The Easydata Way

### Principles

* **Data and metadata always stay together**: Keep your data and your metadata together at all times. This includes licenses and hashes. 
* **Keep a hash of your data and check it**: This is especially true for your raw data, your starting points from which the rest of your data is derived.
* **Data is immutable**: Never edit a raw data file. Especially not manually. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. 
* **Keep (easy-to-use) data recipes**: The code you write should move the raw data through a pipeline to your final analysis. Keep data recipes as a way of recovering your data from its raw format. Your data recipe ideally has an easy-to-use API that gives people you're sharing with a *common start line* when working with the data.

### How it's done in Easydata

Now we'll take a look at how we do this in Easydata. First, some imports that we'll need.

In [None]:
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline

In [None]:
from src.data import Dataset
import logging
from src.log import logger
from src.helpers import notebook_as_transformer

In [None]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

### Set to debug log level
Let's set the log level on to DEBUG on the Easydata src module to see what the Dataset code is doing under the hood...

In [None]:
logger.setLevel(logging.DEBUG)

## Dataset objects

In [None]:
ds = Dataset.load("dataset-challenge")

In [None]:
print(ds.DESCR)

In [None]:
ds.data

In [None]:
print(ds.LICENSE)

Penguin data
------------
<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png" alt="Penguins" width=400px/>

Time to try it out. The next step is to get some data to work with. To ease us into things
we'll start with the [penguin
dataset](https://github.com/allisonhorst/penguins). It isn't very
representative of what real data would look like, but it is small both
in number of points and number of features, and will let us quickly and easily get started.

## Load the pre-created a Dataset recipe

Here we use the dataset recipe to get the raw penguins dataset. When running 

```ds = Dataset.load('penguins-raw')```

Easydata uses an entry in the Dataset Catalog to find the `penguins-raw` Dataset recipe.

Here's what it will do:
* Look in the Catalog for the entry `penguins-raw`
* Traceback through the dependency chain of datasets looking for what's already cashed and matches the hash check, and what needs to be created from scratch
* If this is your first time running, it will download the dataset and created it the first time, populating the `data/raw` directory with the raw download
* Do a hash check
* Create the `penguins-raw` Dataset
* If successful, it will save a processed version of the data in `data/processed` for quick and easy loading any time

Say you need more space on your machine. You can blow away the contents of your `data/processed` directory at any time, and `.load("penguins-raw")` will recreate the Dataset again from `data/raw` whenever you need to use it.

**BONUS** Beyond the scope of this tutorial, your `data/raw` directory doesn't have to be local and can be kept on a bigger machine or even in your cloud storage (AWS, Azure, etc.). This can be handled by the local paths structure introduced in `02-Environment-Challenge`.


In [None]:
ds = Dataset.load('penguins-raw')

To see what happens with a saved cached copy, load it again.

In [None]:
ds = Dataset.load('penguins-raw')

The `penguins-raw` dataset contains information on where to find a single raw data file: `penguins_size.csv`. This is the raw data. 

In [None]:
ds.EXTRA

In [None]:
print(ds.DESCR)

<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png" alt="Diagram of culmen measurements on a penguin" width=300px/>

In [None]:
print(ds.LICENSE)

Without going in to the details, `.EXTRA` lets us manage files off of our local system if necessary and keeps track of raw data file details for our Dataset recipes. To get the fully qualified path to `penguins_size.csv` or any `.EXTRA` file, we can use 

`.EXTRA` will always keep track of relative paths so you are never checking in anything except relative paths to your repo, while using your local paths to resolve where to find the files via `.extra_file()`.

In [None]:
ds.extra_file('penguins_size.csv')

Let's start looking at the data

In [None]:
penguins = pd.read_csv(ds.extra_file('penguins_size.csv'))
penguins.head()

First up, we will get rid of the NAs in
the data.

In [None]:
penguins = penguins.dropna()
penguins.species_short.value_counts()

Visualizing this data is a little bit
tricky since we can't plot in 4 dimensions easily. Fortunately four is
not that large a number, so we can just to a pairwise feature
scatterplot matrix to get an ideas of what is going on. Seaborn makes
this easy.

In [None]:
sns.pairplot(penguins, hue='species_short')


Before we can do any work with the data it will help to clean up it a
little. We won't need NAs, we just want the measurement columns, and
since the measurements are on entirely different scales it will be
helpful to convert each feature into z-scores (number of standard
deviations from the mean) for comparability.


In [None]:
penguin_data = penguins[
    [
        "culmen_length_mm",
        "culmen_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
    ]
].values
scaled_penguin_data = StandardScaler().fit_transform(penguin_data)

We're ready for our analysis. Since we don't want to have to do this work again repeatedly, this is a nice place to save our work (i.e. let Easydata remember the recipe for us), and have the notebook here as a handy reference still if we want to refer back to it. 

## Save the new dataset using this Notebook as a Transformer

We will now save new data as a derived Dataset, say `penguins-clean`, so that we can access the data from anywhere this repo is installed via `Dataset.load("penguins-clean")`. For our purposes that will be the next notebook! We'll do it in another notebook to separate the grungy prep work from the analysis itself.

In [None]:
new_dataset_name = "penguins-clean"
new_data = scaled_penguin_data
new_metadata = ds.metadata.copy() # start with the same metadata

# add some new metadata description reflecting what we did
added_descr_txt = f"""\n\nData cleaned up by removing NAs and scaling. See notebook 04."""

new_metadata['descr'] += added_descr_txt

new_ds = Dataset(dataset_name=new_dataset_name, data=new_data,
                 metadata=new_metadata)

In [None]:
# Due to various design choiced in Jupyter, we need to specify this name manually.
nbname = '04-Data-Challenge.ipynb'
dsdict = notebook_as_transformer(notebook_name=nbname,
                                 input_datasets=[ds],
                                 output_datasets=[new_ds],
                                 overwrite_catalog=True)

Check that we can now load our new dataset from the Dataset Catalog.

In [None]:
new_ds = Dataset.load("penguins-clean")

Let's check that `.data` is the same as expected:

In [None]:
(new_ds.data == scaled_penguin_data).all()

And that the `.DESCR` text has been updated the way we stated.

In [None]:
print(new_ds.DESCR)

Finally, let's check that the license passed on appropriately:

In [None]:
print(new_ds.LICENSE)

In [None]:
#from src.quest import data_test
## Test that we can ds.load the penguins-clean dataset
## Test that it has the expected metadata.

for the data_test text
Now when you come back and want to use this dataset, all it takes is running `Dataset.load('penguins-clean')`.

Everything will just work. In our example with our colleague at the beginning, as long as we still had the `data/raw` file lying around, we would have had everything that we needed to reproduce the work, as is (especially since we aleard included the metadata the first time). 

Say your colleague ran the recipe, but got a different download (as expected), they would have a different hash for the raw data file, you could compare hashes and go from there. There are two options at that point.
1. **Reproduce the existing work**: As long as you have the right to redistribute the dataset (as in this example with CC-0), you could send the raw data file to your colleague, or host it in another way (and update the Dataset recipe accordingly via manual download or a new URL). That means you can recreate the exact work with no more issues. You can either stay here, or move on to accomodate the new data.
2. **Compare the new and old results**: If you decide to try and work with the new data, create a new recipe for the new data, and go from there. This will also allow you to compare your results with your old work.

References:

* Datasets: https://cookiecutter-easydata.readthedocs.io/en/latest/datasets/
* Point to read the docs for how to create a dataset from scratch: https://cookiecutter-easydata.readthedocs.io/en/latest/New-Dataset-Template/
* Create a Penguins Dataset: https://github.com/allisonhorst/palmerpenguins		
* The reference notebooks for dataset creation
* Dataset DAG post
* Local paths...in the cloud