# Practical Exploratory Data Analysis

## Course 1: Setup, Environment, Backups, Reproducibility

## 1. PSC Startup

This course is intended to be taken on an [interactive node](https://www.psc.edu/bridges/user-guide/running-jobs) on the Pittsburgh Supercomputing Cluster.  Before we start up our Jupyter Notebooks, we must first make sure our PSC working area is set up correctly, along with the necessary tools.

### 1.1 Recommended Prior to Installation:
1. `module load git` -> loads a recent version of git, needed for checking out software.
1. Making a ".condarc" file.  I suggest:
```shell
# Specify where to place environments and packages
 envs_dirs:
   - /path/to/conda_envs

 pkgs_dirs:
   - /path/to/conda_envs/pkgs```
Make these directories in areas in which you have a lot of space (not /home/username).  You are only granted 10 GB of space on home, and checking out large conda packages can quickly use it all up.  Make sure `~/.conda/pkgs` is set to read-only mode, as conda may still try to use that as a default package download location if it exists.
1. Symlink your other directories to your home (`ln -s /path/to/other/userspace /home/username`).  You'll be doing most of your work outside of "home", so it makes sense to easily navigate to those areas.

### 1.2 Loading Interactive Node and Anaconda
We need some computing power to handle the upcoming tasks.  We will use an interactive cpu node to provide that power.  
```shell
interact -p RM --egress -t 02:00:00 -A XXXXXX --mem=120GB
module load AI/anaconda3-5.1.0_gpu
cd /my/working/dir```

If this is the first time we're starting up, we'll need to grab the course from github and set up the environment:
```shell
git clone pollackscience:data_course
cd data_course
conda env create -f environment.yml
```

Otherwise, load up the existing environment:
```shell
source activate data_course
```

## 1.3 Juypter Notebook Startup (for PSC)
PSC requires an additional special script to use Jupyter Notebook.  After we launch the kernel, we need a way of connecting out browser to the notebook.  This is accomplished by use of the `startupjupyter` script included in this course.

# 2. Minimizing Painful Surprises

Unexpected and unintentional changes can derail data science projects.  As you saw when beginning this course, dozens of packages were downloaded and installed, many of which will never be directly used, but are required dependencies for the top-level packages.  Extremely complex and multifaceted software is necessary for machine learning research, and that software is not static.  Functions become deprecated and eventually removed, default parameters change, expected behavior is updated, and (of course) bugs are fixed.  Any of these changes can impact your active project.  The impact may be minor or negligible, but it could leave you wondering whether you introduced a bug, or the world simply shifted around you.  The following tips are intended to preserve the integrity of your projects and your peace of mind.

## 2.1 Virtual Environments
Virtual environments are an excellent way to separate out your projects, and prevent them from interfering with one another, or with your system in general.  A virtual environment isolates projects from each one another on the same infrastructure.  Each project consists of its own software, packages, paths, variables, etc.  A virtual environment prevents changes in one project from affecting another.  Therefore, the following rule should always be followed:
- **Every project must live in its own virtual environment**

Python projects typically use one of two virtual environment managers: [virtualenv](https://virtualenv.pypa.io/en/latest/) and [conda envs](https://conda.io/docs/user-guide/tasks/manage-environments.html#).  As we are already relying on Anaconda to manage our python installations and dependencies on PSC, we will focus primarily on the latter.  Using `virtualenv` and `conda` together can lead to conflicts, so it's best to choose one or the other and stick with it.  PSC and Bridges provides a helpful website on the details of their conda installation and recommendations for creating virtual environments with the pre-configured ML software: https://www.psc.edu/user-resources/software/anaconda

```shell
# Exporting your current environment to share with collaborators:
conda env export -p /hard/path/to/conda_envs/data_course > environment.yml
```

```shell
# Download a new package that has a large ripple effect on already-installed packages:
conda install pytorch
# wait for install to complete....
# ....
# Install has changed a bunch of packages and installed a bunch of dependencies.
# List the revisions to tell us how to roll back:
conda list --revisions
# list of revisions, lets undo what we just did and install the next-to-most-recent rev:
conda install --revision XX
# wait for install, we've successfully reverted our working environment and hopefully averted disaster!
```


## 2.2 Version Control and Github
Whether working on solo projects or collaborative efforts, software versioning is a must.  As projects grow and increase in complexity, so does the chance that bugs and other misfortunes will strike.  Version control allows a developer to manage and track the changes in their software, which can help with:
- Recovering accidentally deleted work.
- Undoing bugs of unknown origin.
- Separating large-scale development into smaller tasks.
- Preventing multiple developers from writing conflicting software.
- Improving ease of code sharing and collaboration.
- Impressing future employers with your portfolio.

"Git" is currently the most popular and widely used version control software, and is typically used in conjunction with the [Github](github.com) hosting service.  If you do not have a github account, I highly suggest creating one now.  If you are unfamiliar with git and github, there are hundreds of useful tutorials on the web.  The three most commonly used commands when developing are:
```shell
git add <files>
git commit -m "my commit message"
git push```

These commands add changed files to your current git staging area, stores that info in a new commit (along with a commit message), and then pushes that commit to your remote repository (typically your github account).
- **add, commit, and push your work frequently**

There is very little benefit from versioning if you don't actively version.  Think of it as saving your work, so you don't lose progress.  There is no downside to committing frequently, however you do want to be careful that you don't accidentally commit sensitive material to your repo, or excessively large files (including notebooks with large embedded images or plots).

## 2.3 Unit Testing

Validation of results is integral to any data science project, and this concept translates to software development as well.  How do you know if an innocuous change to the code-base unintentionally affected the results?  You would have to run a segment of your code, and then analyze the output against a known quantity.  This process can be automated, and falls under the category of "unit testing".  The package [pytest](https://docs.pytest.org/en/latest/) is a popular and easy-to-use choice for performing unit tests, and once set-up, can be invoked with a single command.

Investigate the `tests/test_example.py` file, and run these commands:
```shell
pytest -k ex0
pytest -k ex1
pytest -k ex2
pytest -k ex3
pytest```


## 2.4 Debugging with **pdb**

In [1]:
import math
from IPython.core.debugger import set_trace

def complicated_function(x, y, z):
    set_trace()
    x = y*z
    z = y-x**2
    y = y*2-y/z
    x,y,z = y%z,z//x, x+y+z
    
    return (x,y,z)

complicated_function(200, 11.1, math.pi)

> [0;32m<ipython-input-1-ec44d5f369dd>[0m(6)[0;36mcomplicated_function[0;34m()[0m
[0;32m      4 [0;31m[0;32mdef[0m [0mcomplicated_function[0m[0;34m([0m[0mx[0m[0;34m,[0m [0my[0m[0;34m,[0m [0mz[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      5 [0;31m    [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 6 [0;31m    [0mx[0m [0;34m=[0m [0my[0m[0;34m*[0m[0mz[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m    [0mz[0m [0;34m=[0m [0my[0m[0;34m-[0m[0mx[0m[0;34m**[0m[0;36m2[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m    [0my[0m [0;34m=[0m [0my[0m[0;34m*[0m[0;36m2[0m[0;34m-[0m[0my[0m[0;34m/[0m[0mz[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> print(x)
200
ipdb> n
> [0;32m<ipython-input-1-ec44d5f369dd>[0m(7)[0;36mcomplicated_function[0;34m()[0m
[0;32m      5 [0;31m    [0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      6 [

(-1182.724746135079, -35.0, -1147.8530676802322)

# 3. General Good Habits

## 3.1 Linting

"[Linting](https://en.wikipedia.org/wiki/Lint_%28software%29)" is the act of running some sort of software that analyses your code and checks for syntax (or other errors).  It's a very useful tool that encourages you to write in a manner that not only runs, but looks professional.  This makes your code easier to read, debug, and understand for both yourself and your collaborators.  IDEs like [PyCharm](https://www.jetbrains.com/pycharm/) typically have [linting built in](https://www.jetbrains.com/help/pycharm/code-inspection.html), but it can also be run externally by using packages such as [Flake8](http://flake8.pycqa.org/en/latest/).  Remember, just because code runs does not mean that the output is correct!  Well-formated and consistent code will help you determine the source of an error more quickly than sifting through messy garbage code.  Your future self will thank you, your collaborators will thank you, and humanity as a whole can hold its head a little higher.

Python style standards are based on the [PEP 8 (Python Enhancement Proposal 8)](https://www.python.org/dev/peps/pep-0008/) style guide.  The PEPs, in general, are where all official Python improvements and features are documented.  PEP 8 details the "shoulds" and "should nots" for Python programming and stylistic changes.  Here's an example of two pieces of code that produce the same thing, one that follows the PEP 8 standard and one that does not:

In [2]:
# PEP 8 ignored.  Code is offensive to the eye and shameful.

# Mult some nmbrs against a
import math,datetime
from numpy import *
random.seed(1)
a = random.normal(size=100)
def RanNumMaFunc(a1,a2):
    __a1    = math.ceil(a1)
    aa1=random.normal(size=(len(a),__a1));aa2=random.normal(size=(__a1,a2))
    _outMat1 = matmul(a,aa1); _outMat2 = matmul(_outMat1, aa2)
    d=datetime.datetime.now()
    dd = "Today's date is %s" % d
    print(dd);return _outMat2
print(RanNumMaFunc(2.1,5))

Today's date is 2019-02-22 14:11:22.622978
[ 6.95407571 -7.51067268 -6.15986129 -5.0952413   7.99606841]


In [3]:
# PEP 8 is followed.  Code is beautiful, angels sing.
import datetime
import math

import numpy as np

np.random.seed(1)
_global_data = np.random.normal(size=100)


def random_num_mult(shape1, shape2):
    """Multiply the global data by two random matricies of shape:
        ( len(_global_data), shape1), and (shape1, shape2).
    Also prints the current time."""

    shape1 = math.ceil(shape1)
    mat1 = np.random.normal(size=(len(_global_data), shape1))
    mat2 = np.random.normal(size=(shape1, shape2))

    # Chained matrix mult
    output = np.matmul(np.matmul(_global_data, mat1),
                       mat2)

    current_time = datetime.datetime.now()
    print(f"Today's date is {current_time}")

    return output

print(random_num_mult(2.1,5))

Today's date is 2019-02-22 14:11:24.226727
[ 6.95407571 -7.51067268 -6.15986129 -5.0952413   7.99606841]


While it's great to get used to proper coding style, use tools to help get in the habit. Try out flake8 on the bad version of code:

```shell
$ flake8 data_modules/pep8_example_bad.py

data_modules/pep8_example_bad.py:3:12: E401 multiple imports on one line
data_modules/pep8_example_bad.py:3:12: E231 missing whitespace after ','
data_modules/pep8_example_bad.py:5:1: F405 'random' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:6:5: F405 'random' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:7:1: E302 expected 2 blank lines, found 0
data_modules/pep8_example_bad.py:7:20: E231 missing whitespace after ','
data_modules/pep8_example_bad.py:9:8: E225 missing whitespace around operator
data_modules/pep8_example_bad.py:9:9: F405 'random' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:9:35: E231 missing whitespace after ','
data_modules/pep8_example_bad.py:9:42: E702 multiple statements on one line (semicolon)
data_modules/pep8_example_bad.py:9:42: E231 missing whitespace after ';'
data_modules/pep8_example_bad.py:9:46: E225 missing whitespace around operator
data_modules/pep8_example_bad.py:9:47: F405 'random' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:9:71: E231 missing whitespace after ','
data_modules/pep8_example_bad.py:10:16: F405 'matmul' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:10:24: E231 missing whitespace after ','
data_modules/pep8_example_bad.py:10:29: E702 multiple statements on one line (semicolon)
data_modules/pep8_example_bad.py:10:42: F405 'matmul' may be undefined, or defined from star imports: numpy
data_modules/pep8_example_bad.py:11:6: E225 missing whitespace around operator
data_modules/pep8_example_bad.py:15:1: W391 blank line at end of file
```

## 3.2 Project Structure
__init__.py structure example, maybe autoreload as well

## 3.3. Docstrings
Basic docstring usage, notebook integration