# Practical Exploratory Data Analysis

## Course 1: Setup, Environment, Backups, Reproducibility

The goal of this series of courses is to act as a guide for exploratory data analysis (EDA), using practical, interactive examples.  In particular, the following skills, techniques, and concepts will be emphasized:

1. Introductory Tasks.
    1. Setting up an appropriate workspace.
    1. The benefits of virtual environments.
    1. Versioning, redundancy, and reproducibility.
    1. Style and organization.
1. Basic Analysis Tools.
    1. I/O and manipulation.
    1. Visualization.
    1. 

## PSC Startup

This course is intended to be taken on an [interactive node](https://www.psc.edu/bridges/user-guide/running-jobs) on the Pittsburgh Supercomputing Cluster.  Before we start up our Jupyter Notebooks, we must first make sure our PSC working area is set up correctly, along with the necessary tools.

### Recommended Prior to Installation:
1. `module load git` -> loads a recent version of git, needed for checking out software.
1. Making a ".condarc" file.  I suggest:
```shell
# Specify where to place environments and packages
 envs_dirs:
   - /path/to/conda_envs

 pkgs_dirs:
   - /path/to/conda_envs/pkgs```
Make these directories in areas in which you have a lot of space (not /home/username).  You are only granted 10 GB of space on home, and checking out large conda packages can quickly use it all up.  Make sure `~/.conda/pkgs` is set to read-only mode, as conda may still try to use that as a default package download location if it exists.
1. Symlink your other directories to your home (`ln -s /path/to/other/userspace /home/username`).  You'll be doing most of your work outside of "home", so it makes sense to easily navigate to those areas.

### Loading Interactive Node and Anaconda
We need some computing power to handle the upcoming tasks.  We will use an interactive cpu node to provide that power.  
```shell
interact -p RM --egress -t 02:00:00 -A XXXXXX --mem=120GB
module load AI/anaconda3-5.1.0_gpu
cd /my/working/dir```

If this is the first time we're starting up, we'll need to grab the course from github and set up the environment:
```shell
git clone pollackscience:data_course
cd data_course
conda env create -f environment.yml
```

Otherwise, load up the existing environment:
```shell
source activate data_course
```

# Minimizing Painful Surprises

Unexpected and unintentional changes can derail data science projects.  As you saw when beginning this course, dozens of packages were downloaded and installed, many of which will never be directly used, but are required dependencies for the top-level packages.  Extremely complex and multifaceted software is necessary for machine learning research, and that software is not static.  Functions become deprecated and eventually removed, default parameters change, expected behavior is updated, and (of course) bugs are fixed.  Any of these changes can impact your active project.  The impact may be minor or negligible, but it could leave you wondering whether you introduced a bug, or the world simply shifted around you.  The following tips are intended to preserve the integrity of your projects and your peace of mind.

## Virtual Environments
Virtual environments are an excellent way to separate out your projects, and prevent them from interfering with one another, or with your system in general.  A virtual environment isolates projects from each one another on the same infrastructure.  Each project consists of its own software, packages, paths, variables, etc.  A virtual environment prevents changes in one project from affecting another.  Therefore, the following rule should always be followed:
- **Every project must live in its own virtual environment**

Python projects typically use one of two virtual environment managers: [virtualenv](https://virtualenv.pypa.io/en/latest/) and [conda envs](https://conda.io/docs/user-guide/tasks/manage-environments.html#).  As we are already relying on Anaconda to manage our python installations and dependencies on PSC, we will focus primarily on the latter.  Using `virtualenv` and `conda` together can lead to conflicts, so it's best to choose one or the other and stick with it.  PSC and Bridges provides a helpful website on the details of their conda installation and recommendations for creating virtual environments with the pre-configured ML software: https://www.psc.edu/user-resources/software/anaconda

## Version Control and Github
Whether working on solo projects or collaborative efforts, software versioning is a must.  As projects grow and increase in complexity, so does the chance that bugs and other misfortunes will strike.  Version control allows a developer to manage and track the changes in their software, which can help with:
- Recovering accidentally deleted work.
- Undoing bugs of unknown origin.
- Separating large-scale development into smaller tasks.
- Preventing multiple developers from writing conflicting software.
- Improving ease of code sharing and collaboration.
- Impressing future employers with your portfolio.

Github is currently the most popular and widely used version control software 