# Data@NNL

Presented by: Brian McClune and Austin Stegall

# Outline

1. Introduction
2. Obtaining Data at NNL
3. Scrubbing Data at NNL

# Introduction

1. The Data in Data Science
2. One Thing Before Data
3. The Data Science Process

1. The Data in Data Science

In this section, we cover some basics up front:

* What is data science?
    * It's using data to answer questions
    * It can involve:
        * Statistics, computer science, mathematics
        * Data cleaning and formatting
        * Data visualization
      
[ The Data Science Venn Diagram](https://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

* What is data?
    * "Data are individual facts, statistics, or items of information, often numeric.
      In a more technical sense, data are a set of values of qualitative or
      quantitative variables about one or more persons or objects..."
    * Key points:
        * A *set* of values
        * Comprised of *variables*
        * Describing *qualitative* or *quantitative* characteristics or measures of
          a person or object

[Data](https://en.wikipedia.org/wiki/Data)

* What is big data?
    * "Data that exceeds the processing capacity of conventional database systems."
    * The characteristics of big data:
        * Volume
        * Velocity
        * Variety
  
[Volume, Velocity, Variety: What You Need to Know About Big Data](https://www.forbes.com/sites/oreillymedia/2012/01/19/volume-velocity-variety-what-you-need-to-know-about-big-data/?sh=67a336ef1b6d)

* What does data look like?
    * Sequencing data (e.g. for DNA)
    * Census data
    * Medical records data
    * Geographical information system (GIS) data
    * Language data (e.g. free form text and documents)
    * Website data (e.g. traffic history)
    * Audio data 
    * Video data
    * Sensor data
    * Network data
    * ...

2. One Thing Before Data

In this section, we stress that the data is important, but it is secondary.
The question they want to answer is primary. Depending upon the
of question, your requirements of data can differ and the *type* of
analysis can too.

It is also the first dimension in which we orient our audience: we
aren't covering modeling here. But the analyses that precede modeling
(which begins with number 3 below) are relevant here.

Type of analyses (in rough order of difficulty):

1. Descriptive
    * Goal is to describe or summarize a set of data
    * Common descriptive statistics:
        * Measures of central tendency (e.g., mean, median)
        * Measures of variability (e.g., range, standard deviation)
2. Exploratory
    * Goal is to explore data and find relationships
    * Often used as a step to form or reinforce hypotheses before further analysis
3. Inferential
    * Goal is to understand the impact of input variables to some outcome
4. Predictive
    * Goal is to understand how input variables can be used to predict some outcome
5. Causal
    * Goal is to understand what happens to one variable when we manipulate another
    * Hard to do with observed data alone; typically done as part of designed studies
6. Mechanistic
    * "[The] goal of mechanistic analysis is to understand the exact changes in
      variables that lead to exact changes in other variables. These analyses are
      exceedingly hard to use to infer much, except in simple situations or in those
      that are nicely modeled by deterministic equations."
      (Coursera: The Data Scientist's Toolbox, Johns Hopkins)

[Good discussions distinguishing inference and prediction](https://stats.stackexchange.com/questions/244017/what-is-the-difference-between-prediction-and-inference)

3. The Data Science Process

In this section, we present to the audience the *data science process*.

We also use this as a second dimension for orienting our audience: we
aren't covering modeling and interpretation here, but we do need to
familiarize them with:

* How to obtain data depending upon its format and location
* Not how to scrub, but what characteristics of data they might need
  to consider when going to scrub for a specific problem
* Not how to explore, but some basic exploration (either descriptive
  or visual) that they might consider for a specific problem

* The Data Science Process
    * An acronym introduced by Hilary Mason in 2010: OSEMN (rhymes with "possum")
        * **O**btain
        * **S**crub
        * **E**xplore
        * **M**odel
        * i**N**terpret

[A Taxonomy of Data Science](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)

# Obtaining Data at NNL

1. Data sources
2. Data formats
3. Techniques

1. Data sources

In this section, we provide an overview for the audience: where
might they find data? The intent is twofold:

* Help audience members understand how many places they might find
  data at the laboratory
* Motivate the next topic of discussion: data formats

2. Data formats

In this section, we provide an overview of some of the data formats
audience members might encounter as data scientists at the laboratory.

Some examples:

* Excel
* CSV
* JSON (web API)
* YAML
* XML
* HTML (web scraping)
* HDF5
* DB (Oracle, SQL Server, Hadoop)
* Serialized binary formats
    * Pickle (`.pkl`)
    * NumPy binary format (`.npy`)
    * Protobuf (`.proto`)
    * TensorFlow TFRecord (`.tfrec`)
    * R RDS (`.rds`)
    * ...

3. Techniques

In this section we provide just a look at code examples for
obtaining data (in its raw format, prior to any cleaning) from the
variety of sources and formats introduced. The intentions here
are to:

* Illustrate what different approaches look like to obtain data
  depending upon its source and format
* Demystify for the audience the process of doing so

# Scrubbing Data at NNL