# Foundations of Data Science 

### Building best practices toward reproducible research 

# Plan for Today

- **What is Data Science?**
- **Reproducibility** &rarr; what this means and how to build code that you and others can replicate
- **Working from the Commandline** &rarr; why and how
- **Version Control** &rarr; why and how

# What does data science look like?

## Most think this

![image.gif](figures/lecture01_gapminder-animation.gif)

## Or this
<img src="figures/lecture01_ethnic-exposure-boko-haram.png" width="600">

## Or this
![](figures/lecture01_ROC-curve-example.png)

## But in reality it's more like this...

![](figures/lecture01_messy-files.png)

### Data Science requires that we explore data from multiple angles in various ways 

### and this path is <u>rarely</u> linear


# Reproducibility...

### is a essential feature of "**good science**": we should be able to follow the same path you took and come to the same conclusions.

+ allows researchers to probe and explore others work
+ test validitiy
+ test the bounds of a result

### but it's also a **practical reality** given how projects come together.

+ often juggling multiple versions of the same file
+ collaboration can create conflicts across versions
+ projects are picked up and put down &rarr; tracing the progression of spiderweb of files is not always easy (or possible)
+ new people enter the fray &rarr; getting them up-to-speed wastes time and resources


# What do data scientists in policy and social science do?

# The term "data scientist" is a catch-all

Consider four motivations:
- **A**. `"I made a plot, I'm a data scientist! Data science is about using code."`
- **B**. `"With enough data, I can answer any question. It's not about expertise, it's about having access to the right information and crunching it. Data science is its own thing: it emerged from the revolution of personal computing and massive data."`
- **C**. `"No clue what 'data science' really means, but neither does anyone else! I'm an opportunist. I'll use our mutual ignorance to get a nice job."`
- **D**. `"Data science is just another way of looking at a social process. Sometimes the things we care about can't be isolated to a single cases or reproduced in an lab. World plays out at different levels of aggregation and we need to be able to understand how processes play out at those different levels"`

### Reality is likely a mixture of all these

## The task of a data scientist engaged in social science research is...

## to generate <font color="blue">compelling</font> answers

## But <font color="blue">compelling</font> answers...

- require thorough **exploration**
- are often **discovered** while working with data
- emerge through **collaboration**
- require **describing** the data and results in multiple ways

## The task of a data scientist engaged in social science research is... 

## to generate <font color="blue">compelling</font>, <font color="red">valid</font> answers

## But <font color="red">valid</font> answers require...

- **Scrutiny** &rarr; peer review
- **Discussion** &rarr; regarding...
    + Conceptual Definitions
    + Measurement
    + Scope conditions
    + Engaging with past work
- **Reproducibility** &rarr; others can trace your steps

## The task of a data scientist engaged in social science research is... 

## to generate <font color="blue">compelling</font>, <font color="red">valid</font>, <font color="orange">unbiased</font> answers

## But <font color="orange">unbiased</font> answers require...

- **Introspection** &rarr; how might measurements and algorithms reinforce human error
- **Diversity** &rarr; seeing different problems requires vantage
- **Reproducibility** &rarr; work that different points of view can chime in on and engage with.

## The task of a data scientist engaged in social science research is...

### to <span style="color:#477acc"><u>acknowledge the limitations of the design</u></span>

### What can and can't we say from these results?

### In what ways does missing data and other measurement challenges limit the kinds of inferences we can make?

### How might we improve future studies in this domain? 
+ what sort of data would we need?
+ what sort of skills might help solve future problems?
+ what sort of expertise might offer new insights?

### To what degree should these results inform or not inform policy?
+ Data science and statistics has mixed history.
    - Horrible policies have resulted from relying too heavily on poor data and theory, e.g. eugenics; Iraq war; Anti-vaccinations movement... to name a few.
+ We need to convey the uncertainty that are baked into our results

## The task of a data scientist engaged in social science research is...

### to <span style="color:#477acc"><u>acknowledge the limitations of the design</u></span> 

### while <span style="color:#477acc"><u>emphasizing</u></span> what we learn from approaching the problem in this way.

# Generating Reproducible Work

### 1. Readable
### 2. Portable
### 3. Well-Named
### 4. Repeatable
### 5. Version Control

# Readable

- **Well Commented Code and Functions**

```python
x = np.linspace(1,10,100,float)
y = 1 + 2*x + np.random.normal(0,1,100)
plt.scatter(x,y)
```

vs.


```python
# Monte Carlo Simulation of the Linear Model
N = 100 # Simulated sample size
x = np.linspace(start=1,stop=10,num=N,dtype=float) # generate exogeneous variable
e = np.random.normal(loc=0,scale=1,size=N) # simulate error
y = 1 + 2*x + e # generate y as a function of x and error
plt.scatter(x,y) # plot values
```

- **Well Commented Code and Functions**
- **Well-Named Objects**

```python
# Monte Carlo Simulation of the Linear Model
N = 100 # Simulated sample size
x = np.linspace(start=1,stop=10,num=N,dtype=float) # generate exogeneous variable
e = np.random.normal(loc=0,scale=1,size=N) # simulate error
y = 1 + 2*x + e # generate y as a function of x and error
plt.scatter(x,y) # plot values
```

vs.

```python
# Monte Carlo Simulation of the Linear Model
sample_size = 100 # Simulated sample size
indep_var = np.linspace(start=1,stop=10,num=sample_size,dtype=float) # generate exogeneous variable
error = np.random.normal(loc=0,scale=1,size=sample_size) # simulate error
dep_var = 1 + 2*indep_var + error # generate dependent var as a function of independent var and error
plt.scatter(indep_var,dep_var) # plot values
```


- **Well Commented Code and Functions**
- **Well-Named Objects**
- **Leverage Spacing**

```python
# Monte Carlo Simulation of the Linear Model
sample_size = 100 # Simulated sample size
indep_var = np.linspace(start=1,stop=10,num=sample_size,dtype=float) # generate exogeneous variable
error = np.random.normal(loc=0,scale=1,size=sample_size) # simulate error
dep_var = 1 + 2*indep_var + error # generate dependent var as a function of independent var and error
plt.scatter(indep_var,dep_var) # plot values
```
vs.

```python
# Monte Carlo Simulation of the Linear Model

# Simulated sample size
sample_size = 100 

# generate exogeneous variable
indep_var = np.linspace(start=1,stop=10,num=sample_size,dtype=float) 

# simulate error
error = np.random.normal(loc=0,scale=1,size=sample_size) 

# Generate dependent var as a function of independent var and error
dep_var = 1 + 2*indep_var + error 

# Plot Relationship
plt.scatter(indep_var,dep_var) # plot values
```


- **Well Commented Code and Functions**
- **Well-Named Objects**
- **Leverage Spacing**

**To a degree, Code&mdash;like writing&mdash;should be more Hemmingway than Faulkner.**:

- concise
- clear
- readable without running

# Portable

- Project can easily **travel across computers**
    - e.g. R Project (`.rproj`) and python's Virtual Environments (`venv`)

- Project can easily **travel across computers**
    - e.g. `R` Project (`.rproj`) and python's Virtual Environments (`venv`)
- Scripts avoid **"machine" specific designations**
    + Avoid **specific file paths**: 
    
`/Users/my-user-name/data-projects/my-project`
 
 vs.

`~/data-projects/my-project`

- Project can easily **travel across computers**
- Scripts avoid **"machine" specific designations**
    + Avoid **specific file paths**: 
    + **Retain software and packages versions**, so when the project is spun up months or years later, everything runs (even though python or R have long since moved on).
        - e.g. `R`'s `packrat` package
        - python's Virtual Environments (`venv`)... retains module installations and its own `pip` download

- Project can easily **travel across computers**
- Scripts avoid **"machine" specific designations**
- **Use text files**
    + Not software dependent (e.g. .docx, .ia)
    + Can open on any system
    + Can be easily searched via the commandline
    + Easy to track changes via version control

# Well-named

- **No spaces!**
    + A space between designations can many things
    + spaces are ambiguous for the computer
    + will become immediately clear once we move into the commandline
 
```bash
data analysis 2.py
histogram of age distribution.pdf
```

vs

```bash
data-analysis-2.py
histogram-of-age-distribution.pdf
```

- **No spaces!**
- **Names that state the purpose of the file** (no matter how long).

```bash
data-analysis-2.py
histogram-of-age-distribution.pdf
```

vs

```bash
Analysis01_wrangling-census-data-for-visualization_v2.py
Figure02_histogram_age-distribution-of-sample-respondents.pdf
```

- **No spaces!**
- **Names that state the purpose of the file** (no matter how long).
- Maintain **designated folders** for different aspects of the project.

e.g.

```bash
data-project
├── raw-data/        # Where our input data lives
├── output-data/     # Where our manipulated data lives
├── py/              # Where our Python functions live
├── R/               # Where our R functions live
├── figures/         # Where our generated figures live
├── reports/         # Where our text-based (.tex/.md/.txt) live
└── analysis/        # Where our analyses live
```

# Repeatable

- Every step of the project can be expressed as code
- Automate what you can
- Use functions to repeat common tasks
- Clearly state all depdendencies (i.e. packages/modules) at the top of every script

# Version Control

- Retain a **record of all changes** made throughout the project's lifespan
- Easily handle **collaboration**: 
    + track who did what
    + uniform method dealing with conflicting changes
- Provides a **room for experimentation** and non-linear exploration
- No more **version file names**!

More on this later...

# Working from the Command Line


### The commandline offers an easy way in which to navigate the computer. 

From it, we can:

- create, move, edit files
- install new functionality onto our computer
- run scripts in `R` or `python`

And much much more...



### Why use the commandline?

- **contextualizes** where things are on the computer (as the computer does)
    + We'll call to files from python and R using the file `path`.
- **reproducible** 
    + we can retain all our traditionally "point and click" steps as a `.sh` script. 
    + "point and click" is not reproducible as each step is not recorded and thus cannot be retraced.
- **automated**
    + batch run scripts
    + have multiple processes running in the background
- **remote into virtual machines**, such as AWS or Azure

### Let's take a quick tour using the commandline by building a simple project 

### Where are we on the computer?

```bash
PWD
```
```
/Users/ericdunford/
```

### Move to the Desktop

```bash
cd ~/Desktop
PWD
```
```
/Users/ericdunford/Desktop
```

### What files are in this directory?

```bash
ls
```
```
old-files
tmp-files
notes
test-script.py
```

Looks like there are **three directories** and **one python file**.

### Let's find out more information

Most all bash commands have arguments that change the nature of the output.

```bash
ls -a 
```
```
.
.DS_Store
.ipynb_checkpoints
old-files
tmp-files
notes
test-script.py
```

`-a` lists 'all' files. The files with `.` are "hidden" in that they don't appear on the normal point and click interface.

### Let's find out more information

Most all bash commands have arguments that change the nature of the output.

```bash
ls -l
```
```
total 0
drwxr-xr-x  6 ericdunford  staff  192 Aug 28 18:01 old-files
drwxr-xr-x  9 ericdunford  staff  288 Aug 30 14:54 test-env
drwxr-xr-x  2 ericdunford  staff  133 Aug 30 14:54 notes
-rw-r--r--  1 ericdunford  staff   10 Sep  1 16:28 test-script.py
```

`-l` offers a more comprehensive list of the available files.

### Make a directory for our project 

```bash
mkdir my-project
ls
```
```
old-files
tmp-files
notes
test-script.py
my-project
```

### Move into the project and create the following folders

- `raw-data/`
- `output-data/`
- `py`
- `notes/`

```bash
cd my-project
mkdir raw-data output-data py notes  # multiple files can be specified at once
ls
```
```
notes       output-data py          raw-data
```

### Let's create a new note in the `notes` directory

```bash
touch notes/new-note.txt   # 'touch' will create a file on the fly
ls notes/                  # we list specific directories by pointing to them
```
```
new-note.txt
```

### We can also populate files directly by funneling (`>`) the output

```bash
echo This is a new file! > notes/new-note-2.txt
ls notes/
```
```
new-note.txt
new-note-2.txt
```

### Print the file content in `new-note-2.txt`

```bash
cat notes/new-note-2.txt
```
```
This is a new file!
```

### Open files

```bash
open notes/new-note-2.txt
```

### Open files with a specific application 

```bash
open notes/new-note-2.txt -a Sublime\ Text 
```

### Create a python script and then run it

```bash
echo 'print("Running a really important program...")' > py/my-script.py
python3 py/my-script.py
```
```
Running a really important program...
```

## Moving around and Managing Files

### Delete files

```bash
rm notes/new-note.txt
ls notes/
```
```
new-note-2.txt
```
> **Warning!** Once you `rm` a file, it is gone. 

### Delete directories

```bash
ls 
```
```
notes raw-data output-data py
```
```bash
rmdir raw-data
ls 
```
```
notes output-data py
```

### Delete directories with files in them

```bash
rmdir notes
```
```
rmdir: notes: Directory not empty
```
```bash
rm -rf notes
ls
```
```
output-data py
```
> **Warning!** Once again, once it's gone it's gone.

### Moving files

```bash
mv notes ~/Desktop
ls ~/Desktop
```
```
old-files
tmp-files
notes
test-script.py
my-project
notes
```

### Navigating

```bash
cd ..     # go back to the last directory
cd        # go to the top directory
cd -      # go back to where you once where
cd ~/Desktop/my-project/raw-data/file-1/ # go to a specific location
```

### Why spaces in file names are a problem


```bash
open this\ file\ of\ mine.txt
```
vs
```bash
open this-file-of-mine.txt
```

More typing, hard to wield when calling to files from `R` or `python`

## Package managers

Assist in downloading additional functionality onto your system.

[homebrew](https://brew.sh/) (Mac/Linux) 
```bash
brew install trash
```
and

[chocolatey](https://chocolatey.org/) (Windows) 
```bash
choco install trash
```


## Installing python models via `pip3`

Make sure that you have a current version of `python3` on your system. After version 3.4, `pip` comes with the installation of python. 
```bash
python3 --version
```
```
Python 3.7.0
```
```bash
pip3 --version
```
```
pip 18.0 from /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pip (python 3.7)
```

Install models as follows: `pip3 install [package name]`

e.g.
```bash
pip3 install numpy
```

## Asking for help

```bash
grep -h
```
```
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]
```

## Asking for help

`man` short for "manual" 
```bash
man grep
```

Prints off the entire manual as pages, by which we navigate as follows:

- `:n` =  next page,
- `:p` = past page, 
- `:q` = quit 

### Cozying up to the commandline

- Can be a bit awkward at first.
- Will ease the process of switching between languages. 
- Necessary for version control 
    + (though there are point-and-click interfaces, such as the github port build into R studio)
- Will use throughout the semester, so you'll have plenty of time learn it.
- More on the command line next time!

<br><br>
#### Remember: the goal is to more away from any process that is 'point and click' (keep each step reproducible)



# Using Git 

## What is version control?

### Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

### For many, "version-control" means something like this...

```
my-project
├── my-paper_v5.2.docx
├── Drafts/
    ├── my-paper_v5.1.docx
    ├── my-paper_v5.docx
    ├── my-paper_v4.7.docx
    ├── my-paper_v4.6.docx
    ├── my-paper_v4.5.docx
    ...
```


### Version control in this manner is error prone and complicates collaboration

```
my-project
├── my-paper_v5.2.docx
├── Drafts/
    ├── my-paper_v5.emg-edits.docx
    ├── my-paper_v5.dfb-edits-v2.docx
    ├── my-paper_v5.dfb-edits-v1.docx
    ├── my-paper_v5.1.docx
    ├── my-paper_v4.6.emg-final-edits.docx
    ...
```

#### What's the operative version?

#### Where's that good description of var1 that we scrapped? We want it back.

## What is Git?

- `git` is one of many possible version control protocols (others include mercurial, bazaar, darc, ...) 
- It is useful because it saves snapshots rather than just tracking differences. Treats **data as a stream of snapshots**.
- Changes are made **locally** and then can be easily incorporated and merged with the work of others. No network dependencies required.
- it’s impossible to change the contents of any file or directory without Git knowing about it.

## How Git works...

Three states:

- **Committed** &rarr; data is safely stored in your local database.
- **Modified** &rarr; changed the file but have not committed it to the database yet.
- **Staged** &rarr; marked a modified file in its current version to go into the next commit snapshot.


![](figures/lecture01_git-process-image.png)

# Git Basics

## Initializing a Repository

```bash
mkdir my-project
cd my-project 
git init
```

or just...

```bash
git init my-project
```

## Checking the Status of Things

Assuming we're in the `my-project` directory...

```bash
git status
```

```
On branch master

No commits yet

nothing to commit (create/copy files and use "git add" to track)
```

## Staging Files

First, let's generate a file and check the status.
```bash
echo 'This is a repository that shows some cool code.' > README.md
git status
```

```git
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	README.md

nothing added to commit but untracked files present (use "git add" to track)
```

## Staging Files

Let's stage it. 
```bash
git add README.md
git status
```

```git
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   README.md
```

## Committing Files

Let's now commit the file.
```bash
git commit -m "This is my first commit"
```

```git
master (root-commit) 5680e24] This is my first commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md
```
```bash
git status
```
```
On branch master
nothing to commit, working tree clean
```

## Reviewing Commits

Let's look at our history...
```bash
git log
```

```git
commit 5680e24ccf49d7796f0c3a59a13d50e1e1d991ed (HEAD -> master)
Author: edunford <eric.dunford@georgetown.edu>
Date:   Tue Sep 4 15:25:54 2018 -0400

    This is my first commit
```

What's all going on here?

## Reviewing Commits

Printing logs in viable formats...

```bash
git log --oneline
```

```git
5680e24 (HEAD -> master) This is my first commit
```

## Continue working...

```bash
mkdir raw-data output-data py analysis
git status
```

```
On branch master
nothing to commit, working tree clean
```

What's going on here?


## Continue working...

```bash
touch raw-data/some-file1.txt output-data/manip-file.txt py/file2.py analysis/file3.py
git status
```

```
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	analysis/
	output-data/
	py/
	raw-data/

nothing added to commit but untracked files present (use "git add" to track)
```

## Continue working...

```bash
git add .     # dot means just stage all unstaged files.
git commit -m "Added some files"
git log --oneline
```

```
4831662 (HEAD -> master) Added some files
5680e24 This is my first commit
```

## Fast Forward a Few Steps in the project...

```bash
git log --oneline
```

```
333fb20 (HEAD -> master) Started analysis
ede382e Added data entries to raw-data
e1c8db1 Added module to .py file
4831662 Added some files
5680e24 This is my first commit
```

## Connecting to Remote Git Repositories


## What is Github?

- single largest host for Git repositories (i.e. the `.git` that the `git init` produces)
- can `push` our changes to the repository to upload our changes to the remote.
- Recall that a git repository is an **entire snapshot** of a project and all its data, so the fact that we can upload and download files, work on them locally, and then incorporate those changes back into the main work flow is quite powerful!


## Create a new Git Repository

![](figures/lecture01_git-rep-step1.png)

## Create a new Git Repository

![](figures/lecture01_git-rep-step2.png)

## Create a new Git Repository

![](figures/lecture01_git-rep-step3.png)

## Create a new Git Repository

![](figures/lecture01_git-rep-step4.png)

## Create a new Git Repository

![](figures/lecture01_git-rep-step5.png)

## Create a new Git Repository

![](figures/lecture01_git-rep-step6.png)

## Link the local repository to the Github remote

```bash
git remote add origin https://github.com/edunford/example-rep1.git
```

Now "push" the changes to the remote repository. 

```bash
git push -u origin master
```

List the remotes associated with this repository

```bash
git remote
```
```
origin
```

## Check for updated information

**fetch** to pull down data (put not merge it with your existing version). You have to merge it manually.
```bash
git fetch
```
**pull** to automatically fetch and then merge.
```bash
git pull
```

## Cloning 

We can now download the repository easily on any local computer by **cloning** 
```bash
git clone https://github.com/edunford/example-rep1.git
```


### Next Time

- Delve deeper into managing work flow with Git.
- Perform more advanced operations with the command line.
- Begin delving into python

<br><br>
**References**

Images and wording for some slides were pulled from: Scott and Ben Straub. (2014). ‘Pro Git’. Ed. 2: https://git- scm.com/book/en/v2