<p style = "text-align:center; font-size:200%;"> Data Practices, Datalad & Beyond </p>

<p style = "text-align:center; font-size:110%;">Data Science Basics in Neuroscience</p>
<p style = "text-align:center; font-size:110%;">March 22nd, 2021</p>

Goals for Today
---
<img src="https://media.giphy.com/media/L17xM7PvLcqJggsCYa/giphy.gif" style="float:center"/>

 - Why should we care about reproducibility

 - What we've learned to do so far

 - How to add data into practices

 - Full research projects with the help of `datalad` and `neurodocker`

Why reproducibility matters - Dr. Daniel Bolnick
---

 - Find the relevant article and blog post [here](http://retractionwatch.com/2016/12/08/sinking-feeling-gut-diary-retraction/)
>""Recently, Dr. Tony Wilson from CUNY Brooklyn tried to recreate my analysis, so that he could figure out how it worked and apply it to his own data... he couldn't quite recreate some of my core results."
  <img src="https://3.bp.blogspot.com/-CFkSpxbvWnY/WEFOZy07Z3I/AAAAAAAAEGM/ZNkimbvdLh4TIbBUdp2rFwrRrRj20AT6QCLcB/s400/Figure%2B2.png" height="120" style="float:right">
 <img src="https://3.bp.blogspot.com/-JMw37lxif0o/WEFOsCVkf0I/AAAAAAAAEGQ/-BYBthSzx2IC0_Q4lmqgN1JvLxsA_8SSgCLcB/s1600/Figure%2B3.jpeg" height="80" style="float:right">

>"So: how many results, negative or positive, that enter the published literature are tainted by a coding mistake as mine was. We just don't know. Which raises an important question: why don't we review code (or other custom software) as part of the peer-review process?"

Why Reproducibility Matters
---

 - More recent & relevant example: Neuroimaging Analysis Replication and Prediction Study (NARPS)
<img src="images/narps.png" width="560" style="align:center">

 - 70 independent teams analyzed the same dataset, testing the same 9 hypothesis.

 - No two teams chose identical workflows.

 - Resulted in sizeable variations in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline

"Good enough" practices so far ...
---

 - Using the terminal to:
     1. Automate repetitive tasks across multiple files
     2. Wrangling large files without opening up Excel or some clunky software
     3. Multitask and parallelize your jobs

 - **Notebooks** are lowering barriers to reproducible research projects
 
<img src="https://jupyter.org/assets/jupyterpreview.png" width="500" style="float:left">
<img src="https://d33wubrfki0l68.cloudfront.net/3215c7166555d2ac02ef678fd025c171f90db23c/4e60a/images/bandone.png" width="500" style="float:right">

<img src="images/version_control.png" style="align:center">

<img src="https://inundata.org/talks/open-code/open-code/d42194071e688066055df2a68b1c3fd8.png" style="align:center">

### In a GitHub Repo:

 1. Overview document for the project

 2. Clear dependencies and requirements for code

 3. Decompose programs into modular functions

 4. Provide explanatory comments at the start of each module/script

 5. Code is deposited in a persistent and version controlled repository

 6. Provide citatiton for users

## Complete Github research compendium examples:

 - [Scripts and sample data](https://github.com/IStevant/XY-mouse-gonad-scRNA-seq)
 - [Files, environment, DOI](https://github.com/boettiger-lab/pomdp-intro)

#### [guides.github.com/activities/citable-code/](https://guides.github.com/activities/citable-code/)

<img src="https://inundata.org/talks/open-code/open-code/12dbee11604742f79f2a4140a864384b.png" width="600" style="align:center">

<p style="text-align:center; font-size:200%">But is just "good enough" enough?</p>


<center>
    <img src="https://media.giphy.com/media/3o6YglDndxKdCNw7q8/giphy.gif" height="250">
</center>

<p style="text-align:center; font-size:200%">How do we integrate good code with large datasets?</p>

<p style="text-align:center; font-size:200%">Brief Intro to Datalad</p>
<center>
<img src="https://raw.githubusercontent.com/datalad-handbook/artwork/59ad3d7f256ce6eb443bab955972c73083f06eb0/src/reproduced.svg">
</center>

## `Datalad` is a command-line data management multi-tool that can assist you in handling the entire life cycle of digital objects!
---

<img src="http://handbook.datalad.org/en/latest/_images/dataset.svg" width=350 style="float:right">
 - Every datalad command works on a datalad dataset, a dataset  is  simply a directory on any computer.

---
<img src="http://handbook.datalad.org/en/latest/_images/local_wf.svg" width="350" style="float:right">
 - Git and git-annex works under the hood to version control metadata and large files in your dataset.


```
$ datalad clone --dataset . \
 https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
```

```
[INFO] Cloning dataset to Dataset(/home/me/dl-101/DataLad-101/recordings/longnow) 
[INFO] Attempting to clone from https://github.com/datalad-datasets/longnow-podcasts.git to /home/me/dl-101/DataLad-101/recordings/longnow 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 
[INFO] Start receiving objects 
[INFO] Start resolving deltas 
[INFO] Completed clone attempts for Dataset(/home/me/dl-101/DataLad-101/recordings/longnow) 
[INFO] Remote origin not usable by git-annex; setting annex-ignore 
install(ok): recordings/longnow (dataset)
add(ok): recordings/longnow (file)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1
```

```
$ tree -d   # we limit the output to directories
.
├── books
└── recordings
    └── longnow
        ├── Long_Now__Conversations_at_The_Interval
        └── Long_Now__Seminars_About_Long_term_Thinking

5 directories
```

```
$ cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
$ ls
2003_11_15__Brian_Eno__The_Long_Now.mp3
2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
2004_02_14__James_Dewar__Long_term_Policy_Analysis.mp3
2004_03_13__Rusty_Schweickart__The_Asteroid_Threat_Over_the_Next_100_000_Years.mp3
2004_04_10__Daniel_Janzen__Third_World_Conservation__It_s_ALL_Gardening.mp3
2004_05_15__David_Rumsey__Mapping_Time.mp3
2004_06_12__Bruce_Sterling__The_Singularity__Your_Future_as_a_Black_Hole.mp3
2004_07_10__Jill_Tarter__The_Search_for_Extra_terrestrial_Intelligence__Necessarily_a_Long_term_Strategy.mp3
2004_08_14__Phillip_Longman__The_Depopulation_Problem.mp3
2004_09_11__Danny_Hillis__Progress_on_the_10_000_year_Clock.mp3
2004_10_16__Paul_Hawken__The_Long_Green.mp3
2004_11_13__Michael_West__The_Prospects_of_Human_Life_Extension.mp3
```

```
$ cd ../      # in longnow/
$ du -sh      # Unix command to show size of contents
3.7M	.
```

<center>
    <img src="https://media.giphy.com/media/toB3AnUDkqE3GENKx0/giphy.gif" width=300>
</center>

```
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
get(ok): Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 (file) [from web...]
    

$ datalad status --annex all
236 annex'd files (35.7 MB/15.4 GB present/total size)
nothing to save, working tree clean
```

<p style="text-align:center; font-size:200%">What does this accomplish?</p>

<center>
    <img src="https://media.giphy.com/media/l0ExazFt3nDN6v49W/giphy.gif">
</center>

### One `datalad` dataset  for everything:

---
<img src="https://raw.githubusercontent.com/myyoda/talk-principles/master/pics/dataset_linkage.png" width="600" style="float:right">
 - Linking & recording datasets! Individual datasets have data-dependencies and access URLs for individual files recorded. You can start building re-usable data subunits with:
 
 ```
 datalad create-sibling
 ```

<center>
    <img src="https://raw.githubusercontent.com/myyoda/talk-principles/master/pics/dataset_modules.png" width=980>
</center>

 ## Data sharing and collaboration
 
 - Install existing datasets by cloning the structure instead of copying all large files.
 - Create sibling datasets for publishing, sharing, and collaborations.
    
<center>
    <img src="http://handbook.datalad.org/en/latest/_images/collaboration.svg" width=800>
</center>

### Data Provenance

<img src="https://media.giphy.com/media/yxF4HmDIXw83S/giphy.gif" width=500 style="float:right">

How many times have you asked:
 - Where did this spreadsheet/file come from?
 - How was this file produced?
 - Why can't I reproduce the same results as before?

### `datalad run`
---
```
$ datalad run -m "create a list of podcast titles" "bash code/list_titles.sh > recordings/podcasts.tsv"
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (notneeded: 1, ok: 1)
```

<center>
    <img src="https://media.giphy.com/media/8kqrtQiz9YqnS/giphy.gif" width=400 style="align:center">
</center>

```
$ git log -p -n 1
commit eee1356bb7e8f921174e404c6df6aadcc1f158f0
Author: Elena Piscopia <elena@example.net>
Date:   Tue Jun 23 21:03:48 2020 +0200

    [DATALAD RUNCMD] create a list of podcast titles
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "1cdc8632-b584-11ea-90a2-3119e6b9cf19",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/recordings/podcasts.tsv b/recordings/podcasts.tsv
new file mode 100644
index 0000000..f691b53
--- /dev/null
+++ b/recordings/podcasts.tsv
@@ -0,0 +1,206 @@
+2003-11-15	Brian Eno  The Long Now
+2003-12-13	Peter Schwartz  The Art Of The Really Long View
+2004-01-10	George Dyson  There s Plenty of Room at the Top  Long term Thinking About Large scale Computing
+2004-02-14	James Dewar  Long term Policy Analysis
```

##### `datalad rerun`
---

<img src="https://media.giphy.com/media/111ebonMs90YLu/giphy.gif" width=350  style="float:right">

```
$ datalad rerun eee1356bb7e8f921174e404c6df6aadcc1f158f0
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
add(ok): recordings/podcasts.tsv (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (notneeded: 1, ok: 1)
  unlock (notneeded: 1)
```

<center>
    <img src="https://media.giphy.com/media/9V1F9o1pBjsxFzHzBr/giphy.gif" width=250>
</center>

### Complete  provenance capture with containers

```
$ datalad containers-run -n nilearn \
  --input 'inputs/mri_aligned/sub-*/in_bold3Tp2/sub-*_task-avmovie_run-*_bold*' \
  --output 'sub-*/LC_timeseries_run-*.csv' \
  "bash -c 'for sub in sub-*; do for run in run-1 ... run-8;
     do python3 code/extract_lc_timeseries.py \$sub \$run; done; done'"

-- Git commit -- Michael Hanke &lt;michael.hanke@gmail.com&gt;; Fri Jul 6 11:02:28 2018
    [DATALAD RUNCMD] singularity exec --bind {pwd} .datalad/e...
    === Do not change lines below ===
    {
     "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
     "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
     "exit": 0,
     "inputs": [
      "inputs/mri_aligned/sub-*/in_bold3Tp2/sub-*_task-avmovie_run-*_bold*",
      ".datalad/environments/nilearn.simg"
     ],
     "outputs": ["sub-*/LC_timeseries_run-*.csv"],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^
---
 sub-01/LC_timeseries_run-1.csv | 1 +
 sub-01/LC_timeseries_run-2.csv | 1 +
```

<img src="https://media.giphy.com/media/OK27wINdQS5YQ/giphy.gif" width=400 style="float:right">

 - Of course you can re-run these container commits too.

### Integration - built in commands for sharing

<center>
    <img src="http://handbook.datalad.org/en/latest/_images/thirdparty.svg" width=1400>
</center>

### Metadata Handling - Powerful queries into your datasets

<center>
    <img src="http://handbook.datalad.org/en/latest/_images/metadata_prov_imaging.svg" width=1500>
</center>

<center>
    <img src="images/datalad_overview.png">
</center>

## Take Aways:
 - Reproducibility is an incremental journey.
 - There are plenty of tools to make this journey easier.
 - Only worry about 2 things with `datalad`: datasets and associated files!
     - A dataset is a git repository with optional large file content tracking.
     - Data provenance is important, useful, and scalable!
 - All these skills are highly desirable for data science/engineer positions!
 - Sharing promotes the social aspects of scientific research!

#### Resources:
 - [repronim.org](https://www.repronim.org/webinar-series.html)
 - [The Datalad Handbook](http://handbook.datalad.org/en/latest/index.html)
 - [Talk by Dr. Karthik Ram at Reproducibility for Biomedical Researchers seminar @ UCSF](https://github.com/karthik/ucsf19)