### NIAID / OCICB / BCBB

### Python Programming for Biologist Seminar Series

---
# Reproducible Science with Jupyter Notebook

https://reproducible-science-curriculum.github.io/workshop-RR-Jupyter/

---

### Instructor: R. Burke Squires
- To see recorded videos and materials (archived) the NIAID Bioinformatics Portal: https://bioinformatics.niaid.nih.gov
- To download materials for each class please see: https://github.com/burkesquires
 
---

## Learning Objectives

The following are the overarching learning objectives for the curriculum.

* Understand the value of reproducible research practices for more effective research for current and future you.
* Understand the value of reproducible research practices for advancing research as a whole.
* Understand what is meant by making your research more reproducible.
* Know practices to make your research more reproducible, in particular by using Jupyter Notebooks, and have the skills to do so.
* Have the confidence and foundation to continue improving reproducibility of your research.
* Understand what’s possible and they still can learn to be more effective with reproducible research.

## Workshop outline

A Reproducible Science with Jupyter Notebooks Curriculum workshop currently has five modules:

1. [Introduction to the workshop](#i-workshop-introduction)
2. [Data and Project Organization](#ii-data-and-project-organization)
3. [Introduction to the Jupyter notebook](#iii-introduction-to-the-jupyter-notebook)
4. [Data Exploration](#iv-data-exploration)
5. [Automation](#v-automation)
6. [Publication](#vi-publication)
7. [Sharing](#vii-sharing)

### I. Workshop Introduction

**Goals**: Introduction to the workshop, including motivation, agenda and goals for the workshop.

**Materials**<br/>
*Repository*: <https://reproducible-science-curriculum.github.io/workshop-introduction-RR-Jupyter/>

### II. Data and Project Organization

**Goals**: Students will learn recognizing common data file formats and how to import them into a Jupyter notebook; be able to design and justify a directory structure and file naming convention for a project; be able to move from an empty notebook through exploratory analysis into a more refined script or set of notebooks that communicates results reproducibly.

**Instructor's skills**: Good understanding of file organisation in research projects. Understanding of file structure on major operating systems (Windows, Linux/Unix, Mac OS) and the interface/commands for managing files and folders. Understanding of basic file types (binary vs. text). At least a basic overview of how files are stored (and deleted) in different operating systems. Understanding of file and folder naming conventions (names, extensions etc.).

**Materials**<br/>
*Repository*: <https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/>

    A boilerplate for reproducible and transparent science with close resemblances to the philosophy of [Cookiecutter Data Science](https://github.com/drivendata/cookiecutter-data-science): *A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.*

    Requirements
    ------------
    Install `cookiecutter` command line:

    

In [None]:
pip install cookiecutter

    Usage
    -----
    To start a new science project, in a Terminal (or Command Prompt?) window:

        cookiecutter gh:burkesquires/cookiecutter-reproducible-science

    Project Structure
    -----------------

    ```
    .
    ├── AUTHORS.md
    ├── LICENSE
    ├── README.md
    ├── bin                <- Your compiled model code can be stored here (not tracked by git)
    ├── config             <- Configuration files, e.g., for doxygen or for your model if needed
    ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    ├── docs               <- Documentation, e.g., doxygen or scientific papers (not tracked by git)
    ├── notebooks          <- Ipython or R notebooks
    ├── reports            <- For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
    │   └── figures        <- Figures for the manuscript or reports
    └── src                <- Source code for this project
        ├── data           <- scripts and programs to process data
        ├── external       <- Any external source code, e.g., pull other git projects, or external libraries
        ├── models         <- Source code for your own model
        ├── tools          <- Any helper scripts go here
        └── visualization  <- Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.
    ```

    Check out my latest research project, which successfully applied the `cookiecutter` philosophy: [SEMIC: an efficient surface energy and mass balance model applied to the Greenland ice sheet](https://gitlab.pik-potsdam.de/krapp/semic-project).

### Another example of a project layout:
    
https://github.com/jfear/example_project

### III. Introduction to the Jupyter notebook

**Goals**: Students will understand the concept, importance, and components of reproducible research; understand the strengths of Jupyter Notebooks as a tool for reproducible research; be able tp create and navigate through a Jupyter Notebook containing Markdown and Code cells; and be able to know and access the broader Jupyter and Python ecosystems and communities.

**Instructor's skills:** Familiarity with Jupyter notebooks; familiarity with markdown; basic python skills.

**Materials**<br/>
*Repository*: <https://github.com/Reproducible-Science-Curriculum/introduction-RR-Jupyter>


### IV. Data Exploration

**Goals**: Students will be able to assess the structure and cleanliness of their dataset; be able to describe their findings, translate results, and summarize their thought process in a narrative comprised of Markdown text and Python code in a Jupyter Notebook; learn practices for modifying raw data to prepare a clean data set in a reproducible and documented way; and be able to assess whether their data is “Tidy”, and how to arrange it into a tidy format.

**Instructor's skills**: Facility with tabular data. Understanding of the steps needed to reshape, merge, and subset data. Knowledge of different types of plots, and which types of plots are appropriate for various kinds of data. Familiaritiy with regular expressions, `pandas`, and `matplotlib` is helpful.

**Materials**<br/>
*Repository*: <https://reproducible-science-curriculum.github.io/data-exploration-RR-Jupyter/>

### V. Automation

**Goals**: Students will learn how to programmatically assemble a manuscript using elements generated by a notebook, including text, headings and figures generated from code and data.

**Instructor's skills:** Good understanding of programming concepts, in particular code modularisation, writing and using functions, code reusability and so on. Good understanding of selected software engineering concepts such as project build and automation, code testing, continuous integration and  so on. Solid knowledge of Python, Jupyter, and relevant packages (consult the materials for details). Understanding of basic statistical concepts (consult the materials for details).

**Materials**: <br/>
*Repository*: <https://reproducible-science-curriculum.github.io/automation-RR-Jupyter/>

### VI. Publication

**Goals**: Students will learn how to export their notebooks in a variety of formats for publication; be able to describe the utility of documentation to themselves and others; be able to describe and compose appropriate and descriptive keywords for a given record; be able to define and describe the importance of unique identifiers for data, publication and software; and learn how to select an appropriate license for their research artifacts.

**Instructor's skills:** Understanding of requirements for reproducible publication. Understanding of differences between publication and sharing. Understanding the difference between open and restricted access publication. Overview of tools and repositories for publishing research outputs. Knowledge of different licensing models and ability to discuss major differences between the most commonly used licenses in research.

**Materials**:<br/>
*Repository*: <https://reproducible-science-curriculum.github.io/publication-RR-Jupyter/>

### VII. Sharing

**Goals**: Students will learn how to share their Jupyter notebooks online, both static (using GitHub) and interactive (using Binder).

**Instructor's skills:** Some familiarity with GitHub, understanding of software dependencies and (containerized) environments.

**Materials**<br/>
*Repository*: <https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/>


## Ongoing work

These materials are being developed and revised on an ongoing basis. The list of [GitHub issues for the Reproducible-Science-Curriculum](https://github.com/issues?user=Reproducible-Science-Curriculum) gives a pretty good idea of what is happening and what needs to be done.