## Reproducible Research Checklist

### Do: Start with Good Science
- Garbage in, garbage out
- Coherent, focused question simplifies many problems
- Working with good collaborators reinfoces good practices
- Something that's interesting to you will (hopefully) motivate good habits

### Don't: Do Things by Hand
- Editing spreadsheets of data to 'clean it up'
    - Removing outliers
    - QA / QC
    - Validating
- Editing tables or figures (e.g. rounding, formatting)
- Downloading data from a web site (clicking links in a web broswer)
- Moving data around your computer; splitting / reformatting data files
- 'We'are just going to do this once ...'

Things done by hand need to be precisely documented ( this is harder than it sounds )

### Don't: Point and Click
- Manage data processing / statistical analysis packages have graphical user interfaces (GUIs)
- GUIs are convenient / intuitive but the actions you take with a GUI can be difficult for others to reproduce
- Some GUIs produce a log file or script which includes equivalent commands; these can be saved for later examination
- In general, be careful with data analysis software that is highly interactive; ease of use can sometimes lead to non-reproducibile analysis
- Other interactive software, such as text editors, are usually fine

Do: Teach a Computer
- If something needs to be done as part of your analysis / investigation, try to teach your computer to do it (even if you only need to do it once)  
- In order to give your computer instructions, you need to write down exactly what you mean to do and how it should be done  
- Teaching a computer *almost* guarantees reproducibility  

### Do: Use Some Version Control
- Slow things down
- Add changes in small chunks (don't just do one massive commit)
- Track / tag snapshots; revert to old versions
- Software like GitHub / BitBucket / SourceForge make it easy to publish results

### Do: Keep Track of Your Software Environment
- If you work on a complex project involving many tools/datasets, the software and computing environment can be critical for reproducing your analysis
- **Computer architecture**: CPU(Intel, AMD, ARM), GPUs
- **Operating system**: Windows, Mac OS, Linux / Unix
- **Software toolchain**: Compliers, interpreters, command shell, programming languages (C, Perl, Python, etc.), database backends, data analysis software
- **Supporting software / infrastructure**: Libraries, R packages, dependencies
- **External dependencies**: Web sites, data repositories, remote databases, software repositories
- **Version numbers**: Ideally, for everything (if available)

### Don't: Save Output
- Avoid saving data analysis output (tables, figures, summaries, processed data, etc.), except perhaps temporarily for efficiency purposes
- If a stray output file cannot be easily connected with the means by which it was created, then it is not reproducible
- Save the data + code that generated the output, rather than the output itself
- Intermediate files are ok as long as there is clear documentation of how they were created

### Do: Set Your Seed
`set.seed()`

### Do: Think About the Entire Pipeline
- Data analysis is a lengthy process; it is not just tables / figures / reports
- Rawdata --> Processed Data --> Analysis --> Report
- How you got the end is just **as important as the end itself**
- The more of the data analysis pipeline you can make reproducible, the better for everyone

In [1]:
sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.0.2      ellipsis_0.3.1      IRdisplay_0.7.0    
 [4] pbdZMQ_0.3-3        tools_4.0.2         htmltools_0.5.0    
 [7] pillar_1.4.6        base64enc_0.1-3     crayon_1.3.4       
[10] uuid_0.1-4          IRkernel_1.1.1.9000 jsonlite_1.7.0     
[13] digest_0.6.25   

In [10]:
Sys.setenv(LANG = "en_US.UTF-8")
Sys.setlocale("LC_MESSAGES", 'en_US.UTF-8')
Sys.setlocale("LC_TIME", 'en_US.UTF-8')
Sys.setlocale("LC_MONETARY", 'en_US.UTF-8')
Sys.setlocale("LC_PAPER", 'en_US.UTF-8')
Sys.setlocale("LC_MEASUREMENT", 'en_US.UTF-8')

In [9]:
# test if the locale time is set to en_US.UTF-8 so the time format is in english output
# e.g. following command outputs "Wednesday" other than "星期三"
tempDate <- as.Date("2020-08-12")
weekdays(tempDate)

## Reproducible Research with Evidence-based Data Analysis

### Replication and Reproducibility
- Replication
    - Focused on the validity of the scientific claim
    - 'Is the claim true?'
    - The ultimate standard for strengthening scientific evidence
    - New investigators, data, analytical methods, laboratories, instruments, etc.
    - Particularly important in studies that can impact broad policy or regulatory decisions
- Reproducibility
    - Focuses on the validity of the data analysis
    - 'Can we trust this analysis?'
    - Arguably a minimum standard for any scientific study
    - New investigaors, same data, same methods
    - Important when replication is impossible

### What Problem Does Reproducibility Solve?

- What we get
    - Transparency
    - Data Availability
    - Software / Methods Availability
    - Improved Transfer of Knowledge
- What we do NOT get
    - Validity / Correctness of the analysis

An analysis cam be reproducibile and still be wrong  
We want to know "can we trust this analysis?"  
Does requiring repeoducibility deter bad analysis?

### Problems with Reproducibility
The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-correcting

- Addresses the most "downstream" aspect of the research process - post-publication
- Assumes everyone plays by the same rules and wants to achieve the same goals (i.e. scientific discovery)