# Table of Contents
 <p><div class="lev1"><a href="#Open-science-101">Open science 101</a></div><div class="lev2"><a href="#Note-on-these-materials">Note on these materials</a></div><div class="lev1"><a href="#What-does-open-science-mean?">What does open science mean?</a></div><div class="lev1"><a href="#The-lifecycle-of-a-project">The lifecycle of a project</a></div><div class="lev2"><a href="#Where-this-cycle-can-go-wrong...">Where this cycle can go wrong...</a></div><div class="lev2"><a href="#Where-does-data-reproducibility-fit-in?">Where does data reproducibility fit in?</a></div><div class="lev1"><a href="#The-open-science-ecosystem.">The open-science ecosystem.</a></div><div class="lev2"><a href="#Sharing-and-publishing-your-work">Sharing and publishing your work</a></div><div class="lev2"><a href="#Storing-larger-datasets">Storing larger datasets</a></div><div class="lev2"><a href="#Analyzing-your-data-effectively">Analyzing your data effectively</a></div><div class="lev2"><a href="#Open-source-languages">Open-source languages</a></div><div class="lev2"><a href="#Other-key-tools-in-open-science">Other key tools in open-science</a></div><div class="lev1"><a href="#Organizations-and-resources-in-open-science">Organizations and resources in open-science</a></div><div class="lev2"><a href="#Software-Carpentry"><a href="https://software-carpentry.org/" target="_blank">Software Carpentry</a></a></div><div class="lev2"><a href="#Data-Carpentry"><a href="http://www.datacarpentry.org/" target="_blank">Data Carpentry</a></a></div><div class="lev2"><a href="#On-Berkeley's-campus">On Berkeley's campus</a></div><div class="lev1"><a href="#Where-to-go-from-here?">Where to go from here?</a></div><div class="lev2"><a href="#A-quick-primer-on-jupyter"><a href="extras/Jupyter_Intro_Background.ipynb" target="_blank">A quick primer on jupyter</a></a></div><div class="lev2"><a href="#Intro-to-scientific-python"><a href="extras/scientific python.ipynb" target="_blank">Intro to scientific python</a></a></div>

# Open science 101

Today we will cover the broad communities and tools at your disposal for collecting, managing, and analyzing your data in a sensible manner. These are all prerequisites for making your work reproducible - fortunately, they also let you do things much more efficiently and effectively.

## Note on these materials
This stuff is written using a tool called "jupyter notebooks". They're a way of integrating narrative-style prose with code. It's one of the many tools in the open-science ecosystem that we'll talk about. They're all incredibly useful, and important for making your work more reproducible. There will be more information about this later.

# What does open science mean?
That's a tough question, as the field is changing all the time. **Open science** means doing work that is:

* Easy to discover

* Easy to understand

* Easy to replicate

* Easy to build off of

At its core, open science is about refining the process of science itself so that it is more efficient and useful for the broader community.

# The lifecycle of a project
Any project moves in cycles and science is no different.

In fact, understanding the cycle of science - and where it can go awry - is a crucial part of calling yourself a scientist. Here's a really high-level view of what happens in any scientific paper:

1. Be interested in something
1. Read the literature
1. Plan an experiment
1. Conduct the experiment / collect the data
1. Store and catalogue the data
1. Analyze the data
1. Refine the analysis
1. Write up the results

Embedded in this straightforward list are a number of feedback loops, where you move back a few steps and repeat a step using the knowledge that you've gained thus far.

At its core, open science is about improving this process.

## Where this cycle can go wrong...
It turns out there are a lot of ways that the steps above run really inefficiently right now. For example:

** Sharing Knowledge**
* Be interested in something <span style="color:red"><-- Scientists aren't well-trained in communication skills, making it difficult for people to connect with their ideas.</span>

* Read the literature  <span style="color:red"><-- Reading the literature is difficult if you're not at a university that pays large licensing fees.</span>

* Planning and conducting an experiment  <span style="color:red"><-- details on exactly how an experiment are often incomplete, and only describe the one experiment that worked.</span>

** Working with your data**
* Storing and cataloging the data  <span style="color:red"><-- Most scientists get zero formal training on how to structure and store data.</span>

* Analyze the data  <span style="color:red"><-- On that note, most scientists also aren't given any formal training in data analysis.</span>

* Refine the analysis  <span style="color:red"><-- Running one analysis is hard enough, it's harder to incorporate this into a broader cycle that asks questions using data.</span>

* Sharing your results  <span style="color:red"><-- Some results just don't make sense as a journal article, but this is the only method we have for sharing results.</span>

Open science tries to address many of the problems described above. It does this by ensuring that the fruits of our labor are as discoverable, understandable, and useful as possible.

## Where does data reproducibility fit in?

It turns out that doing data reproducibly has a lot of overlap with doing science openly and effectively.

### The Four Facets of reproducibility:
* **Documentation**: note the difference between binary files (e.g. docx) and .txt files and why text files are preferred for documentation.
* **Organization**: tools to organize your projects so that you don’t have a single folder with hundreds of files.
* **Automation**: the power of scripting to create automated data analyses.
* **Dissemination**: publishing is not the end of your analysis, rather it is a way station towards your future research and the future research of others.

### The basic unit of science is data

Nowadays, doing your science openly, reproducibly, and efficiently means getting a set of skills **with data and computers** that is not part of the traditional scientific education.

The benefit to you is that doing science reproducibly will also make your science more efficient and effective.

**These lectures will focus on the iterative process of analyzing and understanding your data.**

# The open-science ecosystem.

This is a quick run-down of the big players in the world of open-science. It'll cover the high-level concepts and we can dive into the details later. We'll see how each of these fit into the scientific workflow we discussed above.

It's broken down into two main sections:

* **Sharing and publishing work**
* **Analyzing data effectively**

## Sharing and publishing your work

This broadly covers how we learn about the scientific ideas that are out there, and how we build upon those ideas in order to produce new science.

### The publishing industry and the Open Access movement

Reproducible science means accessible science. While we all treat this as a given, most of science exists behind publisher paywalls that prevent you from accessing the fruits of our labor.

In recent years we have seen growth in the number of journals that offer "Open Access" material. This means that the readers aren't the ones paying for the content. You don't need to be behind a paywall in order to access this content.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/25/Open_Access_logo_PLoS_white.svg/640px-Open_Access_logo_PLoS_white.svg.png" style="margin: 0 auto; width: 10%">

Here are some big open access players:

**Higher-end journals**
* PLOS-XXX journals, e.g., [PLOS Biology](http://journals.plos.org/plosbiology/)
* [eLife](https://elifesciences.org/)
* [Nature Communications](http://www.nature.com/ncomms/)

**Other OA journals**
* [peerj](https://peerj.com/)
* [Frontiers](http://www.frontiersin.org/)
* [PLOS ONE](http://journals.plos.org/plosone/)

### Pre-print servers

A more recent shift has come in the world of pre-print servers. These offer a website where you can post articles **before** they're peer-reviewed. Among other things, this gives you benefits like:

* You're less worried of being scooped while your article is in peer-review.
* Your work is immediately (and freely) available to others
* You get feedback on your article and improve it before a final submission
* You get a better idea for what the journal does to improve the paper.

**The most commmon preprint server in our field is [biorxiv](http://biorxiv.org/)**

![](images/biorxiv.png)

### Publishing data or code

Papers have always been the dominant currency in science, but they are often a poor way to share scientific results.

For example, it's becoming more common for data to be made available with a paper. Where do you put it? What about the code used to analyze the data?

There currently isn't a single solution to this problem, but this landscape is changing very quickly. For example,

* [Binder](http://mybinder.org) allows you to create a live, interactive computing environment using a repository of code you store on github. (see [the LIGO binder](http://mybinder.org/repo/minrk/ligo-binder/notebooks/GW150914_tutorial.ipynb) for example).
* [Figshare](https://figshare.com/) allows you to deposit data, figures, and code.
* [Github](https://github.com) is the best place to store code so that others can use it.

<img src="https://12hoy26budd28l29v3kl2fq1-wpengine.netdna-ssl.com/wp-content/uploads/2015/02/logo-figshare.png" style="height: 150px; margin: 0 30%">
<img src="https://assets-cdn.github.com/images/modules/logos_page/GitHub-Logo.png" style="float: left; height: 100px; margin: 0 30%">


## Storing larger datasets
* The [CRCNS repository](http://crcns.org) will let you deposit your data and share it with others.
* There are also domain-specific repositories such as [openfmri](https://openfmri.org/)


<img src="images/crcns.png"/>

## Analyzing your data effectively

Once you've run an experiment and collected some data (or downloaded an open dataset to play around with), you need to start analyzing that data.

The traditional graduate education doesn't include much training in these methods, but there are many organizations and materials out there to help you improve. It's what the rest of these lectures will focus on.

## Open-source languages

While we have many tools for storing and analyzing our data, you should use the tools that leverage an open-source language. The open-source community shares many of the same principles that science does, and in the last decade we have seen a rapid movement away from closed languages such as Matlab and SPSS towards their open-source counterparts.

Fortunately, open source languages are generally **more powerful**, **more flexible**, and **more useful outside of science**. For these reasons alone you should start your scientific career working with an OS language. The fact that you get all of these tools for free (and can freely share them with others) is icing on the cake.

The two most common open-source languages for scientific analysis are:

* **R** - which is designed for statistical analysis. It is lightweight and fast to get working, especially with the excellent [RStudio](https://www.rstudio.com/) software. It also has some beautiful visualization capabilities using the [ggplot](http://ggplot2.org/) library.
* **Python** - which is a general computing language with a **lot** of data analytic packages built for it. Python has a steeper learning curve but is more flexible (and ultimately more generally utilized) than R.

**These materials will focus on Python**. While any open-source language is fine, I've found python to be the most flexible and promising in looking towards the future. We'll go over a few pieces of the python ecosystem now (though again, I don't care what language you use as long as it's open).

![](https://www.ibm.com/developerworks/community/blogs/jfp/resource/BLOGS_UPLOADED_IMAGES/trends0.png)

**A quick note on Matlab**: I know many of you are thinking "where does Matlab fit into all of this?". That's a complicated question. Matlab is decidedly *not* an open-source language. It was built as a linear algebra engine many years ago, and has grown in size and scope over the years. There are a *ton* of great packages that people have built for matlab, but you still need to pay many thousands of dollars for a license to use the language. If you want a more future-proof skillset, an open-source language is the way to go.

## Other key tools in open-science
On top of the scientific python world there are a number of other key tools you'll find useful. These include:

* **bash** for interacting with your computer's filesystem, doing things when you don't have a GUI, and basically doing anything when you're not in another language like python or R.
* **git** for keeping track of how your code and analyses change over time.
* **jupyter** for managing your projects and workflows, and doing all kinds of other things with your data.
* **Markdown**, **Latex**, and other text-based markup formats for making machine-readable code.

# Organizations and resources in open-science

Many of these skills take a lot of time and energy to properly learn. Fortunately, there are a lot of organizations out there dedicated to helping you out. This is especially true at Berkeley, which is one of the focal points of the open science movement.

## [Software Carpentry](https://software-carpentry.org/)
<img src="https://software-carpentry.org/img/software-carpentry-banner.png" style="width: 400px">

* Basic software and coding principles
* Core languages like Bash, Git, Python, R
* Focus on software skills

## [Data Carpentry](http://www.datacarpentry.org/)
<img src="https://www.software.ac.uk/sites/default/files/images/content/DC1_logo.jpg" style="width: 400px">

* More data-centric
* More domain-specific
* Analytic and "best practices" in data management


## On Berkeley's campus

* The [D-Lab](http://dlab.berkeley.edu/) - workshops, training, working groups
* The [Berkeley Institute for Data Science](https://bids.berkeley.edu/) - Consulting and project help, events, lectures
* The [Berkeley Data Science Education Initiative](http://data.berkeley.edu/) - For upcoming classes in data science

# Where to go from here?
The next two lectures will emphasize the data analytic parts of this process. This means deciding how you'll structure your data, deciding how to analyze it, and deciding where to go from there.

* **Week 2** will focus on data management, structuring your data, and managing your analysis workflow
* **Week 3** will focus on data analysis and statistics, as well as visualization

## [A quick primer on jupyter](extras/Jupyter_Intro_Background.ipynb)
Describes the Jupyter ecosystem and how it can help you improve your work.

## [Intro to scientific python](extras/scientific python.ipynb)
has a short introduction to using python within jupyter notebooks. It will get you up and running, and we can dive into more detail in the next class.
