# Organizing your project

You have just started a new job and have to take over the work of a previous employee who has left the lab and gone off the grid. You receive an Excel file of this person's life work.  Your boss has instructed you to:

1. Make sense of all the data they have collected
2. Write a report on the findings to share with others in the lab, so they may use the data and analyses in their own work.

The file which was sent to you can be downloaded here: [gapminderDataFiveYear_superDirty.xlsx](data/gapminderDataFiveYear_superDirty.xlsx).

Download the file. With the goal of _making sense of the data_, what can you tell me about this data, and how do you know that?

- What is it?
- Where did it come from?
- When was it collected?
- Has anything been changed? If so, why was it changed?

## Organize projects for your future self.

**Goals**
- "Recognize the difference between raw and modified data"
- "Document the life history of a file as it gets modified"
- "Associate file history with directory structure"
- "Make it easy to answer the 5 W's"

## Rule 1: Always look at the raw data

Let's take a look at the data in Excel before we start working on it. Does everything look correct?

![Screenshot](assets/img/gapminderDataFiveYear_superDirty-screenshot.png)

> ## Should we fix any of the problems we see in this file? Why or why not?

> No, keep original data forever unchanged.
>
> There should always be a copy of the data in its most raw state. Never make corrections to the original data file. There are several ways to ensure that the original data stays unchanged.

## Rule 2: Never change the raw data.

What if we were to change this file?

- Changes made by hand (in GUI applications) are incredibly hard to track. Some tools are better than others, but this is outside our control.
- If we changed our only copy of the raw data, we may never be able to reproduce what we did.
- We want to track **exactly** what changed in the file, and if we must do it by hand, we should make that easy to see.

Some ideas to detect changes:

- Include the date stamp in the file name
    - `2017-03-15_01_dleehr-fixes_gapminderDataFiveYear_superDirty.xlsx`
- Color the changed cells to highlight the changes
- Side-by-side comparison with original
- Record exact changes in a README file.

Clearly label the file or make the location for the file that makes sense to future you and collaborators.

## Rule 3: Structure your data in folders / sub-folders
(and keep raw vs. modified data separate)

**Let's create some folders for our project**

The most important way to delineate file types and organize the workflow of your project is to design your file structure at the very start of your project's history.

Often you do not know exactly where your project will lead, so it is okay if your intitial design evolves as the project does. There are a few standards that we will walk through that work well for most projects.

Create a folder that will house all of your projects and call it `projects`.

```
projects/
```

Now let's make a folder within that folder for our new project, called `gapminder`:

```
projects/
    gapminder/
```

Since we have data, we want to create a `data` folder.

```
projects/
    gapminder/
        data/
```

Since we have this Excel spreadsheet as our original, raw data, we will save it in a folder called `00_raw` to indicate that it is the starting point for our project. Create a subfolder in `data/` called `00_raw` and put the `gapminderDataFiveYear_superDirty.xlsx` file there. Now you should have a folder structure that looks like this:

```
projects/
    gapminder/
        data/
            00_raw/
                gapminderDataFiveYear_superDirty.xlsx
```

Since we want to change something in this file, we will create a *new* folder in `data/` that shows that it is different from `00_raw/`, and we will call it `01_cleaning/`.

Make a subfolder in `data/` called `01_cleaning/` and put a copy of the `gapminderDataFiveYear_superDirty.xlsx` file there.

Then, make an *empty* subfolder called `02_cleaned` in the `data/` folder.

```
projects/
    gapminder/
        data/
            00_raw/
                gapminderDataFiveYear_superDirty.xlsx
            01_cleaning/
                gapminderDataFiveYear_superDirty.xlsx
            02_cleaned/
```

With this structure, we can capture and communicate most of the characteristics of data as it exsts in our project: history, function, and format. In addition, there is a clear understanding of how to proceed to the next step, data cleaning.

## Why `01_cleaning` and `02_cleaned`?

Realistically, we may not completely **clean** a file in one pass. Sometimes we'll get interrupted, and sometimes we'll catch things later and need to revise the file.

1. Copy the original raw data into `01_cleaning` (e.g. the xlsx file)
2. Edit the file in `01_cleaning`, saving there and recording changes.
3. Export a cleaned version in CSV or Tab-delimited format to `02_cleaned`.

If we catch things later, we
don't put a file in `02_cleaned` until we're done with it. And if we have more cleaning to do, we start from `01_cleaning` and produce a new file in `02_cleaned`.

Let's say we've gone through the original excel file and think it looks OK. We'll save it as a "tab-delimited file", which means that:

1. It is now just a regular text file, not an excel file.
2. It has columns spaced apart by tab (`\t`) characters

> ## Tips and Tricks
>
> - Make your raw data read-only, so you cannot inadvertently change the file.
> - Put gnarly incoming data in **quarantine**, then convert, rename, and document.
> - Use ISO 8601 (YYYY-MM-DD) dates at the beginning of data file names to make them sortable

## Rule 4: Add metadata to your projects

> **What is metadata?**

## Why READMEs

Every project should describe to users what the purpose of the project is. This is commonly done in a README file. As the starting point for a project the README file is formatted as plain text (or [markdown](https://guides.github.com/features/mastering-markdown/)) to make it easily readable. A README file should include the following information:

- The project name
- The date the README was created
- Contact information for the person(s) who maintains the project
- Three or four sentences about the goal of the project
- If the project uses data from an external source, where the data is from

Think about the beginning of this lesson, when we had nothing but a file with a name. These are the things that would have made it easy to make sense of that data.

So, before we make any modifications to the raw data, we need a practice for how to record the initial state of the data, as well as our modifications.

## Adding a Top Level README

To add a README to our project, open a text editor. We'd recommend either [Sublime Text](https://www.sublimetext.com/) or [Atom](https://atom.io/).

Now, let's make a README

* Open text editor
* Start writing

~~~
Project name
Today's date
Maintainer's contact info
Data Origin
3-4 sentences about the goal of the project
~~~

* Save as `README` in the project directory.

This file serves as the starting point for future you, or anyone who receives this data.

## Adding a README in a Subdirectory

README files in subdirectories are a good idea too. Often there are many files, and it's distracting to fill the top-level README with details about smaller pieces of the project.

- For raw data directories, you should include the location (e.g URL) where the file was retrieved or generated.
- For modified data directories, you should include the exact tools and steps used to modify the data, along with dates
- For other directories like code or documentation, the README should communicate what the directories contain.

## Keeping the READMEs up-to-date

- Dates are good on file names here
- When changing something in a directory, you should add a line at the README

# Rule 5: Make your project self-documenting

## Self-documenting Projects

READMEs are commentary on what we consider the "real work", and realistically can be an afterthought. We've all had projects under a deadline or someone asking for a result, and the documentation step is easy to defer until later.

Later never comes, or we forget the details by the time it does. So another good practice is to use good, descriptive names on files, directories, and in code. These are for our benefit, not the computer.

> ## Project README
> `gapminder/README.md`
> ~~~
> gapminder
> =========
>
> ## Project Summary
>
> This project analyzes population-level statistics about many countries to
> determine if there is a relationship between x and y.
>
> Started: 2017-03-15
> Maintainer: Dan Leehr dan.leehr@duke.edu
>
> ## Data Origin
>
> This data is the gapminder dataset, originally collected and published in XXX,
> and retrieved from [1]. The dataset reflects population-level statistics about
> many countries spanning the last several decades.
>
> [1] https://github.com/Reproducible-Science-Curriculum/organization-RR-Jupyter/raw/gh-pages/data/gapminderDataFiveYear_superDirty.xlsx
>
>
> ## Summary of changes
>
> 2017-03-15	dan.leehr@duke.edu	Inherited gapminderDataFiveYear_superDirty.xlsx from Frank Grimes.
> 					Placed raw data in 00_raw.
> 2017-03-15	dan.leehr@duke.edu	Cleaned gapminderDataFiveYear_superDirty.xlsx file in 01_cleaning and
> 					exported to CSV format in 02_cleaned.
> ~~~

# Rule 6: Be flexible

The important thing is not to perfectly follow the rules above, the goal is to do whatever works to make your data / analyses easy to follow for your particular use case. The important thing is that your folders are **structured**, **easy to follow**, and **rich with metadata**.
