# Module 2: File Management (1 hour)

From "[Piled Higher and Deeper](http://phdcomics.com/comics/archive.php?comicid=1531)" by Jorge Cham
<img alt="filenaming" src="http://www.phdcomics.com/comics/archive/phd101212s.gif" width="500px" />

## Organizing Projects
This section draws from Karl Broman's [steps toward reproducible research](http://kbroman.org/steps2rr/pages/organize.html)

* Encapsulate everything in one directory
* Separate raw data from derived data
* Separate the data from the code 
* Use relative paths
* Choose file names carefully -- more on that in the following section
* Avoid using "final" in a file name
* Write README files

## File Naming
This section draws from Stanford Library's [data best practices](https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming)

### Basic Information

* Project or experiment name or acronym
* Location/spatial coordinates
* Researcher name/initials
* Date or date range of experiment
* Type of data
* Conditions
* Version number of file
* Three-letter file extension for application-specific files

### Tips
* For dates, use YYYYMMDD or YYMMDD.
* Keep the file names short! 
* Avoid special characters. 
* When numbering, use leading zeros. 
* Do not use spaces. 
* Do use underscores, dashes, or CamelCase.

### Example 

http://bit.ly/naming_exemplar

## File Formats
This section draws from Stanford Library's [data best practices](https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-formats)

### Discussion
* What formats are you saving your files as? 
* Are they proprietary or open formats? 
* How does this affect your ability to read the file in the future? 
* Can you think of an example when proprietary is the better option? 

### Some preferred file formats

* Containers: TAR, GZIP, ZIP
* Databases: XML, CSV
* Geospatial: SHP, DBF, GeoTIFF, NetCDF
* Moving images: MOV, MPEG, AVI, MXF
* Sounds: WAVE, AIFF, MP3, MXF
* Statistics: ASCII, DTA, POR, SAS, SAV
* Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
* Tabular data: CSV
* Text: XML, PDF/A, HTML, ASCII, UTF-8
* Web archive: WARC

## Data Provenance and Version Control

### What is provenance? 
This section draws from the Wikipedia article [Provenance](https://en.wikipedia.org/wiki/Provenance).
> **Provenance** is the chronology of the ownership, custody or location of a historical object. The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody, and places of storage. [E]stablishing provenance is essentially a matter of documentation. 

### What is data provenance? 
This section draws from the Wikipedia article [Data lineage](https://en.wikipedia.org/wiki/Data_lineage).
> Data lineage includes the data's origins, what happens to it and where it moves over time. **Data provenance** documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. "*Lineage* is a simple type of *why provenance*."

### How is version control related to provenance? 
It's part of documentation! 

<img src="https://guides.github.com/activities/hello-world/branching.png" alt="github" />
From [GitHub Guides: Hello World](https://guides.github.com/activities/hello-world/)

## Storage/Backup/Archive/Preservation

### Definitions
This section draws from DataONE's [Education Module Lesson 6](https://www.dataone.org/sites/all/documents/education-modules/handouts/L06_DataProtection_Handout.pdf).

* **Storage**: the medium you're using to keep your active files, backups, and archives (e.g. hard drive, flash drive, CD/DVD, magnetic tape, cloud)
* **Backup**: periodic snapshots of current version; stored for short or near-long-term; often done on a somewhat frequent schedule  
* **Archive**: final version for historical reference or disasters; stored for long-term; created at end of project or at major 
milestone
* **Preservation**: Includes backups and archiving as well as processes such as data conversion, reformatting, and rescue.

### Cloud Storage Options at The University of Utah
<table>
<tr>
    <td><a href="http://box.utah.edu/"><img src="http://box.utah.edu/_images/ubox-logo.png" width="250" /></a></td>
    <td><a href="http://gcloud.utah.edu/"><img src="http://gcloud.utah.edu/_images/logo-google-apps.png" width="250" /></a></td>
    <td><a href="https://redcap01.brisc.utah.edu/ccts/redcap/"><img src="https://redcap01.brisc.utah.edu/ccts/redcap/redcap_v7.0.19/Resources/images/redcap-logo-large.png" width="250" /></a></td>
    <td><a href="http://campusguides.lib.utah.edu/labarchives"><img src="https://www.labarchives.com/wp-content/uploads/2015/06/LA_2ColorLogo.png" width="250" /></a></td>
</tr>
</table>

| Resource | Security<span title="See University Policy 4-004C for definitions and explanations of restricted and sensitive data"><sup>1</sup></span> | Collaborate<span title="with research groups outside of the University"><sup>2</sup></span> | Backup | Max file size | Max allocated space | File type | Cost |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **[REDCap](https://redcap01.brisc.utah.edu/ccts/redcap/)** | Restricted | yes | Yes | na | unlimited | any | free | 
| **[LabArchives](http://campusguides.lib.utah.edu/labarchives)** | Sensitive | yes | Yes | 4GB | unlimited | any | free | 
| **[CHPC Group Space](http://campusguides.lib.utah.edu/ld.php?content_id=33672291)** | Sensitive | ?? | Upon request | na | na | ?? | \$150/TB |
| **[CHPC Archive](http://campusguides.lib.utah.edu/ld.php?content_id=33672291)** | Sensitive | ?? | Yes | na | na | ?? | \$120/TB | 
| **[UBox](http://box.utah.edu/)** | Restricted | yes | Yes | 15GB | 1TB | See [Link](https://community.box.com/t5/How-to-Guides-for-Managing/What-file-types-and-fonts-are-supported-by-Box-s-Content-Preview/ta-p/327) | free | 
| **[Google Drive](http://gcloud.utah.edu/)** | Sensitive | yes | yes | See [FAQ](https://support.google.com/drive/answer/37603?hl=en&amp;ref_topic=7000756) | unlimited | See [FAQ](https://support.google.com/drive/answer/37603?hl=en&amp;ref_topic=7000756) | free | 

The CHPC also provides tools for secure large file transfer such as [Globus](https://www.chpc.utah.edu/documentation/software/globus.php). The table above was adapted from a handout created by Daureen Nesdill (Data Management Librarian at Marriott Library).

### What are important considerations for creating backups? 

This section draws from DataONE's [Education Module Lesson 6](https://www.dataone.org/sites/all/documents/education-modules/handouts/L06_DataProtection_Handout.pdf).

* Are there **existing policies** that might affect how and when you do backups? Do you need to create **backups of the backups**? What will happen to your backups when **funding ceases, the project ends, or staff leave**?
* **How often** should you do backups? Will backups be **manual or automatic**?
* Should you do **partial or full** backups? What will you do with **non-digital content**?
* **Where** will you backup your files?
* How will you **verify** that a backup has been performed successfully?
* **How long** will you keep your backups?
* How do you **recover files** from your backup? Can you read data off of **older backups**? How will outdated data be **disposed of**?

### The 3-2-1 Rule
The section draws from Peter Krogh's [Backup Overview](http://dpbestflow.org/node/262).

* **3** copies of any important file (a primary and two backups)
* **2** different media types (such as hard drive and optical media), to protect against different types of hazards.
* **1** copy should be stored offsite (or at least offline).

## Exercise
**Organize your primary workspace**
1. Inventory what project files you have. 
2. For a particular project, decide on a naming convention and implement it. 
3. Look at your file formats. Are you future-proofing? 
4. Create a README file for that project. For suggestions, see [Wikipedia](https://en.wikipedia.org/wiki/README) or [GitHub](https://guides.github.com/features/wikis/#Formatting-a-readme).

**Develop a backup plan**
1. Remember the 3-2-1 rule. 
2. Find existing policies or create one for yourself. 

## Extra Reading for Fun 
* Manes S. [README? Sure--before I buy!](https://login.ezproxy.lib.utah.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db=aph&bquery=JN+%26quot%3bPCWorld%26quot%3b+AND+DT+19961101+AND+readme&type=1&site=ehost-live). PCWorld [serial on the Internet]. (1996, Nov), [cited July 13, 2017]: 366. 
* Library of Congress' [Recommended Formats Statement](https://www.loc.gov/preservation/resources/rfs/)