# File archival and recovery

Create files to play with.

In [None]:
echo "aaaaa" > a.txt
echo "bbbbb" > b.txt
echo "ccccc" > c.txt

## File compression and archival

In [None]:
ls ?.txt

### Combine multiple files into single file

In [None]:
tar -cvf abc.tar ?.txt

In [None]:
ls

### Compress concatenated file

In [None]:
gzip abc.tar

In [None]:
ls

In [None]:
rm abc.tar.gz

### In one step

In [None]:
tar -czvf abc.tar.gz ?.txt 

In [None]:
ls

In [None]:
rm ?.txt

In [None]:
ls

### Recover files

In [None]:
tar -xzvf abc.tar.gz

In [None]:
ls

In [None]:
rm abc.tar.gz

## Checksums

When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a “checksum” function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.

There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.

The strategy is:

- Generate and store a checksum together with a data file whose integrity you care about
- When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum


In [None]:
cat > hello.txt << EOF
1 Hello, bash
2 Hello, again
3 Hello
4 again
EOF

In [None]:
cat hello.txt

### Different algorithms to generate checkusms

In [None]:
cksum hello.txt

In [None]:
md5sum hello.txt

In [None]:
sha1sum hello.txt

### If we alter hello.txt in any way the checksum will be different

In [None]:
cat hello.txt

In [None]:
md5sum hello.txt > hello.md5

In [None]:
cat hello.md5

In [None]:
cat > hello.txt << EOF
1 Hello, bash
2 Hello, again
2 Hello
4 again
EOF

In [None]:
cat hello.txt

In [None]:
md5sum -c hello.md5

### Working with multiple files

In [None]:
md5sum ?.txt > MD5SUM

In [None]:
echo "ccacc" > c.txt

In [None]:
md5sum -c MD5SUM

## Exercises

1. Create an MD5 checksum for all notebooks (files with extension .ipynb) and store as MD5_CHECKSUM

2. Archive all notebooks (files with extension .ipynb) in this directory ANd the checksum file as notebooks.tar.gz.

3. Delete all (files with extension .ipynb).

4. Recover the notebooks from the archived file, and check that the files have not changed during the process of tarring and untarring.