# The File System

As we learned in Chapter 2 of [The Theory and Craft of Digital Preservation](https://jhupbooks.press.jhu.edu/title/theory-and-craft-digital-preservation) the *file* and *file formats* are foundational concepts for digital curation, and for computing in general.

> File formats enable most modern computing. A file format is a convention that established the rules of how information is structured and stored in a file. File extensions (.mp3, .jpg, .doc), in part, define the file and enable it to be interpreted. (p. 47)

We will be taking a closer look at some of these formats in the coming weeks. But first we are going to review some of what you learned in Introduction to Programming (INST126) about how to interact with files and [file systems](https://en.wikipedia.org/wiki/File_system). While there are many types of file systems with different properties many of them offer the same Application Programming Interface or [API](https://en.wikipedia.org/wiki/API) which allows you to interact with files using read and write operations.

Digital curation practices often require us to interact with the file system in order to read, write and update files. Most computer operating systems (Windows, OS X, etc) provide access to the file system using applications like the File Explorer (Windows) and the Finder (OS X). Digital curation regularly requires *documenting* and performing *repeatable* interactions with files and the file system. Programming languages like Python are useful for automating digital curation tasks and workflows, and for documenting the curatorial behaviors. Being able to create and modify these programs is a useful digital curation skill that we will be exploring this semester.

This Jupyter notebook provides some examples of reading from the filesystem and then asks you to perform a similar task.

## Paths

Python's [pathlib](https://docs.python.org/3/library/pathlib.html) module provides an object-oriented way to interact with the filesystem. It is named *pathlib* because it is organized around the idea of *file paths* which are locations for files on the file system. Normally you see these using a notation something like:


    /home/ed/documents/resume.pdf
    
or on Windows:

    C:\Users\Ed\Documents\resume.pdf
    
The two different notations are expressing the same information, a location for a file on the file system by starting at the *root* of the file system and then navigating down through *folders* (also known as *directories*) until there is a name for a *file*. This is known as an *absolute path* because it is anchored to the root of the filesystem.

So for example we can crate a path for this computer's root file system. Since the Jupyter notebook is running in a virtual machine in the Google Cloud, you will be inspecting the file system of the computer running in the cloud, and not your own computer. If you were running the Jupyter notebook locally on your computer then you would be examining the file system on your computer.

In [None]:
import pathlib

root = pathlib.Path('/')
print(root)

/


Now that we have a `Path` instance stored in the variable named `root` we can list the contents of this path location using its [iterdir](https://docs.python.org/3/library/pathlib.html#pathlib.Path.iterdir) method in a little loop:

In [None]:
for p in root.iterdir():
    print(p)

/var
/boot
/opt
/root
/run
/etc
/sbin
/dev
/tmp
/media
/home
/lib
/usr
/mnt
/srv
/proc
/bin
/sys
/lib64
/.dockerenv
/tools
/datalab
/swift
/tensorflow-1.15.2
/content
/lib32


This may look different when you run it on your computer, since it may have a different root directory. In this case we see what is in the root directory of the Linux operating system that Google Cloud gave us for our Colab notebook.

Paths can also be *relative*. Relative paths are not anchored to the root of the file system. Instead they refer to a location that is relative to program's *current working directory*. You can think of this as the home for your program, or where it was started from. Relative paths look something like this:

    ed/resume.pdf

Most programming languages let you determine the current working directory. Python's [os](https://docs.python.org/3/library/os.html) module has a function called [getcwd](https://docs.python.org/3/library/os.html?highlight=os#os.getcwd):

In [None]:
import os

os.getcwd()

'/content'

## Reading Files

Digital curation is fundamentally concerned with *the care of data*, and data is almost always stored in files of some kind. When caring for data we often want to *read* the files and folders on the file system to determine the formats that the data is expressed in, its size, and file fixity. We'll be discussing some of these operations in future modules. For now we are going to focus on reading the files in a directory and figuring out how much data they contain as an example

In order to make this a bit more interesting we're going to work with a set of files from [Digital Corpora](https://digitalcorpora.org/corpora/files) which is a project that provides materials for digital forensics education. Ideally we would share this data in Google Drive since it integrates nicely with Jupyter notebooks in Colab. But I haven't heard from all students yet about their preferred Google Account email address to add them. So until then I've create a small utility for downloading the data into our notebooks.

In [None]:
! pip install --upgrade -q git+https://github.com/edsu/inst341data.git

  Building wheel for inst341data (setup.py) ... [?25l[?25hdone


In [None]:
from inst341data import get_module_2

get_module_2("inst341")

Downloaded inst341


Now we have a directory called `inst341` which we can interact with. For example, we can create a `Path` object for the directory we just created. We can use a relative path since the zip file unpacked into our current working directory:

In [None]:
data = pathlib.Path('inst341')
print(data)

inst341


We can list the files that were unpacked from the ZIP file:

In [None]:
for p in data.iterdir():
    print(p)

inst341/710097.pdf
inst341/481368.csv
inst341/441236.ps
inst341/120637.ppt
inst341/215842.txt
inst341/789265.ppt
inst341/512608.ppt
inst341/761213.ps
inst341/278141.jpg
inst341/447656.html
inst341/763624.html
inst341/286538.pdf
inst341/377087.txt
inst341/925740.doc
inst341/098807.doc
inst341/306840.html
inst341/033333.pdf
inst341/368751.html
inst341/115389.jpg
inst341/064568.pdf
inst341/116114.dbase3
inst341/362088.pdf
inst341/837467.jpg
inst341/951980.gz
inst341/356028.pdf


The directory looks like a collection of different types of files (pdf, doc, jpg, etc). If we want we can try to read one of them and print it out. Take the `inst341/377087.txt` as an example. We can create a `Path` object for it and then use the `read_text` method to read its contents into a variable that we can then print out:

In [None]:
p = pathlib.Path('inst341/377087.txt')
contents = p.read_text()
print(contents)


000
FONT12 KNHC 100237
PWSAT2
TROPICAL DEPRESSION PALOMA WIND SPEED PROBABILITIES NUMBER  19      
NWS TPC/NATIONAL HURRICANE CENTER MIAMI FL   AL172008               
0300 UTC MON NOV 10 2008                                            
                                                                    
AT 0300Z THE CENTER OF TROPICAL DEPRESSION PALOMA WAS LOCATED NEAR  
LATITUDE 22.0 NORTH...LONGITUDE 78.0 WEST WITH MAXIMUM SUSTAINED    
WINDS NEAR 25 KTS...30 MPH...45 KM/HR.                              
                                                                    
Z INDICATES COORDINATED UNIVERSAL TIME (GREENWICH)                  
   ATLANTIC STANDARD TIME (AST)...SUBTRACT 4 HOURS FROM Z TIME      
   EASTERN  STANDARD TIME (EST)...SUBTRACT 5 HOURS FROM Z TIME      
   CENTRAL  STANDARD TIME (CST)...SUBTRACT 6 HOURS FROM Z TIME      
                                                                    
                                                                    
I. 

This is a somewhat random assortment of files, so perhaps it isn't the best example. But this looks like this text file contains a weather report for a tropical storm Paloma in 2008?

## File Sizes and Storage

Knowing how much storage data takes up is very important for digital curation tasks because storage space is often limited and can be expensive to maintain over time. Some files can be compressed to save space. Some files may not be worth saving if they are already available in compressed formats. Being able to programatically determinte how much space files use is therefore very useful.

A `Path` object has a [stat](https://docs.python.org/3/library/pathlib.html#pathlib.Path.stat) method which returns information about the file, such as the size. For example here is the size in [bytes](https://en.wikipedia.org/wiki/Byte) of the weather report we just looked at:

In [None]:
p = pathlib.Path('inst341/377087.txt')
info = p.stat()
print(info.st_size)

5817


By using a loop again with we can print out the sizes for all the files in the directory:

In [None]:
data = pathlib.Path('inst341')

for p in data.iterdir():
    info = p.stat()
    print(p, info.st_size)

inst341/710097.pdf 42019
inst341/481368.csv 36096
inst341/441236.ps 57926
inst341/120637.ppt 59392
inst341/215842.txt 1244
inst341/789265.ppt 5257728
inst341/512608.ppt 33792
inst341/761213.ps 11196
inst341/278141.jpg 39479
inst341/447656.html 12258
inst341/763624.html 28650
inst341/286538.pdf 133799
inst341/377087.txt 5817
inst341/925740.doc 34816
inst341/098807.doc 755200
inst341/306840.html 34040
inst341/033333.pdf 223686
inst341/368751.html 11267
inst341/115389.jpg 71167
inst341/064568.pdf 641356
inst341/116114.dbase3 11394
inst341/362088.pdf 3555072
inst341/837467.jpg 6654
inst341/951980.gz 205652
inst341/356028.pdf 142229


We can also figure out the total size in bytes of all of the files:

In [None]:
data = pathlib.Path('inst341')
total_size = 0

for p in data.iterdir():
    info = p.stat()
    total_size += info.st_size
    
print(total_size)

11411929


Or to get a bit little bit more fancy we can use a Python [Counter](https://docs.python.org/3/library/collections.html?highlight=counter#collections.Counter) we could count up how much space the files take up by file extension:

In [None]:
import collections

sizes = collections.Counter()

for p in data.iterdir():
    info = p.stat()
    size = info.st_size
    
    extension = p.suffix
    sizes[extension] += size
    
for extension, size in sizes.most_common():
    print(extension, size)

.ppt 5350912
.pdf 4738161
.doc 790016
.gz 205652
.jpg 117300
.html 86215
.ps 69122
.csv 36096
.dbase3 11394
.txt 7061


It looks like the .ppt (Powerpoint) and .pdf (Portable Document Format) files take up the majority of the space.

## Exercise

I have created a special zip file for you to download and analyze. Each file is uniquely named for you. You will need to replace **USERNAME** in the command below with your UMD username. Your UMD username will be the the portion of your UMD email address before the **@** sign. So for example if your email address is:

    edsu@umd.edu

Your username would be **edsu** and you would update the command like so:

```python
get_module_2('edsu')
```

In [None]:
get_module_2('USERNAME')

### 1. Directory Contents

Create a `Path` object for your directory and use a loop to print out the files that are in the directory.



### 2. Total Size

Use a loop to calculate and print out the total size in bytes of your directory.

### 3. Largest File

Use Python to determine the largest file in your directory. One way to do this would be to use a loop like above and a variable to keep track of the largest file that you've seen.