# External Metadata

As we learned about in our reading of [The Numbers Don't Speak for Themselves](https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/2) in [Data Feminism](https://data-feminism.mitpress.mit.edu/) this week, the *description* of data is extremely important. Description is often recorded separately from the data as *external metadata*. External metadata is especially important for representing the context in which data was generated.

We also learned previously in Chapter 6 of [The Theory and Craft of Digital Preservation](https://jhupbooks.press.jhu.edu/title/theory-and-craft-digital-preservation) that as a practical matter, this context often includes a record of what a file looks like at a particular point in time. This is known as a file's *fixity*. Knowing what files should be present, and that their content is what we expect it to be is a fundamental requirement for caring for data. It lets us notice when things have gone wrong with our data.

## Fixity

One popular way of managing fixity information for files is to create what's called a digital fingerprint or *hash* for a file. As Owens says:

> A cryptographic hash function is an algorithm that takes a given set of data (like a file) and computes a sequence of characters that then serves as a fingerprint for that data. Even changing a single bit in a file will result in a totally different sequence of characters. For example, I computed an MD5 hash for an image file which returned the value "4937A316849E472473608D43446EBF9EF". Now if I compute the has for another copy of that file and get the same result, I'll have rather high confidence that those two copies are exactly the same. Similarly, if I have a stored value from the last time I computed a hash for that file, when I recompute the hash in the futre and get that value again, I have a high degreee of confidence that the file has not been chnanged or altered.

## Manifests

It's not uncommon to store a list of files and their fixity values in a special file called a *manifest*. A manifest is an example of *external metadata*. The idea of a manifest is not unique to digital curation, and comes from an [older practice](https://en.wikipedia.org/wiki/Manifest_(transportation) from transportation. When shipping things long distances by boat it was (and still is) very important to make sure what was put on the boat didn't dissappear en route. Below is an example of shipping manifest for *people* who were immigrating into the United States from Turkey.

<img src="https://raw.githubusercontent.com/edsu/inst341/master/modules/module-05/images/manifest.jpg">

The same concept that is used to track things as they move through space can be applied to things as they travel in time. A manifest simply lists the things we expect to be present, and what their state should be.

In light of the D'Ignazio and Klein chapter its important to consider who the manifest is being made by, what it contains (and doesn't contain) and who it is being made for. As an analog to that question think about the shipping manifest above, and how Armenian names are westernized as they are recorded in the receipt manifest at Ellis Island. Could there be a parallel for manifests in digital preservation? 

## Generating Fixity

In this notebook we will experiment with generating fixity values, and storing them in a machine reaadable manifest. We will also check the manifest to make sure the files look ok.

First lets install some data to work with. We're going to use the inst341data package instead of Google Drive this week so that you can get data customized for you during the exercise. But first we're going to download the generic data for the class to use to illustrate some examples.

In [1]:
! pip install --quiet inst341data

import inst341data
inst341data.get_module_5('inst341')

  Building wheel for inst341data (setup.py) ... [?25l[?25hdone
Downloaded inst341


We can create a Path object for the data in the `inst341` directory that was just created on the file system. Then we can use it to print out the files in the directory.

In [2]:
from pathlib import Path

data = Path('inst341')
for p in data.iterdir():
  print(p)

inst341/591796.xml
inst341/185621.kmz
inst341/731818.html
inst341/327479.html
inst341/016351.txt
inst341/395504.txt
inst341/436739.txt
inst341/303329.png
inst341/109068.jpg
inst341/061020.html
inst341/250963.html
inst341/134884.html
inst341/826550.xml
inst341/158095.txt
inst341/344176.txt
inst341/282539.html
inst341/283668.txt
inst341/492605.html
inst341/170887.txt
inst341/450254.txt
inst341/476277.csv
inst341/319760.txt
inst341/484768.html
inst341/600980.html
inst341/manifest.json
inst341/836391.html


We will be looking at these more in a moment but for now notice that there are a bunch of numbered files with different extensions as well as a `manifest.json` file.

In order to calculate the `fixity` value for one of these files we're going to create a little function that uses Python's [hashlib](https://docs.python.org/3/library/hashlib.html?highlight=hashlib#module-hashlib) module to make it easy to generate a [SHA256](https://en.wikipedia.org/wiki/SHA-2) checksum for a `Path` object. SHA256 is a hashing algorithm similar to the MD5 discussed above.

In [24]:
import hashlib

def get_sha256(p):
  h = hashlib.sha256()
  data = p.open('rb').read()
  h.update(data)
  return h.hexdigest()  

Let's try using our `get_sha256` function by passing it a `Path` object for one of our files:

In [25]:
get_sha256(Path('inst341/492605.html'))

'aa0f647bb649edf138b984356098bd412fcaa724c61fa802c6e741fd33886fee'

So tha value `aa0f647bb649edf138b984356098bd412fcaa724c61fa802c6e741fd33886fee` is a unique fingerprint that identifies the contents of the file stored at `inst341/492605.html`.

## Reading a Manifest

The `inst341/manifest.json` file is a manifest for all the files and their fixities stored in the [JavaScript Object Notation (JSON)](https://en.wikipedia.org/wiki/JSON) format. You probably have used JSON in your INST126 or INST326 classes since it's one of the most common data formats on the Web.

This particular JSON file contains a `list` of `objects`, or as they are called in Python, `dictionaries`. Each of these dictionaries contains two key/value pairs: `path` and `sha256`.

There are many different formats for manifests that are used in the digital preservation community. However no matter the representation the concept is essentially the same: you need a file name and a fixity value.

Reading in our JSON manifest is relatively easy with Python's [json](https://docs.python.org/3/library/json.html) module. We just need to open the file and pass the file object to `json.load` which will parse all the JSON data into Python native data structures (a list of dictionaries) that we can then use like any list or dictionary.

In [6]:
import json

manifest = json.load(open('inst341/manifest.json'))

Once we have read it in we can then go through each item in the manifest and see the filename and the sha256 value.

In [7]:
for entry in manifest:
  print(entry['path'], entry['sha256'])

inst341/484768.html aea41602cbc7f44e4c399bd88d2de54d2888c7ee7bf265842d82d3b28c964c7d
inst341/319760.txt 067aa5090b866d1d096566f29bbebecb54eee65784e191cd4163f746675e27da
inst341/492605.html aa0f647bb649edf138b984356098bd412fcaa724c61fa802c6e741fd33886fee
inst341/826550.xml 0fea1e400379ea6d9c098ec4c6e92b94a1dd630b34b5073cbc903c6df2525a71
inst341/134884.html 5941e809a59fe455fed36a356dfa1814b584c8ca1bec3c205dc71b0d8adc2bbd
inst341/170887.txt d91972e1a035460c97d39f5ec6aeab16a6bdd0f69201f3a7524d67a5581745d1
inst341/836391.html ebc1c7d9112dc304442b1ccba8bd40646fc837d3f86d6875efdc296b75cfed91
inst341/250963.html a6cfa3970c12fe65dfb9f1d43df8585d4fd0206688f7c9c3514bf7993bc4228d
inst341/436739.txt 589881e15694ad6b46c9a07e085d805e4eda1c200526a13f5d6844d091766483
inst341/016351.txt 0076f44ae20cefd13b75ee2b839f6bc67ccb0ac42487b826da250dbf84cb4791
inst341/282539.html 9ef2cb9e8df9ea12cf9042cda75b7160c0bf085df2133fcef8ca80d06f25fde7
inst341/327479.html 8f125e336cb4b1b2239805f5751199f747a276f15688e83ec9

Remember, these are the files and sha256 values *in the manifest*. Hopefully they match the files we see on the file system. But you won't know until the manifest is *validated*.

## Validate the Manifest

Now lets put all the pieces together to read in our manifest (data/manifest.json) and verify that each path's sha256 values matches what is found on the file system. We do this by calculating the sha256 by giving the `get_sha256` function a Path for a file, and comparing the result with what the manifest says it should be.

In [26]:
import json
import pathlib

manifest = json.load(open('inst341/manifest.json'))

for entry in manifest:
  p = pathlib.Path(entry['path'])
  sha256 = get_sha256(p)

  if sha256 == entry['sha256']:
    print(p, 'is ok')
  else:
    print(p, 'is invalid: found', sha256, 'but expected', entry['sha256'])

inst341/484768.html is ok
inst341/319760.txt is ok
inst341/492605.html is ok
inst341/826550.xml is ok
inst341/134884.html is ok
inst341/170887.txt is ok
inst341/836391.html is ok
inst341/250963.html is ok
inst341/436739.txt is ok
inst341/016351.txt is ok
inst341/282539.html is ok
inst341/327479.html is ok
inst341/600980.html is ok
inst341/061020.html is ok
inst341/591796.xml is ok
inst341/303329.png is ok
inst341/450254.txt is ok
inst341/185621.kmz is ok
inst341/344176.txt is ok
inst341/158095.txt is ok
inst341/476277.csv is ok
inst341/283668.txt is ok
inst341/731818.html is ok
inst341/395504.txt is ok
inst341/109068.jpg is ok


Whew, the manifest looks valid! All the files in the manifest have a sha256 value that matches what we find when we recalculate it using the file on the filesystem. That means our data is what we expect it to be!

## Exercise

### 1. Get Data

First download your module 5 data by replacing USERNAME in the string below with your UMD username (the same one you used in the Module 3 and 2 notebooks).

In [None]:
import inst341data

inst341data.get_module_5('USERNAME')

If that generated an error make sure you run the cell above that does the:

    pip install --quiet inst341data

### 2. Read in Your Manifest

Read in your manifest that will be in your username directory that was just created. For example if my username is `edsu` my manifest would be found at `edsu/manifest.json`. Print out each item in the manifest, it's path and sha256 value.

### 3. Validate Your Manifest

Use the example above to validate your manifest. Remember you want to validate *your* files not the ones in the `inst341` directory. Are there any files that failed validation?

### 4. **Optional:** Create a Validation Function

If you'd like a challenge see if you can create a function called `validate` that is given the path to a manifest and will return True or False depending on whether the manifest is valid or not.


### 5. **Really Optional:** Efficiency

Do you see any problem with the `get_sha256` function above? How could it be improved?