Format Corpus Analysis
======================

Here, we can run Python scripts to scan through the contents of [various test corpora](http://www.digipres.org/real-data/), invoking various tools and analysing the results.

Using an [IPython Notebook](http://ipython.org/notebook.html) makes it very easy to regenerate the results by [re-running](https://github.com/paulgb/runipy) these analyses as part of a [continious integration process](https://travis-ci.org/openplanets/format-corpus). Furthermore, because it's an IPython Notebook it generates output that is easy to publish on the web as static pages.

See also https://github.com/richardlehane/comparator

    sf DIR

    java -jar ~/droid/droid-command-line-6.1.5.jar -Ns ~/.droid6/signature_files/DROID_SignatureFile_V81.xml -Nc ~/.droid6/container_sigs/container-signature-20150218.xml -recurse -Nr DIR

    droid -Ns ~/.droid6/signature_files/DROID_SignatureFile_V81.xml -Nc ~/.droid6/container_sigs/container-signature-20150218.xml -recurse -Nr DIR
    

    python fido.py -recurse DIR
    
    find systems-showcase-files -type f -exec file -I {} \;
    

Outline of the process
----------------------

### 1. Update the remote corpora

Where appropriate, the format corpus pull existing corpora by remote reference rather than duplicating them in the main repository. Therefore, the first step is to create/update the local copies of those resources.

### 2. Run the tools

We then run various tools of interest, and collect the results.

### 3. Summarise the results

We then summarise the results from the various tools.

### 4. Compare with previous results

We take the latest results and combine them with earlier sets of results, in order to see how things have changed over time.

### 5. Generate website

The data and graphs generated in this way are then used to generate a static website generated via Jekyll.

1. Update the remote corpora
----------------------------

...???...

2. Run the tools
----------------

...

In [7]:
import os
import subprocess
import time
from __future__ import print_function

def run_command(cmd):
    '''given shell command, returns communication tuple of stdout and stderr'''
    return subprocess.Popen(cmd, 
                            stdout=subprocess.PIPE, 
                            stderr=subprocess.PIPE, 
                            stdin=subprocess.PIPE)

def run_tika(fp,out_fp):
    start_time = os.times()[4]
    p = run_command(["tika", "-m", fp])
    of = open(out_fp+".out",'wb')
    tika_type = None
    while p.poll() is None:
        line = p.stdout.readline()
        of.write(line)
        # Convert bytes to string, as UTF-8:
        line = line.decode()
        if "Content-Type" in line:
            tika_type = line.rstrip().split(':')[1].strip()
    of.close()
    # Determine run-time:
    end_time = os.times()[4]
    run_time = end_time - start_time
    # Check for stderr
    errs = p.stderr.readlines()
    has_stderr = False
    if len(errs) > 0:
        ef = open(out_fp+".err",'wb')
        ef.writelines(errs)
        ef.close()
        has_stderr = True
    # Note return code:
    #print(p.returncode)
    # Return:
    return { 'type': tika_type, 'returncode': p.returncode, 'has_stderr': has_stderr, 'duration': run_time }


prefix = '/Users/andy/Documents/workspace/format-corpus/'
indir = prefix+'corpora/'
outdir = prefix+'tool-output/tika/'
count = 0

of = open(prefix+"scan-results.out",'w')

for root, dirs, filenames in os.walk(indir):
    rel_path = root.replace(indir, "")
    out_name = rel_path.replace("/",".")
    
    for f in filenames:
        # Set up input and output filenames:
        fp = os.path.join(root, f)
        rel_fp = os.path.join(rel_path,f)
        out_fp = os.path.join(outdir,out_name+"."+f)
        out_fp = out_fp.replace(" ","_")
        # Run tools
        print("Running Tika on",f)
        tika = run_tika(fp,out_fp)
        of.write("%s\t%s\t%s\t%s\t%s\n" % (rel_fp, tika['type'], tika['returncode'], tika['has_stderr'], tika['duration']))
        # Count files processed:
        count+=1
        # Only process one folder per file right now:
        break
    
of.close()

print("DONE")
        

Running Tika on Curation outline3.nmind.tar
Running Tika on 00000019.300.tif
Running Tika on Neddy_Flyer_ft_HeatherRyan.jpg
Running Tika on Aesops-Fables.azw
Running Tika on .DS_Store
Running Tika on create-variations.sh
Running Tika on lorem-ipsum-openprintcopypw.pdf
Running Tika on lorem-ipsum-plus-image-updated-opencopyprintpw.pdf
Running Tika on readme.md
Running Tika on MAPS.ARJ
Running Tika on readme.md
Running Tika on !
Running Tika on .gitignore
Running Tika on null
Running Tika on readme.md
Running Tika on 008677.pdf
Running Tika on 020747.pdf
Running Tika on balloon.j2c
Running Tika on diagram.png
Running Tika on balloon.jpg
Running Tika on balloon_eciRGBv2.tif
Running Tika on balloon.tif
Running Tika on ConceptDraw Format metadata template.csv
Running Tika on copac-uknuc.mmp
Running Tika on Curation outline 3.nmind
Running Tika on readme.md
Running Tika on KSBASE.WK1
Running Tika on PEYTREND.WK3
Running Tika on KSBASE.WQ1
Running Tika on KS4000.WQ2
Running Tika on MonteCarlo