# Microfiche to Dataset: The Process

## Exploring Files

### Initial File Structure and Count

Initial Structure:
``` 
MICROFICHE/
    BATCH/
        CALLSIGN/
            FILING/
                SCAN.jpg
```

Initial Naming Convention:
```
Batch #/CALLSIGN/MONTH YYYY - NOTE/*(#).jpg 
Example:

~/Batch_25/KNKN991/APRIL 1991 STEP 1 OF 4:
KNKN991-APRIL-1991-STEP 1 OF 4- (1).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (10).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (11).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (12).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (13).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (14).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (2).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (3).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (4).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (5).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (6).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (7).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (8).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (9).jpg
```


### Useful Functions

Used `cd`, `ls`, `wc` functions to understand directory structure, and count folders and files:

```bash
# navigate to parent folder
cd MICROFICHE/ 
# list one BATCH folder per line PIPE to count lines = count of folders
ls -1 BATCH* | wc -l 
# find all jpg (incl. subdirectories) PIPE to count
$(find $(pwd) -name "*.jpg" | wc -l) 
# lists all files incl. subdirectories
ls -R 
```

### Issues
- WHITESPACE
    - Difficult to pass files into functions as arguments. Unless whitespace is escaped, file names are read as *multiple* arugments, leading to errors. 
- SORTING
    - Images displayed in the order: 1, 10, 11, 12,.. 19, 2, 20.. etc.
    - FILING folder name didn't include CALLSIGN -- combining into a single folder for chronological sorting wasn't possible as multiple FILING's were named "MAY 1994", (with different parent BATCH/CALLSIGN/)
    - BATCH designation was unclear. It was unclear if BATCH/CALLSIGN folders were comprehensive for that particular callsign (e.g. KNKN991).
- INTERPRETATION
    - Date, and NOTE were based on quick readings of the folder contents. However, each FILING contained multiple dates. It was not clear whether the date was the signature date, date received/filed, approval date. Dates in a single FILING could span over 12 months. 

### Final Structure After Preprocessing

## Tidying File Paths

Below are transformations I used to tidy file paths (folder and file names) before passing files to `imagemagick` `convert` for image transformation. 

### Set path variables as shortcuts

In [3]:
MICROFICHE=~/Desktop/test-scans;
WORK=~/Desktop/test-working;
clean=${WORK}/cleaningLog.txt;

### Replace whitespace in BATCH/ with underscore

Reference: [Parameter Expansion - Search and Replace, bash-hackers.org](http://wiki.bash-hackers.org/syntax/pe#search_and_replace)

In [50]:
# Remove whitespace from BATCH/ folder name
# Use Search and Replace
# loop over every BATCH/ folder

cd $MICROFICHE
for f in *[0-9];
do echo mv -v $f ${f/\ /_}; # TEST RUN
# mv -v "${f}" ${f/\ /_};
done

# UNCOMMENT to run

mv -v Batch 25 Batch_25
Batch 25 -> Batch_25
mv -v Batch 26 Batch_26
Batch 26 -> Batch_26
mv -v Batch 27 Batch_27
Batch 27 -> Batch_27
mv -v Batch 28 Batch_28
Batch 28 -> Batch_28
mv -v Batch 29 Batch_29
Batch 29 -> Batch_29
mv -v Batch 30 Batch_30
Batch 30 -> Batch_30


### Rename FILING folders

- Remove NOTE
- Convert MONTH words to numbers "mm"



[Incrementing using Double Parenthesis](http://tldp.org/LDP/abs/html/dblparens.html)

In [3]:
cd $MICROFICHE

for BATCH in */;
do cd $MICROFICHE/$BATCH;
# echo $BATCH;
    for CALLSIGN in *;
    do cd $MICROFICHE/$BATCH/$CALLSIGN;
    echo $CALLSIGN/:;
    i=0;
        for FILING in *;
        do 
            rm1=${FILING/\ /_}; # replace first whitespace btwn MONTH YYYY with '_'
            rm2=${rm1%% *}; # remove all characters up to last whitespace from end
            rm3=${rm2%%-*}; # remove final '-'
            rm4="${rm3//_/ }"; # replace all underscore with whitespace
            case "$rm4" in # replace MONTH with MM
                "JAN"* ) MM='01';;
                "FEB"* ) MM='02';;
                "MAR"* ) MM='03';;
                "APR"* ) MM='04';;
                "MAY"* ) MM='05';;
                "JUN"* ) MM='06';;
                "JUL"* ) MM='07';;
                "AUG"* ) MM='08';;
                "SEP"* ) MM='09';;
                "OCT"* ) MM='10';;
                "NOV"* ) MM='11';;
                "DEC"* ) MM='12';;
                esac;
            YYYY=${rm4#* }; # extract YYYY from $rm4
            mv -v "${FILING}" $YYYY-$MM-$CALLSIGN-0$i >> $clean
            ((i++)); ## crude numbering for same MONTH YYYY but different NOTE
            # TESTING
            #echo $FILING;
            #echo $rm1;
            #echo $rm2;
            #echo $rm3;
            #echo $rm4;
            #echo $i
        done
    tail -n $i $clean; ## print renames to screen
    done
done


Adobe/:
bash: $clean: ambiguous redirect



### Check for irregular FILING folder names

In [41]:
# CHECK FOR folder name length > 18 [yyyy-mm-callsign-0i]
cd $MICROFICHE

i=0;
declare -a fix_list=();
fix_list[0]="Folders to rename";
echo ${fix_list[0]}
for f in */*/*;
do FILING=${f##*/};
    if [ ${#FILING} != 18 ]
    then
	    ((i++));
        fix_list[$i]=${MICROFICHE}/${f};
        echo ${i}: ${fix_list[$i]};
    fi
done

tofix=${i}
echo ${fix_list[0]}: $tofix

Folders to rename
1: /Users/cynthiiee/Desktop/test-scans/Batch_25/KNKN298/1994-11-KNKN298-010
Folders to rename: 1


In [35]:
echo ${fix_list[*]}
        echo ${fix_list[$i]##*/};
        echo ${fix_list[$i]%/*}/;

1
Folders to rename /Users/cynthiiee/Desktop/test-scans/Batch_25/KNKN298/1994-11-KNKN298-010
1994-11-KNKN298-010
/Users/cynthiiee/Desktop/test-scans/Batch_25/KNKN298/


### Correct FILING folder names

REMEMBER TO RENAME!

```
mv -v $tooLong $correctName >> $clean;
tail -n 1 $clean
```

no error: `exit code $?=0` 

error: `exit code $?=1` 
ref:https://www.linuxjournal.com/article/10844

In [36]:
i=1;
until [ ${i} -gt ${tofix} ]; 
do
    read -p "What do you want to rename ${fix_list[$i]##*/} to?" target_list[$i];
        if [ -d ${fix_list[$i]%/*}/${target_list[$i]}]
        then 
            mv -v ${fix_list[$i]} ${target_list[$i]} >> $clean;
            tail -1 $clean;
            ((i++));
        fi
done

What do you want to rename 1994-11-KNKN298-010?


testing reference: http://wiki.bash-hackers.org/commands/classictest

### Remove whitespace from SCAN.jpg

Install [rename utility](http://plasmasturm.org/code/rename/)

```
brew install rename
```

In [149]:
cd $MICROFICHE;
rename -v "s/ *//g" */*/*/*.jpg | wc -l

       0


### Add Leading zero to SCAN.jpg numbering

In [157]:
echo $MICROFICHE

/Users/cynthiiee/Desktop/test-scans


In [173]:
cd $MICROFICHE;
for f in $(find $(pwd) -name "*([1-9])*.jpg");  # find parentheses with single digits
    do
	    # short=${f##*/};
        # echo $short ${short//\(/\(0};
        # echo ${f//\(/\(0}; # replace all ( with (0
        mv -v $f ${f//\(/\(0} # >> $clean;
            # tail -1 $clean;
    done

### BACK UP to 02-.../ 

## Image Transformation

Requires ImageMagick. Check installation using:
`convert -version`

### Split image in two, and convert to gray scale

In [116]:
convert -version

Version: ImageMagick 7.0.6-0 Q16 x86_64 2017-06-12 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules 
Delegates (built-in): bzlib freetype jng jpeg ltdl lzma png tiff xml zlib


In [152]:
function graycrop {
	convert 
    -colorspace Gray 
    -crop 50%x100% +repage "$1" "$1";
	rm -v "$1" #>> $clean;
	#tail -1 $clean; #$clean is a log of all operations
}

- Check original number of files
- Run loop
- Check new number of files

```shell
~$ for f in $(find $SCANS -name"*.jpg"); do      echo $f; done | wc -l
    4634
## ^ can return inaccurate count if file path runs over two lines
~$ for f in $(find $SCANS -name"*.jpg"); do      graycrop $f; done | wc -l
    4634
~$ find $SCANS -name "*.jpg" | wc -l
    9268
```    

In [185]:
cd $SCANS
#find $(pwd) -name "*.jpg"
unsplit=$(find $(pwd) -name "*).jpg" | wc -l); # count only unsplit jpgs [name ends in ).jpg]
echo $unsplit;
#find $SCANS -name "*.jpg" 
#for img in $(find $(pwd) -name "*.jpg"); do echo ${img##*/}; done | wc -l


0
2802


In [153]:
for f in $(find $(pwd) -name "*.jpg");
do graycrop $f;
done | wc -l

    1402


In [186]:
split=$(find $(pwd) -name "*-[0,1].jpg" | wc -l)
echo $split;

2802


### [BACK UP] to 03-GrayCrop/

### [SKIP] Check and Correct Split Pages 

These are functions written to rejoin document pages that were incorrectly split due to the original SCAN positioning.

![split doc](incorrect-split.jpg)

This step was used for one set of BATCH files, but eventually skipped in favour of making a note during the [meta-tagging](#section: meta-tagging) stage. 

```bash
rmi(imgno,[0/1]), unsplit(imgno.), mvv(mistake,correction), Lfnlname(), cdmvi()

function unsplit {
local img0=$(ls *\($1\)-0.jpg);
local img1=$(ls *\($1\)-1.jpg);
	convert $img0 $img1 +append ${img0/-0/};
	rm -v $img0 $img1;
	echo "unsplit ${img0/-0/}" >> $clean;
	tail -1 $clean;
}

function cdmvi {
        cd ..;
	mv -v $(pwd)/$fol $menter/ >> $clean;  
	tail -1 $clean;  
	fol='';
}

function rmi {
	rm -v $(find $(pwd) -name "*($1)-$2.jpg") >> $clean;
	tail -1 $clean
}
```

### Rename SCAN for meta-tagging

In [None]:
function Lfnlname {
     local i=1
     local fol=$1
     for f in "$fol"/*.jpg;
     do
	  if [ $i -lt 10 ]
	  then
      		mv -v "$f" "$fol/${fol##*/}-0$i.jpg" >> $clean;
     	  else
      		mv -v "$f" "$fol/${fol##*/}-$i.jpg" >> $clean;
     	  fi
	  tail -1 $clean
          let i=i+1 ## equivalent to ((i++))
     done
}


## $master script for terminal

In [9]:
echo $work;
ls $work/scripts
cat $work/scripts/fcc-project.sh

/Users/cynthiiee/Dropbox/FCC/Working/
Grayson Codes.txt	archive			fcc-project.sh
Grayson codes colin.txt	copyFromListToFolder.sh	list file names.txt
#!/bin/bash
# 
# Structure of file
#
# -re loads re-org functions
#   - mvlater(folder)
#   - mvfol(licence,MMM,yyyy):mv to $mwork
#   - Lmvfol():loops current folder
#   - mvv(folder, target):mv with log to $sesslog
#   - Lnumber(folder): renumber all ([1-9]) to (0[1-]) in given folder & sub folders using find
#
# 1. Set locations
# This includes FCC, microfiche, scripts, docs, xls, progress
#
# 2. Load functions and scripts
#   - scripts contain prompts (``startSess, imgpro, dataE, endSess, mvlater)
#   - functions set global variables, run loops, and group commands
#        entryO(yyyy,mm,licence)
#        [*.jpg].nameorder.sh
#        [*.jpg].grapcrop()
#        unsplit(img1,img2)
#        [*.jpg].finalrename()
#        checkD()
#        [*.jpg].Lfindrepjpg(find,rep)
#        ocr(img,doctype)
#        entryC()

# Set locations

## path

## Meta-Tagging

<a id='section: meta-tagging'></a>