# Microfiche to Dataset: The Process

## Exploring Files

### Initial File Structure and Count

Initial Structure:
``` 
MICROFICHE/
    BATCH/
        CALLSIGN/
            FILING/
                SCAN.jpg
```

Initial Naming Convention:
```
Batch #/CALLSIGN/MONTH YYYY - NOTE/*(#).jpg 
Example:

~/Batch_25/KNKN991/APRIL 1991 STEP 1 OF 4:
KNKN991-APRIL-1991-STEP 1 OF 4- (1).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (10).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (11).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (12).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (13).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (14).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (2).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (3).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (4).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (5).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (6).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (7).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (8).jpg
KNKN991-APRIL-1991-STEP 1 OF 4- (9).jpg
```


### Useful Functions

Used `cd`, `ls`, `wc` functions to understand directory structure, and count folders and files:

```bash
# navigate to parent folder
cd /Volumes/Seagate/FCC-backups/K5/K5-00-originals
# list one BATCH folder per line PIPE to count lines = count of folders
ls -1 BATCH* | wc -l 
# find all jpg (incl. subdirectories) PIPE to count
$(find $(pwd) -name "*.jpg" | wc -l) 
# lists all files incl. subdirectories
ls -R 
```

### Issues
- WHITESPACE
    - Difficult to pass files into functions as arguments. Unless whitespace is escaped, file names are read as *multiple* arugments, leading to errors. 
- SORTING
    - Images displayed in the order: 1, 10, 11, 12,.. 19, 2, 20.. etc.
    - FILING folder name didn't include CALLSIGN -- combining into a single folder for chronological sorting wasn't possible as multiple FILING's were named "MAY 1994", (with different parent BATCH/CALLSIGN/)
    - BATCH designation was unclear. It was unclear if BATCH/CALLSIGN folders were comprehensive for that particular callsign (e.g. KNKN991).
- INTERPRETATION
    - Date, and NOTE were based on quick readings of the folder contents. However, each FILING contained multiple dates. It was not clear whether the date was the signature date, date received/filed, approval date. Dates in a single FILING could span over 12 months. 

### Final Structure After Preprocessing

## Tidying File Paths

Below are transformations I used to tidy file paths (folder and file names) before passing files to `imagemagick` `convert` for image transformation. 

### Set path variables as shortcuts

In [1]:
m00=/Volumes/Seagate/FCC-backups/K5/K5-01-originals;
MICROFICHE=/Volumes/Seagate/FCC-backups/K5/K5-01-initialSort;
WORK=~/Dropbox/UoM/FCC/Working;
clean=${WORK}/documentation/logs/cleaningLog.txt;

### Replace whitespace in BATCH/ with underscore

Reference: [Parameter Expansion - Search and Replace, bash-hackers.org](http://wiki.bash-hackers.org/syntax/pe#search_and_replace)

In [3]:
# Remove whitespace from BATCH/ folder name
# Use Search and Replace
# loop over every BATCH/ folder

cd $MICROFICHE
for f in *[0-9];
do 
#echo mv -v $f ${f/\ /_}; # TEST RUN
mv -v "${f}" ${f/\ /_};
done

# ls $MICROFICHE
# UNCOMMENT to run

Batch 1 -> Batch_1
Batch 2 -> Batch_2
Batch 3 -> Batch_3
Batch 4 -> Batch_4
Batch 6 -> Batch_6


Correcting Batch 1 and 2 names

```bash
for f in *; do cd $rootdir/$f; for l in *; do mv -v $l "${l#*-}"; done; done
for f in *; do cd $rootdir/$f; for l in *; do echo mv $l ""${l//-/ }""; done; done
```
```
batchdir=$(pwd)

for CALLSIGN in *;
do cd $batchdir/$CALLSIGN;
    i=0;
        for FILING in *;
        do
            MM=xx;
            YYYY=xxxx;
            lic=${FILING%%-*};
            month=${FILING##*-};
            YYYY=${FILING:8:4};
            case "$month" in
                    "JAN"* ) MM='01';;
                "FEB"* ) MM='02';;
                "MAR"* ) MM='03';;
                "APR"* ) MM='04';;
                "MAY"* ) MM='05';;
                "JUN"* ) MM='06';;
                "JUL"* ) MM='07';;
                "AUG"* ) MM='08';;
                "SEP"* ) MM='09';;
                "OCT"* ) MM='10';;
                "NOV"* ) MM='11';;
                "DEC"* ) MM='12';;
                esac;
             echo mv -v "${FILING}" "->" ../$YYYY-$MM-${CALLSIGN/-/_}-0$i;
             unset -v rm1 rm2 rm3 rm4 MM YYYY;
             ((i++));
         done;
    cd $batchdir
done;

```
              
              
              
              
              
              
              
              
              
              
              
              

### Rename FILING folders

- Remove NOTE
- Convert MONTH words to numbers "mm"
- Remove CALLSIGN folder level



[Incrementing using Double Parenthesis](http://tldp.org/LDP/abs/html/dblparens.html)

In [None]:

MICROFICHE=/Volumes/Seagate/FCC-backups/K5/K5-01-initialSort/
cd $MICROFICHE
BATCH=Batch_3

#for BATCH in *;
do cd $MICROFICHE/$BATCH;
    for CALLSIGN in *;
    do cd $MICROFICHE/$BATCH/$CALLSIGN;
    #echo ${BATCH}/${CALLSIGN}/:;
    i=0;
        for FILING in *;
        do 
            MM=xx;
            YYYY=xxxx;
            rm1=${FILING/\ /_}; # replace first whitespace btwn MONTH YYYY with '_'
            rm2=${rm1%% *}; # remove all characters up to last whitespace from end
            rm3=${rm2%%-*}; # remove final '-'
            rm4="${rm3//_/ }"; # replace all underscore with whitespace
            case "$rm4" in # replace MONTH with MM
                "JAN"* ) MM='01';;
                "FEB"* ) MM='02';;
                "MAR"* ) MM='03';;
                "APR"* ) MM='04';;
                "MAY"* ) MM='05';;
                "JUN"* ) MM='06';;
                "JUL"* ) MM='07';;
                "AUG"* ) MM='08';;
                "SEP"* ) MM='09';;
                "OCT"* ) MM='10';;
                "NOV"* ) MM='11';;
                "DEC"* ) MM='12';;
                esac;
            YYYY=${rm4#* }; # extract YYYY from $rm4
            echo mv -v "${FILING}" "->" ../$YYYY-$MM-${CALLSIGN/-/_}-0$i 
            #mv -v "${FILING}" ../../$YYYY-$MM-${CALLSIGN/-/_}-0$i >> $clean
            unset -v rm1 rm2 rm3 rm4 MM YYYY
            ((i++)); ## crude numbering for same MONTH YYYY but different NOTE
            # TESTING
            #echo $FILING;
            #echo $rm1;
            #echo $rm2;
            #echo $rm3;
            #echo $rm4;
            #echo $i
        done
    #tail -n $i $clean; ## print renames to screen
    done
#done | less

Batch3

```bash
batchdir=$(pwd)

for CALLSIGN in *;
do cd $batchdir/$CALLSIGN;
    i=0;
        for FILING in *;
        do
            MM=xx;
            YYYY=xxxx;
            rm1=${FILING/\ /_}; 
            rm2=${rm1%% *}; 
            rm3=${rm2%%-*}; 
            rm4="${rm3//_/ }"; # replace all underscore with whitespace
            case "$rm4" in 
                "JAN"* ) MM='01';;
                "FEB"* ) MM='02';;
                "MAR"* ) MM='03';;
                "APR"* ) MM='04';;
                "MAY"* ) MM='05';;
                "JUN"* ) MM='06';;
                "JUL"* ) MM='07';;
                "AUG"* ) MM='08';;
                "SEP"* ) MM='09';;
                "OCT"* ) MM='10';;
                "NOV"* ) MM='11';;
                "DEC"* ) MM='12';;
                esac;
            YYYY=${rm4#* }; 
            mv -v "${FILING}" $YYYY-$MM-${CALLSIGN/-/_}-0$i 
            unset -v rm1 rm2 rm3 rm4 MM YYYY
            ((i++)); 
        done
        cd $batchdir
    done
```

Batch_6, Batch_4


```bash
batchdir=/Volumes/Seagate/FCC-backups/K5/K5-01-initialSort/Batch_6

for CALLSIGN in *;
do cd $batchdir/$CALLSIGN;
    i=0;
        for FILING in *;
        do
            MM=xx;
            YYYY=xxxx;
            rm1=${FILING/\ /_}; 
            rm2=${rm1%% *}; 
            rm3=${rm2%%-*}; 
            rm4="${rm3//_/ }"; # replace all underscore with whitespace
            case "$rm4" in 
                "JAN"* ) MM='01';;
                "FEB"* ) MM='02';;
                "MAR"* ) MM='03';;
                "APR"* ) MM='04';;
                "MAY"* ) MM='05';;
                "JUN"* ) MM='06';;
                "JUL"* ) MM='07';;
                "AUG"* ) MM='08';;
                "SEP"* ) MM='09';;
                "OCT"* ) MM='10';;
                "NOV"* ) MM='11';;
                "DEC"* ) MM='12';;
                esac;
            YYYY=${rm4#* }; 
            #echo $YYYY-$MM-${CALLSIGN/-/_}-0$i $CALLSIGN/$FILING 
            mv -v "${FILING}" $batchdir/$YYYY-$MM-${CALLSIGN/-/_}-0$i 
            unset -v rm1 rm2 rm3 rm4 MM YYYY
            ((i++)); 
        done
        cd $batchdir
    done
```

### Remove whitespace from SCAN.jpg

Install [rename utility](http://plasmasturm.org/code/rename/)

```
brew install rename
```

In [None]:
cd $MICROFICHE;
for f in *;
do rename -v "s/ *//g" $f/*.jpg;
done

### Add Leading zero to SCAN.jpg numbering

In [None]:
echo $MICROFICHE

In [None]:
cd $MICROFICHE;
for f in $(find $(pwd) -name "*([1-9])*.jpg");  # find parentheses with single digits
    do
        #echo 
        mv -v $f ${f//\(/\(00}
    done
    
for f in $(find $(pwd) -name "*([1-9][0-9])*.jpg");  # find parentheses with single digits
    do
        #echo 
        mv -v $f ${f//\(/\(0}
    done

Manual fixes

KNKN256-NOVEMBER-1989-l001458.jpg

KNKN316-AUGUST-1993-2--001096.jpg


### BACK UP to 02-.../ 

In [None]:
MICROFICHE=~/Desktop/K4-02-renamed
echo $MICROFICHE

```
./1998-11-KNKQ449-00:
KNKQ449-NOVEMBER-1998-(001).jpg
KNKQ449-NOVEMBER-1998-(002).jpg
KNKQ449-NOVEMBER-1998-(003).jpg
```

```
MICROFICHE/
    YYYY-MM-CALLSIGN-xx/
        YYYY-MM-CALLSIGN-xx-001.jpg
```

## Image Transformation

Requires ImageMagick. Check installation using:
`convert -version`

Install:
`brew install imagemagick`

### Split image in two, and convert to gray scale

In [None]:
convert -version

In [None]:
function graycrop {
    convert -colorspace Gray -crop 50%x100% +repage "$1" "$1";
    rm -v "$1" >> $clean;
    tail -1 $clean; #$clean is a log of all operations
}

- Check original number of files
- Run loop
- Check new number of files

pwd=/Volumes/Seagate/FCC-backups/K5/K5-03-grayCrop

```shell
~$ for f in $(find $(pwd) -name "*.jpg"); do      echo ${f##*/}; done | wc -l
    4634
## ^ can return inaccurate count if file path runs over two lines
~$ for f in $(find $(pwd) -name "*.jpg"); do      graycrop $f; done | wc -l
    4634
~$ find $SCANS -name "*.jpg" | wc -l
    9268
```    

### [BACK UP] to 03-GrayCrop/

### [SKIP] Check and Correct Split Pages 

These are functions written to rejoin document pages that were incorrectly split due to the original SCAN positioning.

![split doc](incorrect-split.jpg)

This step was used for one set of BATCH files, but eventually skipped in favour of making a note during the [meta-tagging](#section: meta-tagging) stage. 

```bash
rmi(imgno,[0/1]), unsplit(imgno.), mvv(mistake,correction), Lfnlname(), cdmvi()

function unsplit {
local img0=$(ls *\($1\)-0.jpg);
local img1=$(ls *\($1\)-1.jpg);
	convert $img0 $img1 +append ${img0/-0/};
	rm -v $img0 $img1;
	echo "unsplit ${img0/-0/}" >> $clean;
	tail -1 $clean;
}

function cdmvi {
        cd ..;
	mv -v $(pwd)/$fol $menter/ >> $clean;  
	tail -1 $clean;  
	fol='';
}

function rmi {
	rm -v $(find $(pwd) -name "*($1)-$2.jpg") >> $clean;
	tail -1 $clean
}
```

### Rename SCAN for meta-tagging

In [None]:
function Lfnlname {
     local i=1
     local fol=$1
     for f in "$fol"/*.jpg;
     do
      if [ $i -lt 10 ]
      then
            mv -v "$f" "$fol/${fol##*/}-00$i.jpg";
      elif [ $i -lt 100 ]
      then 
      mv -v "$f" "$fol/${fol##*/}-0$i.jpg";
      else
            mv -v "$f" "$fol/${fol##*/}-$i.jpg";
      fi
          let i=i+1
     done
}

Move jpgs into year folders for tagging

```bash
mkdir 1991

for f in 1991-*; do mv $f/*.jpg 1991/; done

rmdir 1991-* # check if all empty
```

## $master script for terminal

In [None]:
echo $work;
ls $work/scripts
cat $work/scripts/fcc-project.sh

## Meta-Tagging

<a id='section: meta-tagging'></a>