# Determine BATCH II Scan Errors

#### Updated: April 13, 2023 by Ashley Ferreira

#### Introduction (to the issues we attempt to quantify)
There are two potentail issues with the scanning we are trying to isolate and get an idea for how prevelant they are:
1. Images are cropped too soon (causes aspect ratio to be less wide than expected)
2. Images are out of phase (can see two sets of metadata on either side of ionogram instead of middle)

#1 is the easiest to detect by just reading in the pixel scale and calculating the aspect ratio, wheras for #2 we need to use an optical character regonition program to estimate the number of characters in the metadata. Both of these cause us to not be able to see the ionogram trace correctly. Images like first and last ones of most subdirs are supposed to be off and not have the normal ionogram aspect ratios so it is normal to expect a certain small rate of these images.

#### Setup 
You will likely need to pip install tensorflow and keras_ocr as they do not come by default with anaconda, uncomment the cells below to do this if needed. Then run the cells to import the libraries adn set some of the default parameters.

In [None]:
#! pip install tensorflow --user

In [None]:
#! pip install keras_ocr --user

In [1]:
# imports
import cv2
import pandas as pd
import numpy as np
import os
import sys
import json

import random
from random import randrange
from IPython.display import clear_output

from matplotlib import pyplot as plt
from matplotlib import rcParams as rc
from matplotlib import rcParamsDefault
from matplotlib.ticker import MaxNLocator
import matplotlib.image as mpimg
rc.update(rcParamsDefault)

# replace this with your own library path for --user pip installs
sys.path.append('C:/Users/aferreira/AppData/Roaming/Python/Python38/Scripts')
import keras_ocr

pipeline = keras_ocr.pipeline.Pipeline()

Looking for C:\Users\aferreira\.keras-ocr\craft_mlt_25k.h5
Looking for C:\Users\aferreira\.keras-ocr\crnn_kurapan.h5


In [2]:
# set paths
batchDir = 'L:/DATA/Alouette_I/BATCH_II_raw/'
save_dir = 'U:/Downloads/' 
outFile = save_dir + 'chars_and_aspect_ratios_v2.csv'

# set default saving settings
append2outFile = True
saveImages = False

#### Inititalize Functions
The main processing for this code uses two functions, read_all_rolls() which loops over all the batch 2 raw data ionogram images and saves the outputs from read_image() to a CSV file. This second function, read_image() reads in one image a time, whos path is passed it it by read_all_rolls(), and outputs the height and width along with the estimated character count of the metadata. 

In [3]:
def read_image(image_path, plotting=False):
    '''
    This function reads in one image a time and outputs the height 
    and width along with the estimated character count of the metadata.

    Parameters:

        image_path (str): path to the image

        plotting (bool, optional): True for a verbose display mode to
                                   illustrate the analysis in detail, 
                                   False otherwise

    Returns:

        char_count (int): estimated number of characters in the ionogram
                          metadata (right now, only looks for numbers along
                          bottom 20% of the image, usually only 15 expected)
    
        height (int): number of pixels along y-axis of original image
        
        width (int): number of pixels along x-axis of origional image

        says_isis (bool): True if 'isis' independant of capitalization is 
                          present within the detected text, False otherwise
    '''

    # read in image using keras_ocr
    image = keras_ocr.tools.read(image_path) 

    # extract height and width of image in pixels 
    height, width = image.shape[0], image.shape[1]

    # cut image to just include bottom 20% of pixels
    cropped_image = [image[height-height//5:height,:]]

    # create predictions for location and value of characters
    # on the cropped image, will output (word, box) tuples
    prediction = pipeline.recognize(cropped_image)[0]

    if plotting == True:
        # display original image to make sure it is alright
        plt.imshow(image)
        plt.show()

        # display the cropped image
        plt.imshow(cropped_image[0])
        plt.show()

    # if no characters are found move on
    if prediction == [[]]:
        char_count = 0

    # if characters are found look at the predictions
    else:
        if plotting == True:
            # plot the predictied box and tuples
            keras_ocr.tools.drawAnnotations(image=cropped_image[0], predictions=prediction)
            plt.show()

        # check how many are numbers since letters are often picked up from noise 
        # (sometimes something like a '0' maps to an 'o', but we only expect digits in metadata)

        # loop over predicted (word, box) tuples and count number of digit characters
        char_count = 0
        says_isis = False
        for p in prediction:

            # select just the word part of the tuple
            value = p[0]

            if 'isis' in value.lower():
                says_isis = True

            # if word is composed of just integers then 
            # count how many and incriment char_count
            if value.isdigit():
                char_count += len(value)

    return char_count, height, width, says_isis

In [4]:
def read_all_rolls(batchDir=batchDir, saveImages=saveImages):
   '''
   This function loops over all images nested within batchDir
   and saves the outputs from read_image() to a CSV file.

   Parameters:

      batchDir (str, optional): path to directory of entire batch 
                                of ionogram scan images to analyze

      saveImages (bool, optional): True to save all images with irregular
                                    aspect ratios for visual inspection, 
                                    False otherwise

   Returns:

      None

   '''
   # initialize lists to save values to in loop
   rolls, subdirs, images = [], [], []
   heights, widths, char_counts = [], [], []
   says_isis_lst = []
   
   # loop over all rolls in the batch 2 raw data directory
   raw_contents = os.listdir(batchDir)
   for roll in raw_contents:

      # loop over all subdirectories within the roll
      roll_contents = os.listdir(batchDir + roll) 
      for subdir in roll_contents:

         # loop over all images in the subdirectory
         subdir_contents = os.listdir(batchDir + roll + '/' + subdir) 
         for image in subdir_contents:

            # save full path of image
            image_path = batchDir + roll + '/' + subdir + '/' + image

            # make sure path exits
            pathExist = os.path.exists(image_path)
            if pathExist:

               # save id of image
               rolls.append(roll)
               subdirs.append(subdir)
               images.append(image)

               # send to read_image to get aspect ratio and character count
               num_of_chars, h, w, says_isis = read_image(image_path)

               # aspect ratio could also be read in like
               #im = cv2.imread(image_path, 0)
               #h, w = im.shape

               # save values
               char_counts.append(num_of_chars)
               heights.append(h)
               widths.append(w)
               says_isis_lst.append(says_isis)
               
               # this is not used anymore but was helpful  
               # in exploring the acceptable aspect ratios
               if saveImages == True:
                  if (w/h < 1.2) or (w/h > 3):
                     save_name = save_dir + 'off_aspect_ratios/' + roll + '_' + subdir + '_' + image
                     im = cv2.imread(image_path, 0)
                     cv2.imwrite(save_name, im)   
                     print('aspect ratio is off, saved image to:', save_name)               

            # initialize dataframe and save results to csv
            # (redoing this each interation to not loose information)
            df_mapping_results = pd.DataFrame()

            df_mapping_results['roll'] = rolls
            df_mapping_results['subdir'] = subdirs
            df_mapping_results['image'] = images
            df_mapping_results['character_count'] = char_counts
            df_mapping_results['height'] = heights
            df_mapping_results['width'] = widths
            df_mapping_results['says_isis'] = says_isis_lst

            # mode = 'a' means it will append to existing data within the file
            if append2outFile == True:
               mode = 'a' 

               # wipe lists now that they have been saved
               rolls, subdirs, images = [], [], []
               heights, widths, char_counts = [], [], []
               says_isis_lst = []

               if os.path.exists(outFile):
                     header = False 
               else:
                     header = True
               
            else: 
               # this overwrites existing file
               mode = 'w'
               header = True

            df_mapping_results.to_csv(outFile, mode=mode, index=False, header=header)

Why are images cropped to contain only bottom 20% of pixels?
- The metadata we are interested in is expected to be at the bottom of the image, below the ionogram graph
- Including the data from the ionogram graph makes the keras_ocr functions more likely to predict random words from the noise, so its best to lower the chances of this by cropping the image
- Some images are upsidedown or for another reason will have their metadata cut off due to the crop but it is okay for us to underestimate, and this is preffered when compared to overestimating
- There are some images which contain just long sections of metadata and these are not out of phase but would contribute to an overestimation and so cropping bottom 20% opposed to bottom 200 pixels allows us to filter out the characters on those images from view

Possible improvements to these functions:
- loading the images in batches (tried but could only do two at a time with keras but may be worthwhile loading in whole subdirectory then just passing two to keras at a time)
- writting to csv without using pandas/in a more clean way to constantly append (only append after 100 or something)
- improve the ocr detection and regonition parts
- make code more efficient, not focused too much with L drive and keras OCR being main limitting factors
- use consistent mehtod of naming variables

#### Running the functions
Below, I run the read_all_rolls() function for the batch 2 raw data directory and it saves the results as it processes. On my local computer, the processing time seems to be ~2s per image but it is highly dependant on the image, ranging from less than 1s to more than 7s, with most values lying in between that range.

In [5]:
# run function for batch 2 results
read_all_rolls() 

