# Determine BATCH II Scan Errors

#### Updated: June 3, 2023 by Ashley Ferreira

Note that all of my code so far uses the old way of naming where box = roll, which we now know is false, the code has just yet to be updated. 

#### Setup 
You will likely need to pip install line_profiler, tensorflow, and keras_ocr as they do not come by default with anaconda, uncomment the cells below to do this if needed. Then run the cells to import the libraries adn set some of the default parameters.

In [4]:
# uncomment below to download non-standard libraries
#! pip install tensorflow --user
#! pip install keras_ocr --user
#! pip install line_profiler

In [1]:
# enter your network username to have correct paths
username = 'aferreira'

# enter True to use GPU, False for CPU
gpu_use = True

In [2]:
# imports
import sys
import cv2
import os
import gc
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from threading import Thread
#from PIL import Image
#from numba import jit, cuda, njit

In [3]:
# replace this with your own library path for --user pip installs
sys.path.append('C:/Users/' + username + '/AppData/Roaming/Python/Python38/Scripts')
    
if gpu_use:
    sys.path.insert(0, 'u:/temp/' + username + '/python/envs/tf210/lib/site-packages/')

In [4]:
import tensorflow as tf
import keras_ocr
print(tf.__version__) # for gpu use should be version 2.10.*
print(tf.config.list_physical_devices('GPU')) # for GPU use this should show something

2.10.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [6]:
pipeline = keras_ocr.pipeline.Pipeline()
%load_ext line_profiler
print('number of CPU cores available:', os.cpu_count())

number of CPU cores available: 6


In [10]:
# set paths
batchDir = 'L:/DATA/Alouette_I/BATCH_II_raw/'
save_dir = 'U:/Downloads/test_runs/' 
outFile = save_dir + 'notebook20_outputs_june3.csv'

# make the directory to save into  
# if it doesn't already exist
if not(os.path.exists(save_dir)):
    os.makedirs(save_dir)

# set default saving settings
# (not sure if code works anymore if these are changed)
append2outFile = True 
saveImages = False

#### Inititalize functions
The main processing for this code uses two functions, read_all_rolls() which loops over all the batch 2 raw data ionogram images and saves the outputs from read_image() to a CSV file. This second function, read_image() reads in one image a time, whos path is passed it it by read_all_rolls(), and outputs the height and width along with the estimated digit count of the metadata.

Playing around with downsizing the image and found that 2-4 factors of downsizing in likely sweet spot where time taken for inference is significantly reduced but also image is somewhat legible still. Could alternatively decide on set length of one axis and then just scale the other accordingly or have different downscales for large and small images. Image library can likely allow for better interpolation but is not easily compatible with keras_ocr.

In [11]:
def read_image(image_path, plotting=False, just_digits=False, down_factor=2):
    '''
    This function reads in one image a time and outputs the height 
    and width along with the estimated digit count of the metadata.

    Parameters:

        image_path (str): path to the image

        plotting (bool, optional): True for a verbose display mode to
                                   illustrate the analysis in detail, 
                                   False otherwise

        just_digits (bool, optional): if True only count characters that are 
                                      integers, False to count any characters

        down_factor (int, optional): factor by which to integer divide height
                                     and width to scale down size of image

    Returns:

        digit_count (int): estimated number of integers in the ionogram
                          metadata (right now, only looks for numbers along
                          bottom 20% of the image, usually only 15 expected)
                          ~~~~ done for cropped image ~~~~
    
        height (int): number of pixels along y-axis of original image
        
        width (int): number of pixels along x-axis of origional image

        says_isis (bool): True if 'isis' independant of capitalization is 
                          present within the detected text, False otherwise
                          ~~~~ done for origional image ~~~~
    '''
    try: 

        # read in image using keras_ocr
        image = keras_ocr.tools.read(image_path) 

        # extract height and width of image in pixels 
        height, width = image.shape[0], image.shape[1]

        # cut image to just include bottom 20% of pixels
        cropped_height = height-height//5
        #cropped_image = [image[cropped_height:height,:]]

        down_size = (width//down_factor, height//down_factor)
        image = cv2.resize(image, down_size)

        # create predictions for location and value of characters
        # on the cropped image, will output (word, box) tuples
        prediction = pipeline.recognize([image])[0]

        # if no characters are found move on
        if prediction == [[]]:
            digit_count = 0

        # if characters are found look at the predictions
        else:
            if plotting == True:
                # plot the predictied box and tuples
                keras_ocr.tools.drawAnnotations(image=image, predictions=prediction)
                plt.show()

            # loop over predicted (word, box) tuples and count number of digit characters
            digit_count = 0
            says_isis = False
            for p in prediction:

                # select word and box part of the tuple
                value, box = p[0], p[1]

                # check for 'isis' of any capitalization in image
                if 'isis' in value.lower(): # may want to check 1, I, 5 variations on this
                    says_isis = True
                    print('found potential ISIS text')
                
                # if word is composed of just integers then 
                # count how many and incriment digit_count
                if just_digits == False or (just_digits == True and value.isdigit()):

                    # check that box is within the cropped height
                    in_bounds = True
                    for b in box:
                        if b[1] < cropped_height:
                            in_bounds = False
                            break
                            
                    if in_bounds:
                        digit_count += len(value)

        print('digits count:', digit_count)

    except Exception as e:
        print('ERR:', e)
        digit_count, height, width, says_isis = 'ERR', 'ERR', 'ERR', 'ERR'

    return digit_count, height, width, says_isis

In [12]:
def read_all_rolls(outFile=outFile, append2outFile=True, batchDir=batchDir, plotting=False, max_images=None, save_each=100):
   '''
   This function loops over all images nested within batchDir
   and saves the outputs from read_image() to a CSV file.

   Parameters:

      outFile (str, optional): path to CSV file where results from this 
                               function can be stored 

      append2outFile (bool, optional): if True will append to data in outFile 
                                       (if any exists), otherwise overwrites

      batchDir (str, optional): path to directory of entire batch 
                                of ionogram scan images to analyze

      plotting (bool, optional): just passes directly to read_image()

      max_images (int, optional): maximum number of images used to iterate over

      save_each (int, optional): save results to CSV after this number of images

   Returns:

      None

   '''
   # check if there is already data in the output file 
   if os.path.exists(outFile) and os.path.getsize(outFile)!=0:
      found = False
      header = False 

      df = pd.read_csv(outFile)
      last_entry = batchDir + df['roll'].iloc[-1] + '/' + df['subdir'].iloc[-1] + '/' + df['image'].iloc[-1]
      del df 

      # garbage collector
      collected = gc.collect()
      print("Garbage collector: collected",
               "%d objects." % collected)

   else: 
      found = True
      header = True
      last_entry = ''

   # initialize lists to save values to in loop
   rolls, subdirs, images = [], [], []
   heights, widths, digit_counts = [], [], []
   says_isis_lst = []

   images_saved = 0
   
   # loop over all rolls in the batch 2 raw data directory
   raw_contents = os.listdir(batchDir)
   for roll in raw_contents:

      # loop over all subdirectories within the roll
      roll_contents = os.listdir(batchDir + roll) 
      for subdir in roll_contents:
         
         # loop over all images in the subdirectory
         subdir_contents = os.listdir(batchDir + roll + '/' + subdir) 
         for image in subdir_contents:

            # save full path of image
            image_path = batchDir + roll + '/' + subdir + '/' + image

            # skip over image if already analyzed in CSV
            if found == False and last_entry == image_path:
               found = True

            if found == True:
               images_saved += 1

               if max_images != None and images_saved > max_images:
                  sys.exit()

               # save id of image
               rolls.append(roll)
               subdirs.append(subdir)
               images.append(image)

               # send to read_image to get aspect ratio, digit count, and isis text
               num_of_digits, h, w, says_isis = read_image(image_path)

               # save values
               digit_counts.append(num_of_digits)
               heights.append(h)
               widths.append(w)
               says_isis_lst.append(says_isis)              

               # save to csv after a set number of images (perhaps best to make propto max images)
               if images_saved % save_each == 0:

                  # initialize dataframe and save results to csv
                  # (redoing this each interation to not loose information)
                  df_mapping_results = pd.DataFrame()

                  df_mapping_results['roll'] = rolls
                  df_mapping_results['subdir'] = subdirs
                  df_mapping_results['image'] = images
                  df_mapping_results['digit_count'] = digit_counts
                  df_mapping_results['height'] = heights
                  df_mapping_results['width'] = widths
                  df_mapping_results['says_isis'] = says_isis_lst

                  # mode = 'a' means it will append to existing data within the file
                  if append2outFile == True:
                     mode = 'a' 

                     # wipe lists now that they have been saved
                     rolls, subdirs, images = [], [], []
                     heights, widths, digit_counts = [], [], []
                     says_isis_lst = []
                     
                  else: 
                     # this overwrites existing file
                     mode = 'w'
                     header = True

                  df_mapping_results.to_csv(outFile, mode=mode, index=False, header=header)
                  del df_mapping_results

                  collected = gc.collect()
                  print("Garbage collector: collected",
                           "%d objects." % collected)

#### Running the functions
Below, I run the read_all_rolls() function for the batch 2 raw data directory and it saves the results as it processes.

Profilers are run on the functions as well and on VDI the read_all_rolls() call below spends 99.1% of the processing time on the num_of_digits, h, w, says_isis = read_image(image_path) line, 0.1% of the time saving the CSV, and 0.8% of the time doing the forced garbage cleaning. For the read_image() function first call 97.4% of the processing time is spent on the prediction = pipeline.recognize([image])[0] line and 2.5% of the time is spent on the image = keras_ocr.tools.read(image_path) line.

In [7]:
# profile the read_all_rolls() function
# run it for 10 only to get results
%lprun -f read_all_rolls read_all_rolls(max_images=10, save_each=5)

digits count: 0
digits count: 0
digits count: 0
digits count: 0
digits count: 0
Garbage collector: collected 8585 objects.
digits count: 0
digits count: 0
digits count: 0
digits count: 0
digits count: 0
Garbage collector: collected 1449 objects.
*** SystemExit exception caught in code being profiled.

In [8]:
# (just one iteration here)
image_path = batchDir + 'R014207709' + '/' + '145' + '/' + '1.png'
%lprun -f read_image read_image(image_path)

digits count: 0


In [9]:
# (just one iteration here)
image_path = batchDir + 'R014207709' + '/' + '145' + '/' + '10.png'
%lprun -f read_image read_image(image_path)

digits count: 0


#### Multiprocessing & Multithreading Implimentations
Testing to see if either of these are a worthwhile approach. Multiprocessing and multithreading were able to work and do a better job at maximizing CPU and having good memory usage but after a while CPU usage still goes down to only ~30% utlization and this significantly slows down performance. 

I don't think true multithreading will ever work well with python but its useful here as there does seem to be some time the CPU takes a dip in utilization like when retreieving an image from the ethernet. 

I've tried a few different ways to do the multithreading/processing, one was setting it up to designate each thread to a set roll (can be seen in earlier version of this notebook) but that means its a long time before they join up so below I've implimented the multithreading/processing where the different threads are assigned to subdirectories within the processing loop. This means an alternate version of read_all_rolls() is created below, however read_image() can stay the same as already defined earlier.

In [13]:
def subdir_analysis(roll, subdir, outFile=outFile, batchDir=batchDir, save_each=100):
   '''
   to support read_all_rolls_multithread() in multithreading
   '''
   # initialize values
   rolls, subdirs, images = [], [], []
   heights, widths, digit_counts = [], [], []
   says_isis_lst, images_saved, mode = [], 0, 'a'

   # loop over all images in the subdirectory
   subdir_contents = os.listdir(batchDir + roll + '/' + subdir) 
   total_images = len(subdir_contents)
   for image in subdir_contents:

      # save full path of image
      image_path = batchDir + roll + '/' + subdir + '/' + image
      images_saved += 1
      print(images_saved, image_path)

      # save id of image
      rolls.append(roll)
      subdirs.append(subdir)
      images.append(image)

      # send to read_image to get aspect ratio, digit count, and isis text
      num_of_digits, h, w, says_isis = read_image(image_path)

      # save values
      digit_counts.append(num_of_digits)
      heights.append(h)
      widths.append(w)
      says_isis_lst.append(says_isis)              

      # save to csv after a set number of images or if last image
      if images_saved % save_each == 0 or images_saved == total_images:

         # initialize dataframe and save results to csv
         df_mapping_results = pd.DataFrame()

         df_mapping_results['roll'] = rolls
         df_mapping_results['subdir'] = subdirs
         df_mapping_results['image'] = images
         df_mapping_results['digit_count'] = digit_counts
         df_mapping_results['height'] = heights
         df_mapping_results['width'] = widths
         df_mapping_results['says_isis'] = says_isis_lst

         # wipe lists now that they have been saved
         rolls, subdirs, images = [], [], []
         heights, widths, digit_counts = [], [], []
         says_isis_lst = []
            
         df_mapping_results.to_csv(outFile, mode=mode, index=False, header=False)
         del df_mapping_results

         collected = gc.collect()
         print("Garbage collector: collected",
                  "%d objects." % collected)

In [14]:
def read_all_rolls_multithread(thread_count=6):
   '''
   A more bare bones version of read_all_rolls() taylored to support multithreading
   (reference read_all_rolls() for more info)
   --> does not yet support re-runs with no overwriting like the origional
   '''
   # inialize header for the output file
   header = {'roll':[], 'subdir':[], 'image':[], 'digit_count':[],  
                  'height':[], 'width':[], 'says_isis':[]}
   df_header = pd.DataFrame(data=header)
   df_header.to_csv(outFile, mode='w', index=False, header=True)

   subdir_threads = []

   # loop over all rolls in the batch 2 raw data directory
   raw_contents = os.listdir(batchDir)
   for roll in raw_contents:

      # loop over all subdirectories within the roll
      roll_contents = os.listdir(batchDir + roll) 
      for subdir in roll_contents:

         # setup a set of subdirectories to run
         # (if running for all rolls need boundary control here)
         if len(subdir_threads) < thread_count:
            subdir_threads.append(subdir)

         else:
            # create threads (replace 'Thread' with 'Process' for multiprocessing)
            threads = [Thread(target=subdir_analysis, args=[roll, subdir]) 
                        for subdir in subdir_threads]

            # start the threads
            for thread in threads:
               thread.start()

            # wait for completion
            for thread in threads:
               thread.join()

            print('#### All threads done, beginning a new set ####')
            
            # clear list and append first one
            subdir_threads = []
            subdir_threads.append(subdir)

In [None]:
read_all_rolls_multithread(thread_count=2)

1 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/1.png
1 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/1.png
digits count: 0
2 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/10.png
digits count: 0
3 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/11.png
digits count: 0
2 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/10.png
digits count: 0
4 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/12.png
digits count: 0
5 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/13.png
digits count: 0
3 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/11.png
digits count: 0
4 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/12.png
digits count: 0
6 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/14.png
digits count: 0
5 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/13.png
digits count: 0
7 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/15.png
digits count: 0
8 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/145/16.png
digits count: 0
6 L:/DATA/Alouette_I/BATCH_II_raw/R014207709/146/14.png
digits count: 0
9 L:/DATA/

#### EasyOCR Implimentation
We are pretty limitted with just running these models on CPU but this should be faster, lets see how well it does...

In [None]:
#! pip install easyocr

In [9]:
import easyocr
text_reader = easyocr.Reader(['en']) # load model into memory

CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.
Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

Downloading recognition model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

In [16]:
def read_image_easyOCR(image_path, down_factor=1):
    '''
    '''
    try: 
        # read in with cv2 
        image_png = cv2.imread(image_path, 0)
        height, width = image_png.shape

        # downsize 
        down_size = (width//down_factor, height//down_factor)
        image_png = cv2.resize(image_png, down_size)

        # save as jpeg (overwrite)
        image_name = image_path.split('/')[-1]
        temp_image_path = save_dir + image_name.replace('.png', '.jpg') # just want the name part
        cv2.imwrite(temp_image_path, image_png) # need to delete these if actually scaling this

        # do reading 
        results = text_reader.readtext(temp_image_path)
        for (bbox, text, prob) in results:
            print(text)

    except Exception as e:
        print('ERR:', e)
        digit_count, height, width, says_isis = 'ERR', 'ERR', 'ERR', 'ERR'

Does not give good results for reading in ISIS 1 metadata image but not so bad for ionogram film annotation. Overall found worse perfomance and not as much of a speed difference as I expected while just on CPU. Batching and GPU acceleration should help but perhaps not enough to make redoing this with EasyOCR worthwhile. 

In [18]:
image_path = batchDir + 'R014207709' + '/' + 'C-111-50' + '/' + '9.png' 
#image_path = batchDir + 'R014207709' + '/' + '145' + '/' + '1.png'
#image_path = batchDir + 'R014207709' + '/' + 'C-109-06' + '/' + '13.png'
read_image_easyOCR(image_path)

In [20]:
# results from profiler on local
%lprun -f read_image_easyOCR read_image_easyOCR(image_path)

SC 6S
'Sc
CSSX Sc
SC &S
S
Timer unit: 1e-07 s

Total time: 46.7089 s
File: <ipython-input-16-c490ec352688>
Function: read_image_easyOCR at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def read_image_easyOCR(image_path, down_factor=1):
     2                                               '''
     3                                               '''
     4         1         10.0     10.0      0.0      try: 
     5                                                   # read in with cv2 
     6         1   55643461.0 55643461.0     11.9          image_png = cv2.imread(image_path, 0)
     7         1         79.0     79.0      0.0          height, width = image_png.shape
     8                                           
     9                                                   # downsize 
    10         1         22.0     22.0      0.0          down_size = (width//down_factor, height//down_factor)
    11         1      36