# attempt notebook with API-first structure
In the previous attempt (ocr-test.ipynb), I got all the pieces working, but the code was difficult to read, and made moving to a relation database difficult. I'll prototype an "API-first notebook" to see if a that would be a better structure to bridge the data/backend/frontend.

In [17]:
# installs, imports
%pip install -q \
    pandas

import pandas as pd
import shutil

Note: you may need to restart the kernel to use updated packages.


In [None]:
# define "API" first, where each function can be replaced by a simple API call

# 
# *** IN PSEUDO CODE ***
# 

## "SERVICES"
## notebook: pure functions
## webdev:   eg lambda services, easy to isolate and scale horizontally as needed

# upload images
def uploadImage(rawFileURL):
    # notebook: move files from `upload` folder to `rawFiles` folder
    # webdev: write to temporary storage, keep files in case crop doesn't do a good job and needs to be reverted by the user
    uploadedFileURL = './store/rawFiles/' + rawFileURL.split('/')[-1]
    shutil.move(rawFileURL, uploadedFileURL)
    return uploadedFileURL
# crop/convert/postprocess images
def cropImages(rawFiles):
    # write to storage
    # write to postProcessedFiles table
    return [postProcessedFile1.Id, postProcessedFile2.Id, postProcessedFile3.Id] # could return success/failure, but prefer to return IDs for reference, might want to return objects depending on how it'll be used later
# ocr images to create receiptTexts
def ocrImage(croppedImage):
    # contains data for boundingbox, text
    # contains reference to filename
    return [receiptText1.Id, receiptText2.Id, receiptText3.Id]
# create receipt
# def createReceipt(receiptTexts):
#     # contains refrence to receiptTexts, filenames via receiptTexts # note one receipt could have multiple images, and texts across these images
#     # write to receipts table
#     return receipt1.Id

## "DATA"
## notebook: pandas dataframes-->ORM-->to API external call directly
## webdev:   eg API create endpoints

# define/import referenceItems
def writeReferenceItem(name, quantity, unitOfMeasure, price, pricePerWeight, referenceUrl):
    # contains data for name, quantity, unitOfMeasure, price, pricePerWeight, referenceUrl
    # write to referenceItems table
    return referenceItem1.Id

# define/import search querries tied to referenceItems
def writeEligibleProduct(productName="Schar Gluten Free Hot Dog Buns", referenceItem="Hot dog buns"):
    # contains data for productName, referenceItem
    # write to eligibleProducts table
    return searchQuerry1.Id

def writeEligibleExpense(description, amount, date, receiptTextId,referenceItemId):
    # contains data for description, amount, date, receiptTextId,referenceItem
    # write to eligibleExpenses table
    return eligibleExpense1.Id

## "PROCESSING"
## notebook: impure functions on pandas dataframes
## webdev:   backend controllers / helpers, harder to isolate

# parse receiptTexts against possible product names to find and create eligible eligibleExpenses (description, amount, date, searchQuerry, referenceItem)
def checkForEligibleProductName(receiptTextId):
    eligibleProduct = None
    def checkProductName(receiptTextId):
        # check if text in receiptText is in eligibleProducts
        return True
    def lookForProductPrice(receiptTextId):
        # bunch of magic here
        return price
    if checkProductName(receiptTextId):
        price = lookForProductPrice(receiptTextId)
        eligibleProduct = ( checkProductName(receiptTextId), lookForProductPrice(receiptTextId) )
    return eligibleProduct

def parseTextForEligibleExpenses(receiptTextId): 
    if checkForEligibleProductName(receiptTextId):
        writeEligibleExpense(priceEach, quantity, receiptTextId, referenceItemId)
    return # success/failure? not sure yet

### Seeding

There's still a step for seeding the referenceItems table. In a notebook, that's loading the data in pandas dataframe; but a product that's data that is already in the app for all users to use, and this is problematic in this data structure because there should be referenceItems that are public to everyone and private to the user. For simplicity, I will assume all referenceItems are shared between all users for the purpose of the prototype. 

In [None]:
# SEEDING
# create referenceItems
referenceitems = pd.read_csv('referenceitems.csv')
for referenceitem in referenceitems:
    createReferenceItem(
        referenceitem['name'], 
        referenceitem['quantity'], 
        referenceitem['unitOfMeasure'], 
        referenceitem['price'], 
        referenceitem['pricePerWeight'], 
        referenceitem['referenceUrl']
        # TODO: likely should have one to many relationship with searchQuerries by ID (i.e. different text strings it appears as on receipts)
    )

# create searchQuerries, ...

### Main function stuff

This is where the two worlds meet, but are different in how they would be approached. 

A notebooks flow would look like: 
- for image in images_in_folder process and ocr all images; analyze data
- for each string for each receipt: look for eligible expenses; analyze data
- for each eligible expense join on reference item info; analyze data
- export a single table with all expenses and information

(!) this is the main difference in the two approaches / ways of thinking
It's a bit easier to think in batches in a notebook (eg ocr all images, parse all texts, etc.) but that's not good for a webdev flow (mostly because it doesn't work well with relational data).

consider the user journey:
- select images to upload, wait
- get a list of receipts to manually review, edit expenses, save (and ideally mark as reviewed, but not for minimum viable prototype)

so a webdev flow would look like:
1. select one or multiple images, upload
2. background chron job to process ocr on each image, look for eligible expenses, match to reference items
3. line item level update mutations


one approach is waterfall

- main(): 
    - "upload" images (for file in files_in_folder)
        - cropImage
            - ocrImage (returns receiptTexts)
                - parse text for eligible expenses

another approach is to do it async as chron jobs, many advantages (scalability, error handling)
- "upload" images (go through folder)
- cropImages that were uploaded but not cropped
- ocrImage that were cropped but not OCRed
- parse receiptTexts for eligible expenses line items
- chron() to run every so often

... on the front-end the user will then pick up this data and edit individual Expense line items

The async chron job approach above might be a good match for notebooks in the sense that parse each step in bulk before to look at the results before building the block of code. Arguably code can be written sequentially right away, or refactored later, but my objective is to make it easy on both sides to be able to easily build support tooling. 

In [None]:
# First attempt: waterfall (see `main()` in [ocr-test.ipynb])
# I originally tried writing my notebook in the first approach, where I would create a function in a block above the main function block, and have the main function run the `for` loop with reference to all the individual pieces, but I found it hard to read and edit. 

In [15]:
# Second attempt: async

import time # mostly for testing
import threading

stop_flag = False

def chron():
    while not stop_flag:
        # check for raw files to crop
        # ... probably by looking up at a table of rawFiles that were "uploaded"
        #     ^ this is the key difference, there isn't an entire dataframe being passed to the next step
        # ... tempNewRawFiles = ...
        # cropImages(tempNewRawFiles)

        # check for croppedImages to ocr
        # ... tempNewCroppedImages = ...
        # ocrImage(tempNewCroppedImages)

        # check for receiptTexts
        # ... tempNewReceiptTexts = ...
        # parseTextForEligibleExpenses(tempNewReceiptTexts)

        # 🏁 now there should be Expenses created, ready for user to manually review and modify

        time.sleep(0.500) # slow things down for testing
        pass

def stop_chron():
    global stop_flag
    stop_flag = True
    watch_thread.join()  # Wait for the thread to finish
    pass

watch_thread = threading.Thread(target=chron)
watch_thread.start()
# elsewhere I can use `stop_chron()` to stop the thread

# kick things off, note this parrallels nicely what the action the use would take
def main():
    # "upload" images
    # rawFiles = ... # from os folder list etc
    # for rawFile in rawFiles:
    #     uploadImage(rawFile)
    pass

In [16]:
# from a webdev perspective I would have a docker instance running the above code
# here I turn everything off for the purpose of using the notebook
stop_chron()