# `ai_parse_document()` Visualizer Backend

#### Overview
This notebook acts as the backend for the AI Parse Visualizer App. 

Documents are saved to the Volume `input` directory and parsed using `ai_parse_document()`. Results are saved in a Delta table. 

Any documents already in the results table will not be parsed again, unless `force_reprocess_all == True`. 

#### Prerequisites
- You must have a existing catalog, schema, and volume  
- You must create 2 directories named `input` and `output` in the volume
- You must place a few documents in the `input` directory — it's suggested to keep the document length small to start with as `ai_parse_document()` takes time to process each page

#### Instructions
1. Update the variables in the next cell to reflect your catalog, schema, and volume name
2. Update the `results_table` name, this is where parsed document results will be stored. You should avoid change the name of this table when re-running the notebook  
3. If you want to parse a single file, you can enter the filename for `input_file`. Otherwise, leave blank and all documents not already in the results table will be parsed 
4. Run the notebook! 

#### Output

You will end up with a catalog structure like below:

```text
catalog
└─ schema
   ├─ volume
   │  ├─ input/
   │  └─ output/
   └─ tables
      └─ parsed results table
```

#### Limitations
- Removing documents from the volume `input` directory will not remove parsed results from the `output` directory or results table. Instead, you need to drop the documents from the results table to avoid surfacing in the AI Parse Visualizer App
- This is meant to visualize parsed results on a test set. This is not meant for visualizing results over hundreds of documents


In [0]:
from delta.tables import DeltaTable
from pyspark.sql.functions import lit

In [0]:
catalog = ""
schema = ""
volume = ""
results_table = ""
input_file = "" # single file name (i.e. my.pdf) or leave blank

# WARNING - setting to True will overwrite all previous parsed results in the ai_parse_results table
force_reprocess_all = False

# DO NOT MODIFY BELOW

In [0]:
sql = f"""
  CREATE TABLE IF NOT EXISTS {catalog}.{schema}.{results_table} (
    createdAt TIMESTAMP NOT NULL,
    path STRING NOT NULL,
    parsed VARIANT NOT NULL,
    CONSTRAINT {results_table}_pk PRIMARY KEY (path)
  )
"""

spark.sql(sql)

In [0]:
def get_unprocessed_filenames(volume_path: str, result_path: str):
  """Returns a list of filenames not existing in the results table"""

  sql = f"LIST 'dbfs:{volume_path}'"
  
  input_volume_files = spark.sql(sql)
    
  sql = f"SELECT path FROM {result_path}"
  processed_files = spark.sql(sql)
    
  unprocessed_files = input_volume_files.join(
      processed_files,
      on="path",
      how="left_anti"
  )
  
  unprocessed_filenames = [row.path.split("/")[-1] for row in unprocessed_files.select("path").collect()]
  
  if unprocessed_filenames:
      process_glob = "{{" + ",".join(unprocessed_filenames) + "}}"
  else:
      process_glob = None
  
  return process_glob


In [0]:
volume_path = f"/Volumes/{catalog}/{schema}/{volume}"
results_table_path = f"{catalog}.{schema}.{results_table}"

if force_reprocess_all:
  files_to_process = "*"
else:
  files_to_process = get_unprocessed_filenames(f"{volume_path}/input", results_table_path)

input_files_path = f"{volume_path}/input/{files_to_process}"
output_results_path = f"{volume_path}/output"

if files_to_process:
  sql = f"""
  WITH parsed_documents AS (
      SELECT
        path,
        ai_parse_document(
          content,
          map(
            'imageOutputPath', '{output_results_path}',
            'descriptionElementTypes', '*'
          )
        ) AS parsed
      FROM READ_FILES('{input_files_path}', format => 'binaryFile')
    )
  SELECT 
    current_timestamp() AS createdAt,
    path,
    parsed
  FROM parsed_documents
  """

  df = spark.sql(sql)

  delta_table = DeltaTable.forName(spark, results_table_path)

  delta_table.alias("target").merge(
      df.alias("source"),
      "target.path = source.path AND source.path IS NOT NULL"
  ).whenMatchedUpdate(
      set={
          "createdAt": "source.createdAt",
          "path": "source.path",
          "parsed": "source.parsed"
      }
  ).whenNotMatchedInsert(
      values={
          "path": "source.path",
          "createdAt": "source.createdAt",
          "parsed": "source.parsed"
      }
  ).execute()

