# `ai_parse_document()` Visualizer Backend

#### Overview
Documents in the `input` directory are parsed using `ai_parse_document()` and results are stored in the `ai_parse_results()` table. Documents in the `ai_parse_results()` table will not be parsed again, unless `force_reprocess_all == True`. 

#### Prerequisites
- Existing catalog, schema, and volume  
- Volume must have 2 directories named `input` and `output`  
- A few documents in the `input` directory — it's suggested to keep this small to start with as `ai_parse_document()` takes time to process each

#### Instructions
1. Update the variables in the next cell to reflect your catalog, schema, and volume name.  
2. If you want to parse a single file, you can enter the filename for `input_file`. Otherwise, leave blank and all documents not already in the `ai_parse_results()` table will be parsed.  
3. Run the notebook. 

**Note**: removing documents from the `input` directory will not remove parsed results from the `output` directory or `ai_parse_results` table. Instead, drop the documents from the `ai_parse_results` table to avoid surfacing in the Databricks App.

#### Output

You will end up with a catalog structure like below:

```text
catalog
└─ schema
   ├─ volume
   │  ├─ input/
   │  └─ output/
   └─ tables
      └─ ai_parse_results


In [0]:
from delta.tables import DeltaTable
from pyspark.sql.functions import lit

In [0]:
catalog = "users"
schema = "david_hurley"
volume = "ai_parse_files"

input_file = "" # single file name (i.e. my.pdf) or leave blank

# WARNING - setting to True will overwrite all previous parsed results in the ai_parse_results table
force_reprocess_all = False

### DO NOT MODIFY BELOW

In [0]:
sql = f"""
  CREATE TABLE IF NOT EXISTS {catalog}.{schema}.ai_parse_results (
    createdAt TIMESTAMP NOT NULL,
    path STRING NOT NULL,
    parsed VARIANT NOT NULL,
    CONSTRAINT ai_parse_results_pk PRIMARY KEY (path)
  )
"""

spark.sql(sql)

In [0]:
def get_unprocessed_filenames(volume_path: str, result_path: str):
  """Returns a list of filenames not existing in ai_parse_results table"""

  sql = f"LIST 'dbfs:{volume_path}'"
  
  input_volume_files = spark.sql(sql)
    
  sql = f"SELECT path FROM {result_path}"
  processed_files = spark.sql(sql)
    
  unprocessed_files = input_volume_files.join(
      processed_files,
      on="path",
      how="left_anti"
  )
  
  unprocessed_filenames = [row.path.split("/")[-1] for row in unprocessed_files.select("path").collect()]
  
  if unprocessed_filenames:
      process_glob = "{{" + ",".join(unprocessed_filenames) + "}}"
  else:
      process_glob = None
  
  return process_glob


In [0]:
volume_path = f"/Volumes/{catalog}/{schema}/{volume}"
results_table_path = f"{catalog}.{schema}.ai_parse_results"

if force_reprocess_all:
  files_to_process = "*"
else:
  files_to_process = get_unprocessed_filenames(f"{volume_path}/input", results_table_path)

input_files_path = f"{volume_path}/input/{files_to_process}"
output_results_path = f"{volume_path}/output"

if files_to_process:
  sql = f"""
  WITH parsed_documents AS (
      SELECT
        path,
        ai_parse_document(
          content,
          map(
            'imageOutputPath', '{output_results_path}',
            'descriptionElementTypes', '*'
          )
        ) AS parsed
      FROM READ_FILES('{input_files_path}', format => 'binaryFile')
    )
  SELECT 
    current_timestamp() AS createdAt,
    path,
    parsed
  FROM parsed_documents
  """

  df = spark.sql(sql)

  delta_table = DeltaTable.forName(spark, results_table_path)

  delta_table.alias("target").merge(
      df.alias("source"),
      "target.path = source.path AND source.path IS NOT NULL"
  ).whenMatchedUpdate(
      set={
          "createdAt": "source.createdAt",
          "path": "source.path",
          "parsed": "source.parsed"
      }
  ).whenNotMatchedInsert(
      values={
          "path": "source.path",
          "createdAt": "source.createdAt",
          "parsed": "source.parsed"
      }
  ).execute()

