# ZIP Flatten + Unzip PoC

This notebook provides a repeatable proof-of-concept for **flattening and unzipping ERP delivery ZIP files** into a **single output folder** in a Databricks Volume. It is designed to validate that our approach can:
- Discover ZIPs in nested directory structures
- Unzip archives (including ZIPs that contain internal folder paths)
- Flatten all extracted files into a single destination directory
- Avoid filename collisions using a deterministic naming strategy
- Continue processing when corrupt ZIPs are encountered (skip + continue)

### How it works
1. Takes a **source folder** containing ZIP files (recursively)
2. For each ZIP:
   - Extracts contents into a **temporary folder** under the destination directory
   - Moves each extracted file into the destination root folder using the naming format:
     
`     <ZipBaseName>_<Hash(zip_path)>_<FlattenedInternalPath>`
     
3. Removes the temporary folder to keep the destination clean

### Inputs / Outputs
- **Source Volume Path** (`src_vol` widget): folder containing ZIP files (recursive search)
- **Destination Volume Path** (`dst_vol` widget): output folder where all extracted files will be written (flat)

### Test Scenarios (validated below)
The notebook output is validated against test packs covering:
- Single-file ZIPs (1:1)
- Multi-file ZIPs (1:N expansion)
- Filename collisions across ZIPs (same TXT name)
- Encoding edge cases (unicode / special chars)
- Unexpected file types inside ZIP
- Corrupt ZIP handling (skip + continue)
- Deeply nested ZIP locations (recursive discovery)
- Internal folder structure inside ZIP (nested paths + colliding basenames)
- Zero-byte TXT files
- High-file-count ZIPs (large fan-out, e.g., 1000 members)


In [0]:
# This assumes the data already exists at dbfs:/FileStore/stage_test_pack/incoming

dbutils.fs.cp(
  "dbfs:/FileStore/stage_test_pack/incoming",
  "/Volumes/raw_wilson/erp/incoming/test_pack",
  recurse=True
)

[0;31m---------------------------------------------------------------------------[0m
[0;31mExecutionError[0m                            Traceback (most recent call last)
File [0;32m<command-5569020497463083>, line 1[0m
[0;32m----> 1[0m dbutils[38;5;241m.[39mfs[38;5;241m.[39mcp(
[1;32m      2[0m   [38;5;124m"[39m[38;5;124mdbfs:/FileStore/stage_test_pack/incoming[39m[38;5;124m"[39m,
[1;32m      3[0m   [38;5;124m"[39m[38;5;124m/Volumes/raw_wilson/erp/incoming/test_pack[39m[38;5;124m"[39m,
[1;32m      4[0m   recurse[38;5;241m=[39m[38;5;28;01mTrue[39;00m
[1;32m      5[0m )

File [0;32m/databricks/python_shell/lib/dbruntime/remotefshandler/RemoteFsHandler.py:54[0m, in [0;36mprettify_exception_message.<locals>.f_with_exception_handling[0;34m(*args, **kwargs)[0m
[1;32m     51[0m [38;5;28;01mclass[39;00m [38;5;21;01mExecutionError[39;00m([38;5;167;01mException[39;00m):
[1;32m     52[0m     [38;5;28;01mpass[39;00m
[0;32m---> 54[0m [38;5;28;

In [0]:
import os

# Notebook parameters 
dbutils.widgets.text("src_vol", "/Volumes/raw_wilson/erp/incoming/_stage_unzip_behavior_pack/unzip_behavior_pack/", "Source")
dbutils.widgets.text("dst_vol", "/Volumes/raw_wilson/erp/incoming/test_unzipped", "Destination")

# Read widget values
src_vol = dbutils.widgets.get("src_vol")
dst_vol = dbutils.widgets.get("dst_vol")


# Pass values into the shell environment for the %sh cell
os.environ["SRC_VOL"] = src_vol
os.environ["DST_VOL"] = dst_vol

In [0]:
%sh
# Folder_Hash_File
# - Recursively find ZIP files in the source directory
# - Unzip each one into a temporary folder under the destination
# - Move extracted files into the destination root (flattened)
# - Rename to avoid collisions: <zipBase>_<hash(zip_path)>_<internalPathFlattened>
# - Clean up temp folder
#
# Note: xargs -P 4 enables parallel processing of up to 4 ZIPs at a time.

src_vol="$SRC_VOL"
dst_vol="$DST_VOL"

run_ts=$(date -u +"%Y%m%dT%H%M%SZ")

log_dir="$dst_vol/_log"
log_file="$log_dir/unzip_failures_${run_ts}.log"
mkdir -p "$log_dir"

find "$src_vol" -iname "*.zip" -print0 |
  xargs -0 -P 4 -I {} bash -c '
    zip="$1"
    dst="$2"
    log_file="$3"

# Create a deterministic short hash from the zip path (used for uniqueness)
    hash=$(printf "%s" "$zip" | shasum -a 256 | cut -c1-12)

# Zip filename and base name (strip .zip/.ZIP)
    zname=$(basename "$zip")
    zbase=${zname%.[Zz][Ii][Pp]}   # strip .zip/.ZIP


# Create a temporary extraction folder specific to this zip + process id
# This prevents clashes when running in parallel.
    tmp="$dst/tmp_${hash}_$$"
    mkdir -p "$tmp"


# Extract quietly into temp. If unzip fails (corrupt zip), cleanup, log and skip.
    if ! err_out=$(unzip -oq "$zip" -d "$tmp" 2>&1); then
      ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
      msg=$(printf "%s" "$err_out" | tr "\n" " " | cut -c1-800)
      printf "%s | UNZIP_FAIL | %s | %s\n" "$ts" "$zip" "$msg" >> "$log_file"
      rm -rf "$tmp"
      exit 0
    fi


# For each extracted file:
# - compute its relative path inside the zip extraction
# - replace "/" with "_" to flatten internal folder structures
# - move into the destination root with a collision-safe name
    while IFS= read -r -d "" f; do
      rel=${f#"$tmp"/}
      safe=${rel//\//_}
      mv "$f" "$dst/${zbase}_${hash}_$safe"
    done < <(find "$tmp" -type f -print0)


# Remove temp extraction folder
    rm -rf "$tmp"
  ' bash {} "$dst_vol" "$log_file"

In [0]:
%sh
dst_vol="/Volumes/raw_wilson/erp/incoming/test_unzipped"
#find "$dst_vol" -maxdepth 1 -type f -name "*" -print
#find "$dst_vol" -maxdepth 1 -type f -name "*" -exec rm -rf {} +

# Tested Scenarios




1. Single-file ZIPs (1:1)

Scenario
	•	Each ZIP contained exactly 1 TXT file

Expected Behaviour
	•	Each TXT file is extracted
	•	All files land directly in the target root directory

Outcome: **Pass**

All TXT files were successfully unpacked and written to the base output folder.

2. Multi-file ZIPs (1:N expansion)

Scenario
	•	Multiple Zip Files
	•	Each ZIP File contains > 1 TXT files

Expected Behaviour
	•	All TXT files extracted
	•	Flattened into the base output directory

Outcome: **Pass**

Each ZIP expanded correctly, resulting in 15 TXT files written to the root folder.


3. Filename collisions across ZIPs

Scenario
	•	2 ZIP files
	•	Both ZIPs contained a TXT file with the same filename

Expected Behaviour
	•	No overwrite
	•	Files uniquely named using a deterministic naming strategy

Naming Strategy

`<ZipName>__<Hash(source_path)>__<FileName>`

Outcome: **Pass**

Both files were successfully extracted and written without collision using the naming convention.

4. Encoding edge cases (special characters)

Scenario
	•	ZIP containing TXT files with non-ASCII / special characters in filenames

Expected Behaviour
	•	Files should extract without errors
	•	Filenames preserved safely

Outcome: **Pass**

Files were unpacked correctly and written to the base folder.


5. Unexpected file types inside ZIP

Scenario
	•	1 ZIP containing non-TXT files (e.g. image, JSON)

Expected Behaviour
	•	Files extracted successfully **(content type not restricted at unzip stage yet)**

Outcome: **Pass**
Unexpected file types were unpacked without issue.


6. Corrupt ZIP handling

Scenario
	•	intentionally corrupted ZIP file

Expected Behaviour
	•	ZIP should fail validation
	•	Process should continue for other ZIPs
	•	Corrupt file should not block the run

Outcome: **Expected Failure / Handled Gracefully**

The corrupt ZIP failed integrity checks and was skipped. All other ZIPs processed successfully.

7. Deeply nested ZIP location

Scenario
• ZIP file located in a deeply nested directory structure
• ZIP path example:

`deep_nesting/.../DEEP_NEST_ZIP_LOCATION__drop001.zip`

• ZIP contained multiple TXT files

Expected Behaviour
• ZIP should be discovered regardless of directory depth
• All TXT files extracted successfully
• Extracted files flattened into the target root directory (no subfolders preserved)

Outcome: **Pass**

The ZIP was correctly discovered despite deep nesting, all TXT files were unpacked, and the extracted files were written directly to the base output folder.

8. Internal folder structure inside ZIP (colliding basenames)

Scenario
• 1 ZIP file containing an internal directory hierarchy
• ZIP members stored in nested paths (e.g. folderA/, folderB/sub/)
• Multiple TXT files shared the same basename but existed in different internal folders

Expected Behaviour
• All files should be extracted regardless of internal ZIP folder structure
• Internal directory paths should be ignored during extraction
• Files flattened into the target root directory
• Filename collisions handled using deterministic renaming

Outcome: **Pass**

All TXT files were successfully extracted from nested paths within the ZIP. Despite colliding basenames, files were flattened into the base output directory without overwrites using the configured naming strategy.

9. Zero-byte files inside ZIP

Scenario
• ZIP containing one or more zero-byte TXT files

Expected Behaviour
• Zero-byte files should be extracted successfully
• Files should not be skipped or dropped
• Zero-byte files should be written to the target root directory

Outcome: **Pass**

Zero-byte TXT files were successfully extracted and written to the base output folder with no errors or omissions.

10. High-file-count ZIP (large fan-out)

Scenario
• ZIP containing a high number of member files
• Example: MANY_SMALL_FILES__1000_members.zip

Expected Behaviour
• All member files should be extracted successfully
• Files should be flattened into the target root directory
• Temporary extraction space used to safely handle large fan-out

Outcome: **Pass**

All files were successfully extracted into a temporary directory, flattened, and written to the base output folder.