In [0]:
%sql
SHOW EXTERNAL LOCATIONS;

In [0]:
%sql
LIST 'abfss://warehouse@dbxdl.dfs.core.windows.net/lineitems/' WITH (CREDENTIAL `dbxdl-storage-account-creds`) LIMIT 3;


In [0]:
%sql
CREATE CATALOG IF NOT EXISTS MASTERCLASS;

In [0]:
%sql
USE CATALOG MASTERCLASS;

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS BRONZE
 MANAGED LOCATION "abfss://delta@dbxdl.dfs.core.windows.net/bronze/";

In [0]:
%sql
CREATE OR REPLACE TABLE BRONZE.USEDCARS (
  V VARIANT
) USING DELTA;

In [0]:
%sql
SELECT * FROM PARQUET.`abfss://warehouse@dbxdl.dfs.core.windows.net/lineitems/*`;

In [0]:
%sql
COPY INTO BRONZE.USEDCARS
FROM 'abfss://warehouse@dbxdl.dfs.core.windows.net/lineitems/*'
FILEFORMAT = PARQUET
FORMAT_OPTIONS ('singleVariantColumn' = 'true');

Why COPY INTO / Auto Loader don’t help here?
* singleVariantColumn and the Auto Loader flag of the same name only exist for JSON sources. Therefor, a work around is needed. ￼ ￼
* For Parquet you must therefore read the data with Spark (batch or structured streaming) and do the cast yourself.

Below is a workflow on  Databricks Runtime 15.3 (or later) / Serverless to load parquet into VARIANT, the first release that exposes the VARIANT type in Delta Lake. The key idea is that a Parquet record is first read as a STRUCT and then cast into a single VARIANT column; there is no “singleVariantColumn” option for Parquet the way there is for JSON, so you have to do the cast yourself.

1  Prerequisites

| Requirement                                      | Why it matters                                      |
|--------------------------------------------------|-----------------------------------------------------|
| DBR ≥ 15.3                                       | VARIANT is only supported from 15.3 up              |
| Delta table backed by Unity-Catalog or the hive metastore | VARIANT is a first-class Delta data type            |
| Parquet rows ≤ 16 MB                             | Larger rows are silently redirected to _malformed_data |
| Maps in your data must have string keys          | VARIANT rejects maps with non-string keys           |

In [0]:
%sql

TRUNCATE TABLE BRONZE.USEDCARS;

INSERT INTO BRONZE.USEDCARS (V)
SELECT parse_json(to_json(struct(*))) AS V
FROM parquet.`abfss://warehouse@dbxdl.dfs.core.windows.net/lineitems/*`;

In [0]:
%sql

SELECT * FROM BRONZE.USEDCARS LIMIT 3;

Let's try pulling out the attributes from the VARIANT

In [0]:
%sql

SELECT
  variant_get(V, '$.L_ORDERKEY') AS order_key,
  variant_get(V, '$.L_PARTKEY') AS part_key,
  variant_get(V, '$.L_SUPPKEY') AS supp_key,
  variant_get(V, '$.L_LINENUMBER') AS line_number,
  variant_get(V, '$.L_QUANTITY') AS quantity,
  variant_get(V, '$.L_EXTENDEDPRICE') AS extended_price,
  variant_get(V, '$.L_DISCOUNT') AS discount,
  variant_get(V, '$.L_TAX') AS tax,
  variant_get(V, '$.L_RETURNFLAG') AS return_flag,
  variant_get(V, '$.L_LINESTATUS') AS line_status,
  variant_get(V, '$.L_SHIPDATE') AS ship_date,
  variant_get(V, '$.L_COMMITDATE') AS commit_date,
  variant_get(V, '$.L_RECEIPTDATE') AS receipt_date,
  variant_get(V, '$.L_SHIPINSTRUCT') AS ship_instruct,
  variant_get(V, '$.L_SHIPMODE') AS ship_mode,
  variant_get(V, '$.L_COMMENT') AS comment
FROM BRONZE.USEDCARS
LIMIT 3;

# What's the problem here?

INSERT INTO … SELECT … FROM <files> is a generic SQL append: every time it runs it re‑reads whatever files you point it at and blindly appends those rows, so it has no memory of what it loaded before. COPY INTO, in contrast, is an idempotent, file‑aware loader; it maintains a load history (file path, size, checksum, timestamp) inside the Delta transaction log and therefore skips any file it has processed before, even if you rerun the same command or schedule it repeatedly. Databricks and Snowflake implement this the same way, which is why Snowflake’s COPY INTO also avoids duplicates while a vanilla INSERT INTO does not.

# No DataFrame‑level substitute (and why)

* There is no DataFrameWriter.copyInto() or similar; the only Spark‑side APIs are write.mode(...).save(...), which always append and therefore duplicate rows on re‑runs.  ￼
* You could hand‑roll idempotency by writing input_file_name() into a “manifest” column and merging on it, but that just recreates what COPY INTO already does for you inside the Delta log. 