
# What is Databricks SQL's read_files function?

The read_files function in Databricks SQL is a powerful table-valued function that allows you to directly query and ingest data from files stored in cloud object storage or Unity Catalog volumes. It lets you treat raw data files as if they were tables, enabling immediate SQL-based analysis without requiring prior table creation or full data loading.

This function is ideal for:

Ad-hoc data exploration: Quickly examine data directly from files.

Initial data ingestion: Serve as the source for creating streaming tables, especially when combined with Auto Loader for incremental processing.

How read_files simplifies data access
Accessing diverse file types at scale can be challenging. read_files makes it easy, offering these benefits:

Direct Access: Read data directly from files in formats like JSON, CSV, Parquet, AVRO, ORC, and more.

Automatic Schema Inference: It intelligently infers the schema across your files, or you can provide one.

Flexible File Discovery: Easily target individual files or entire directories, including recursive discovery, using glob patterns.

Simplified Data Engineering: Streamlines the initial step of bringing data into your Lakehouse, serving as a flexible entry point for your data pipelines.

For more details, refer to the Databricks [read_files documentation](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files).

In [0]:
%run ./_resources/00-setup $reset_all_data=false

In [0]:
display(dbutils.fs.ls(volume_folder))


## 1. Basic Usage: Automatic Format Dectection

One of the key advantages of `read_files` is automatic format detection. Let's try to read many file formats in our demo data folder, and use read_files to detect the file format

In [0]:
display(spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_json') LIMIT 5"))

In [0]:
display(spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_csv') LIMIT 5"))

In [0]:
display(spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_parquet') LIMIT 5"))

In [0]:
display(spark.sql(f"SELECT year, month, COUNT(*) as records FROM read_files('{volume_folder}/user_parquet_partitioned') GROUP BY year, month"))


`read_files` also supports powerful glob patterns for selective file reading. You can select the specific format you want to read.

In [0]:
display(spark.sql(f"SELECT 'JSON Files' as source, * FROM read_files('{volume_folder}/*json*') LIMIT 3"))


## 2. Schema Inference

Different formats have different schema inference capabilities and performance.

We can also use schema hints to override the schema inferrence.

In [0]:
json_schema = spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_json') LIMIT 0").schema
print(json_schema.treeString())

In [0]:
csv_schema = spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_csv') LIMIT 0").schema  
print(csv_schema.treeString())

In [0]:
parquet_schema = spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_parquet') LIMIT 0").schema
print(parquet_schema.treeString())

In [0]:
display(spark.sql(f"""
SELECT
  format,
  MAX(id_type) AS id_type,
  MAX(age_group_type) AS age_group_type,
  MAX(date_type) AS date_type
FROM (
SELECT 
  'JSON' as format,
  typeof(id) as id_type,
  typeof(age_group) as age_group_type,
  typeof(creation_date) as date_type
FROM read_files('{volume_folder}/user_json')
UNION ALL
SELECT 
  'CSV' as format,
  typeof(id) as id_type, 
  typeof(age_group) as age_group_type,
  typeof(creation_date) as date_type
FROM read_files('{volume_folder}/user_csv')
UNION ALL
SELECT 
  'Parquet' as format,
  typeof(id) as id_type,
  typeof(age_group) as age_group_type, 
  typeof(creation_date) as date_type
FROM read_files('{volume_folder}/user_parquet')
) type_comparision
GROUP BY format
"""))

In [0]:
display(spark.sql(f"""
SELECT 
  id,
  typeof(id) as id_type_after_hint,
  age_group,
  typeof(age_group) as age_group_type_after_hint
FROM read_files(
  '{volume_folder}/user_json',
  schemaHints => 'id bigint, age_group string'
) LIMIT 5
"""))


## 3. Format-Specific Features

There are some particular options that are specific to each format with `read_files`

In [0]:
display(spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_csv_no_headers', format => 'csv', header => 'false') LIMIT 5"))

In [0]:
display(spark.sql(f"""
SELECT * FROM read_files(
  '{volume_folder}/user_csv_no_headers',
  format => 'csv',
  schema => 'id bigint, creation_date string, firstname string, lastname string, email string, address string, gender double, age_group double'
) LIMIT 5
"""))

In [0]:
display(spark.sql(f"""
SELECT * FROM read_files(
  '{volume_folder}/user_csv_pipe_delimited',
  format => 'csv',
  sep => '|'  
) LIMIT 5
"""))

In [0]:
display(spark.sql(f"""
SELECT 
  firstname,
  lastname,
  id,
  typeof(id) as id_inferred_type,
  age_group,
  typeof(age_group) as age_group_inferred_type
FROM read_files(
  '{volume_folder}/user_json',
  inferColumnTypes => true
) LIMIT 5  
"""))

In [0]:
# Demonstrate column pruning (Parquet's key advantage)
print("⚡ Parquet Column Pruning Demo:")
import time

# Read all columns
start_time = time.time()
all_cols_count = spark.sql(f"SELECT * FROM read_files('{volume_folder}/user_parquet')").count()
all_cols_time = time.time() - start_time

# Read only specific columns  
start_time = time.time()
select_cols_count = spark.sql(f"SELECT id, firstname FROM read_files('{volume_folder}/user_parquet')").count()
select_cols_time = time.time() - start_time

print(f"📊 All columns: {all_cols_count:,} records in {all_cols_time:.2f}s")
print(f"📊 2 columns: {select_cols_count:,} records in {select_cols_time:.2f}s") 
print(f"⚡ Column pruning speedup: {all_cols_time/select_cols_time:.1f}x faster")


## 4. Streaming Usage

`read_files` can be used in streaming tables to ingest files into Delta Lake. `read_files` leverages Auto Loader when used in a streaming table query

In [0]:
# Create a streaming view
spark.sql(f"""
CREATE OR REPLACE TEMPORARY VIEW streaming_json_users AS
SELECT 
  *,
  current_timestamp() as processing_time
FROM STREAM read_files(
  '{volume_folder}/user_json',
  maxFilesPerTrigger => 5,
  schemaLocation => '{volume_folder}/read_files_streaming_schema'
)
""")

display(spark.sql("SELECT COUNT(*) as total_records FROM streaming_json_users"))

In [0]:
spark.sql(f"""
CREATE OR REPLACE TEMPORARY VIEW streaming_csv_users AS  
SELECT 
  *,
  'CSV' as source_format,
  current_timestamp() as processing_time
FROM STREAM read_files(
  '{volume_folder}/user_csv',
  schemaLocation => '{volume_folder}/csv_streaming_schema'
)
""")

spark.sql(f"""
CREATE OR REPLACE TEMPORARY VIEW streaming_parquet_users AS
SELECT 
  *,
  'PARQUET' as source_format, 
  current_timestamp() as processing_time
FROM STREAM read_files(
  '{volume_folder}/user_parquet',
  schemaLocation => '{volume_folder}/parquet_streaming_schema'
)
""")

display(spark.sql("""
SELECT 'CSV' as format, COUNT(*) as records FROM streaming_csv_users
UNION ALL  
SELECT 'PARQUET' as format, COUNT(*) as records FROM streaming_parquet_users
"""))


## 5. read_files vs Auto Loader

We have covered some basic features of `read_files`. However, there might be some questions about when to use `read_files` and when to use Auto Loader.

We have some comparison and decision matrix that could help you decide when to leverage the power of `read_files` and Auto Loader.


| Capability | read_files | Auto Loader | Winner |
|-----------|------------|-------------|---------|
| Ad-hoc queries | ✅ Perfect | ❌ Not designed | read_files |
| Multi-format API | ✅ Unified API | ❌ Single format | read_files |
| Streaming performance | ⚠️ Basic | ✅ Optimized | Auto Loader |
| Schema evolution | ⚠️ Manual | ✅ Automatic | Auto Loader |
| Incremental processing | ❌ Full scan | ✅ Incremental | Auto Loader |
| Setup complexity | ✅ Zero setup | ⚠️ More config | read_files |
| Cost efficiency | ⚠️ Pay per query | ✅ Incremental | Auto Loader |
| File notifications | ❌ No | ✅ Cloud notifications | Auto Loader |
| Batch processing | ✅ Excellent | ⚠️ Streaming focus | read_files |


| Scenario | Recommended Tool | Why |
|----------|------------------|-----|
| Data exploration | read_files | Zero setup, immediate results |
| Production streaming | Auto Loader | Optimized for continuous ingestion |
| Multi-format analysis | read_files | Unified API across formats |
| One-time migration | read_files | Simple, no infrastructure setup |
| Real-time pipelines | Auto Loader | Incremental processing, notifications |
| Cross-format reporting | read_files | Query multiple formats easily |
| Cost-sensitive streaming | Auto Loader | Only processes new data |
| Quick prototyping | read_files | Fastest time to value |


## Conclusion

We have seen what the capabilities of Databricks SQL's `read_files` are, and now you can apply it in your projects.

In [0]:
DBDemos.stop_all_streams()