
<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.opc-router.com/wp-content/uploads/2020/08/was-ist-json_600x250px-1.jpg" alt="Databricks Learning" style="width: 600">
</div>

## Querying JSON 

Listado de set de datos

In [0]:
%python
files = dbutils.fs.ls("dbfs:/databricks-datasets/samples/")
display(files)

Lee archivos JSON desde sentencia SQL

In [0]:
SELECT * FROM json.`dbfs:/databricks-datasets/samples/people/people.json`

Uso de comodín para leer varios archivos

In [0]:
SELECT * FROM json.`dbfs:/databricks-datasets/samples/people/*.json`

Leer la carpeta entera

In [0]:
SELECT * FROM json.`dbfs:/databricks-datasets/samples/people/`

Ejecuta funciones de agregación

In [0]:
SELECT count(*) FROM json.`dbfs:/databricks-datasets/samples/people/`

Utiliza la función para lectura de fuentes de archivos (disponible en Unity Catalog)

In [0]:
 SELECT *,
    _metadata.file_path as file_path
  FROM json.`dbfs:/databricks-datasets/samples/people/`

## Querying text Format

Consulta a archivos planos

In [0]:
SELECT * FROM text.`dbfs:/databricks-datasets/samples/people/`

## Querying binaryFile Format

Consulta a archivos binarios

In [0]:
SELECT * FROM binaryFile.`dbfs:/databricks-datasets/samples/people/`


## Querying CSV 

Consulta a archivos planos CSV

In [0]:
%fs ls dbfs:/databricks-datasets/samples/population-vs-price/

In [0]:
SELECT * FROM csv.`dbfs:/databricks-datasets/samples/population-vs-price/data_geo.csv`
LIMIT 3

In [0]:
SELECT * FROM text.`dbfs:/databricks-datasets/samples/population-vs-price/data_geo.csv`
LIMIT 3

Leer archivo plano e inferir el esquema para crear una vista

In [0]:
CREATE OR REPLACE TEMP VIEW v_population_data
USING csv
OPTIONS (
  path "dbfs:/databricks-datasets/samples/population-vs-price/data_geo.csv",
  header "true",
  inferSchema "true",
  sep ","
);


In [0]:
SELECT * FROM v_population_data;


## Limitations of Non-Delta Tables

In [0]:
DESCRIBE EXTENDED v_population_data

In [0]:
%python
files = dbutils.fs.ls("dbfs:/databricks-datasets/samples/population-vs-price")
display(files)

In [0]:
SELECT COUNT(*) FROM v_population_data

## Rename Columns

Lectura de archivos CSV con pyspark

In [0]:
%python
df = spark.read.csv("dbfs:/databricks-datasets/samples/population-vs-price/", header=True)
display(df.limit(3))

Definición del esquema en formato string

In [0]:
%python
schema_csv="""
rank_2014 INT, city STRING, state STRING, state_code STRING, population_estimate_2014 BIGINT, median_sales_price_2015 DOUBLE
"""
df = spark.read.csv("dbfs:/databricks-datasets/samples/population-vs-price/", header=True, schema=schema_csv)
display(df.limit(3))

Definición del esquema con clases de Python

In [0]:
%python
from pyspark.sql.types import (
    StructType, StructField,
    StringType, IntegerType, DoubleType, LongType
)
schema_file = StructType([
    StructField("rank_2014", IntegerType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("state_code", StringType(), True),
    StructField("population_estimate_2014", LongType(), True),
    StructField("median_sales_price_2015", DoubleType(), True)
])
df = spark.read.csv("dbfs:/databricks-datasets/samples/population-vs-price/", header=True, schema=schema_file)
display(df.limit(3))

Renombrado de columnas basado en un mapa

In [0]:
%python
from pyspark.sql import DataFrame 

def rename_columns(df: DataFrame) -> DataFrame:
    rename_map = {
        "2014 rank": "rank_2014",
        "City": "city",
        "State": "state",
        "State Code": "state_code",
        "2014 Population estimate": "population_estimate_2014",
        "2015 median sales price": "median_sales_price_2015"
    }
    for old_col, new_col in rename_map.items():
        df = df.withColumnRenamed(old_col, new_col)
    return df
df = spark.read.csv("dbfs:/databricks-datasets/samples/population-vs-price/", header=True)
df = rename_columns(df)

Renombrado de columnas con función genérica

In [0]:
%python
import re
from pyspark.sql import DataFrame

def normalize_columns_for_hive(df: DataFrame) -> DataFrame:
    new_cols = []
    for col in df.columns:
        new_col = col.lower()
        new_col = re.sub(r'[^a-z0-9_]', '_', new_col)
        if re.match(r'^[0-9]', new_col):
            new_col = f"col_{new_col}"
        new_col = re.sub(r'_+', '_', new_col)
        new_col = new_col.strip('_')
        new_cols.append(new_col)
    df_renamed = df.toDF(*new_cols)
    return df_renamed
df = spark.read.csv("dbfs:/databricks-datasets/samples/population-vs-price/", header=True)
df = normalize_columns_for_hive(df)
display(df.limit(3))