# Sesión 2 Databricks - Introducción a Apache Spark: RDDs, DataFrames, Datasets y Spark SQL


Todo el contenido de esta sesión puede ser amplicado con: 
- https://docs.databricks.com/aws/en/
- https://spark.apache.org/docs/latest/api/python/index.html
- https://learn.microsoft.com/en-us/azure/databricks/notebooks/best-practices

## Tipos de  celdas

En Databricks, podemos crear celdas de lops siguientes tipos:
* Markdown
* Python
* R
* Scala
* SQL
* Bash

In [0]:
# %scala
# object HelloWorld {
#   def main(args: Array[String]): Unit = {
#     println("Hello, World!")
#   }
# }

In [0]:
%sh
ls -la /

In [0]:
# %r
# x <- sqrt(18)
# print(x)

In [0]:
print("Hola mundo!")

## dbutils

dbutils es una utilidad interna de Databricks que proporciona comandos para interactuar con el entorno de ejecución de notebooks. Es muy útil para tareas como manipular archivos en DBFS, trabajar con secretos, ejecutar notebooks dentro de otros notebooks y más.

In [0]:
dbutils.help()

### dbutils.secrets
Para usar secretos, por detrás se puede almacenar en 
* Azure Key Vault.
* Base de datos administrada por Databricks.
* AWS Secrets Manager (IAM Rol configurado para poder acceder).

In [0]:
dbutils.secrets.help()

### dbutils.fs
Para interactuar con el sistema de archivos, utilizamos *dbutils.fs*, comandos de interés:
* dbutils.fs.cp
* dbutils.fs.ls
* dbutils.fs..mv
* dbutils.fs.rm

In [0]:
dbutils.fs.help()

In [0]:
#dbutils.fs.mounts()

In [0]:
dbutils.fs.ls("/")

### dbutils.notebook
Para poder interactuar con otros notebooks, utilizamos *dbutils.notebook*

In [0]:
%python
dbutils.notebook.help()

### dbutils.widgets
Para poder paramterizar los notebooks, se utilizan los widgets, que hay de 4 tipos:
* text
* dropdown
* combobox
* multiselect

In [0]:
dbutils.widgets.text("schema_name", "databricks_david_schema")
# dbutils.widgets.dropdown("state", "CA", ["CA", "IL", "MI", "NY", "OR", "VA"])
# dbutils.widgets.combobox(
#   name='fruits_combobox',
#   defaultValue='banana',
#   choices=['apple', 'banana', 'coconut', 'dragon fruit'],
#   label='Fruits'
# )
# dbutils.widgets.multiselect(
#   name='days_multiselect',
#   defaultValue='Tuesday',
#   choices=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
#     'Friday', 'Saturday', 'Sunday'],
#   label='Days of the Week'
# )


Para obtener el valor del parámetro establecido:

In [0]:
dbutils.widgets.get("schema_name") 

Para obtener como un diccionario en Python, todos los parámetros establecidos:


In [0]:
dbutils.widgets.getAll()

Para eliminar uno o todos los widgets:

In [0]:

#dbutils.widgets.remove('schema_name')
#dbutils.widgets.removeAll()

## Catalog

En Databricks, un Catalog es la capa más alta dentro del sistema de namespaces para organizar datos. Introducido con Unity Catalog, permite gestionar catálogos, esquemas y tablas de manera centralizada y segura, especialmente útil en entornos multiusuario y con integración de seguridad.

La jerarquía de objetos es:

Catalog > Schema (Database) > Table / View / Function

Cada catálogo puede contener múltiples esquemas, y cada esquema puede tener múltiples tablas o vistas.
Puedes usar SQL para crear un catálogo nuevo de esta forma:

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS sesion2;
-- CREATE CATALOG IF NOT EXISTS customer_cat COMMENT 'This is customer catalog';
-- CREATE CATALOG customer_cat MANAGED LOCATION 's3://depts/finance';
-- CREATE FOREIGN CATALOG postgresql_catalog USING CONNECTION postgresql_connection OPTIONS (database = 'my_db');

## Schemas


Un **Schema** (también conocido como Database en otros entornos) es un contenedor lógico dentro de un catálogo que agrupa tablas, vistas, funciones y otros objetos relacionados.  

In [0]:
%sql
-- En el esquema por defecto
CREATE SCHEMA IF NOT EXISTS schema_in_default_catalog

Puedes hacer tu notebook más flexible usando widgets para capturar el nombre del schema como parámetro:

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS ${schema_name}

También puedes capturar el valor del widget en Python y construir dinámicamente la consulta SQL, lo cual, es mucho más flexible:

In [0]:
# Obtener el valor del widget
custom_schema = dbutils.widgets.get("schema_name")

# Crear la sentencia SQL con el nombre del schema
query = f"""
    CREATE SCHEMA IF NOT EXISTS {custom_schema}
"""

# Ejecutar el SQL dinámico
spark.sql(query)

Esto es útil cuando:
* Quieres hacer validaciones antes de ejecutar.
* El nombre del esquema depende de lógica adicional.
* Estás creando múltiples objetos de forma programática.

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS sesion2.${schema_name} COMMENT 'This is customer catalog';
-- CREATE SCHEMA customer_sc MANAGED LOCATION 's3://depts/finance';

Ahora vamos a obtener la información del esquema:

In [0]:
%sql
DESCRIBE SCHEMA EXTENDED ${catalog_name}.${schema_name};

## Volumes

Un **Volume** es un espacio de almacenamiento en un esquema (schema) de Unity Catalog que se utiliza para guardar archivos como CSV, JSON, Parquet, imágenes, etc.  
Es una forma estructurada y segura de trabajar con archivos dentro del entorno de Databricks, usando controles de acceso unificados (ACLs).

---

### 🔹 ¿Qué es un Volume?

- Son directorios montados dentro de Unity Catalog.
- Pueden ser **internos** (gestionados por Databricks) o **externos** (en un bucket S3 o ADLS).
- Útiles para manejar datos no tabulares o datos crudos que aún no han sido ingeridos en tablas.

In [0]:
%sql
-- CREATE VOLUME my_another_volume
CREATE VOLUME IF NOT EXISTS ${catalog_name}.${schema_name}.landing;
-- Create an external volume on the specified location with comment
-- CREATE EXTERNAL VOLUME my_catalog.my_schema.my_external_volume
-- LOCATION 's3://my-bucket/my-location/my-path'
-- COMMENT 'This is my example external volume on S3'

Ahora vamos a descargar un fichero CSV dentro del volumen, para ello, vamos a hacer uso de CURL:

In [0]:
%sh
curl -L https://raw.githubusercontent.com/dvddepennde/crops_training_school/refs/heads/main/nutrients_csvfile.csv -o /Volumes/sesion2/databricks_david_schema/landing/nutrients_csvfile.csv

In [0]:
%sql
SELECT *
FROM csv.`/Volumes/${catalog_name}/${schema_name}/landing/nutrients_csvfile.csv`

## Ejercicios dbutils & catalog & schema & volumes

1. Crea parámetros que indiquen el nombre del catálogo, nombre de schema y nombre de volumen. Asignales un valor que gustes, es el que utilizaremos para posteriores ejercicios.

In [0]:
dbutils.widgets.text(
    name="catalog_name",            
    defaultValue="sesion2",    
    label="Nombre del catálogo"
)
dbutils.widgets.text(name="schema_name",defaultValue="databricks_david_schema",label="Nombre del esquema")
dbutils.widgets.text(name="volume",defaultValue="landing",label="Nombre del volumen")


2. Obten el valor de cada uno de los widgets y muestralos en una sola linea con el formato "Nombre de catálogo: {}, Nombre del esquema: {}, Nombre del volumen: {}"

In [0]:
catalog_name = dbutils.widgets.get("catalog_name")
schema_name = dbutils.widgets.get("schema_name")
volume = dbutils.widgets.get("volume")

print(f"Nombre de catálogo: {catalog_name}, Nombre del esquema: {schema_name}, Nombre del volumen: {volume}")

3. Crea el catálogo, dentro de este, el esquema, y por último, dentro del esquema, el volumen. Haz las sentencias tanto en celdas SQL como en Python.

*IMPORTANTE*: Debes utilizar variables y no poner directamente el valor del parámetro en la sentencia

In [0]:
# Crear catálogo
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")

# Crear esquema dentro del catálogo
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")

# Crear volumen dentro del esquema
spark.sql(f"CREATE VOLUME IF NOT EXISTS {catalog_name}.{schema_name}.{volume}")

In [0]:
%sql
-- Ejecutar en celda SQL después de definir las variables en Python
CREATE CATALOG IF NOT EXISTS ${catalog_name};
CREATE SCHEMA IF NOT EXISTS ${catalog_name}.${schema_name};
CREATE VOLUME IF NOT EXISTS ${catalog_name}.${schema_name}.${volume};

4. Descarga el archivo de la URL proporcionada dentro del volumen creado previamente. Una vez creado, deberás verificar que existe.

URL: https://raw.githubusercontent.com/dvddepennde/luxury-watches-analysis/refs/heads/main/data/Luxury%20watch.csv

Puede que la celda no sea necesario que sea en python...

In [0]:
%sh
curl -L https://raw.githubusercontent.com/dvddepennde/crops_training_school/refs/heads/main/nutrients_csvfile.csv -o /Volumes/sesion2/databricks_david_schema/landing/nutrients_csvfile.csv

In [0]:
import requests
# URL del archivo
url = "https://raw.githubusercontent.com/dvddepennde/crops_training_school/refs/heads/main/nutrients_csvfile.csv"

# Ruta de destino en Databricks (DBFS)
destination_path = "/Volumes/sesion2/databricks_david_schema/landing/nutrients_csvfile.csv"

# Descargar el archivo
response = requests.get(url)

# Guardar el archivo en el volumen
with open(destination_path, "wb") as f:
    f.write(response.content)

print("Archivo descargado exitosamente.")


5. Sube un CSV al volumen, de forma programatica o manual, para posterior análisis. Podéis buscar en https://www.kaggle.com/datasets . Una vez lo subas, consulta su contenido con SparkSQL.

6. Crea un notebook y llámalo sesion2_notebook_called, el cual, debe recibir por parámetro:
* db_name: Simulamos nombre de base de datos, si no sabéis que poner, poned "public".
* table_name: Simulamos nombre de tabla, si no sabéis que poner, poned vuestro nombre.
* num: Generar un número entero de forma aleatoria utilizando random.

El cual obtenga los valores, valide que "num" es un número y devolver:
* En caso de que la validación de num no sea correcta, un mensaje como el siguiente: "{'status': 'FAILED', 'custom_message': '<REPLACE_WITH_CUSTOM_MSJ>'}". El notebook no deberá continuar.
* En caso de que la validación sea correcta, un mensaje "Simulando operaciones..." y que devuelva: "{'status': 'OK', custom_message: '<REPLACE_WITH_CUSTOM_MSJ>', ... }" donde ... corresponde con devolver los mismos parámetros que recibió.

tip: Para validar si es o no entero, podéis usar este fragmento de código, o cualquiera que os parezca mejor:

```python3
try:
    # Intentamos convertirlo a entero
    num = int(num_str)
    # Código en caso de que SÍ sea entero

except ValueError:
    # Código si no es numérico (entero)
```


In [0]:
import random

result = dbutils.notebook.run(
    "/Workspace/Users/tortilla-huso-43@icloud.com/sesion2_notebook_called", 
    timeout_seconds=60, 
    arguments={
        "db_name": "public",
        "table_name": "david",
        "num": f"{random.randint(0,20)}"  # Probar también con un entero como "123"
    }
)

print(result)

# Spark

Cuando PySpark lee un archivo (por ejemplo, un archivo CSV), puede intentar adivinar los tipos de datos de cada columna, basándose en los primeros registros del archivo. Este proceso se llama inferencia del esquema.

Algo de documentación al respecto:
* https://docs.databricks.com/aws/en/getting-started/dataframes
* https://spark.apache.org/docs/latest/quick-start.html

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS ${catalog_name};
CREATE SCHEMA IF NOT EXISTS ${catalog_name}.${schema_name};
CREATE VOLUME IF NOT EXISTS ${catalog_name}.${schema_name}.${volume};

In [0]:
import requests
# URL del archivo
url = "https://raw.githubusercontent.com/dvddepennde/luxury-watches-analysis/refs/heads/main/data/Luxury%20watch.csv"

# Ruta de destino en Databricks (DBFS)
destination_path = "/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv"

# Descargar el archivo
response = requests.get(url)

# Guardar el archivo en el volumen
with open(destination_path, "wb") as f:
    f.write(response.content)

print("Archivo descargado exitosamente.")

Archivo descargado exitosamente.


In [0]:
path_luxury_watches = destination_path
df = spark.read.csv(
    path=path_luxury_watches,
    header=True,     # Utiliza la primera fila como nombres de columnas
    inferSchema=True ## INFERENCIA DE ESQUEMA HABILITADA
)
display(df)

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price (USD)
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0


Ahora, en este caso, vamos a declarar nosotros el esquema que va a tener el dataframe:

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from pyspark.sql.functions import when, col
from pyspark.sql.functions import lit

In [0]:
# Definir el esquema manualmente
schema = StructType([
    # Nombre, Tipo de dato, Requerido
    StructField("Brand", StringType(), True),
    StructField("Model", StringType(), True),
    StructField("Case Material", StringType(), True),
    StructField("Strap Material", StringType(), True),
    StructField("Movement Type", StringType(), True),
    StructField("Water Resistance", StringType(), True),
    StructField("Case Diameter (mm)", DoubleType(), True),
    StructField("Case Thickness (mm)", DoubleType(), True),
    StructField("Band Width (mm)", DoubleType(), True),
    StructField("Dial Color", StringType(), True),
    StructField("Crystal Material", StringType(), True),
    StructField("Complications", StringType(), True),
    StructField("Power Reserve", StringType(), True),
    StructField("Price (USD)", StringType(), True)
])

# Cargar el archivo CSV usando el esquema definido manualmente
df = spark.read.option('mode', ).csv(path=path_luxury_watches, header=True, schema=schema)

# Mostrar los datos
display(df)


Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price (USD)
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0


En cuanto a las diferencias, es importante tener en cuenta las siguientes características:


| Característica               | Inferir Esquema (automático)         | Definir Esquema (manual)         |
|------------------------------|--------------------------------------|----------------------------------|
| **Facilidad de uso**          | Muy fácil, no requiere intervención. | Requiere más trabajo inicial.   |
| **Precisión**                 | Puede ser impreciso con datos inconsistentes. | Alta precisión (100% control).  |
| **Rendimiento**               | Más lento en archivos grandes.       | Más rápido, ya que no necesita inferir. |
| **Flexibilidad**              | Ideal para datos que varían frecuentemente. | Menos flexible, requiere actualización manual. |
| **Robustez**                  | Puede fallar en ciertos casos (por ejemplo, valores no numéricos). | Es más robusto y seguro.       |


Pero...
- ¿Qué pasa si el esquema no coincide?

- ¿Que ocurre si una de las columnas declaradas no existe?
- ¿Que ocurre si en el origen hay más columnas que las definidas en el esquema?
- ¿Cuando se debería usar un esquema fijo y cuando inferirlo?

Mostramos el esquema de las columnas:

In [0]:
df.printSchema()

root
 |-- Brand: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Case Material: string (nullable = true)
 |-- Strap Material: string (nullable = true)
 |-- Movement Type: string (nullable = true)
 |-- Water Resistance: string (nullable = true)
 |-- Case Diameter (mm): double (nullable = true)
 |-- Case Thickness (mm): double (nullable = true)
 |-- Band Width (mm): double (nullable = true)
 |-- Dial Color: string (nullable = true)
 |-- Crystal Material: string (nullable = true)
 |-- Complications: string (nullable = true)
 |-- Power Reserve: string (nullable = true)
 |-- Price (USD): string (nullable = true)



Mostrar únicamente las N primeras filas:

In [0]:
display(df.show(5))

+---------+----------+---------------+---------------+-------------+----------------+------------------+-------------------+---------------+----------+----------------+-------------+-------------+-----------+
|    Brand|     Model|  Case Material| Strap Material|Movement Type|Water Resistance|Case Diameter (mm)|Case Thickness (mm)|Band Width (mm)|Dial Color|Crystal Material|Complications|Power Reserve|Price (USD)|
+---------+----------+---------------+---------------+-------------+----------------+------------------+-------------------+---------------+----------+----------------+-------------+-------------+-----------+
|    Rolex|Submariner|Stainless Steel|Stainless Steel|    Automatic|      300 meters|              40.0|               13.0|           20.0|     Black|        Sapphire|         Date|     48 hours|      9,500|
|    Omega| Seamaster|       Titanium|         Rubber|    Automatic|      600 meters|              43.5|              14.47|           21.0|      Blue|        Sapph

Obtener las dimensiones del DataFrame:

In [0]:
rows = df.count()
columns = len(df.columns)
print(f"Número de filas: {rows}, columnas: {columns}")


Número de filas: 507, columnas: 14


Obtener lista de columnas del DataFrame

In [0]:
print(df.columns)


['Brand', 'Model', 'Case Material', 'Strap Material', 'Movement Type', 'Water Resistance', 'Case Diameter (mm)', 'Case Thickness (mm)', 'Band Width (mm)', 'Dial Color', 'Crystal Material', 'Complications', 'Power Reserve', 'Price (USD)']


Convertir el dataframe a dataframe de Pandas:

In [0]:
## CUIDADO CON ESTO, ya que, si nos metemos en la documentación del método:
# This method should only be used if the resulting Pandas ``pandas.DataFrame`` is
# expected to be small, as all the data is loaded into the driver's memory.

pandas_df = df.toPandas()
pandas_df.sample(4)

Unnamed: 0,Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price (USD)
481,Audemars Piguet,Royal Oak,Stainless Steel,Stainless Steel,Automatic,50 meters,41.0,10.4,20.0,Blue,Sapphire,Date,60 hours,19500
135,Ulysse Nardin,Marine,Titanium,Rubber,Automatic,300 meters,44.0,14.5,22.0,Blue,Sapphire,Date,60 hours,9500
191,Blancpain,Fifty Fathoms,Stainless Steel,NATO Strap,Automatic,300 meters,45.0,15.5,23.0,Black,Sapphire,,120 hours,12500
461,Vacheron Constantin,Patrimony,White Gold,Leather,Manual,30 meters,40.0,6.79,20.0,Silver,Sapphire,,65 hours,28000


Ordenar por alguna columna, de forma ascendente/descendente:

In [0]:
display(df.orderBy("Case Diameter (mm)", ascending=True).limit(5))
## Lo cual es idéntico a
# display(df.sort("Case Diameter (mm)", ascending=True).show(5)


Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price (USD)
Cartier,Santos-Dumont,18K Yellow Gold,Leather,Manual,30 meters,27.5,6.5,18.0,Silver,Sapphire,,,14200
Chopard,Happy Sport,Stainless Steel,Stainless Steel,Quartz,30 meters,30.0,7.2,15.0,Silver,Sapphire,Date,,5800
Chopard,Happy Sport,Stainless Steel,Leather,Quartz,30 meters,30.0,10.5,17.0,White,Sapphire,,,5000
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800


Ordenar por varias columnas:

In [0]:
display(df.sort(['Band Width (mm)',"Case Diameter (mm)"], ascending=[False, False]).limit(5))
## Lo cual es idéntico a
# display(df.orderBy(['Band Width (mm)',"Case Diameter (mm)"], ascending=[False, False]).show(5))

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price (USD)
Bulgari,Octo Finissimo,Titanium,Titanium,Automatic,30 meters,40.0,5.15,28.0,Black,Sapphire,,55 hours,10000
Bulgari,Octo Finissimo,Titanium,Leather,Automatic,100 meters,40.0,5.15,28.0,Black,Sapphire,,60 hours,12800
Hublot,Big Bang,Carbon Fiber,Rubber,Automatic,100 meters,44.0,14.5,26.0,Black,Sapphire,Chronograph,72 hours,12000
Hublot,Big Bang,Ceramic,Rubber,Automatic,100 meters,44.0,14.5,26.0,Black,Sapphire,Chronograph,72 hours,17900
Girard-Perregaux,Laureato,Stainless Steel,Stainless Steel,Automatic,100 meters,41.0,10.88,26.0,Black,Sapphire,Chronograph,46 hours,12000


Renombramos columna:

In [0]:
df = df.withColumnRenamed(
    "Price (USD)", "Price"
)
# from pyspark.sql.functions import col
# df = df.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
display(df)


Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0


Añadir valores de columna tipo condicional:

In [0]:

df = df.withColumn(
    "Case Diameter Class",
    when(col("Case Diameter (mm)") <= 30, "Small")
    .when((col("Case Diameter (mm)") > 30) & (col("Case Diameter (mm)") <= 35), "Medium")
    .when((col("Case Diameter (mm)") > 35) & (col("Case Diameter (mm)") <= 42), "Big")
    .otherwise("Gigant")
)
# df['nueva_columna'] = 

display(df)

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0,Gigant
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0,Big
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0,Gigant
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0,Medium
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0,Gigant
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0,Gigant
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0,Big
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0,Big
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0,Big


Añadir columna como un literal:

In [0]:
df = df.withColumn("Currency", lit("USD"))
df = df.withColumn("File", lit(f"{path_luxury_watches}"))
display(df)

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,File
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0,Gigant,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0,Big,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0,Gigant,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0,Medium,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0,Gigant,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0,Gigant,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0,Big,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0,Big,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0,Big,USD,/Volumes/sesion4/databricks_david_schema/landing/luxury_watch.csv


Eliminamos columna:

In [0]:
df = df.drop("File")
display(df.limit(3))

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500,Big,USD
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800,Gigant,USD
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200,Big,USD


Obtenemos únicamente las columnas que nos resultan interesantes

In [0]:
display(df.select("Brand", "Model", "Price"))

Brand,Model,Price
Rolex,Submariner,9500.0
Omega,Seamaster,5800.0
Tag Heuer,Carrera,4200.0
Breitling,Navitimer,7900.0
Cartier,Tank Solo,2800.0
Jaeger-LeCoultre,Reverso,5500.0
Seiko,Prospex,1400.0
Citizen,Promaster,1200.0
Tissot,Le Locle,650.0
Hamilton,Khaki Field,495.0


Mostramos los valores únicos de una columna

In [0]:
display(df.select("Brand").distinct())

Brand
Blancpain
Piaget
A. Lange & Sohne
Oris
Zenith
Ulysse Nardin
Frederique Constant
IWC
Hublot
Bulgari


In [0]:
df.createOrReplaceTempView("watches")

distinct_brands = spark.sql("SELECT DISTINCT Brand FROM watches")
display(distinct_brands)

Brand
Blancpain
Piaget
A. Lange & Sohne
Oris
Zenith
Ulysse Nardin
Frederique Constant
IWC
Hublot
Bulgari


Unimos la columna Brand con el model:


In [0]:
from pyspark.sql.functions import concat_ws, col
df = df.withColumn(
    "Brand_Model",
    concat_ws(" ", col("Brand"), col("Model"))  # Concatenar 'Brand' y 'Model' con un espacio en el medio
)
display(df)

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,Rolex Submariner
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0,Gigant,USD,Omega Seamaster
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0,Big,USD,Tag Heuer Carrera
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0,Gigant,USD,Breitling Navitimer
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0,Medium,USD,Cartier Tank Solo
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0,Gigant,USD,Jaeger-LeCoultre Reverso
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0,Gigant,USD,Seiko Prospex
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0,Big,USD,Citizen Promaster
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0,Big,USD,Tissot Le Locle
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0,Big,USD,Hamilton Khaki Field


Convertimos la columna Price a numérico

In [0]:
from pyspark.sql.functions import col, regexp_replace

df_cleaned = df.withColumn(
    "Price",
    regexp_replace(col("Price"), r"[^\d.]", "")  # Eliminar todo lo que no sea dígito o punto
).withColumn(
    "Price",
    col("Price").cast("double")  # Convertir a tipo numérico (Double)
)
df_cleaned.printSchema()

root
 |-- Brand: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Case Material: string (nullable = true)
 |-- Strap Material: string (nullable = true)
 |-- Movement Type: string (nullable = true)
 |-- Water Resistance: string (nullable = true)
 |-- Case Diameter (mm): double (nullable = true)
 |-- Case Thickness (mm): double (nullable = true)
 |-- Band Width (mm): double (nullable = true)
 |-- Dial Color: string (nullable = true)
 |-- Crystal Material: string (nullable = true)
 |-- Complications: string (nullable = true)
 |-- Power Reserve: string (nullable = true)
 |-- Price: double (nullable = true)
 |-- Case Diameter Class: string (nullable = false)
 |-- Currency: string (nullable = false)
 |-- Brand_Model: string (nullable = false)



Vamos ahora a ordenar por precio del reloj:

In [0]:
display(df_cleaned.orderBy("Price", ascending=False).limit(10))


Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,22.0,Blue,Sapphire,"Date, Moon Phase",45 hours,70000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,20.0,Blue,Sapphire,"Date, Chronograph",55 hours,67000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.8,8.3,21.0,Blue,Sapphire,Date,45 hours,62500.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.8,8.3,20.0,Blue,Sapphire,Date,45 hours,57000.0,Big,USD,Patek Philippe Nautilus
Audemars Piguet,Royal Oak,Rose Gold,Rose Gold,Automatic,50 meters,41.0,9.8,20.0,Blue,Sapphire,Date,60 hours,55000.0,Big,USD,Audemars Piguet Royal Oak
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.8,8.3,20.0,Blue,Sapphire,Date,45 hours,52000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,22.0,Blue,Sapphire,Date,45 hours,51000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.8,8.3,21.0,Blue,Sapphire,Date,55 hours,51000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,20.0,Blue,Sapphire,Date,45 hours,49800.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.8,8.3,20.0,Blue,Sapphire,Date,45 hours,47000.0,Big,USD,Patek Philippe Nautilus


### Filtros


Filtramos por un valor exacto de tipo String

In [0]:
# Filtrar los registros donde la columna 'Movement Type' tiene el valor 'Automatic'
df_filtered = df_cleaned.filter(df_cleaned['Movement Type'] == 'Automatic')

# Mostrar las primeras 5 filas
display(df_filtered.limit(5))

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,Rolex Submariner
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0,Gigant,USD,Omega Seamaster
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0,Big,USD,Tag Heuer Carrera
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0,Gigant,USD,Breitling Navitimer
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0,Gigant,USD,Seiko Prospex


Filtramos con condición de cadena:

In [0]:
display(df_cleaned.filter(
  (df_cleaned['Case Material'].contains('Steel')) & 
  ~(df_cleaned['Strap Material'].contains('eather')) &
  (df_cleaned['Water Resistance'].contains('100'))
))

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model
Oris,Big Crown ProPilot,Stainless Steel,Textile,Automatic,100 meters,41.0,12.0,20.0,Black,Sapphire,"Date, GMT",38 hours,1800.0,Big,USD,Oris Big Crown ProPilot
Bell & Ross,Aviation,Stainless Steel,Rubber,Automatic,100 meters,42.0,11.5,24.0,Black,Sapphire,Chronograph,42 hours,4500.0,Big,USD,Bell & Ross Aviation
Rolex,GMT-Master II,Stainless Steel,Stainless Steel,Automatic,100 meters,40.0,12.5,20.0,Black,Sapphire,GMT,48 hours,14000.0,Big,USD,Rolex GMT-Master II
Rolex,Datejust,Stainless Steel,Jubilee,Automatic,100 meters,36.0,12.0,20.0,Silver,Sapphire,Date,70 hours,9000.0,Big,USD,Rolex Datejust
Zenith,El Primero,Stainless Steel,Alligator,Automatic,100 meters,42.0,12.75,20.0,Silver,Sapphire,Chronograph,50 hours,6500.0,Big,USD,Zenith El Primero
Piaget,Polo S,Stainless Steel,Stainless Steel,Automatic,100 meters,42.0,9.4,24.0,Blue,Sapphire,Date,50 hours,10000.0,Big,USD,Piaget Polo S
Ulysse Nardin,Marine,Stainless Steel,Rubber,Automatic,100 meters,44.0,12.5,22.0,Blue,Sapphire,Date,60 hours,9500.0,Gigant,USD,Ulysse Nardin Marine
Girard-Perregaux,Laureato,Stainless Steel,Stainless Steel,Automatic,100 meters,42.0,10.88,20.0,Blue,Sapphire,,54 hours,7800.0,Big,USD,Girard-Perregaux Laureato
Girard-Perregaux,Laureato,Stainless Steel,Stainless Steel,Automatic,100 meters,38.0,10.88,25.0,Blue,Sapphire,,54 hours,6700.0,Big,USD,Girard-Perregaux Laureato
Rolex,Datejust,Stainless Steel,Stainless Steel,Automatic,100 meters,41.0,11.2,20.0,Silver,Sapphire,Date,70 hours,8900.0,Big,USD,Rolex Datejust


Filtramos con operador mayor que:

In [0]:
# Filtrar los registros donde la columna 'Price' es mayor que 98
df_filtered = df_cleaned.filter(df_cleaned['Price'] > 65000)
display(df_filtered)

Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,22.0,Blue,Sapphire,"Date, Moon Phase",45 hours,70000.0,Big,USD,Patek Philippe Nautilus
Patek Philippe,Nautilus,Stainless Steel,Stainless Steel,Automatic,120 meters,40.5,8.3,20.0,Blue,Sapphire,"Date, Chronograph",55 hours,67000.0,Big,USD,Patek Philippe Nautilus


Condiciones múltiples:

In [0]:
df_filtered = df_cleaned.filter(
  (df_cleaned['Brand'] == 'Rolex') & 
  (df_cleaned['Movement Type'] == 'Automatic') & 
  (df_cleaned['Power Reserve'] == '55 hours') & 
  (df_cleaned['Price'] > 65000)
)
display(df_filtered)


Brand,Model,Case Material,Strap Material,Movement Type,Water Resistance,Case Diameter (mm),Case Thickness (mm),Band Width (mm),Dial Color,Crystal Material,Complications,Power Reserve,Price,Case Diameter Class,Currency,Brand_Model


### Agrupaciones

Contar los relojes por marca

In [0]:
display(
    df_cleaned.groupBy(
        'Brand'
    ).count(

    ).withColumnRenamed(
        "count", "Nº de relojes"
    ).orderBy(
        'Nº de relojes', ascending=False
    ).limit(5)
)

Brand,Nº de relojes
IWC,39
Audemars Piguet,38
Patek Philippe,33
Zenith,30
Blancpain,29


Media de precio por marca y modelo

In [0]:
#display(df_cleaned.groupBy(['Brand', 'Model']).avg())
#display(df_cleaned.select(['Brand', 'Model' , 'Price']).groupBy(['Brand', 'Model']).avg())
display(
    df_cleaned.select([
        'Brand', 'Model' , 'Price']
    ).groupBy(
        ['Brand', 'Model']
    ).avg().withColumnRenamed(
        "avg(Price)", "AVG Price"
    ).sort(['AVG Price', 'Brand'], ascending=[False, True])
)

Brand,Model,AVG Price
Patek Philippe,Nautilus,42230.769230769234
Rolex,Daytona,27500.0
Audemars Piguet,Royal Oak,23734.21052631579
Patek Philippe,Calatrava,23385.714285714286
A. Lange & Sohne,Saxonia,23000.0
Breguet,Classique,21658.33333333333
Vacheron Constantin,Patrimony,21000.0
A. Lange & Söhne,Saxonia,20400.0
Piaget,Altiplano,19450.0
Vacheron Constantin,Overseas,19042.85714285714


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

### Tabla

In [0]:
df_cleaned.columns

['Brand',
 'Model',
 'Case Material',
 'Strap Material',
 'Movement Type',
 'Water Resistance',
 'Case Diameter (mm)',
 'Case Thickness (mm)',
 'Band Width (mm)',
 'Dial Color',
 'Crystal Material',
 'Complications',
 'Power Reserve',
 'Price',
 'Case Diameter Class',
 'Currency',
 'Brand_Model']

In [0]:
catalog = dbutils.widgets.get('catalog_name')
schema = dbutils.widgets.get('schema_name')
table_name = "luxury_watches"
table_name_simple = f"{table_name}_simple"

In [0]:
df_simple_table = df_cleaned.select(['Brand', 'Model', 'Price'])

## PANDAS
# df.to_parquet("./file.parquet") 1 ficher -> file.parquet

### PySPARK
# df_simple_table.write.format("parquet").mode("overwrite") -> N ficheros parquet -> file.parquet/part-0001.snappy.parquet

df_simple_table.write.format("delta").mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table_name_simple}")
print(f"Tabla creada: {catalog}.{schema}.{table_name_simple}")

Tabla creada: sesion4.databricks_david_schema.luxury_watches_simple


In [0]:
## Renombramos columnas, ya que no se aceptan acentos, ni espacios, establecemos snake_case
new_cols = [
    "brand", "model", "case_material", "strap_material", "movement_type",
    "water_resistance", "case_diameter", "case_thickness", "band_width",
    "dial_color", "crystal_material", "complications", "power_reserve",
    "price", "case_diameter_class", "currency", "brand_model"
]
# Aplicamos el renombrado de golpe
df_renamed = df_cleaned.toDF(*new_cols)

df_renamed.write.format("delta").mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table_name}")
print(f"Tabla creada: {catalog}.{schema}.{table_name}")

Tabla creada: sesion4.databricks_david_schema.luxury_watches


Consultar la tabla creada:


In [0]:
data = spark.sql(f"""SELECT * FROM {catalog}.{schema}.{table_name}""")
display(data)

brand,model,case_material,strap_material,movement_type,water_resistance,case_diameter,case_thickness,band_width,dial_color,crystal_material,complications,power_reserve,price,case_diameter_class,currency,brand_model
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,Rolex Submariner
Omega,Seamaster,Titanium,Rubber,Automatic,600 meters,43.5,14.47,21.0,Blue,Sapphire,Date,60 hours,5800.0,Gigant,USD,Omega Seamaster
Tag Heuer,Carrera,Stainless Steel,Leather,Automatic,100 meters,41.0,13.0,20.0,White,Sapphire,Chronograph,42 hours,4200.0,Big,USD,Tag Heuer Carrera
Breitling,Navitimer,Stainless Steel,Stainless Steel,Automatic,30 meters,43.0,14.25,22.0,Black,Sapphire,Chronograph,70 hours,7900.0,Gigant,USD,Breitling Navitimer
Cartier,Tank Solo,Stainless Steel,Leather,Quartz,30 meters,31.0,6.05,20.0,Silver,Sapphire,,,2800.0,Medium,USD,Cartier Tank Solo
Jaeger-LeCoultre,Reverso,Stainless Steel,Leather,Manual,30 meters,42.9,9.2,20.0,Black,Sapphire,,45 hours,5500.0,Gigant,USD,Jaeger-LeCoultre Reverso
Seiko,Prospex,Stainless Steel,Rubber,Automatic,200 meters,44.3,12.9,20.0,Black,Sapphire,Date,50 hours,1400.0,Gigant,USD,Seiko Prospex
Citizen,Promaster,Stainless Steel,Stainless Steel,Eco-Drive,200 meters,42.0,13.0,22.0,Black,Mineral,Chronograph,270 days,1200.0,Big,USD,Citizen Promaster
Tissot,Le Locle,Stainless Steel,Leather,Automatic,30 meters,39.3,9.75,19.0,White,Sapphire,Date,38 hours,650.0,Big,USD,Tissot Le Locle
Hamilton,Khaki Field,Stainless Steel,Leather,Automatic,100 meters,38.0,9.8,20.0,Black,Sapphire,,80 hours,495.0,Big,USD,Hamilton Khaki Field


In [0]:
from pyspark.sql.functions import col, upper, concat_ws

columnas_a_agregar = {
    "Brand": upper(col("Brand")),
    "ModelName": concat_ws(" ", col("Brand"), col("Model")),
    "Band Width (mm)": col("Band Width (mm)") ** 2
}
df = df.withColumns(columnas_a_agregar)
display(df.select(['Brand', 'ModelName', 'Band Width (mm)']))

Brand,ModelName,Band Width (mm)
ROLEX,Rolex Submariner,400.0
OMEGA,Omega Seamaster,441.0
TAG HEUER,Tag Heuer Carrera,400.0
BREITLING,Breitling Navitimer,484.0
CARTIER,Cartier Tank Solo,400.0
JAEGER-LECOULTRE,Jaeger-LeCoultre Reverso,400.0
SEIKO,Seiko Prospex,400.0
CITIZEN,Citizen Promaster,484.0
TISSOT,Tissot Le Locle,361.0
HAMILTON,Hamilton Khaki Field,400.0


### Ejercicios Spark

1. Filtrar por la marca Rolex

In [0]:
display(data.filter(col('Brand') == 'Rolex'))

brand,model,case_material,strap_material,movement_type,water_resistance,case_diameter,case_thickness,band_width,dial_color,crystal_material,complications,power_reserve,price,case_diameter_class,currency,brand_model
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,Rolex Submariner
Rolex,GMT-Master II,Stainless Steel,Stainless Steel,Automatic,100 meters,40.0,12.5,20.0,Black,Sapphire,GMT,48 hours,14000.0,Big,USD,Rolex GMT-Master II
Rolex,Datejust,Stainless Steel,Jubilee,Automatic,100 meters,36.0,12.0,20.0,Silver,Sapphire,Date,70 hours,9000.0,Big,USD,Rolex Datejust
Rolex,Datejust,Stainless Steel,Stainless Steel,Automatic,100 meters,41.0,11.2,20.0,Silver,Sapphire,Date,70 hours,8900.0,Big,USD,Rolex Datejust
Rolex,Submariner,Stainless Steel,Stainless Steel,Automatic,300 meters,40.0,13.0,20.0,Black,Sapphire,Date,48 hours,9500.0,Big,USD,Rolex Submariner
Rolex,GMT-Master II,Stainless Steel,Stainless Steel,Automatic,100 meters,40.0,12.5,20.0,Black,Sapphire,"GMT, Date",70 hours,9950.0,Big,USD,Rolex GMT-Master II
Rolex,Datejust,Stainless Steel,Stainless Steel,Automatic,100 meters,41.0,11.8,20.0,Black,Sapphire,Date,70 hours,8200.0,Big,USD,Rolex Datejust
Rolex,Datejust,Stainless Steel,Jubilee,Automatic,100 meters,36.0,12.5,20.0,Black,Sapphire,Date,70 hours,8000.0,Big,USD,Rolex Datejust
Rolex,GMT-Master II,Stainless Steel,Stainless Steel,Automatic,100 meters,40.0,12.4,20.0,Black,Sapphire,"GMT, Date",70 hours,9100.0,Big,USD,Rolex GMT-Master II
Rolex,Datejust,Stainless Steel,Jubilee Bracelet,Automatic,100 meters,36.0,12.5,20.0,Silver,Sapphire,Date,70 hours,8900.0,Big,USD,Rolex Datejust


2. Consulta la tabla samples.tpch.orders y realiza algún tipo de gráfico (O alguna similar de samples.tpch o del CSV extraído)


In [0]:
%sql
select * from 
samples.tpch.orders;

o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment
13710944,227285,O,162169.66,1995-10-11,1-URGENT,Clerk#000000432,0,accounts. ruthlessly regular accounts alongside of the car
13710945,225010,O,252273.67,1997-09-29,5-LOW,Clerk#000002337,0,ironic platelets snooze slyly. instru
13710946,238820,O,179947.16,1997-10-31,2-HIGH,Clerk#000004135,0,ole requests. regularly
13710947,581233,O,33843.49,1995-05-25,2-HIGH,Clerk#000000138,0,arefully final platelets. carefully express packages boost careful
13710948,10033,O,42500.65,1995-09-04,4-NOT SPECIFIED,Clerk#000003398,0,regular requests use furiously. fluffily
13710949,615502,O,48225.35,1995-07-13,3-MEDIUM,Clerk#000004639,0,ate quickly along the enticing ideas. furiously i
13710950,710665,F,265761.0,1992-11-29,2-HIGH,Clerk#000000735,0,", sly ideas among the ideas promise furiously about the furiously e"
13710951,382528,F,137666.86,1993-05-21,5-LOW,Clerk#000000777,0,. blithely pending packages nag furiously against the carefully unusual ac
13710976,122618,O,158725.42,1998-03-06,4-NOT SPECIFIED,Clerk#000001281,0,ages. final packages wake carefully according
13710977,575623,O,178703.66,1998-05-04,5-LOW,Clerk#000003371,0,", final requests hinder s"


Databricks visualization. Run in Databricks to view.

3. Utilizando el CSV propio que has subido al volumen, crea una tabla en el catálogo creado y consultala con SQL. Prueba a añadir columnas adicionales al CSV, como por ejemplo, refresh_date a la fecha actual, nombre del fichero...

# Control de versiones + Best practiques  + Modularization
En este punto, vamos a unir nuestro repositorio de Github: https://github.com/dvddepennde/databricks_notebook_bp al Workspace y trabajaremos con él, realizando una serie de operaciones y mostrando como debería ser estructurado.