# Notebook explanation

This notebook performs the processing and enrichment of employees of employees and sales orders using PysPark in Databricks. The main steps are described below:

1. ** Data load **:  
   The `bronze_employes` and` Silver_orders` boards are charged from the Spark catalog as dataframes.

2. ** Order processing **:  
   - Duplicates of orders are eliminated using the `Order_id` field.
   - The `Event_Date` column is converted to Timestamp type and renamed as` approved_date`.
   - Orders are grouped by `Employee_id` to calculate:
     - Date of the first and last sale.
     - Total orders and products sold.
     - Minimum and maximum amount of products sold in an order.

3. ** Employee enrichment **:  
   - The aggregate information of orders is joined with employee data.
   - The load date (`Load_date`) and the days from the first sale are added.

4. ** Storage **:  
   - The resulting Dataframe is saved as a delta table in the specified scheme, overwriting the existing table if applied.

This flow allows you to maintain a table of employees enriched with sales metrics, useful for subsequent analysis and reports.

In [0]:
%run ../Transversal/config

In [0]:
%run ../Transversal/utils

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m




In [0]:
from datetime import datetime
import pytz

current_date = datetime.now(horario_colombia).strftime("%d/%m/%Y %H:%M:%S")

employees = spark.table(Bronze_Employes)
orders = spark.table(Silver_Orders)

StatementMeta(, 5691d1ec-0330-4ace-8a05-90a63cae7660, 10, Finished, Available, Finished)

In [0]:
from pyspark.sql.functions import to_timestamp, col, min, max, count, sum as _sum, lit, datediff


def generate_silver_employees(orders_df, employees_df, table_silver, current_date):

    clean_orders_df = orders_df.dropDuplicates(["order_id"])
    clean_orders_df = clean_orders_df.withColumn("approved_date", to_timestamp(col("event_date"), "dd/MM/yyyy HH:mm:ss"))

    orders_agg = clean_orders_df.groupBy("employee_id").agg(
    min("approved_date").alias("first_sale"),
    max("approved_date").alias("last_sale"),
    count("order_id").alias("total_orders"),
    _sum("quantity_products").alias("total_products"),
    min("quantity_products").alias("min_quantity_sold"),
    max("quantity_products").alias("max_quantity_sold"))

    orders_agg = orders_agg.withColumn("load_date", to_timestamp(lit(current_date), "dd/MM/yyyy HH:mm:ss"))
    orders_agg = orders_agg.withColumn("days_since_first_sale", datediff(col("load_date"), col("first_sale")))

    df_merged = employees_df.join(
        orders_agg,
        on="employee_id",
        how="left")

    df_merged.write\
    .format("delta")\
    .mode("overwrite")\
    .option("overwriteSchema", "true")\
    .saveAsTable(table_silver)

    return df_merged

StatementMeta(, 5691d1ec-0330-4ace-8a05-90a63cae7660, 11, Finished, Available, Finished)

In [0]:
generate_silver_employees(
    orders_df=orders,
    employees_df=employees,
    table_silver=Silver_Employees,
    current_date=current_date
)

StatementMeta(, 5691d1ec-0330-4ace-8a05-90a63cae7660, 12, Finished, Available, Finished)

DataFrame[employee_id: bigint, name: string, phone: string, email: string, address: string, comission: double, first_sale: timestamp, last_sale: timestamp, total_orders: bigint, total_products: bigint, min_quantity_sold: int, max_quantity_sold: int, load_date: timestamp, days_since_first_sale: int]