# Transformações e Padronizações – Camada Trusted¶

Este notebook realiza a transformação dos dados provenientes da camada RAW, aplicando regras de limpeza, normalização, padronização de tipos, renomeação de colunas e modelagem estrutural.

In [2]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DecimalType, IntegerType, LongType
from pyspark.sql import functions as f
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable

In [3]:
builder: SparkSession.Builder = SparkSession.builder \
    .appName("Preparação TRS de Socios") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g")

In [4]:
spark: SparkSession = configure_spark_with_delta_pip(builder).getOrCreate()

spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/13 18:35:21 WARN Utils: Your hostname, wilcb, resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
26/01/13 18:35:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/wilcb/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/wilcb/.ivy2.5.2/cache
The jars for the packages stored in: /home/wilcb/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-938df015-3d3b-49c7-9904-0ac400f45133;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: resolve 196ms :: artifacts dl 7ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.0 from central in [de

## Funções Auxiliares

In [5]:
def null_count(df: DataFrame) -> None:
    """
    Conta e exibe a quantidade de valores nulos por coluna em um DataFrame do PySpark.

    A função percorre todas as colunas do DataFrame fornecido e calcula, para cada uma, 
    a quantidade de valores nulos (`NULL`). O resultado é exibido diretamente no console 
    por meio do método `.show()`.

    Args:
        df (DataFrame): DataFrame a ser inspecionado.

    Returns:
        None: A função apenas imprime o resultado no console.
    """
    try:
        # Cria uma lista de expressões que somam os valores nulos por coluna
        nulls = [
            f.sum(f.col(c).isNull().cast("int")).alias(c)
            for c in df.columns
        ]

        # Exibe os totais de nulos por coluna
        df.select(nulls).show()
    except Exception as e:
        print(f"[ERRO] Falha na busca de dados nulos: {e}")
        raise

## Leitura do DeltaTable na camada RAW

In [7]:
try:
    dt: DeltaTable = DeltaTable.forPath(spark, "../../RAW/simples/2025-12")
except Exception as e:
    print(f"[ERRO] Falha na leitura do DeltaTable: {e}")
    raise

In [8]:
try:
    df: DataFrame = dt.toDF()
    df.show(truncate=False)
except Exception as e:
    print(f"[ERRO] Falha na conversão para DataFrame: {e}")
    raise

26/01/13 18:43:13 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-----------+-------------+------------------+---------------------+---------+--------------+-----------------+--------------------------+
|cnpj_basico|opcao_simples|data_opcao_simples|data_exclusao_simples|opcao_mei|data_opcao_mei|data_exclusao_mei|data_ingestao             |
+-----------+-------------+------------------+---------------------+---------+--------------+-----------------+--------------------------+
|23331913   |S            |20150908          |00000000             |N        |00000000      |00000000         |2026-01-09 19:54:31.653903|
|23331914   |N            |20190101          |20191231             |N        |00000000      |00000000         |2026-01-09 19:54:31.653903|
|23331915   |S            |20240101          |00000000             |S        |20240101      |00000000         |2026-01-09 19:54:31.653903|
|23331916   |S            |20150923          |00000000             |S        |20150923      |00000000         |2026-01-09 19:54:31.653903|
|23331917   |N            |

Contar valores nulos

In [9]:
null_count(df)



+-----------+-------------+------------------+---------------------+---------+--------------+-----------------+-------------+
|cnpj_basico|opcao_simples|data_opcao_simples|data_exclusao_simples|opcao_mei|data_opcao_mei|data_exclusao_mei|data_ingestao|
+-----------+-------------+------------------+---------------------+---------+--------------+-----------------+-------------+
|          0|            0|                 0|                    0|        0|             0|                0|            0|
+-----------+-------------+------------------+---------------------+---------+--------------+-----------------+-------------+



                                                                                

In [13]:
df.printSchema()

root
 |-- cnpj_basico: string (nullable = true)
 |-- opcao_simples: string (nullable = true)
 |-- data_opcao_simples: string (nullable = true)
 |-- data_exclusao_simples: string (nullable = true)
 |-- opcao_mei: string (nullable = true)
 |-- data_opcao_mei: string (nullable = true)
 |-- data_exclusao_mei: string (nullable = true)
 |-- data_ingestao: timestamp (nullable = true)

