Transformações a serem realizadas

- [ ] Remoção de arquivos com horário de ping diferente do horário de requsição.
- [ ]  Ajustar hora em sp, diminuindo -3.
- [ ]  Em Curitiba, quando o campo “codigolinha” estiver “REC”, o ônibus não está em operação, logo, será removido.
- [ ] Ausência de valor no campo “linha” em BSB indica que não está em operação, logo deverá ser removido.
- [ ] Atualizar campos de horas e datas para ISO 8601  2024-02-24T13:05Z.
- [ ] Padronizar o sentido de operação da linha em SP e CWB para integers 1 = ida 2=  volta.
- [ ] Padronizar os identificadores de ônibus CUR_idOnibus.
- [ ] Add nome dos arquivos para um campo algo como "query_timestamp"


In [1]:
from pyspark.sql import *
from delta import *

builder = SparkSession.builder.appName("geral").config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").config("spark.executor.memory", "8g").config("spark.driver.memory", "8g").config("spark.sql.parquet.enableVectorizedReader", "false")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

INPUT_PATH = "/home/felipe/dados_vm"
BRONZE_PATH = "/home/felipe/code/topicos_dados/lake/bronze/"
SILVER_PATH = "/home/felipe/code/topicos_dados/real_lake/silver"

print(spark.version)


24/03/26 08:56:04 WARN Utils: Your hostname, desktop resolves to a loopback address: 127.0.1.1; using 192.168.0.106 instead (on interface enp6s0)
24/03/26 08:56:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/felipe/.local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/felipe/.ivy2/cache
The jars for the packages stored in: /home/felipe/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b6501f29-7ab1-43a5-a6cf-11bfd81cc772;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.1.0 in central
	found io.delta#delta-storage;3.1.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 476ms :: artifacts dl 10ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.1.0 from central in [default]
	io.delta#delta-storage;3.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0

3.5.0


In [2]:
from pyspark.sql import functions as F

def checkInvalidTimeRows(df):
    
    df = df.withColumn("updated_at", F.to_timestamp("updated_at"))
    df = df.withColumn("queried_at", F.to_timestamp("queried_at"))

    timediff = F.abs(F.unix_timestamp("updated_at") - F.unix_timestamp("queried_at"))

    return df.filter(timediff < 300)


In [4]:
# São Paulo

from pyspark.sql.functions import udf, input_file_name, col
from pyspark.sql.types import StringType, TimestampType
from datetime import datetime, timedelta


# Commmon Functions
def changeBusIdSP(onibus_id):
    """
    Change the column "id_onibus" to the following pattern: CITY_id_onibus

    Example: SP_0881
    """
    return f"SPO_{onibus_id}"

def changeTimestamp(timestamp):
    """ 
    Change timestamp to ISO 8601 (2024-02-24T13:05Z) using GMT-3
    """
    datetime_object = datetime.fromisoformat(timestamp)
    return str((datetime_object-timedelta(hours=3)).isoformat())

def addTimestampQueryTime(filenames):
    """ 
    Add the column "query_timestamp" indicating the timestamp
    
    file:///home/felipe/code/topicos_dados/dados/cb_micro/1706665101.9679544.parquet -> 1706665101.9679544
    """
    
    file_name = str(filenames)[32:-8]
    time = (datetime.fromtimestamp(float(file_name))).strftime("%Y-%m-%dT%H:%M:%SZ")
    
    return time

# UDFS

udf_transformBusIdSpo = udf(changeBusIdSP,StringType())
udf_changeTimestamp = udf(changeTimestamp,StringType())
udf_addTimestampFile = udf(addTimestampQueryTime,StringType())

# Reading DF
sp = spark.read.format("parquet").option("inferSchema","true").option("header","true").load(f"{INPUT_PATH}/sp").withColumn("inputFiles",input_file_name())


# # Changing DF
sp = sp.withColumn("queried_at",udf_addTimestampFile(col("inputFiles")))
sp = sp.withColumn("bus_id",udf_transformBusIdSpo(col("id_onibus")))
sp = sp.withColumn("updated_at",udf_changeTimestamp(col("tempo_captura")))

# Droping columns
sp = sp.drop("inputFiles","id_onibus","tempo_captura")

# Renaming columns

sp = sp.withColumnsRenamed({"lt0":"station0","lt1":"station1","c":"bus_code"})


# Testing filtering rows

sp = checkInvalidTimeRows(sp)

# saving
sp.write.format("delta").option("path",f"/home/felipe/code/topicos_dados/real_lake/silver/silver_sp_geral").saveAsTable("silver_sp_geral")

sp.show()

24/03/26 09:04:30 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------------------+-------------------+--------+-----------------+--------------+-------------------+---------+-------------------+
|           latitude|          longitude|bus_code|         station0|      station1|         queried_at|   bus_id|         updated_at|
+-------------------+-------------------+--------+-----------------+--------------+-------------------+---------+-------------------+
|-23.650333250000003|       -46.70932225| 6044-10| TERM. STO. AMARO|   JD. D. JOSÉ|2024-02-02 05:31:16|SPO_78288|2024-02-02 05:30:28|
|      -23.641039625|       -46.71706575| 6044-10| TERM. STO. AMARO|   JD. D. JOSÉ|2024-02-02 05:31:16|SPO_78113|2024-02-02 05:27:23|
|       -23.64464325|         -46.727139| 6044-10| TERM. STO. AMARO|   JD. D. JOSÉ|2024-02-02 05:31:16|SPO_78061|2024-02-02 05:30:50|
|         -23.643037|        -46.7420015| 6044-10| TERM. STO. AMARO|   JD. D. JOSÉ|2024-02-02 05:31:16|SPO_78879|2024-02-02 05:27:06|
|        -23.6620535|         -46.779414| 6044-10| TERM. STO. 

In [3]:
spark.sql("DROP TABLE IF EXISTS silver_sp_geral")
# spark.sql("DROP TABLE IF EXISTS silver_bsb")
# spark.sql("DROP TABLE IF EXISTS silver_rj")
# spark.sql("DROP TABLE IF EXISTS silver_cwb")

DataFrame[]

In [79]:
# Curitiba

from pyspark.sql.functions import udf, input_file_name, col
from pyspark.sql.types import StringType, IntegerType, TimestampType, DoubleType
from datetime import datetime, timedelta
from pyspark.sql import functions as F


def changeBusIdCWB(onibus_id):
    """
    Change the column "id_onibus" to the following pattern: CITY_id_onibus

    Example: SP_0881
    """
    return f"CWB_{onibus_id}"

def changeTimestampCwb(time,timestamp):
    """ 
    Change timestamp to ISO 8601 (2024-02-24T13:05Z) using GMT-3 for Curitiba

    This is the 'tempo_captura' field in Curitiba: "22:40"
    2024-01-30T
    """
    return f"{timestamp[:10]}T{time}:00Z"

def addTimestampQueryTime(filenames):
    """ 
    Add the column "query_timestamp" indicating the timestamp
    
    file:///home/felipe/code/topicos_dados/dados/cb_micro/1706665101.9679544.parquet -> 1706665101.9679544
    """
    try:
        file_name = str(filenames)[33:-8]
    except:
        print(filenames)
    time = (datetime.fromtimestamp(float(file_name))).strftime("%Y-%m-%dT%H:%M:%SZ")
    return time

def changeSentidoField(sentido):
    sentidoMap = {
        'IDA': 1,
        'VOLTA': 2
    }

    return sentidoMap[sentido] if sentido in list(sentidoMap.keys()) else 0

def removeInactiveBus(df):

    filtered_df = df.filter(df['linha']!="REC")
    return filtered_df



# UDFS

udf_transformBusIdCwb = udf(changeBusIdCWB,StringType())
udf_changeTimestamp = udf(changeTimestampCwb,StringType())
udf_addTimestampFile = udf(addTimestampQueryTime,StringType())
udf_changeSentido = udf(changeSentidoField,IntegerType())

# Reading DF

cwb = spark.read.format("parquet").option("inferSchema","true").option("header","true").load(f"{INPUT_PATH}/cwb").withColumn("inputFiles",input_file_name())

# Transformation

cwb = cwb.withColumn("queried_at",udf_addTimestampFile(col("inputFiles")))
cwb = cwb.withColumn("bus_id",udf_transformBusIdCwb(col("id_onibus")))
cwb = cwb.withColumn("updated_at",udf_changeTimestamp(col("tempo_captura"),col('queried_at')))
cwb = cwb.withColumn("bus_direction",udf_changeSentido(col("sentido")))

cwb = removeInactiveBus(cwb)

# Dropping

cwb = cwb.drop("tempo_captura","sentido","inputFiles","id_onibus")

# Renamming

cwb = cwb.withColumnsRenamed({
    "adaptado":"is_adapted",
    "linha":"bus_code",
    "tipo_veiculo":"type_vehicle",
    "situacao":"situation",
    "situacao_2":"situation_2",
    "tabela":"table"
})

cwb = checkInvalidTimeRows(cwb)

# Casting types

cwb = cwb.withColumn("latitude", cwb["latitude"].cast(DoubleType()))
cwb = cwb.withColumn("longitude", cwb["longitude"].cast(DoubleType()))
cwb = cwb.withColumn("is_adapted", cwb["is_adapted"].cast(IntegerType()))

# Saving

cwb.write.format("delta").option("path",f"/home/felipe/code/topicos_dados/real_lake/silver/silver_cwb_geral").saveAsTable("silver_cwb_geral")

cwb.show()
# from pyspark.sql.functions import col, isnan, when, count
# excluded_columns = ["queried_at", "updated_at"]
# columns_to_check = [c for c in cwb.columns if c not in excluded_columns]

# # Verifique quais colunas têm valores nulos
# null_counts = cwb.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in columns_to_check])

# # Mostre o resultado
# cwb.filter(cwb["longitude"].isNotNull()).count()




24/03/26 02:05:30 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
24/03/26 02:05:30 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 84,44% for 9 writers
24/03/26 02:05:30 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 76,00% for 10 writers
24/03/26 02:05:30 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/26 02:05:31 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 63,33% for 12 writers
24/03/26 02:05:32 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/26 02:05:32 WARN MemoryManager: Total allocation exceeds 95,

+----------+----------+------------+--------+----------+---------------+-----+----------+-------------------+---------+-------------------+-------------+
|  latitude| longitude|type_vehicle|bus_code| situation|    situation_2|table|is_adapted|         queried_at|   bus_id|         updated_at|bus_direction|
+----------+----------+------------+--------+----------+---------------+-----+----------+-------------------+---------+-------------------+-------------+
| -25.57997|-49.335138|  ARTICULADO|     685|NO HORÁRIO|REALIZANDO ROTA|    1|         1|2024-02-07 14:36:23|CWB_HI602|2024-02-07 14:35:00|            2|
|-25.542625|-49.313965|  ARTICULADO|     685|  ATRASADO|REALIZANDO ROTA|    2|         1|2024-02-07 14:36:23|CWB_HA614|2024-02-07 14:36:00|            2|
|-25.538276|-49.313716|  ARTICULADO|     685|  ATRASADO|REALIZANDO ROTA|    4|         1|2024-02-07 14:36:23|CWB_HA616|2024-02-07 14:36:00|            1|
|-25.576793| -49.33179|  ARTICULADO|     685|NO HORÁRIO|REALIZANDO ROTA|  5-

In [76]:
spark.sql("drop table if exists silver_cwb_geral")

DataFrame[]

In [56]:
# Brasilia

from pyspark.sql.functions import udf, input_file_name, col
from pyspark.sql.types import StringType, IntegerType, TimestampType, DoubleType
from datetime import datetime, timedelta
from pyspark.sql import functions as F


def changeBusIdBSB(onibus_id):
    """
    Change the column "id_onibus" to the following pattern: CITY_id_onibus

    Example: SP_0881
    """
    return f"BSB_{onibus_id}"

def changeTimestampBsb(timestamp):
    """ 
    Change timestamp to ISO 8601 (2024-02-24T13:05Z) using GMT-3 for Curitiba

    This is the 'tempo_captura' field in Curitiba: "1701355514000"
    But the value isn't the same as the timestamp from querying. 
    """
    time = (datetime.fromtimestamp(float(str(timestamp)[:10]))).strftime("%Y-%m-%dT%H:%M:%SZ")
    return time

def addTimestampQueryTime(filenames):
    """ 
    Add the column "query_timestamp" indicating the timestamp
    
    file:///home/felipe/code/topicos_dados/dados/cb_micro/1706665101.9679544.parquet -> 1706665101.9679544
    """
    file_name = str(filenames)[32:-8]
    time = (datetime.fromtimestamp(float(file_name))).strftime("%Y-%m-%dT%H:%M:%SZ")
    return time

def changeSentidoField(sentido):
    sentidoMap = {
        'IDA': 1,
        'VOLTA': 2
    }

    return sentidoMap[sentido] if sentido in list(sentidoMap.keys()) else 0

def removeInactiveBus(df):

    filtered_df = df.filter(df['linha']!="")
    return filtered_df

# UDFS

udf_transformBusIdBSB = udf(changeBusIdBSB,StringType())
udf_changeTimestamp = udf(changeTimestampBsb,StringType())
udf_addTimestampFile = udf(addTimestampQueryTime,StringType())
udf_changeSentido = udf(changeSentidoField,IntegerType())

bsb = spark.read.format("parquet").option("inferSchema","true").option("header","true").load(f"{INPUT_PATH}/df").withColumn("inputFiles",input_file_name())


column_type = bsb.select("velocidade").schema[0].dataType

# Verificando se o tipo de dados é INT32
if column_type == IntegerType():
    # Convertendo a coluna "velocidade" para o tipo IntegerType
    bsb = bsb.withColumn("velocidade", bsb["velocidade"].cast(IntegerType()))
    bsb = bsb.withColumn("velocidade", bsb["velocidade"].cast(DoubleType()))


bsb = bsb.withColumn("queried_at",udf_addTimestampFile(col("inputFiles")))
bsb = bsb.withColumn("updated_at",udf_changeTimestamp(col("tempo_captura")))
bsb = bsb.withColumn("bus_id",udf_transformBusIdBSB(col("id_onibus")))
bsb = bsb.withColumn("bus_direction",udf_changeSentido(col("sentido")))

bsb = removeInactiveBus(bsb)
bsb = checkInvalidTimeRows(bsb)

# dropping 

bsb = bsb.drop("tempo_captura","sentido","inputFiles","id_onibus")

# renaming

bsb = bsb.withColumnsRenamed({
    "velocidade":"bus_speed",
    "linha":"bus_code",
    "direcao":"direction"
})

# 

bsb.write.format("delta").option("path",f"/home/felipe/code/topicos_dados/real_lake/silver/silver_bsb_geral").saveAsTable("silver_bsb_geral")
# bsb.write.format("delta").mode("append").option("path",f"/home/felipe/code/topicos_dados/lake/silver/silver_bsb")

bsb.show()


24/03/26 01:33:21 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
24/03/26 01:33:21 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 84,44% for 9 writers
24/03/26 01:33:21 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 76,00% for 10 writers
24/03/26 01:33:21 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/26 01:33:21 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 63,33% for 12 writers
24/03/26 01:33:23 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/26 01:33:23 WARN MemoryManager: Total allocation exceeds 95,

+---------+---------+---------+----------+--------+-------------------+-------------------+----------+-------------+
|longitude| latitude|bus_speed| direction|bus_code|         queried_at|         updated_at|    bus_id|bus_direction|
+---------+---------+---------+----------+--------+-------------------+-------------------+----------+-------------+
|-47.95694|-15.86222|    14.44|28.6733942|   0.175|2024-01-31 04:27:31|2024-01-31 04:22:43|BSB_338923|            1|
|-47.90689|-15.73643|    13.33|13.0850245|   0.884|2024-01-31 04:27:31|2024-01-31 04:22:49|BSB_336416|            2|
|-48.14956|-15.88419|     5.56|288.014693|   0.373|2024-01-31 04:27:31|2024-01-31 04:22:55|BSB_335924|            2|
|-48.07032|-15.80553|     5.56|101.689369|   0.805|2024-01-31 04:27:31|2024-01-31 04:22:59|BSB_338290|            1|
|-48.01963|-15.87825|     8.61|122.974859|   368.1|2024-01-31 04:27:31|2024-01-31 04:23:00|BSB_338281|            1|
|-48.05585|-15.86868|     1.39|77.0053832|   807.9|2024-01-31 04

In [55]:
spark.sql("DROP TABLE IF EXISTS silver_bsb_geral")

DataFrame[]

In [42]:
# from pyspark.sql.functions import isNotNull
teste = spark.read.format("parquet").load("/home/felipe/dados_vm/df/1706961424.3244605.parquet")
teste = teste.withColumn("velocidade",teste['velocidade'].cast(DoubleType()))
teste.show()

+----------+----------+-------------+---------+----------+-------+-------+-----+
| longitude|  latitude|tempo_captura|id_onibus|velocidade|sentido|direcao|linha|
+----------+----------+-------------+---------+----------+-------+-------+-----+
|-47.755812|-15.913563|1684647792000|   228141|      NULL|   NULL|    0.0|     |
|-48.057682|-16.006287|1692211340000|   232211|      NULL|    IDA|    0.0|205.1|
| -48.03334|-16.024967|1706961396000|   232297|      NULL|   NULL|    0.0|     |
|-47.756367|-15.913235|1706597704000|   232696|      NULL|   NULL|    0.0|     |
|-47.756489|-15.913743|1706961392000|   232831|      NULL|   NULL|    0.0|     |
|-47.963997|-16.047623|1706961260000|   229555|      NULL|    IDA|-301.35| 3305|
| -47.78577|-15.802013|1706961378000|   233200|      NULL|    IDA| 202.39|764.2|
|-47.954914|-15.873674|1706961392000|   233161|      NULL|    IDA|  59.65|124.6|
|-47.909599|-15.850035|1706961396000|   233170|      NULL|    IDA|    0.0|0.102|
|-48.057519|-16.006291|16944

In [8]:
# Rio de Janeiro

from pyspark.sql.functions import udf, input_file_name, col
from pyspark.sql.types import StringType, TimestampType, DoubleType
from datetime import datetime, timedelta
from pyspark.sql import functions as F


def changeBusIdRj(onibus_id):
    """
    Change the column "id_onibus" to the following pattern: CITY_id_onibus

    Example: SP_0881
    """
    return f"RJO_{onibus_id}"

def changeTimestampRj(timestamp):
    """ 
    Change timestamp to ISO 8601 (2024-02-24T13:05Z) using GMT-3 for Rio de janeiro

    This is the 'tempo_captura' field in Rio de Janeiro: "1701355514000"
    
    """
    time = (datetime.fromtimestamp(float(timestamp[:-3]))).strftime("%Y-%m-%d %H:%M:%S")
    return time

def addTimestampQueryTime(filenames):
    """ 
    Add the column "query_timestamp" indicating the timestamp
    
    file:///home/felipe/code/topicos_dados/dados/cb_micro/1706665101.9679544.parquet -> 1706665101.9679544
    """
    file_name = str(filenames)[54:-8]
    try:
        time = (datetime.fromtimestamp(float(file_name))).strftime("%Y-%m-%d %H:%M:%S")
    except ValueError:
        file_name = str(filenames)[32:-8]
        time = (datetime.fromtimestamp(float(file_name))).strftime("%Y-%m-%d %H:%M:%S")
    finally:
        return time

def changeSeparatorString(values):
    return values.replace(",",".")

# UDFS

udf_transformBusIdRj = udf(changeBusIdRj,StringType())
udf_changeTimestamp = udf(changeTimestampRj,StringType())
udf_addTimestampFile = udf(addTimestampQueryTime,StringType())
udf_changeSeparator = udf(changeSeparatorString,StringType())

rj = spark.read.format("parquet").option("inferSchema","true").option("header","true").load(f"{INPUT_PATH}/rj").withColumn("inputFiles",input_file_name())

rj = rj.withColumn("queried_at",udf_addTimestampFile(col("inputFiles")))
rj = rj.withColumn("updated_at",udf_changeTimestamp(col("tempo_captura")))
rj = rj.withColumn("bus_id",udf_transformBusIdRj(col("id_onibus")))
rj = rj.withColumn("latitude",udf_changeSeparator(col("latitude")))
rj = rj.withColumn("longitude",udf_changeSeparator(col("longitude")))

# dropping

rj = rj.drop("tempo_captura","id_onibus","inputFiles")

# Renaming

rj = rj.withColumnsRenamed({
    "velocidade":"bus_speed",
    "linha":"bus_code",
})

rj = checkInvalidTimeRows(rj)

# casting

rj = rj.withColumn("latitude", rj["latitude"].cast(DoubleType()))
rj = rj.withColumn("longitude", rj["longitude"].cast(DoubleType()))
rj = rj.withColumn("bus_speed", rj["bus_speed"].cast(DoubleType()))


rj.write.format("delta").option("path",f"/home/felipe/code/topicos_dados/real_lake/silver/silver_rj_geral").saveAsTable("silver_rj_geral")

rj.show()


24/03/25 23:39:01 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
24/03/25 23:39:01 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 84,44% for 9 writers
24/03/25 23:39:01 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 76,00% for 10 writers
24/03/25 23:39:01 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/25 23:39:01 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 63,33% for 12 writers
24/03/25 23:39:12 WARN MemoryManager: Total allocation exceeds 95,00% (1.020.054.720 bytes) of heap memory
Scaling row group sizes to 69,09% for 11 writers
24/03/25 23:39:12 WARN MemoryManager: Total allocation exceeds 95,

+---------+---------+---------+--------+-------------------+-------------------+----------+
| latitude|longitude|bus_speed|bus_code|         queried_at|         updated_at|    bus_id|
+---------+---------+---------+--------+-------------------+-------------------+----------+
|-22.91628|-43.22865|     14.0|     422|2024-02-07 17:21:00|2024-02-07 17:18:09|RJO_A72157|
|-22.92749|-43.25945|      7.0|     422|2024-02-07 17:21:00|2024-02-07 17:19:45|RJO_A72046|
|-22.93584|-43.18974|      3.0|     422|2024-02-07 17:21:00|2024-02-07 17:19:31|RJO_A72110|
| -22.9187| -43.1833|     16.0|     007|2024-02-07 17:21:00|2024-02-07 17:19:37|RJO_A72048|
|-22.91828|-43.19426|     37.0|     410|2024-02-07 17:21:00|2024-02-07 17:17:17|RJO_A72180|
| -22.9024|-43.18304|     31.0|     422|2024-02-07 17:21:00|2024-02-07 17:19:35|RJO_A72194|
|-22.92338|-43.17703|     11.0|     422|2024-02-07 17:21:00|2024-02-07 17:17:58|RJO_A72013|
|-22.93288|-43.17916|      0.0|     422|2024-02-07 17:21:00|2024-02-07 17:18:03|

In [19]:
spark.sql("drop table if exists silver_sp_geral")

DataFrame[]