#  01 - Exploración Inicial Base de Pings y Shapefiles de OSM
Este notebook carga los datos de pings geolocalizados, hace una deduplicación y un analisis estructurado de datos sobre estos así como sobre los archivos shapefiles de OpenStreetMap correspondientes a El Salvador


## Archivo de Pings Geolocalizados
Correspondientes a Diciembre 2024 en El [Salvador](url)

In [0]:
%sql
select * from sv_12_2023 limit 40

Descriptivos Agrupados

In [0]:
%sql
select count(1),count(distinct device_id), count(distinct date), max(latitude),min(latitude),max(longitude),min(longitude) from sv_12_2023

Encontrando Valores Duplicados

In [0]:
%sql
select timestamp, device_id,latitude,longitude,count(1)
from sv_12_2023
group by 1,2,3,4
having count(1)>1

Eliminación de Duplicados y creación de variables de interés sobre la fecha

In [0]:
%sql
CREATE TABLE sv_12_2023_deduped AS
with raw as (
SELECT
  distinct  from_utc_timestamp(from_unixtime(timestamp), "America/El_Salvador") datetime,
  timestamp,
  device_id,
  latitude,
  longitude
FROM
  sv_12_2023
)
select datetime,
       hour(datetime) hora,
       date(datetime) fecha,
       CASE 
        WHEN dayofweek(datetime) IN (1, 7) THEN 1
        ELSE 0
       END AS is_weekend,
       CASE dayofweek(datetime)
        WHEN 1 THEN 'Domingo'
        WHEN 2 THEN 'Lunes'
        WHEN 3 THEN 'Martes'
        WHEN 4 THEN 'Miércoles'
        WHEN 5 THEN 'Jueves'
        WHEN 6 THEN 'Viernes'
        WHEN 7 THEN 'Sábado'
      END AS nombre_dia,
       timestamp,
       device_id,
       latitude,
       longitude
from raw


Cantidad de registros después de quitar duplicados

In [0]:
%sql
select count(1) from sv_12_2023_deduped

Distribución de pings y dispositivos únicos por día de semana

In [0]:
%sql
select concat(dayofweek(datetime),'.',nombre_dia) dia,count(1) registros, count(distinct device_id) dispositivos_unicos from sv_12_2023_deduped group by 1

Databricks visualization. Run in Databricks to view.

Histograma de pings por día por dispositivo

In [0]:
%sql
select concat(dayofweek(datetime),'.',nombre_dia) dia,fecha, device_id,count(1) pings_por_dia
from sv_12_2023_deduped group by 1,2,3

Databricks visualization. Run in Databricks to view.

BoxPlot de pings por día por dispositivo

In [0]:
%sql
select concat(dayofweek(datetime),'.',nombre_dia) dia,fecha, device_id,count(1) pings_por_dia
from sv_12_2023_deduped group by 1,2,3

Databricks visualization. Run in Databricks to view.

## Archivo de Shapefiles de OpenStreetMaps

In [0]:
!pip install geopandas

In [0]:
import os
import geopandas as gpd
import pandas as pd

In [0]:
shapes_dir = "../Proyecto de Grado/shapes/"
gdf_list = []

for filename in os.listdir(shapes_dir):
    if filename.endswith(".geojson"):
        file_path = os.path.join(shapes_dir, filename)
        gdf = gpd.read_file(file_path)
        gdf['source_file'] = os.path.splitext(filename)[0]
        gdf_selected = gdf[
            ['source_file', 'osm_id', 'code', 'fclass', 'name', 'geometry']
        ]
        gdf_list.append(gdf_selected)

gdf_all = gpd.GeoDataFrame(
    pd.concat(gdf_list, ignore_index=True),
    crs=gdf_list[0].crs if gdf_list else None
)

In [0]:
gdf_all.display()

In [0]:
# Convert GeoDataFrame to Pandas DataFrame (geometry as WKT)
gdf_all['geometry_wkt'] = gdf_all['geometry'].apply(lambda geom: geom.wkt if geom is not None else None)
pdf = gdf_all.drop(columns='geometry')

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(pdf)

# Register as a temporary view or save as a table
spark_df.createOrReplaceTempView("osm_shapes")

# Optionally, save as a permanent table
spark_df.write.mode("overwrite").saveAsTable("osm_shapes")

In [0]:
%sql
select * from osm_shapes limit 40

In [0]:
%sql
select count(1) from osm_shapes

In [0]:
%sql
select source_file,count(1) from osm_shapes group by 1

In [0]:
%sql
SELECT
  fclass,
  COUNT(1) AS cnt,
  ROUND(100.0 * COUNT(1) / SUM(COUNT(1)) OVER (), 2) AS pct
from osm_shapes
group by 1

Databricks visualization. Run in Databricks to view.

In [0]:
%sql
select osm_id,source_file, fclass,name,count(1) over (partition by osm_id) cnt
from osm_shapes

Eliminando Duplicados y seleccionando unicamente las capas:
- gis_osm_pois_a_free_1
- gis_osm_pofw_a_free_1
- gis_osm_buildings_a_free_1
- gis_osm_landuse_a_free_1
- gis_osm_traffic_a_free_1

Asi mismo aplicando el siguiente orden de prioridad como criterio para seleccionar el registro con el que nos quedaremos luego de quitar duplicados:
1. gis_osm_pois_a_free_1
2. gis_osm_pofw_a_free_1
3. gis_osm_buildings_a_free_1
4. gis_osm_landuse_a_free_1
5. gis_osm_traffic_a_free_1


In [0]:
%sql
CREATE TABLE osm_shapes_deduped AS
with raw as (
select osm_id,
       source_file,
       fclass clase,
       coalesce(name,fclass) nombre,
       case when source_file='sv_gis_osm_pois_a_free_1' then 1
            when source_file='sv_gis_osm_pofw_a_free_1' then 2
            when source_file='sv_gis_osm_buildings_a_free_1' then 3
            when source_file='sv_gis_osm_landuse_a_free_1' then 4
            when source_file='sv_gis_osm_traffic_a_free_1' then 5
        else 6 end priority,
      geometry_wkt
from osm_shapes
where source_file in ('sv_gis_osm_pois_a_free_1','sv_gis_osm_pofw_a_free_1','sv_gis_osm_buildings_a_free_1','sv_gis_osm_landuse_a_free_1','sv_gis_osm_traffic_a_free_1')
)
, row_numbered as (
select *, row_number() over (partition by osm_id order by priority) rn
from raw
)
select * except (rn)
from row_numbered 
where rn=1


In [0]:
%sql
select count(1) from osm_shapes_deduped

Verificando que no hay duplicados

In [0]:
%sql
select osm_id,source_file, clase,nombre,count(1) over (partition by osm_id) cnt
from osm_shapes_deduped
order by 5 desc