# Publicación de datos siguiendo buenas prácticas

En este tutorial, exploraremos los principales pasos necesarios para publicar un conjunto de datos en acceso abierto. Utilizaremos un conjunto de datos existente, el cual transformaremos para su publicación en un repositorio digital. Además de aprender sobre el proceso de publicación, también abordaremos las buenas prácticas que deben seguirse durante todo el procedimiento. Esto incluye: la correcta documentación del recurso, la inclusión de metadatos adecuados y la publicación del código necesario para replicar los experimentos realizados.


In [None]:
# Importamos todas las librerias necesarias para este notebook
import pandas as pd
from datasets import Dataset

## Filtrado del conjunto de datos

Descargamos el conjunto de datos del año 2024 de [Accidentes de tráfico de la ciudad de Madrid](https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=7c2843010d9c3610VgnVCM2000001f4a900aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default) publicado por el Ayuntamiento de Madrid y lo guardamos en la carpeta `data/`.

In [None]:
original_dataset_path= "/home/jovyan/dataset-publication-tutorial/data/2024_Accidentalidad.csv"
original_df= pd.read_csv(original_dataset_path, delimiter=';')
original_df.head(n=3)

Como para este ejemplo queremos transformar este dataset en un dataset de accidentes mortales vamos a analizar la [documentación](https://datos.madrid.es/FWProjects/egob/Catalogo/Seguridad/Ficheros/Estructura_ConjuntoDatos_Accidentesv2.pdf) que se adjunta con este conjunto de datos. En esta pone que los accidentes que han tenido como consecuencia el fallecimiento del accidentado son aquellos en los que el campo `lesividad` tiene el código `04`.

In [None]:
fallecidos_df= original_df[original_df["cod_lesividad"]==4]

Para mantener la relación de nuestro dataset con el original de forma transparente vamos a mantener la referencia al índice de fila en el dataset original. Esta decisión es personal y no es necesaria.

In [None]:
fallecidos_df= fallecidos_df.reset_index().rename(columns={'index': 'original_row_index'})
fallecidos_df.to_csv("2024_Mortalidad.csv", index=False)

El resultado de este filtro es el siguiente:

In [None]:
fallecidos_df.head(n=3)

## Documentación del conjunto de datos

El principal aspecto a tener en cuenta para publicar el conjunto de datos es crear la documentación (datasheet). Por lo general, la mayoría de plataformas admiten documentación en formato markdown, por lo que se puede redactar la documentación antes de decidir el repositorio digital. Para esta tarea, nos esforzaremos en seguir los principios FAIR y las indicaciones de los siguientes trabajos:

- *Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, et al. “Datasheets for Datasets.” Commun. ACM 64, no. 12 (2021): 86–92. https://doi.org/10.1145/3458723.*
- *Shaina Raza, Shardul Ghuge, Chen Ding, Elham Dolatabadi, Deval Pandya; FAIR Enough: Develop and Assess a FAIR-Compliant Dataset for Large Language Model Training?. Data Intelligence 2024; 6 (2): 559–585. doi: https://doi.org/10.1162/dint_a_00255*
- *Wu, Y., Ajmani, L., Longpre, S., & Li, H. (2025). A systematic review of NeurIPS dataset management practices. Proceedings of the 38th International Conference on Neural Information Processing Systems. Presented at the Vancouver, BC, Canada. Red Hook, NY, USA: Curran Associates Inc. https://dl.acm.org/doi/10.5555/3737916.3738948*



---

In [None]:
"""
# Fatal traffic accidents in the city of Madrid during 2024

## Dataset Summary

This dataset contains fatal traffic accidents in the city of Madrid during 2024. It is derived from the [original dataset](https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=7c2843010d9c3610VgnVCM2000001f4a900aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default) published by the City Council of Madrid, selecting only those accidents in which the accident resulted in the death of the person involved.

The motivation behind this dataset is to facilitate the analysis of black spots (high-risk accident areas) in Madrid and to contribute to reducing road fatalities.

This resource is also part of a tutorial designed to learn how to publish an open dataset, following best practices for documentation, metadata, and reproducibility. It was created by Ibai Guillén-Pacho and supported by the Predoctoral Grant (PIPF-2022/COM-25947) of the Consejería de Educación, Ciencia y Universidades de la Comunidad de Madrid, Spain.

## How to Use

The tutorial materials explaining the dataset publication workflow are available at the following [repository](https://github.com/iguillenp/dataset-publication-tutorial). 

The dataset is distributed as a CSV file with semicolon (`;`) as the field separator. It can be processed with any software that supports CSV format, such as spreadsheet applications (Excel, LibreOffice, Google Sheets) or programming environments (Python, R, etc.).

``` python
# Python Example:

import pandas as pd

# Load the dataset (semicolon-delimited CSV)
df = pd.read_csv("2024_Accidentalidad.csv", delimiter=";")

# Inspect the first records
print(df.head())
```

**PENDING**

## Dataset Details
This section provides a detailed description of the dataset, including a preview, its structure, the collection process, the transformations applied, and the maintenance policy. 

### Data Preview
|   original_row_index | num_expediente   | fecha      | hora     | localizacion                                  | numero   |   cod_distrito | distrito           | tipo_accidente        | estado_meteorológico   | tipo_vehiculo           | tipo_persona   | rango_edad      | sexo   |   cod_lesividad | lesividad          |   coordenada_x_utm |   coordenada_y_utm | positiva_alcohol   |   positiva_droga |
|---------------------:|:-----------------|:-----------|:---------|:----------------------------------------------|:---------|---------------:|:-------------------|:----------------------|:-----------------------|:------------------------|:---------------|:----------------|:-------|----------------:|:-------------------|-------------------:|-------------------:|:-------------------|-----------------:|
|                  487 | 2024S000313      | 05/01/2024 | 12:20:00 | AVDA. SAN DIEGO / CALL. PUERTO DE LA BONAIGUA | 122      |             13 | PUENTE DE VALLECAS | Atropello a persona   | Despejado              | Furgoneta               | Peatón         | Más de 74 años  | Hombre |               4 | Fallecido 24 horas |             443449 |        4.47073e+06 | N                  |              nan |
|                 3329 | 2024S002968      | 27/01/2024 | 2:01:00  | AUTOV. M-30, TÚNEL BYPASS 11XC98              | 11XC98   |              2 | ARGANZUELA         | Solo salida de la vía | Despejado              | Motocicleta hasta 125cc | Conductor      | De 25 a 29 años | Hombre |               4 | Fallecido 24 horas |             442010 |        4.47118e+06 | N                  |              nan |
|                 4100 | 2024S003750      | 02/02/2024 | 13:00:00 | AVDA. REYES CATOLICOS, 1                      | 1        |              9 | MONCLOA-ARAVACA    | Atropello a persona   | Despejado              | Turismo                 | Peatón         | Más de 74 años  | Hombre |               4 | Fallecido 24 horas |             438905 |        4.47655e+06 | N                  |              nan |

### Data Structure
The dataset structure is specified in the [original dataset documentation](https://datos.madrid.es/FWProjects/egob/Catalogo/Seguridad/Ficheros/Estructura_ConjuntoDatos_Accidentesv2.pdf). However, we include here a translated version of the original documentation to facilitate a better understanding of the dataset content. In addition, we have added the field `original_row_index`, which preserves the row index from the original dataset, allowing users to trace each filtered record back to its source.

**IMPORTANT:** The file includes one record per person involved in the accident (drivers, passengers, pedestrians, witnesses, etc.).


| Nº | Column Name            | Description                                                                 | Expected Values |
|----|------------------------|-----------------------------------------------------------------------------|-----------------|
| 0  | original_row_index     | Index of the row in the original dataset, useful to trace records back      | Number          |
| 1  | num_expediente         | AAAASNNNNNN, where: <br>• AAAA = accident year <br>• S = accident record    | Text            |
| 2  | fecha                  | Date in format dd/mm/yyyy                                                   | Date            |
| 3  | hora                   | Time expressed in 1-hour ranges                                             | Time            |
| 4  | localizacion           | Street 1 – Street 2 (intersection) or a single street                       | Text            |
| 5  | numero                 | Street number, when applicable                                              | Number          |
| 6  | cod_distrito           | District code                                                               | Number          |
| 7  | distrito               | District name                                                               | Text            |
| 8  | tipo_accidente*        | Type of accident (see typology below)                                       | Text            |
| 9  | estado_meteorologico   | Weather conditions description                                              | Text            |
| 10 | tipo_vehiculo          | Type of vehicle involved                                                    | Text            |
| 11 | tipo_persona           | Type of person involved                                                     | Text            |
| 12 | rango_edad             | Age range of affected person                                                | Text            |
| 13 | sexo                   | Gender: male, female, or not assigned                                       | Text            |
| 14 | cod_lesividad*         | Severity code (see typology below)                                          | Number          |
| 15 | lesividad              | Description of severity                                                     | Text            |
| 16 | coordenada_x_utm       | X coordinate (UTM projection)                                               | Number          |
| 17 | coordenada_y_utm       | Y coordinate (UTM projection)                                               | Number          |
| 18 | positiva_alcohol       | Alcohol test result: N (No) or S (Yes)                                      | N / S           |
| 19 | positiva_droga         | Drug test result: NULL or 1                                                 | NULL / 1        |


**TYPES OF ACCIDENTS**
- Double collision: Traffic accident involving two moving vehicles (head-on, head-side, or side collision).
- Multiple collision: Traffic accident involving more than two moving vehicles.
- Rear-end collision: Accident occurring when a moving or stopped vehicle (due to traffic conditions) is struck from behind by another vehicle.
- Collision with obstacle or road element: Accident occurring between a moving vehicle with a driver and a stationary object occupying the roadway or its surroundings (e.g., parked vehicle, tree, lamppost).
- Pedestrian run-over: Accident involving a vehicle and a pedestrian occupying the roadway or walking on sidewalks, crossings, or other areas of the public road not intended for vehicle traffic.
- Rollover: Accident involving a vehicle with more than two wheels that, for some reason, loses tire contact with the road and ends up on its side or roof.
- Fall: Includes all falls related to traffic circumstances (motorcycle, moped, bicycle, bus passenger, etc.).
- Other causes: Includes accidents such as animal run-over, cliff fall, running off the road, and others.

**SEVERITY**
- 01: Emergency care without subsequent hospitalization – Minor
- 02: Hospitalization ≤ 24 hours – Minor
- 03: Hospitalization > 24 hours – Serious
- 04: Deceased within 24 hours – Fatal
- 05: Outpatient medical care afterwards – Minor
- 06: Immediate medical care in a health center or mutual insurance – Minor
- 07: Medical care only at the accident site – Minor
- 14: No medical assistance
- 77: Unknown
- Blank: No medical assistance

### Data Collection
The dataset is derived from the official [2024 traffic accident records](https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=7c2843010d9c3610VgnVCM2000001f4a900aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default) published by the City Council of Madrid.

### Data Processing
From the original dataset, we extracted only the records corresponding to fatal accidents, defined as those in which the variable `lesividad` has the code `04`. The transformed dataset is stored in CSV format, preserving the same structure and columns as the original file, but containing only a reduced subset of cases that meet this criterion. The scripts employed for this processing are available in the [repository](https://github.com/iguillenp/dataset-publication-tutorial).

### Data Maintenance
The original dataset is maintained and periodically updated by the City Council of Madrid. In contrast, the transformed dataset presented here is designed for tutorial purposes, maintained by the associated teaching team. No further updates are planned beyond potential revisions for new versions of the tutorial. Users interested in updated traffic accident data should always refer to the [City of Madrid Open Data Portal](https://datos.madrid.es).

## License
The dataset is distributed under the license terms of the [City of Madrid Open Data Portal](https://datos.madrid.es/egob/catalogo/aviso-legal).

The tutorial materials (including documentation, code, and instructions provided in this repository) are distributed under the Creative Commons Attribution-NonCommercial 4.0 International ([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en)) license.

## Citation
Original dataset: *Ayuntamiento de Madrid. Accidentes de tráfico en la ciudad de Madrid (2024). Portal de Datos Abiertos. https://datos.madrid.es/portal/site/egob*

Transformed dataset: *Guillén-Pacho I. (2025). Fatal Traffic Accidents in Madrid 2024.**PENDING***

## Acknowledgements
This work is supported by the Predoctoral Grant (PIPF-2022/COM-25947) of the Consejería de Educación, Ciencia y Universidades de la Comunidad de Madrid, Spain.
"""


---

## Publicación del conjunto de datos

### Opción 1: Zenodo

Para esta opción subiremos el dataset en algún formato reconocido como csv o xlsx. Además, añadiremos todos los metadatos posibles y modificaremos el formato del datasheet para que se adapte correctamente a la plataforma. El resultado final está publicado en: https://doi.org/10.5281/zenodo.17054802

---

### Opción 2: Kaggle

Para esta opción subiremos el dataset en algún formato reconocido como csv o xlsx. Además, añadiremos todos los metadatos posibles y modificaremos el formato del datasheet para que se adapte correctamente a la plataforma. El resultado final está publicado en: https://www.kaggle.com/datasets/ibaiguillenpacho/fatal-traffic-accidents-in-the-city-of-madrid-2024/

---

### Opción 3: HuggingFace

Para esta opción subiremos el dataset mediante código, para lo cual hay que transformarlo a un formato Dataset particular. Para mostrar la particularidad de HuggingFace de subir diferentes configuraciones y splits vamos a subir dos configuraciones y tres splits de forma artificial. Para las configuraciones dividiremos en función de si pertenece a la primera mitad del año o la segunda y para los splits en función de si son atropellos o no.

In [None]:
hf_repo= "dcdc-upm/fatal_traffic_accidents_in_the_city_of_madrid_2024"


fallecidos_df["fecha"]= pd.to_datetime(fallecidos_df["fecha"], format="%d/%m/%Y")


# Año completo

data= Dataset.from_pandas(fallecidos_df, preserve_index=False)
atropellados= Dataset.from_pandas(fallecidos_df[fallecidos_df["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)
otros= Dataset.from_pandas(fallecidos_df[~fallecidos_df["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)

data.push_to_hub(hf_repo, config_name="Full Year", split="all")
atropellados.push_to_hub(hf_repo, config_name="Full Year", split="atropellos")
otros.push_to_hub(hf_repo, config_name="Full Year", split="otros")


# Filtrar primer semestre (mes <= 6)
fallecidos_df_first_half = fallecidos_df[fallecidos_df["fecha"].dt.month <= 6]

data_first_half= Dataset.from_pandas(fallecidos_df_first_half, preserve_index=False)
atropellados_first_half= Dataset.from_pandas(fallecidos_df_first_half[fallecidos_df_first_half["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)
otros_first_half= Dataset.from_pandas(fallecidos_df_first_half[~fallecidos_df_first_half["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)

data_first_half.push_to_hub(hf_repo, config_name="First Semester", split="all")
atropellados_first_half.push_to_hub(hf_repo, config_name="First Semester", split="atropellos")
otros_first_half.push_to_hub(hf_repo, config_name="First Semester", split="otros")

# Filtrar segundo semestre (mes > 6)
fallecidos_df_second_half = fallecidos_df[fallecidos_df["fecha"].dt.month > 6]

data_second_half= Dataset.from_pandas(fallecidos_df_second_half, preserve_index=False)
atropellados_second_half= Dataset.from_pandas(fallecidos_df_second_half[fallecidos_df_second_half["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)
otros_second_half= Dataset.from_pandas(fallecidos_df_second_half[~fallecidos_df_second_half["tipo_accidente"].str.startswith("Atropello")], preserve_index=False)

data_second_half.push_to_hub(hf_repo, config_name="Second Semester", split="all")
atropellados_second_half.push_to_hub(hf_repo, config_name="Second Semester", split="atropellos")
otros_second_half.push_to_hub(hf_repo, config_name="Second Semester", split="otros")



---