# Format standardization

The following code shows an example of data standardization to facilitate its exploitation, based on the same data used in the example of file *1_Access_binary_format_files*.

In this case, the same page where the data were downloaded offers a document describing how to interpret the data provided: in this case, it is very convenient to translate all this contextual information into explicit information in the data, as will be seen in the case of the new “Severity” field or in the limitation of the possible values for those fields that have them defined as such.

In [1]:
import pandas as pd
import numpy as np

1. We read the previously saved data, having discarded certain columns in order to match the file downloaded in 2020, as it had undergone some modifications.

In [2]:
accidentes = pd.read_parquet('./accidentes.parquet')

In [3]:
accidentes.shape

(51811, 13)

2. We generated a new field to code the severity levels according to “lesividad”, as specified in the data description document.

In [4]:
accidentes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51811 entries, 0 to 51810
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   num_expediente        51811 non-null  object        
 1   fecha                 51811 non-null  datetime64[ns]
 2   hora                  51811 non-null  object        
 3   localizacion          51811 non-null  object        
 4   numero                51811 non-null  object        
 5   distrito              51811 non-null  object        
 6   tipo_accidente        51811 non-null  object        
 7   estado_meteorológico  46681 non-null  object        
 8   tipo_vehiculo         51635 non-null  object        
 9   tipo_persona          51811 non-null  object        
 10  rango_edad            46517 non-null  object        
 11  sexo                  46768 non-null  object        
 12  cod_lesividad         30042 non-null  float64       
dtypes: datetime64[ns

In [5]:
accidentes.cod_lesividad.isnull().sum()

21769

In [6]:
c_gravedad = pd.api.types.CategoricalDtype(categories = ['Ileso', 'Leve', 'Grave','Fallecido'], ordered = True)

In [7]:
dict_gravedad = {14.0: 'Ileso', 3.0: 'Grave', 4.0: 'Fallecido'}

In [8]:
accidentes['gravedad'] = accidentes['cod_lesividad'].apply(
lambda x: dict_gravedad.get(x, 'Leve') if ~np.isnan(x) else 'Ileso').astype(c_gravedad)

3. We decode the time and add it to the date to have a date-time type field, easier to use.

In [9]:
# Convert the column 'hora' to a datetime object
accidentes['hora'] = pd.to_datetime(accidentes['hora'], format = '%H:%M:%S').dt.strftime('%H:%M:%S')

In [10]:
# Combine the column 'fecha' and 'hora'.
accidentes['fecha'] += pd.to_timedelta(accidentes['hora'])

In [11]:
# We eliminate columns that are no longer needed
accidentes = accidentes.drop(columns = ['hora', 'localizacion', 'numero'])

4. We convert all columns starting from the third one into categorical types.

In [12]:
accidentes.iloc[:,2:] = accidentes.iloc[:,2:].astype('category')

In [13]:
accidentes[:5]

Unnamed: 0,num_expediente,fecha,distrito,tipo_accidente,estado_meteorológico,tipo_vehiculo,tipo_persona,rango_edad,sexo,cod_lesividad,gravedad
0,2018S017842,2019-04-02 09:10:00,CENTRO,Colisión lateral,Despejado,Motocicleta > 125cc,Conductor,De 45 a 49 años,Hombre,7.0,Leve
1,2018S017842,2019-04-02 09:10:00,CENTRO,Colisión lateral,Despejado,Turismo,Conductor,De 30 a 34 años,Mujer,7.0,Leve
2,2019S000001,2019-01-01 03:45:00,CARABANCHEL,Alcance,,Furgoneta,Conductor,De 40 a 44 años,Hombre,,Ileso
3,2019S000001,2019-01-01 03:45:00,CARABANCHEL,Alcance,,Turismo,Conductor,De 40 a 44 años,Mujer,,Ileso
4,2019S000001,2019-01-01 03:45:00,CARABANCHEL,Alcance,,Turismo,Conductor,De 45 a 49 años,Mujer,,Ileso


In [14]:
accidentes.to_parquet('./accidentes1.parquet')

In [15]:
accidentes['gravedad'].value_counts()

Ileso        38378
Leve         12860
Grave          539
Fallecido       34
Name: gravedad, dtype: int64

The penultimate lines of the code require a little explanation.

- The data interpretation document gives a definition of the different possible values of the “LESIVIDAD*” field, and adds that value 3 is considered “grave”, value 14 is considered “uninjured”, value 4 corresponds to the death of the injured person, and all other values are considered “leve”.

- In another part of the document, it is specified that if the injury value is not present, it is because there was no victim in the accident requiring medical attention (which we are going to consider here as “Ileso”, although it could be that in some cases it was “Leve”).

- All the gibberish in that document is translated into three lines of code, and once the new “GRAVEDAD” field is created (which does not add any information that was not present in the original data, of course) it is not necessary to consider it again during the rest of the analyses made with that data.

- In addition, the order of the injury codes is arbitrary, while the severity codes are ordered, making it possible to compare them and do analyses such as “in what percentage of accidents between a car and a motorcycle the severity of the motorcycle driver is greater than that of any of the occupants of the car”, for example.

This is an example of a highly recommended practice in data analysis: translating implicit codings into explicit ones, and translating confusing codings into codings that are easy to understand and use.