# Generation of features.

This section presents an example, based on real data, of the application of some of the techniques described in the previous sections.

We start from both traffic and accident data in the city of Madrid in the year 2019, whose import is described in the files *1_Access_binary_format_files* and *2_Format_standardization*.

In the case of the traffic data, a feature already included in the original data is used which gives a good estimate of the level of traffic congestion, which is described on the website itself from which the data originates: the “degree of occupancy” of the street in question, which is a non-linear indicator generated from the number of vehicles crossing that street, the capacity of the street, the speed of passing vehicles and the speed limit (when few cars pass at a high speed congestion is very low, if many cars pass but still pass at high speed, congestion is low, when many cars pass and the speed decreases, there is some congestion, but when congestion is high, the speed is very low and the volume of cars decreases).

Unfortunately the documentation does not explain the details of the formula for calculating that indicator.

## First of all.

We make a simple analysis of the accident hazard by vehicle types.

To do this, we group accident data by vehicle type and severity, and generate a synthetic indicator that indicates the increase or decrease in the probability that the outcome of an accident is more minor or severe.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

**1. We read the previously saved clean crash data.**

In [2]:
accidentes = pd.read_parquet('./accidentes1.parquet')

**2. We calculate the number of people involved in accidents by type of vehicle and by severity (drivers only, not passengers).**

In [3]:
grav = accidentes[accidentes['tipo_persona'] == 'Conductor']\
    .groupby(['tipo_vehiculo', 'gravedad'])['num_expediente'].nunique()

**3. We calculate the percentage of serious, minor, etc. accidents for the total number of accidents and for each type of vehicle.**

In [4]:
grav_rel = grav.groupby('gravedad').sum()/grav.sum()
grav_rel_veh = grav/grav.groupby('tipo_vehiculo').sum()

**4. We generated an indicator that compares the severity of accidents by type of vehicle with the overall severity.**

In [5]:
grav_ratio = (grav_rel_veh/grav_rel) - 1

**5. We selected the nine most common vehicle types and displayed the data in a table.**

In [6]:
top_accidentes = accidentes['tipo_vehiculo'].value_counts().head(9).index.values

In [7]:
grav_ratio[top_accidentes].unstack(1).loc[top_accidentes].style.format('{:.0%}')

gravedad,Ileso,Leve,Grave,Fallecido
tipo_vehiculo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Turismo,16%,-44%,-82%,-91%
Motocicleta > 125cc,-62%,165%,372%,478%
Furgoneta,24%,-66%,-96%,-28%
Motocicleta hasta 125cc,-70%,191%,253%,299%
Autobús,30%,-83%,-100%,-100%
Camión rígido,30%,-84%,-89%,-100%
Bicicleta,-68%,188%,222%,128%
Ciclomotor,-68%,188%,193%,-100%
Todo terreno,26%,-73%,-79%,-100%


As might be expected, it can be seen from the data that motorcycle and bicycle riders are at much higher risk of serious injury than truck or SUV riders.

## Next...

The traffic patterns in the city are analyzed.

To do this, two features are created from the date and time data: the day of the week and the time of day.

It is important to note that these are not simply subdivisions, but organize the data in different ways: the day of the week subdivision groups the data for all Mondays, all Tuesdays, etc. and the time of day subdivision groups the data from eight to nine for all days, from nine to ten for all days, etc.

Then we group by these two new variables, average the congestion level and display it in a graph.

In [8]:
import altair as alt
trafico = pd.read_parquet('./trafico.parquet')

**1. We generate the average of the street occupancy indicator as a function of two synthetic variables: time of day and day of the week.**

In [9]:
data = trafico.groupby([trafico.fecha.dt.hour, trafico.fecha.dt.dayofweek])\
    ['ocupacion'].mean().rename_axis(['hora','dia_semana']).reset_index()

**2. We create a simple heatmap to show the results.**

In [10]:
alt.Chart(data).mark_rect().encode(
    x = 'hora:O',
    y = 'dia_semana:O',
    color = alt.Color('ocupacion:Q', legend = None)).properties(title = 'Densidad de tráfico en Madrid 2019')

The peak hour between eight and nine o'clock in the morning on weekdays is clearly visible, and the afternoon peak hour shifts to three o'clock on Fridays.

## Finally,...

In the last example, accident data is crossed with traffic data.

Again, a specific indicator is created, which in this case indicates the relationship between the number of accidents and traffic intensity, to try to understand at what times traffic is more dangerous.

**1. We generate accident and traffic intensity indicators.**

In [11]:
data_a = accidentes.groupby([accidentes.fecha.dt.hour, accidentes.fecha.dt.dayofweek])\
    ['num_expediente'].count().rename_axis(['hora', 'dia_semana'])

In [12]:
data_t = trafico[trafico['tipo_elem'] == 'URB']\
    .groupby([trafico.fecha.dt.hour, trafico.fecha.dt.dayofweek])\
    ['carga'].mean().rename_axis(['hora', 'dia_semana'])

**2. With the data aligned, we make the ratio between the two indicators.**

In [13]:
data_at = (data_a/data_t).rename('Accidentalidad').reset_index()

**3. We create a simple heatmap to display the results.**

In [14]:
alt.Chart(data_at).mark_rect().encode(
    x = 'hora:O',
    y = 'dia_semana:O',
    color = alt.Color('Accidentalidad:Q',
                         scale = alt.Scale(scheme = 'lightorange'), legend = None)
    ).properties(title = 'Relación entre accidentalidad y carga de tráfico')

It can be seen that, although the generated indicator has a lot of noise and probably shows some “artifacts” (errors introduced in the data processing), it is enough to draw some interesting conclusions: the most dangerous times for traffic are early mornings on weekends, and there is a peak of danger around four o'clock in the afternoon on Saturdays and Sundays (probably related to the return of copious meals with excess alcohol).