# UK Car Accidents 2015

## Contexto

UK police forces collect data on every vehicle collision in the uk on a form called Stats19. Data from this form ends up at the DfT and is published at https://data.gov.uk/dataset/road-accidents-safety-data

## Contenido
There are 3 files in this set. 
**Accidents** is the primary one and has references by Accident_Index to the **Casualties** and **Vehicles tables**. We also have a number of files that make sense of the fields in the tables.

The tables contain the following information:

- **Accidents** table contains details on the accident 
- **Vehicles** table contains details on the vehicles involved in the accident
- **Casualties** table contains details on the casualties caused by the vehicles involved in the accident


All the data variables are coded rather than containing textual strings. 
There may be missing values or out of range (-1	Data missing or out of range)

## Formato de ficheros

Los ficheros de datos son de formato **CSV separado por comas.** 

## Preguntas

Vamos a intentar resolver las siguientes preguntas:
    
- Los 10 accidentes con mayor número de víctimas 
- Los 10 accidentes con mayor número de vehículos involucrados
- Los accidentes por grado de severidad 
- Los accidentes por rango de edad del conductor
- ¿Cuál es el día de la semana con mayor número de accidentes?
- ¿Cuál es la hora del día en la que se produce el mayor número de accidentes?
- ¿Qué conductores tienen más accidentes, hombres o mujeres?
- ¿Cuál es el distrito policial que más accidentes registró en 2015?
- ¿Qué clase de víctima es la más frecuente en hombres y mujeres?
- left hand side

# 0- Preprocesado de datos 

Se realizará un preprocesado de datos cuyo objetico es limpiar y normalizar los datos para que se puedan utilizar más fácilmente por el resto de prácticas  

El formato de salida será un fichero Microsoft Excel que contendrá los siguientes datasets:

- **Accidentes:** accidentes donde se ha obtenido el mes del año en el que se produjeron y se eliminan las columnas del dataset que no van a ser utilizadas.
- **Vehiculos:** dataset con todos los vehículos involucrados en cada uno de los accidentes y eliminamos las columnas que no van a ser utilizadas
- **Victimas:** dataset de las víctimas por vehículo y accidente 

### Códigos y Categorías

Obtenemos la localización para cada uno de los registros del fichero de accidentes y lo añadimos al dataframe

In [2]:
import numpy as np
import pandas as pd

In [3]:
df_categorias = pd.read_excel("Data/Categories.xlsx",header=None,names =['id_category','category_name','category_description'])

In [4]:
df_categorias = df_categorias [['id_category','category_name']]
df_categorias.head()

Unnamed: 0,id_category,category_name
0,1,Police Force
1,2,Accident Severity
2,3,Day of Week
3,4,Local Authority (District)
4,5,Local Authority (Highway Authority - ONS code)


In [5]:
df_codigos= pd.read_excel("Data/Codes.xlsx",header=None,names =['id_code','code','category','code_name'])
df_codigos.head()

Unnamed: 0,id_code,code,category,code_name
0,1,1,Police Force,Metropolitan Police
1,2,3,Police Force,Cumbria
2,3,4,Police Force,Lancashire
3,4,5,Police Force,Merseyside
4,5,6,Police Force,Greater Manchester


### Preprocesado Accidentes

In [6]:
df_accidents = pd.read_csv("Data/Accidents_2015.csv", parse_dates=[9], dayfirst=True, header = 0)

In [7]:
df_accidents.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,201501BS70001,525130.0,180050.0,-0.198465,51.505538,1,3,1,1,2015-01-12,...,0,0,4,1,1,0,0,1,1,E01002825
1,201501BS70002,526530.0,178560.0,-0.178838,51.491836,1,3,1,1,2015-01-12,...,0,0,1,1,1,0,0,1,1,E01002820
2,201501BS70004,524610.0,181080.0,-0.20559,51.51491,1,3,1,1,2015-01-12,...,0,1,4,2,2,0,0,1,1,E01002833
3,201501BS70005,524420.0,181080.0,-0.208327,51.514952,1,3,1,1,2015-01-13,...,0,0,1,1,2,0,0,1,2,E01002874
4,201501BS70008,524630.0,179040.0,-0.206022,51.496572,1,2,2,1,2015-01-09,...,0,5,1,2,2,0,0,1,2,E01002814


Verificamos que el fichero de accidentes no tiene duplicados

In [8]:
df_accidentes= df_accidents[["Accident_Index"]].drop_duplicates()
len(df_accidentes)

140056

In [9]:
len(df_accidents)

140056

Asignamos el indice del dataframe como ID de cada accidente

In [10]:
df_accidents.index= np.arange(1, len(df_accidents)+1)
##df_accidents.columns.names = ['id_accident']
df_accidents.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
1,201501BS70001,525130.0,180050.0,-0.198465,51.505538,1,3,1,1,2015-01-12,...,0,0,4,1,1,0,0,1,1,E01002825
2,201501BS70002,526530.0,178560.0,-0.178838,51.491836,1,3,1,1,2015-01-12,...,0,0,1,1,1,0,0,1,1,E01002820
3,201501BS70004,524610.0,181080.0,-0.20559,51.51491,1,3,1,1,2015-01-12,...,0,1,4,2,2,0,0,1,1,E01002833
4,201501BS70005,524420.0,181080.0,-0.208327,51.514952,1,3,1,1,2015-01-13,...,0,0,1,1,2,0,0,1,2,E01002874
5,201501BS70008,524630.0,179040.0,-0.206022,51.496572,1,2,2,1,2015-01-09,...,0,5,1,2,2,0,0,1,2,E01002814


Obtenemos el mes del año en el que se produjo el accidente (en número). 

In [11]:
df_accidents['Month'] = df_accidents.apply(lambda row: row['Date'].month,axis=1)
df_accidents.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,Month
1,201501BS70001,525130.0,180050.0,-0.198465,51.505538,1,3,1,1,2015-01-12,...,0,4,1,1,0,0,1,1,E01002825,1
2,201501BS70002,526530.0,178560.0,-0.178838,51.491836,1,3,1,1,2015-01-12,...,0,1,1,1,0,0,1,1,E01002820,1
3,201501BS70004,524610.0,181080.0,-0.20559,51.51491,1,3,1,1,2015-01-12,...,1,4,2,2,0,0,1,1,E01002833,1
4,201501BS70005,524420.0,181080.0,-0.208327,51.514952,1,3,1,1,2015-01-13,...,0,1,1,2,0,0,1,2,E01002874,1
5,201501BS70008,524630.0,179040.0,-0.206022,51.496572,1,2,2,1,2015-01-09,...,5,1,2,2,0,0,1,2,E01002814,1


Obtenemos la hora del accidente del campo Time

In [12]:
import time

In [13]:
df_accidents['Time'] =  pd.to_datetime(df_accidents['Time'], format='%H:%M')

In [14]:
df_accidents['Hour'] = df_accidents.apply(lambda row: row['Time'].hour, axis=1)

In [15]:
df_accidents['Hour'] = np.nan_to_num(df_accidents['Hour']).astype(int)

In [16]:
df_accidents.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,Month,Hour
1,201501BS70001,525130.0,180050.0,-0.198465,51.505538,1,3,1,1,2015-01-12,...,4,1,1,0,0,1,1,E01002825,1,18
2,201501BS70002,526530.0,178560.0,-0.178838,51.491836,1,3,1,1,2015-01-12,...,1,1,1,0,0,1,1,E01002820,1,7
3,201501BS70004,524610.0,181080.0,-0.20559,51.51491,1,3,1,1,2015-01-12,...,4,2,2,0,0,1,1,E01002833,1,18
4,201501BS70005,524420.0,181080.0,-0.208327,51.514952,1,3,1,1,2015-01-13,...,1,1,2,0,0,1,2,E01002874,1,7
5,201501BS70008,524630.0,179040.0,-0.206022,51.496572,1,2,2,1,2015-01-09,...,1,2,2,0,0,1,2,E01002814,1,7


Seleccionamos las columnas que nos interesan

In [17]:
df_accidentes = df_accidents[['Accident_Index','Month','Hour','Day_of_Week','Accident_Severity', 
                              'Number_of_Vehicles','Number_of_Casualties', 'Weather_Conditions', 
                              'Road_Surface_Conditions', 'Urban_or_Rural_Area','Road_Type','Speed_limit',
                              'Local_Authority_(District)','Did_Police_Officer_Attend_Scene_of_Accident']]

In [18]:
print(len(df_accidentes))

140056


In [19]:
df_accidentes.head()

Unnamed: 0,Accident_Index,Month,Hour,Day_of_Week,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Road_Type,Speed_limit,Local_Authority_(District),Did_Police_Officer_Attend_Scene_of_Accident
1,201501BS70001,1,18,2,3,1,1,1,1,1,6,30,12,1
2,201501BS70002,1,7,2,3,1,1,1,1,1,6,30,12,1
3,201501BS70004,1,18,2,3,1,1,2,2,1,6,30,12,1
4,201501BS70005,1,7,3,3,1,1,1,2,1,6,30,12,2
5,201501BS70008,1,7,6,2,2,1,2,2,1,6,30,12,2


### Preprocesado Vehículos

In [20]:
df_vehicles = pd.read_csv("Data/Vehicles_2015.csv", header = 0)

In [21]:
df_vehicles.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type,Vehicle_IMD_Decile
0,201506E098757,2,9,0,18,0,8,0,0,0,...,6,1,45,7,1794,1,11,-1,1,-1
1,201506E098766,1,9,0,9,0,8,0,0,0,...,6,2,25,5,1582,2,1,-1,-1,-1
2,201506E098766,2,9,0,18,0,8,0,0,0,...,6,1,51,8,-1,-1,-1,-1,1,-1
3,201506E098777,1,20,0,4,0,0,0,0,0,...,1,1,50,8,4462,2,1,-1,1,-1
4,201506E098780,1,9,0,15,0,1,0,0,0,...,6,1,27,6,1598,2,-1,-1,1,-1


Asignamos el indice del dataframe como ID de cada vehículo

In [22]:
df_vehicles.index= np.arange(1, len(df_vehicles)+1)
##df_vehicles.columns.names = ['id_vehicle']
df_vehicles.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type,Vehicle_IMD_Decile
1,201506E098757,2,9,0,18,0,8,0,0,0,...,6,1,45,7,1794,1,11,-1,1,-1
2,201506E098766,1,9,0,9,0,8,0,0,0,...,6,2,25,5,1582,2,1,-1,-1,-1
3,201506E098766,2,9,0,18,0,8,0,0,0,...,6,1,51,8,-1,-1,-1,-1,1,-1
4,201506E098777,1,20,0,4,0,0,0,0,0,...,1,1,50,8,4462,2,1,-1,1,-1
5,201506E098780,1,9,0,15,0,1,0,0,0,...,6,1,27,6,1598,2,-1,-1,1,-1


Seleccionamos las columnas que nos interesan

In [23]:
df_vehiculos = df_vehicles[['Accident_Index','Vehicle_Reference','Vehicle_Type','Age_of_Vehicle',
                            'Age_of_Driver','Age_Band_of_Driver','Sex_of_Driver','Towing_and_Articulation',
                            'Vehicle_Location-Restricted_Lane','1st_Point_of_Impact','Was_Vehicle_Left_Hand_Drive?',
                            'Journey_Purpose_of_Driver','Engine_Capacity_(CC)','Propulsion_Code']]

In [24]:
print(len(df_vehiculos))

257845


In [25]:
df_vehiculos.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Age_of_Vehicle,Age_of_Driver,Age_Band_of_Driver,Sex_of_Driver,Towing_and_Articulation,Vehicle_Location-Restricted_Lane,1st_Point_of_Impact,Was_Vehicle_Left_Hand_Drive?,Journey_Purpose_of_Driver,Engine_Capacity_(CC),Propulsion_Code
1,201506E098757,2,9,11,45,7,1,0,0,3,1,6,1794,1
2,201506E098766,1,9,1,25,5,2,0,0,4,1,6,1582,2
3,201506E098766,2,9,-1,51,8,1,0,0,1,1,6,-1,-1
4,201506E098777,1,20,1,50,8,1,0,0,1,1,1,4462,2
5,201506E098780,1,9,-1,27,6,1,0,0,4,1,6,1598,2


### Preprocesado Victimas

In [26]:
df_casualties = pd.read_csv("Data/Casualties_2015.csv", header = 0)

In [27]:
df_casualties.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Casualty_Severity,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker,Casualty_Type,Casualty_Home_Area_Type,Casualty_IMD_Decile
0,201597UA71710,2,1,1,2,75,10,3,0,0,0,0,0,9,3,-1
1,201597UA71810,2,1,2,2,63,9,2,0,0,0,4,0,11,3,-1
2,201597UA71810,2,2,2,2,75,10,2,0,0,0,4,0,11,1,-1
3,201597UA71810,2,3,2,1,78,11,2,0,0,0,4,0,11,1,-1
4,201597UA71810,2,4,2,1,67,10,2,0,0,0,4,0,11,1,-1


Asignamos el indice del dataframe como ID de cada victima


In [28]:
df_casualties.index= np.arange(1, len(df_casualties)+1)
##df_casualties.columns.names = ['id_casualty']
df_casualties.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Casualty_Severity,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker,Casualty_Type,Casualty_Home_Area_Type,Casualty_IMD_Decile
1,201597UA71710,2,1,1,2,75,10,3,0,0,0,0,0,9,3,-1
2,201597UA71810,2,1,2,2,63,9,2,0,0,0,4,0,11,3,-1
3,201597UA71810,2,2,2,2,75,10,2,0,0,0,4,0,11,1,-1
4,201597UA71810,2,3,2,1,78,11,2,0,0,0,4,0,11,1,-1
5,201597UA71810,2,4,2,1,67,10,2,0,0,0,4,0,11,1,-1


Seleccionamos las columnas que nos interesan

In [29]:
df_victimas = df_casualties[['Accident_Index','Vehicle_Reference','Casualty_Reference','Casualty_Class',
                             'Casualty_Type','Age_of_Casualty','Age_Band_of_Casualty','Sex_of_Casualty',
                             'Casualty_Severity']]

In [30]:
print(len(df_victimas))

186189


In [31]:
df_victimas.head()

Unnamed: 0,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Casualty_Type,Age_of_Casualty,Age_Band_of_Casualty,Sex_of_Casualty,Casualty_Severity
1,201597UA71710,2,1,1,9,75,10,2,3
2,201597UA71810,2,1,2,11,63,9,2,2
3,201597UA71810,2,2,2,11,75,10,2,2
4,201597UA71810,2,3,2,11,78,11,1,2
5,201597UA71810,2,4,2,11,67,10,1,2


In [32]:
accidentes = df_accidentes.to_csv("Data/accidentes.csv", index=True,index_label='id_accident')
vehiculos = df_vehiculos.to_csv("Data/vehiculos.csv", index=True,index_label='id_vehicle')
victimas = df_victimas.to_csv("Data/victimas.csv", index=True, index_label='id_casualty')

A partir de los datos de longitud y latitud podemos obtener la ciudad en la que se registra cada accidente utilizando la API de google. 

In [1]:
from urllib2 import urlopen
import json
def getplace(lat, lon):
    url = "http://maps.googleapis.com/maps/api/geocode/json?"
    url += "latlng=%s,%s&sensor=false" % (lat, lon)
    v = urlopen(url).read()
    j = json.loads(v)
    components = j['results'][0]['address_components']
    for c in components:
        if "postal_town" in c['types']:
            town = c['long_name']
    return town

Verificamos que funciona correctamente individualmente 

In [2]:
print(getplace(51.1, 0.1))
print(getplace(51.505538,-0.198465))
print(getplace(51.495171,-0.173519))

Hartfield
London
London


La API de google tiene una limitación de 10.000 búsquedas diarias en su versión gratuita y 100.000 diarias en versiónn premium. No podemos obtener para los 140.056 accidentes del dataset la ciudad en la que se produjeron cada vez que ejecutemos el notebook. Utilizaremos el campo Local_Authority(District) para responder a la pregunta de la localización donde mayor número de accidentes se produjeron en 2015.

# 1- PostgreSQL

<img src="http://logonoid.com/images/postgresql-logo.png" alt="PostgreSQL Logo" style="width: 300px; PADDING-LEFT: 5px"/>



## 1.1 Modelo Conceptual


<tr>
<td> <img src="images/Modelo_Conceptual_Accident.jpg",width=600,height=600> </td>
<td> <img src="images/Modelo_Conceptual_Vehicle.jpg",width=600,height=600> </td>
<td> <img src="images/Modelo_Conceptual_Casualties.jpg",width=600,height=600> </td>
</tr>

## 1.2. Modelo de Relación

<br><br> 
<img src="images/Modelo_Relacion_UKAccidents.jpg",width=600,height=600>
<br><br>

## 1.3. Creacion de Base de Datos

### Borrado de Datos

In [11]:
!echo 'learner' | sudo -S -u postgres dropdb accidentes

sh: 0: getcwd() failed: No such file or directory
[sudo] password for learner: could not identify current directory: No such file or directory
dropdb: database removal failed: ERROR:  database "accidentes" is being accessed by other users
DETAIL:  There is 1 other session using the database.


In [12]:
!echo 'learner' | sudo -S -u postgres createdb accidentes -O learner

sh: 0: getcwd() failed: No such file or directory
[sudo] password for learner: could not identify current directory: No such file or directory
createdb: database creation failed: ERROR:  database "accidentes" already exists


In [13]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [15]:
%sql postgresql://learner:learner@localhost/accidentes

u'Connected: learner@accidentes'

## 1.4. Carga de datos en Pandas

Partimos de los datos limpios y normalizados en fichero csv. 

In [40]:
df_accidentes = pd.read_csv("Data/accidentes.csv", header = 0)
df_accidentes.head()

Unnamed: 0,id_accident,Accident_Index,Month,Hour,Day_of_Week,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Road_Type,Speed_limit,Local_Authority_(District),Did_Police_Officer_Attend_Scene_of_Accident
0,1,201501BS70001,1,18,2,3,1,1,1,1,1,6,30,12,1
1,2,201501BS70002,1,7,2,3,1,1,1,1,1,6,30,12,1
2,3,201501BS70004,1,18,2,3,1,1,2,2,1,6,30,12,1
3,4,201501BS70005,1,7,3,3,1,1,1,2,1,6,30,12,2
4,5,201501BS70008,1,7,6,2,2,1,2,2,1,6,30,12,2


In [41]:
df_vehiculos = pd.read_csv("Data/vehiculos.csv", header = 0)
df_vehiculos.head()

Unnamed: 0,id_vehicle,Accident_Index,Vehicle_Reference,Vehicle_Type,Age_of_Vehicle,Age_of_Driver,Age_Band_of_Driver,Sex_of_Driver,Towing_and_Articulation,Vehicle_Location-Restricted_Lane,1st_Point_of_Impact,Was_Vehicle_Left_Hand_Drive?,Journey_Purpose_of_Driver,Engine_Capacity_(CC),Propulsion_Code
0,1,201506E098757,2,9,11,45,7,1,0,0,3,1,6,1794,1
1,2,201506E098766,1,9,1,25,5,2,0,0,4,1,6,1582,2
2,3,201506E098766,2,9,-1,51,8,1,0,0,1,1,6,-1,-1
3,4,201506E098777,1,20,1,50,8,1,0,0,1,1,1,4462,2
4,5,201506E098780,1,9,-1,27,6,1,0,0,4,1,6,1598,2


In [42]:
df_victimas = pd.read_csv("Data/victimas.csv",header = 0)
df_victimas.head()

Unnamed: 0,id_casualty,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Casualty_Type,Age_of_Casualty,Age_Band_of_Casualty,Sex_of_Casualty,Casualty_Severity
0,1,201597UA71710,2,1,1,9,75,10,2,3
1,2,201597UA71810,2,1,2,11,63,9,2,2
2,3,201597UA71810,2,2,2,11,75,10,2,2
3,4,201597UA71810,2,3,2,11,78,11,1,2
4,5,201597UA71810,2,4,2,11,67,10,1,2


## 1.5. Creacción de tablas

Para almacenar la información se van a crear 3 tablas, una con accidentes, otra con vehiculos, otra con víctimas. 
Las tres tablas de accidentes, vehículos y víctimas están relacionadas por el campo Accident_Index (se llama igual en las tres tablas). 

Adicionalmente vamos a utilizar dos tablas más para crear una master lookup table para obtener las descripciones de los códigos del dataset original.

<br><br> 

<img src="images/Tablas_PostgreSQL.jpg",width=800,height=800>


<br><br>

In [20]:
%sql DROP TABLE accidentes
%sql DROP TABLE vehiculos
%sql DROP SEQUENCE seq_vehicle_id;
%sql DROP TABLE victimas
%sql DROP SEQUENCE seq_casualty_id;
%sql DROP TABLE codigos CASCADE
%sql DROP TABLE categorias


Done.
(psycopg2.ProgrammingError) table "vehiculos" does not exist
 [SQL: 'DROP TABLE vehiculos']
Done.
(psycopg2.ProgrammingError) table "victimas" does not exist
 [SQL: 'DROP TABLE victimas']
Done.
(psycopg2.ProgrammingError) table "codigos" does not exist
 [SQL: 'DROP TABLE codigos CASCADE']
(psycopg2.ProgrammingError) table "categorias" does not exist
 [SQL: 'DROP TABLE categorias']


In [21]:
%sql CREATE SEQUENCE seq_vehicle_id;
%sql CREATE SEQUENCE seq_casualty_id;

Done.
Done.


[]

In [22]:
%%sql 
CREATE TABLE accidentes (
    id_accident                  int not null ,
    accident_index               varchar(13) PRIMARY KEY,
    month                        int,
    hour                         int,
    day_of_week                  int,
    accident_severity            int,
    number_of_vehicles           int,
    number_of_casualties         int, 
    weather_conditions           int,
    road_surface_conditions      int,
    urban_rural_area             int,
    road_type                    int,
    speed_limit                  int,
    local_authority_district     int,
    police_officer_attend        int,
    CONSTRAINT pk_accidentes UNIQUE(id_accident)
);

Done.


[]

In [23]:
%%sql 
CREATE TABLE vehiculos (
    id_vehicle                   int not null default nextval('seq_vehicle_id'),
    accident_index               varchar(13) not null REFERENCES accidentes(accident_index),
    vehicle_reference            int,
    vehicle_type                 int,
    age_of_vehicle               int,
    age_of_driver                int,
    age_band_of_driver           int,
    sex_of_driver                int, 
    towing_articulation          int,
    vehicle_location             int,
    urban_rural_area             int,
    first_point_of_impact        int,
    left_hand_drive              int,
    journey_purpose_of_driver    int,
    engine_capacity              int,
    propulsion_code              int,
    CONSTRAINT pk_vehiculos UNIQUE(id_vehicle)
);

Done.


[]

In [24]:
%%sql 
CREATE TABLE victimas (
    id_casualty                  int not null default nextval('seq_casualty_id'),
    accident_index               varchar(13) not null REFERENCES accidentes(accident_index),
    vehicle_reference            int,
    casualty_reference           int,
    casualty_class               int,
    casualty_type                int,
    age_of_casualty              int,
    age_band_of_casualty         int, 
    sex_of_casualty              int,
    casualty_severity            int,
    CONSTRAINT pk_victimas UNIQUE(id_casualty)
);

Done.


[]

Creación de Lookup Tables 

In [25]:
%%sql 
CREATE TABLE categorias (
    id_category              int not null PRIMARY KEY,
    category_name            varchar(200) not null
);

Done.


[]

In [26]:
%%sql 
CREATE TABLE codigos (
    id_code                  int not null PRIMARY KEY,
    code                     varchar(20) not null,
    category                 varchar(200),
    code_name                varchar(200)
);

Done.


[]

## Exportación de los datos a PostgreSQL
Para exportar los datos a la base de datos, vamos a aprovechar una funcionalidad que nos ofrece Pandas, de exportación de datos a una base de datos relacional.

In [50]:
from sqlalchemy import create_engine

In [51]:
engine = create_engine('postgresql://learner:learner@localhost:5432/accidentes')

In [52]:
df_accidentes.to_sql('accidentes', engine, if_exists = 'append', index = False, chunksize=10000)

KeyError: 'Local_Authority_(District'

Cargamos datos de categorias y códigos

In [53]:
df_categorias.to_sql('categorias', engine, if_exists = 'append', index = False)

In [54]:
df_codigos.to_sql('codigos', engine, if_exists = 'append', index = False)

In [55]:
%%sql
CREATE VIEW CatCodes as 
SELECT 
    codigos.code as CodeName,
    categorias.category_name as CategoryName,
    codigos.code_name as CodeDescription
FROM categorias, codigos
WHERE categorias.category_name = codigos.category;

Done.


[]

In [56]:
%%sql 
SELECT * FROM CatCodes
LIMIT 10;

10 rows affected.


codename,categoryname,codedescription
98,Police Force,Dumfries and Galloway
97,Police Force,Strathclyde
96,Police Force,Central
95,Police Force,Lothian and Borders
94,Police Force,Fife
93,Police Force,Tayside
92,Police Force,Grampian
91,Police Force,Northern
63,Police Force,Dyfed-Powys
62,Police Force,South Wales


## Consultando información

Hacemos querys previas para vericar que la inserción en la base de datos ha ido bien

In [57]:
%%sql
SELECT count(*)
FROM CatCodes

1 rows affected.


count
1032


In [58]:
%%sql 
SELECT count(*)
FROM accidentes

1 rows affected.


count
0


In [59]:
%%sql 
SELECT count(*)
FROM vehiculos

(psycopg2.ProgrammingError) relation "vehiculos" does not exist
LINE 2: FROM vehiculos
             ^
 [SQL: 'SELECT count(*)\nFROM vehiculos']


In [60]:
%%sql 
SELECT count(*)
FROM victimas

(psycopg2.ProgrammingError) relation "victimas" does not exist
LINE 2: FROM victimas
             ^
 [SQL: 'SELECT count(*)\nFROM victimas']


### Los 10 accidentes con mayor número de víctimas

In [68]:
%%sql 
select accident_index, number_of_casualties
from accidentes
order by number_of_casualties
limit 10

0 rows affected.


accident_index,number_of_casualties


### Los 10 accidentes con mayor número de vehiculos

In [73]:
%%sql 
select accident_index, number_of_vehicles
from accidentes 
order by number_of_vehicles 
limit 10

0 rows affected.


accident_index,number_of_vehicles


### Los accidentes por grado de severidad

In [77]:
%%sql 
select CatCodes.CodeDescription, count(*)
from accidentes, CatCodes
where accidentes.accident_severity = cast(CatCodes.CodeName as int8)
    and CatCodes.CategoryName = 'Accident Severity'
group by CatCodes.CodeDescription

0 rows affected.


codedescription,count
