# **Análise Exploratória de Dados (EDA) do Dataset de Acidentes**


Este notebook apresenta uma análise detalhada dos dados de acidentes de carro utilizando o dataset DATATRAN. A seguir, exploraremos diferentes aspectos dos dados, incluindo correlações, severidade, análise espacial e temporal, impacto das condições meteorológicas, entre outros. Cada seção inclui visualizações relevantes para facilitar a interpretação.

## 1. Introdução
Nesta seção, vamos carregar e explorar as primeiras linhas do dataset.

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from typing import Dict, List


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [5]:
from src.eda.correlation_analysis import CorrelationAnalysis
from src.eda.severity_analysis import SeverityAnalysis
from src.eda.spatial_analysis import SpatialAnalysis
from src.eda.temporal_analysis import TemporalAnalysis
from src.eda.weather_analysis import WeatherAnalysis
from src.eda.trend_analysis import TrendAnalysis
from src.eda.feature_analysis import FeatureAnalysis
from src.eda.visualization import AccidentVisualizer

In [3]:
notebook_path = Path.cwd()
project_root = notebook_path.parent
sys.path.append(str(project_root))

In [4]:
def load_data():
  data_path = Path("../files/processed/datatran_merged.csv")
  return pd.read_csv(data_path, sep=";")

In [8]:
df = load_data()
# Exibir as primeiras linhas do dataset
df.head()

Unnamed: 0,id,data_inversa,dia_semana,horario,uf,br,km,municipio,causa_acidente,tipo_acidente,...,tracado_via,uso_solo,pessoas,mortos,feridos_leves,feridos_graves,ilesos,ignorados,feridos,veiculos
0,175827.0,2007-01-01,Segunda,17:30:00,MG,381.0,485.0,BETIM,Falta de atenção,Colisão traseira,...,Cruzamento,Urbano,2,0,0,0,2,0,0,2
1,174220.0,2007-01-01,Segunda,14:15:00,RJ,40.0,43.9,AREAL,Velocidade incompatível,Saída de Pista,...,Curva,Rural,1,0,0,0,1,0,0,1
2,175540.0,2007-01-01,Segunda,16:00:00,PE,101.0,32.0,IGARASSU,Não guardar distância de segurança,Colisão lateral,...,Reta,Rural,2,0,0,0,2,0,0,2
3,175544.0,2007-01-01,Segunda,12:00:00,MG,50.0,0.2,ARAGUARI,Outras,Saída de Pista,...,Curva,Rural,4,0,3,1,0,0,4,1
4,175545.0,2007-01-01,Segunda,08:40:00,MG,381.0,397.4,NOVA UNIAO,Outras,Saída de Pista,...,Curva,Rural,1,0,0,0,1,0,0,1


## **2. Exploração Inicial do Dataset**
Aqui vamos obter uma visão geral das informações contidas no dataset.

### 2.1 Tamanho do Dataset
Primeiro, vamos verificar o tamanho do dataset, ou seja, o número de registros e o número de colunas.

In [68]:
# Tamanho do dataset (número de linhas e colunas)
dataset_size = df.shape
print(f"Tamanho do dataset: {dataset_size[0]} registros e {dataset_size[1]} colunas")



Tamanho do dataset: 2122296 registros e 25 colunas


## 2.2 Informações basicas das Colunas
Em seguida, vamos exibir os nomes das colunas, Tipos de dados, valores não nulos e valores nulos presentes no dataset para que possamos entender melhor sua estrutura.

In [69]:
# Verificando o tipo de dado de cada coluna e a contagem de valores não nulos
column_info = pd.DataFrame({
    'Tipo de Dado': df.dtypes,
    'Valores Não Nulos': df.notnull().sum(),
    'Valores Nulos': df.isnull().sum()
})

# Exibindo a tabela com informações adicionais sobre as colunas
display(column_info)


Unnamed: 0,Tipo de Dado,Valores Não Nulos,Valores Nulos
id,float64,2122296,0
data_inversa,object,2122296,0
dia_semana,object,2122296,0
horario,object,2122296,0
uf,object,2122296,0
br,float64,2122284,12
km,float64,2122284,12
municipio,object,2122296,0
causa_acidente,object,2122296,0
tipo_acidente,object,2122255,41


### 2.3 Estatísticas Descritivas para Variáveis Numéricas
Agora, vamos calcular e exibir as estatísticas descritivas para as variáveis numéricas do dataset.

In [70]:
# Estatísticas descritivas para variáveis numéricas
print("\nEstatísticas descritivas para variáveis numéricas:")
stats_df = df.describe().transpose()


styled_stats = stats_df.style.format("{:,.2f}").background_gradient(cmap='coolwarm', axis=None)
display(styled_stats)



Estatísticas descritivas para variáveis numéricas:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,2122296.0,18970845.72,34359383.73,8.0,391425.75,715183.5,1276920.25,83529889.0
br,2122284.0,212.19,129.32,0.0,101.0,158.0,324.0,958.0
km,2122284.0,262.18,229.4,-870.3,78.8,201.0,411.9,9967.1
pessoas,2122296.0,2.25,1.7,1.0,1.0,2.0,3.0,248.0
mortos,2122296.0,0.06,0.3,0.0,0.0,0.0,0.0,37.0
feridos_leves,2122296.0,0.55,1.02,0.0,0.0,0.0,1.0,61.0
feridos_graves,2122296.0,0.19,0.58,0.0,0.0,0.0,0.0,222.0
ilesos,2122296.0,1.31,1.33,0.0,1.0,1.0,2.0,99.0
ignorados,2122296.0,0.16,0.54,0.0,0.0,0.0,0.0,88.0
feridos,2122296.0,0.74,1.21,0.0,0.0,0.0,1.0,239.0


## **3. Correlação entre Variáveis**
### 3.1 Correlações Numéricas

Vamos analisar as correlações entre variáveis numéricas.

In [71]:
# Instanciando o módulo de correlação
corr_analysis = CorrelationAnalysis(df)

# Exibindo o gráfico de calor para as correlações numéricas
corr_analysis.get_numeric_correlations()


Unnamed: 0,id,br,km,pessoas,mortos,feridos_leves,feridos_graves,ilesos,ignorados,feridos,veiculos
id,1.0,-0.0,0.0,-0.02,-0.01,-0.02,-0.01,0.02,-0.06,-0.02,-0.04
br,-0.0,1.0,0.05,0.01,0.0,0.01,0.02,-0.01,0.0,0.02,0.01
km,0.0,0.05,1.0,-0.0,0.02,0.01,0.0,-0.02,0.01,0.01,-0.05
pessoas,-0.02,0.01,-0.0,1.0,0.22,0.52,0.33,0.61,0.23,0.59,0.46
mortos,-0.01,0.0,0.02,0.22,1.0,0.03,0.16,-0.06,0.08,0.1,0.06
feridos_leves,-0.02,0.01,0.01,0.52,0.03,1.0,0.09,-0.15,0.01,0.88,0.01
feridos_graves,-0.01,0.02,0.0,0.33,0.16,0.09,1.0,-0.13,0.03,0.55,0.04
ilesos,0.02,-0.01,-0.02,0.61,-0.06,-0.15,-0.13,1.0,-0.06,-0.19,0.44
ignorados,-0.06,0.0,0.01,0.23,0.08,0.01,0.03,-0.06,1.0,0.03,0.44
feridos,-0.02,0.02,0.01,0.59,0.1,0.88,0.55,-0.19,0.03,1.0,0.03


### **3.2 Associações Categóricas**
Aqui, vamos explorar as associações entre variáveis categóricas usando o teste de qui-quadrado.

In [72]:
# Exibindo associações entre variáveis categóricas
corr_analysis.get_categorical_associations()


{'data_inversa x dia_semana': {'chi2': 27589848.000000004, 'p_value': 0.0},
 'data_inversa x horario': {'chi2': 11157446.265419167, 'p_value': 0.0},
 'data_inversa x uf': {'chi2': 256225.33111509742, 'p_value': 0.0},
 'data_inversa x municipio': {'chi2': 17968615.41865172, 'p_value': 0.0},
 'data_inversa x tipo_acidente': {'chi2': 1895568.0244225315, 'p_value': 0.0},
 'data_inversa x fase_dia': {'chi2': 825621.0586145234, 'p_value': 0.0},
 'data_inversa x sentido_via': {'chi2': 23551.441444378925, 'p_value': 0.0},
 'data_inversa x tipo_pista': {'chi2': 32043.028063898375, 'p_value': 0.0},
 'data_inversa x tracado_via': {'chi2': 13935167.857849205, 'p_value': 0.0},
 'data_inversa x uso_solo': {'chi2': 2175769.353447743, 'p_value': 0.0},
 'dia_semana x horario': {'chi2': 98998.75165683286, 'p_value': 0.0},
 'dia_semana x uf': {'chi2': 15669.644533989176, 'p_value': 0.0},
 'dia_semana x municipio': {'chi2': 149598.0149074538, 'p_value': 0.0},
 'dia_semana x tipo_acidente': {'chi2': 122303

## **4. Análise de Severidade dos Acidentes**
### 4.1 Severidade por Causa
Analisaremos a severidade média dos acidentes agrupados por causa.

In [74]:
# Instanciando o módulo de severidade
severity_analysis = SeverityAnalysis(df)

# Exibindo severidade por causa
severity_analysis.get_severity_by_cause()


Unnamed: 0_level_0,indice_severidade,mortos,mortos,feridos_graves,feridos_graves,feridos_leves,feridos_leves
Unnamed: 0_level_1,mean,sum,mean,sum,mean,sum,mean
causa_acidente,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
(null),0.83,0,0.00,1,0.50,0,0.00
Acessar a via sem observar a presença dos outros veículos,1.56,1451,0.06,8102,0.34,21535,0.89
Acesso irregular,1.71,168,0.06,984,0.36,2434,0.88
Acostamento em desnível,1.99,39,0.05,173,0.24,628,0.87
Acumulo de areia ou detritos sobre o pavimento,1.95,21,0.02,220,0.21,975,0.94
...,...,...,...,...,...,...,...
Ultrapassagem Indevida,2.81,2849,0.23,6186,0.51,12216,1.00
Ultrapassagem indevida,2.39,6433,0.18,13365,0.38,23914,0.68
Velocidade Incompatível,2.44,5263,0.11,14098,0.28,42544,0.86
Velocidade incompatível,1.80,8885,0.07,25652,0.19,74527,0.56


## **5. Análise Espacial dos Acidentes**
### 5.1 Estatísticas por Estado
Vamos analisar as estatísticas de acidentes por estado (UF).

In [75]:
# Instanciando o módulo de análise espacial
spatial_analysis = SpatialAnalysis(df)

# Exibindo estatísticas por estado
spatial_analysis.get_state_stats()


Unnamed: 0_level_0,id,mortos,mortos,feridos,feridos,veiculos
Unnamed: 0_level_1,count,sum,mean,sum,mean,sum
uf,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
(null),12,5,0.42,17,1.42,18
AC,6322,385,0.06,6403,1.01,10804
AL,23419,2057,0.09,19674,0.84,41193
AM,2948,308,0.1,3055,1.04,4854
AP,3353,245,0.07,3953,1.18,5537
BA,116390,11430,0.1,93006,0.8,207597
CE,44773,3563,0.08,35761,0.8,81972
DF,20737,868,0.04,19422,0.94,38043
ES,87355,3743,0.04,63071,0.72,166527
GO,96380,6830,0.07,80225,0.83,167114


### 5.2 Densidade de Acidentes por Trecho
Verificaremos a densidade de acidentes por rodovia e trecho.

In [76]:
# Exibindo densidade de acidentes por trecho
spatial_analysis.get_accident_density()


Unnamed: 0,br,km,acidentes
0,0.0,0.0,1237
1,0.0,9.5,1
2,0.0,68.0,1
3,0.0,69.3,1
4,0.0,75.3,1
...,...,...,...
211399,869.0,490.4,1
211400,870.0,1.0,1
211401,884.0,46.0,1
211402,931.0,123.0,1


## **6. Análise Temporal dos Acidentes**
### 6.1 Padrões Anuais e Mensais
Vamos analisar a distribuição de acidentes ao longo dos anos e meses.

In [77]:
# Instanciando o módulo de análise temporal
temporal_analysis = TemporalAnalysis(df)

# Exibindo padrões anuais de acidentes
temporal_analysis.get_yearly_stats()

# Exibindo padrões mensais de acidentes
temporal_analysis.get_monthly_pattern()


Unnamed: 0_level_0,id,mortos,feridos
mes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,179714,0.06,0.76
2,164889,0.05,0.74
3,178074,0.05,0.72
4,174502,0.05,0.72
5,178294,0.06,0.72
6,172331,0.06,0.73
7,178037,0.06,0.75
8,173041,0.06,0.75
9,172771,0.06,0.75
10,178552,0.06,0.75


## **7. Impacto das Condições Meteorológicas**
### 7.1 Estatísticas por Condição Meteorológica
Analisaremos a quantidade de acidentes e a severidade média por condição meteorológica.

In [78]:
# Instanciando o módulo de análise meteorológica
weather_analysis = WeatherAnalysis(df)

# Exibindo as estatísticas meteorológicas
weather_analysis.get_weather_stats()


Unnamed: 0_level_0,id,mortos,mortos,feridos,feridos
Unnamed: 0_level_1,count,sum,mean,sum,mean
condicao_metereologica,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ceu Claro,753559,41804,0.06,479822,0.64
Nublado,376487,20779,0.06,274345,0.73
Chuva,335288,15253,0.05,225736,0.67
Céu Claro,331325,27775,0.08,365500,1.1
Sol,242894,9069,0.04,159173,0.66
Ignorada,28422,2469,0.09,19931,0.7
Garoa/Chuvisco,19568,1318,0.07,21061,1.08
Nevoeiro/neblina,16232,1245,0.08,11642,0.72
Ignorado,7925,861,0.11,8078,1.02
Vento,5395,417,0.08,3971,0.74


## **8. Tendências ao Longo do Tempo**
### 8.1 Tendências Anuais e Mensais
Verificaremos a evolução do número de acidentes ao longo dos anos e meses.

In [79]:
# Instanciando o módulo de tendências
trend_analysis = TrendAnalysis(df)

# Exibindo tendência anual de acidentes
trend_analysis.get_yearly_trend()

# Exibindo tendência mensal de acidentes
trend_analysis.get_monthly_trend()


Unnamed: 0_level_0,id,id,id,id,id,id,id,id,id,id,...,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos
mes,1,2,3,4,5,6,7,8,9,10,...,3,4,5,6,7,8,9,10,11,12
ano,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2007,10611,9624,9997,10191,10490,10459,11191,10504,10479,10890,...,6268,6828,6473,6288,7122,6626,6866,6914,6448,8364
2008,11994,10659,11529,11574,11536,11380,11368,11417,11669,11798,...,6851,6807,6765,7060,6818,6670,6942,6871,7469,8900
2009,12264,11733,12689,12186,13186,12775,13487,13548,13019,13932,...,7481,6992,7635,7508,7847,7924,7466,7734,8131,10196
2010,14987,13499,14971,14635,15037,14170,15759,15303,15464,15679,...,8318,7892,8476,7919,9186,8320,8257,8869,8740,10651
2011,16154,14813,15612,16641,16477,15502,16520,15879,15626,16225,...,8365,9310,8784,8688,9190,8747,8600,9133,8391,10057
2012,15710,14537,15369,15548,15176,15277,15909,13403,14836,16152,...,8358,8735,8540,8394,9022,7997,8465,9019,8409,10086
2013,15630,13532,16095,14750,15821,15365,16011,15620,15395,15282,...,8820,7877,8443,8350,8638,8716,8717,8488,8794,10471
2014,15090,14005,14800,14685,14220,13211,13651,13961,13597,13905,...,8613,8212,8248,7987,8012,8462,8137,8570,8415,9461
2015,12230,11422,12448,12064,12709,10681,8592,8214,8049,8188,...,7662,7610,8017,7282,7024,7356,6981,7236,6889,8897
2016,8521,7478,8515,7981,8116,7718,8026,8294,7621,7944,...,7438,6978,7019,6821,7672,7304,6898,7281,6826,8011


## **9. Análise das Features do Dataset**
### 9.1 Valores Ausentes
Identificaremos as colunas com maior número de valores ausentes.

In [80]:
# Instanciando o módulo de análise de features
feature_analysis = FeatureAnalysis(df)

# Exibindo valores ausentes
feature_analysis.get_missing_values()


Unnamed: 0,Valores Ausentes,Porcentagem
tipo_acidente,41,0.001932
br,12,0.000565
km,12,0.000565
classificacao_acidente,10,0.000471
condicao_metereologica,3,0.000141
fase_dia,1,4.7e-05
id,0,0.0
uso_solo,0,0.0
feridos,0,0.0
ignorados,0,0.0


## **10. Conclusões**
Após analisar as diversas dimensões do dataset, podemos identificar os fatores mais críticos que impactam a severidade dos acidentes e os padrões espaciais e temporais. Isso ajuda a direcionar ações para prevenir acidentes, como políticas de segurança nas rodovias mais críticas e em horários de maior risco.