# **Análise Exploratória de Dados (EDA) do Dataset de Acidentes**


Este notebook apresenta uma análise detalhada dos dados de acidentes de carro utilizando o dataset DATATRAN. A seguir, exploraremos diferentes aspectos dos dados, incluindo correlações, severidade, análise espacial e temporal, impacto das condições meteorológicas, entre outros. Cada seção inclui visualizações relevantes para facilitar a interpretação.

## 1. Introdução
Nesta seção, vamos carregar e explorar as primeiras linhas do dataset.

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from typing import Dict, List


In [7]:
notebook_path = Path.cwd()
project_root = str(notebook_path.parent.parent) + "\\src"
sys.path.append(project_root)

In [8]:
project_root

'd:\\projetos\\roadSafeAi\\RoadSafeAI\\src'

In [9]:
from eda.correlation_analysis import CorrelationAnalysis
from eda.severity_analysis import SeverityAnalysis
from eda.spatial_analysis import SpatialAnalysis
from eda.temporal_analysis import TemporalAnalysis
from eda.weather_analysis import WeatherAnalysis
from eda.trend_analysis import TrendAnalysis
from eda.feature_analysis import FeatureAnalysis
from eda.visualization import AccidentVisualizer

In [18]:
notebook_path = Path.cwd()
project_root = str(notebook_path.parent.parent) + "\\src"
sys.path.append(project_root)

In [19]:
project_root

'd:\\projetos\\roadSafeAi\\RoadSafeAI\\src'

In [22]:
def load_data():
  data_path = Path("../../files/processed/maranhao/datatran_ma_merged_base_2007_2024.csv")
  return pd.read_csv(data_path, sep=";")

In [23]:
df = load_data()
# Exibir as primeiras linhas do dataset
df.head()

Unnamed: 0,id,data_inversa,dia_semana,horario,uf,br,km,municipio,causa_acidente,tipo_acidente,...,tracado_via,uso_solo,pessoas,mortos,feridos_leves,feridos_graves,ilesos,ignorados,feridos,veiculos
0,402769.0,2007-01-01,Segunda,09:50:00,MA,316,516.1,CAXIAS,Ultrapassagem indevida,Capotamento,...,Curva,Rural,2,0,0,0,2,0,0,2
1,210762.0,2007-01-01,Segunda,03:30:00,MA,222,556.3,BOM JESUS DAS SELVAS,Velocidade incompatível,Saída de Pista,...,Curva,Rural,2,1,1,0,0,0,1,1
2,173998.0,2007-01-01,Segunda,18:00:00,MA,230,23.9,BARAO DE GRAJAU,Animais na Pista,Atropelamento de animal,...,Reta,Rural,4,0,3,0,1,0,3,1
3,173939.0,2007-01-01,Segunda,15:00:00,MA,10,216.8,GOVERNADOR EDISON LOBAO,Defeito na via,Saída de Pista,...,Reta,Rural,1,0,0,0,1,0,0,1
4,175809.0,2007-01-01,Segunda,08:20:00,MA,135,23.0,SAO LUIS,Falta de atenção,Colisão lateral,...,Reta,Rural,2,0,0,0,2,0,0,2


## **2. Exploração Inicial do Dataset**
Aqui vamos obter uma visão geral das informações contidas no dataset.

### 2.1 Tamanho do Dataset
Primeiro, vamos verificar o tamanho do dataset, ou seja, o número de registros e o número de colunas.

In [24]:
# Tamanho do dataset (número de linhas e colunas)
dataset_size = df.shape
print(f"Tamanho do dataset: {dataset_size[0]} registros e {dataset_size[1]} colunas")



Tamanho do dataset: 34153 registros e 25 colunas


## 2.2 Informações basicas das Colunas
Em seguida, vamos exibir os nomes das colunas, Tipos de dados, valores não nulos e valores nulos presentes no dataset para que possamos entender melhor sua estrutura.

In [25]:
# Verificando o tipo de dado de cada coluna e a contagem de valores não nulos
column_info = pd.DataFrame({
    'Tipo de Dado': df.dtypes,
    'Valores Não Nulos': df.notnull().sum(),
    'Valores Nulos': df.isnull().sum()
})

# Exibindo a tabela com informações adicionais sobre as colunas
display(column_info)


Unnamed: 0,Tipo de Dado,Valores Não Nulos,Valores Nulos
id,float64,34153,0
data_inversa,object,34153,0
dia_semana,object,34153,0
horario,object,34153,0
uf,object,34153,0
br,int64,34153,0
km,float64,34153,0
municipio,object,34153,0
causa_acidente,object,34153,0
tipo_acidente,object,34151,2


### 2.3 Estatísticas Descritivas para Variáveis Numéricas
Agora, vamos calcular e exibir as estatísticas descritivas para as variáveis numéricas do dataset.

In [26]:
# Estatísticas descritivas para variáveis numéricas
print("\nEstatísticas descritivas para variáveis numéricas:")
stats_df = df.describe().transpose()


styled_stats = stats_df.style.format("{:,.2f}").background_gradient(cmap='coolwarm', axis=None)
display(styled_stats)



Estatísticas descritivas para variáveis numéricas:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,34153.0,18698280.71,34184894.63,132.0,371732.0,698365.0,1267106.0,83529724.0
br,34153.0,166.24,107.75,0.0,135.0,135.0,230.0,402.0
km,34153.0,250.28,203.13,0.0,50.4,250.2,392.1,684.9
pessoas,34153.0,2.41,1.94,1.0,2.0,2.0,3.0,54.0
mortos,34153.0,0.14,0.45,0.0,0.0,0.0,0.0,11.0
feridos_leves,34153.0,0.51,1.08,0.0,0.0,0.0,1.0,37.0
feridos_graves,34153.0,0.3,0.69,0.0,0.0,0.0,0.0,20.0
ilesos,34153.0,1.25,1.36,0.0,0.0,1.0,2.0,49.0
ignorados,34153.0,0.24,0.72,0.0,0.0,0.0,0.0,35.0
feridos,34153.0,0.82,1.37,0.0,0.0,1.0,1.0,50.0


## **3. Correlação entre Variáveis**
### 3.1 Correlações Numéricas

Vamos analisar as correlações entre variáveis numéricas.

In [27]:
# Instanciando o módulo de correlação
corr_analysis = CorrelationAnalysis(df)

# Exibindo o gráfico de calor para as correlações numéricas
corr_analysis.get_numeric_correlations()


Unnamed: 0,id,br,km,pessoas,mortos,feridos_leves,feridos_graves,ilesos,ignorados,feridos,veiculos
id,1.0,-0.01,0.01,-0.02,-0.01,-0.01,-0.03,0.02,-0.07,-0.02,-0.03
br,-0.01,1.0,0.39,-0.01,0.09,0.01,0.04,-0.09,0.0,0.03,-0.12
km,0.01,0.39,1.0,0.03,0.08,0.08,0.05,-0.08,0.03,0.08,-0.07
pessoas,-0.02,-0.01,0.03,1.0,0.26,0.57,0.38,0.54,0.32,0.64,0.35
mortos,-0.01,0.09,0.08,0.26,1.0,0.03,0.13,-0.11,0.12,0.09,0.04
feridos_leves,-0.01,0.01,0.08,0.57,0.03,1.0,0.15,-0.1,0.07,0.87,-0.01
feridos_graves,-0.03,0.04,0.05,0.38,0.13,0.15,1.0,-0.14,0.04,0.62,0.02
ilesos,0.02,-0.09,-0.08,0.54,-0.11,-0.1,-0.14,1.0,-0.07,-0.15,0.36
ignorados,-0.07,0.0,0.03,0.32,0.12,0.07,0.04,-0.07,1.0,0.07,0.45
feridos,-0.02,0.03,0.08,0.64,0.09,0.87,0.62,-0.15,0.07,1.0,0.01


### **3.2 Associações Categóricas**
Aqui, vamos explorar as associações entre variáveis categóricas usando o teste de qui-quadrado.

In [28]:
# Exibindo associações entre variáveis categóricas
corr_analysis.get_categorical_associations()


{'data_inversa x dia_semana': {'chi2': 443988.99999999994, 'p_value': 0.0},
 'data_inversa x horario': {'chi2': 7108366.749370254, 'p_value': 0.0},
 'data_inversa x uf': {'chi2': 0.0, 'p_value': 1.0},
 'data_inversa x municipio': {'chi2': 741267.5908098859,
  'p_value': 3.4261175029965097e-242},
 'data_inversa x tipo_acidente': {'chi2': 243244.26378312858, 'p_value': 0.0},
 'data_inversa x fase_dia': {'chi2': 41466.2026559483,
  'p_value': 2.3859506454299227e-248},
 'data_inversa x sentido_via': {'chi2': 19544.444146556758,
  'p_value': 7.552085525532551e-285},
 'data_inversa x tipo_pista': {'chi2': 13723.099019240082,
  'p_value': 8.338850099802055e-08},
 'data_inversa x tracado_via': {'chi2': 1020876.7991824619, 'p_value': 0.0},
 'data_inversa x uso_solo': {'chi2': 49915.00913381678, 'p_value': 0.0},
 'dia_semana x horario': {'chi2': 16602.27453391302,
  'p_value': 1.170511456985344e-136},
 'dia_semana x uf': {'chi2': 0.0, 'p_value': 1.0},
 'dia_semana x municipio': {'chi2': 2870.261

## **4. Análise de Severidade dos Acidentes**
### 4.1 Severidade por Causa
Analisaremos a severidade média dos acidentes agrupados por causa.

In [29]:
# Instanciando o módulo de severidade
severity_analysis = SeverityAnalysis(df)

# Exibindo severidade por causa
severity_analysis.get_severity_by_cause()


Unnamed: 0_level_0,indice_severidade,mortos,mortos,feridos_graves,feridos_graves,feridos_leves,feridos_leves
Unnamed: 0_level_1,mean,sum,mean,sum,mean,sum,mean
causa_acidente,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Acessar a via sem observar a presença dos outros veículos,2.16,64,0.12,258,0.49,332,0.63
Acesso irregular,1.76,5,0.08,26,0.41,45,0.70
Acostamento em desnível,2.37,7,0.21,5,0.15,19,0.58
Acumulo de areia ou detritos sobre o pavimento,1.78,1,0.06,2,0.11,16,0.89
Acumulo de água sobre o pavimento,2.78,6,0.17,5,0.14,33,0.92
...,...,...,...,...,...,...,...
Ultrapassagem Indevida,3.51,120,0.30,252,0.63,306,0.77
Ultrapassagem indevida,3.13,283,0.27,489,0.46,726,0.68
Velocidade Incompatível,3.81,80,0.29,115,0.41,232,0.83
Velocidade incompatível,4.44,128,0.20,251,0.40,539,0.86


## **5. Análise Espacial dos Acidentes**
### 5.1 Estatísticas por Estado
Vamos analisar as estatísticas de acidentes por estado (UF).

In [30]:
# Instanciando o módulo de análise espacial
spatial_analysis = SpatialAnalysis(df)

# Exibindo estatísticas por estado
spatial_analysis.get_state_stats()


Unnamed: 0_level_0,id,mortos,mortos,feridos,feridos,veiculos
Unnamed: 0_level_1,count,sum,mean,sum,mean,sum
uf,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
MA,34153,4749,0.14,27863,0.82,64283


### 5.2 Densidade de Acidentes por Trecho
Verificaremos a densidade de acidentes por rodovia e trecho.

In [31]:
# Exibindo densidade de acidentes por trecho
spatial_analysis.get_accident_density()


Unnamed: 0,br,km,acidentes
0,0,0.0,14
1,1,168.0,1
2,10,0.0,5
3,10,12.6,1
4,10,15.5,1
...,...,...,...
9520,402,156.7,1
9521,402,164.0,2
9522,402,164.7,1
9523,402,168.0,1


## **6. Análise Temporal dos Acidentes**
### 6.1 Padrões Anuais e Mensais
Vamos analisar a distribuição de acidentes ao longo dos anos e meses.

In [32]:
# Instanciando o módulo de análise temporal
temporal_analysis = TemporalAnalysis(df)

# Exibindo padrões anuais de acidentes
temporal_analysis.get_yearly_stats()

# Exibindo padrões mensais de acidentes
temporal_analysis.get_monthly_pattern()


Unnamed: 0_level_0,id,mortos,feridos
mes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2959,0.14,0.82
2,2686,0.13,0.78
3,2841,0.13,0.76
4,2780,0.11,0.75
5,2829,0.13,0.78
6,2630,0.14,0.81
7,2926,0.14,0.89
8,2807,0.14,0.89
9,2871,0.15,0.82
10,2816,0.14,0.82


## **7. Impacto das Condições Meteorológicas**
### 7.1 Estatísticas por Condição Meteorológica
Analisaremos a quantidade de acidentes e a severidade média por condição meteorológica.

In [33]:
# Instanciando o módulo de análise meteorológica
weather_analysis = WeatherAnalysis(df)

# Exibindo as estatísticas meteorológicas
weather_analysis.get_weather_stats()


Unnamed: 0_level_0,id,mortos,mortos,feridos,feridos
Unnamed: 0_level_1,count,sum,mean,sum,mean
condicao_metereologica,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ceu Claro,13988,1704,0.12,10194,0.73
Céu Claro,6678,1365,0.2,7190,1.08
Sol,4317,401,0.09,2986,0.69
Chuva,3742,362,0.1,2976,0.8
Nublado,3686,472,0.13,2946,0.8
Ignorada,1034,247,0.24,858,0.83
Ignorado,294,93,0.32,285,0.97
Garoa/Chuvisco,211,39,0.18,205,0.97
Nevoeiro/neblina,126,49,0.39,150,1.19
Vento,33,5,0.15,33,1.0


## **8. Tendências ao Longo do Tempo**
### 8.1 Tendências Anuais e Mensais
Verificaremos a evolução do número de acidentes ao longo dos anos e meses.

In [34]:
# Instanciando o módulo de tendências
trend_analysis = TrendAnalysis(df)

# Exibindo tendência anual de acidentes
trend_analysis.get_yearly_trend()

# Exibindo tendência mensal de acidentes
trend_analysis.get_monthly_trend()


Unnamed: 0_level_0,id,id,id,id,id,id,id,id,id,id,...,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos,feridos
mes,1,2,3,4,5,6,7,8,9,10,...,3,4,5,6,7,8,9,10,11,12
ano,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2007,163,181,195,186,171,168,218,156,187,145,...,165,106,172,147,176,98,190,115,112,160
2008,256,177,214,189,174,165,168,203,184,181,...,134,118,113,134,167,127,128,131,148,157
2009,185,194,193,168,197,171,192,185,188,198,...,182,136,153,112,149,123,129,140,160,162
2010,215,188,196,210,201,214,231,245,203,220,...,109,140,146,140,168,155,158,186,183,158
2011,258,224,215,267,257,256,259,226,291,266,...,149,181,221,186,208,188,166,171,158,164
2012,230,260,272,258,248,233,276,218,262,269,...,207,170,116,148,167,159,146,167,148,159
2013,269,238,272,229,241,214,267,268,228,222,...,173,117,123,124,189,188,150,140,127,203
2014,249,228,216,229,225,157,212,211,226,233,...,145,156,139,105,139,120,119,168,144,170
2015,200,178,205,198,227,154,150,145,118,142,...,146,118,147,98,128,207,90,94,131,169
2016,144,127,135,104,125,118,122,119,118,127,...,102,70,112,93,107,111,109,151,153,134


## **9. Análise das Features do Dataset**
### 9.1 Valores Ausentes
Identificaremos as colunas com maior número de valores ausentes.

In [35]:
# Instanciando o módulo de análise de features
feature_analysis = FeatureAnalysis(df)

# Exibindo valores ausentes
feature_analysis.get_missing_values()


Unnamed: 0,Valores Ausentes,Porcentagem
tipo_acidente,2,0.005856
id,0,0.0
condicao_metereologica,0,0.0
feridos,0,0.0
ignorados,0,0.0
ilesos,0,0.0
feridos_graves,0,0.0
feridos_leves,0,0.0
mortos,0,0.0
pessoas,0,0.0


## **10. Conclusões**
Após analisar as diversas dimensões do dataset, podemos identificar os fatores mais críticos que impactam a severidade dos acidentes e os padrões espaciais e temporais. Isso ajuda a direcionar ações para prevenir acidentes, como políticas de segurança nas rodovias mais críticas e em horários de maior risco.