The rendered version is at [`/pdf_notebooks/02-US_data_analysis.pdf`]("./pdf_notebooks/02-US_data_analysis.pdf") 

In [None]:
import pandas as pd
import numpy as np
import re
import math

In [None]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

In this notebook, we analyse the US Traffic Accident dataset to derive insights and select features for predictive models.

## 1. Load Data

In [None]:
%%time
data = pd.read_csv("data/US_Accidents_March23_Clean.csv")
data.shape

The dataset we have contains close to 7 millions rows and has a size of 1.5Go (or more). In this analysis and beyond, we are focusing solely on the years 2021 and 2022 to ensure that our insights and predictive models are based on the most recent and relevant data available.
* **Recent and Relevant Data:** The years 2021 and 2022 would be more relevant to leverage the most current insights into what influences impacting accident severity.
* **Higher Data Volume:** Despite comprising a smaller portion of the dataset (43%), data from the years 2021 and 2022 offer a substantial amount of recorded accidents, ensuring robust analysis and modeling.
* **Accuracy in Predictions:** By analyzing recent years, we aim to produce predictive models that accurately reflect present-day accident trends and conditions, enhancing the reliability of our forecasts.
* **Resource Optimization:** Prioritizing these years optimizes our resources (less data to process) by concentrating efforts on data that is more likely to yield actionable insights.

In [None]:
print(f'Portion of data for 2021 to 2022: {100*data[(data["Year"] == 2021) | (data["Year"] == 2022)].shape[0] / data.shape[0]:.2f}%')
print(f'Portion of data for oher years: {100*data[(data["Year"] != 2021) & (data["Year"] != 2022)].shape[0] / data.shape[0]:.2f}%')

In [None]:
fig = px.histogram(data, x='Year', title='Distribution of Accidents by Year')
fig.update_layout(width=450, height=350)
fig.show(config={'staticPlot': True})

In [None]:
data = data[(data["Year"] == 2021) | (data["Year"] == 2022)].copy()
data.shape

The dataset is also heavily imbalanced. Accidents of `Severity` 2 make up over 80% of all data. We could downsample this class to have closer to the other ones. This would allow the analysis to be more effectibe.

In [None]:
downsampled_count = int(data["Severity"].value_counts().sort_values(ascending=False).iloc[1] * 3.0)
downsampled_count

In [None]:
df_majority_downsampled = data[data["Severity"] == 2].sample(n=downsampled_count, random_state=42)
df_rest = data[data["Severity"] != 2]
df_balanced = pd.concat([df_majority_downsampled, df_rest])

In [None]:
df_balanced.shape

We are down to about 940000 rows which is more manageable

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=data['Severity'], opacity=0.7, name='Original Data'))
fig.add_trace(go.Histogram(x=df_balanced['Severity'], opacity=0.7, name='Sampled Data'))
fig.update_layout(
    height=300, width=450,
    title='Comparison of Severity Distribution',
    xaxis_title='Severity', yaxis_title='Count',
    barmode='overlay', bargap=0.1, bargroupgap=0.1,
    xaxis=dict(tickmode='linear',  tick0=min(data['Severity']), dtick=1)
)
# Show the plot
fig.show(config={'staticPlot': True})

In [None]:
df = df_balanced.copy()

In [None]:
del(data)
del(df_balanced)

We fix the Datetime datatypes

In [None]:
df["Start_Time"] = pd.to_datetime(df["Start_Time"])
df["End_Time"] = pd.to_datetime(df["End_Time"])
df["Date"] = pd.to_datetime(df["Date"])

## Descriptive Analysis

In [None]:
numerical_vars = df.select_dtypes(include=['number']).columns.tolist()
boolean_vars = df.select_dtypes(include=['bool']).columns.tolist()
categorical_vars = df.select_dtypes(include=['object','category']).columns.tolist()
datetime_vars = df.select_dtypes(include=['datetime']).columns.tolist()

### 1. Univariate Analysis

#### Numerical and ordinal categorical variables
* **Hourly Patterns:** Peak accident times are during the late afternoon (16:00 - 17:00), likely due to the evening rush hour. The early morning hours (2:00 - 5:00) have the fewest accidents.
* **Daily Patterns:** Accidents are evenly spread across the days of the month, with minor fluctuations. This indicates no specific days are particularly prone to accidents.
* **Weekly Patterns:** Weekdays see a higher number of accidents compared to weekends. Fridays have the highest number of accidents, possibly due to end-of-week fatigue and increased travel. Sundays have the fewest, suggesting reduced traffic.
* **Monthly Patterns:** December has the highest number of accidents, possibly due to winter weather and holiday travel. October has the lowest, which might be attributed to milder weather.
* **Weather Conditions:** Most accidents occur under clear and cloudy conditions, with fewer accidents in severe weather conditions like snowstorms and thunderstorms. The mean temperature during accidents is 63Â°F, indicating accidents occur across a wide range of temperatures. The average visibility is 9 miles, and wind speeds are generally low (mean of 7.38 mph). However, there are extreme values, indicating occasional severe conditions.
* **Distance and Duration:** The median accident duration is approximately 78 minutes, with a wide range of durations indicating variability in accident severity and response times. The average distance affected by an accident is relatively short (0.73 miles), with most area affected being at or near the accident location.
* **Traffic Features:** Traffic signals, crossings, and junctions are common at accident sites. Notably, a significant portion of accidents occur at night (30.20%) and on highways (32.63%), suggesting these conditions require special attention for safety improvements.
* **State-Level Insights:** California and Florida have the highest number of accidents, reflecting their large populations and extensive road networks. States like Wyoming and Vermont have significantly fewer accidents, likely due to smaller populations and less traffic.

In [None]:
len(numerical_vars)

In [None]:
df[numerical_vars].info()

In [None]:
df["Severity"].value_counts()

In [None]:
df[numerical_vars].describe()

In [None]:
fig = make_subplots(rows=4, cols=4, subplot_titles=numerical_vars)
for i, col in enumerate(numerical_vars):
    row = i // 4 + 1
    col_pos = i % 4 + 1
    fig.add_trace(
        go.Histogram(x=df[col], nbinsx=25 if col != 'Severity' else 4, showlegend=False),
        row=row, col=col_pos
    )

# Update layout
fig.update_layout(height=700, width=1120, title_text="Histograms of Numerical Variables")
fig.update_xaxes(tickvals=[1, 2, 3, 4], row=1, col=1)
fig.update_xaxes(tickvals=list(range(1, 13)), row=4, col=3)

# Show plot
fig.show(config={'staticPlot': True})

#### Boolean variables

In [None]:
len(boolean_vars)

In [None]:
df[boolean_vars].info()

In [None]:
true_counts = df[boolean_vars].sum()

In [None]:
num_plots = len(boolean_vars)
num_cols = 8
num_rows = math.ceil(num_plots / num_cols)

# Create Plotly figure with subplots
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=boolean_vars, 
                    specs=[[{'type':'pie'}]*num_cols]*num_rows)
# Populate subplots with pie charts
for i, column in enumerate(boolean_vars):
    row = i // num_cols + 1  # Plotly subplots start from row 1
    col = i % num_cols + 1   # Plotly subplots start from col 1
    counts = df[column].value_counts()
    fig.add_trace(
        go.Pie(labels=counts.index, values=counts, textinfo='percent', sort=False),
        row=row, col=col
    )
    fig.update_layout(
        title=f"{column}",
        font=dict(size=10),
        margin=dict(l=10, r=10, t=40, b=10),  # Adjust margins for better layout
        showlegend=False
    )
# Update layout and show figure
fig.update_layout(
    title='Distribution of Boolean Columns',
    height=350, width=1000,
    template='plotly_white',
)

fig.show(config={'staticPlot': True})

#### Categorical variables

Some variables may not be very useful as they are so we would use their transformed version or new variables extracted from them.

In [None]:
categorical_vars.remove("Street")
categorical_vars.remove("Weather_Condition")

In [None]:
len(categorical_vars)

In [None]:
df[categorical_vars].info()

In [None]:
print("Unique values counts")
for col in categorical_vars:
    print(f"{col}: {df[col].unique().shape[0]}")

In [None]:
# Create a subplot for Weather Category, Weather Intensity, and City
fig = make_subplots(rows=1, cols=3, subplot_titles=['Weather Category', 'Weather Intensity', 'City', 'County'])

# Plot Weather Category (bar plot)
weather_cat_counts = df['Weather_Category'].value_counts()
fig.add_trace(go.Bar(x=weather_cat_counts.index, y=weather_cat_counts.values, marker_color='orange'), row=1, col=1)

# Plot City (bar plot)
city_counts = df['City'].value_counts().nlargest(15)
fig.add_trace(go.Bar(x=city_counts.index, y=city_counts.values, marker_color='green'), row=1, col=2)

# Plot County (bar plot)
county_counts = df['County'].value_counts().nlargest(15)
fig.add_trace(go.Bar(x=county_counts.index, y=county_counts.values, marker_color='purple'), row=1, col=3)

fig.update_layout(title='Weather Category, Intensity, Top Cities and Top Counties',
                  height=340, width=900, showlegend=False)

fig.show(config={'staticPlot': True})

In [None]:
# Plot State (choropleth map)
state_counts = df['State'].value_counts().reset_index()
state_counts.columns = ['State', 'Counts']

# Create choropleth map for State
fig = go.Figure(data=go.Choropleth(
    locations=state_counts['State'],
    z=state_counts['Counts'],
    locationmode='USA-states',
    colorscale='Reds',
    colorbar_title='Number of Accidents'
))

fig.update_layout(
    title_text='Total Number of Accidents in the US (2021-2022)',
    geo=dict(scope='usa', projection_type='albers usa'),
    height=400, width=800, showlegend=True, barmode='group',
)

# Show the plot
fig.show(config={'staticPlot': True})