The rendered version is at `/pdf_notebooks/02-data_analysis.pdf` 

In [None]:
import pandas as pd
import numpy as np
import re
import math

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
%matplotlib inline

In [None]:
!pip install -U kaleido

In [None]:
from scipy.stats import spearmanr, chi2_contingency, pointbiserialr

In this notebook, we analyse the US Traffic Accident dataset to derive insights and select features for predictive models.

In [None]:
%%time
data = pd.read_csv("data/US_Accidents_March23_Clean.csv")
data.shape

In [None]:
%%time
fig = px.histogram(data, x='Year', title='Distribution of Accidents by Year')
fig.update_layout(width=450, height=350)
fig.show(config={'staticPlot': True})

In [None]:
print(f'Portion of data for 2021 to 2022: {data[(data["Year"] == 2021) | (data["Year"] == 2022)].shape[0] / data.shape[0]:.2f}%')
print(f'Portion of data for oher years: {data[(data["Year"] != 2021) & (data["Year"] != 2022)].shape[0] / data.shape[0]:.2f}%')

The dataset we have contains over 7 millions rows and has a size of 1.5Go (or more). In this analysis and beyond, we are focusing solely on the years 2021 and 2022 for our analysis to ensure that our insights and predictive models are based on the most recent and relevant data available.
* Recent and Relevant Data: The years 2021 and 2022 would be more relevant to leverage the most current insights into what influences impacting accident severity.
* Higher Data Volume: Despite comprising a smaller portion of the dataset (43%), data from the years 2021 and 2022 offer a substantial amount of recorded accidents, ensuring robust analysis and modeling.
* Accuracy in Predictions: By analyzing recent years, we aim to produce predictive models that accurately reflect present-day accident trends and conditions, enhancing the reliability of our forecasts.
* Resource Optimization: Prioritizing these years optimizes our resources (less data to process) by concentrating efforts on data that is more likely to yield actionable insights.

In [None]:
data = data[(data["Year"] == 2021) | (data["Year"] == 2022)].copy()
data.shape

The dataset we are dealing with is highly imbalanced. Severity of level 2 make up (89%) of the data. We understand that nearly all of the reported accidents in the US were of severity 2. However, for the purpose of analysis and modeling, we may downsample to make the dataset balanced as we are more interested in studying factors for accident severity.

In [None]:
data["Severity"].value_counts()

In [None]:
data["Severity"].value_counts() / len(data)

In [None]:
downsampled_count = int(data["Severity"].value_counts().sort_values(ascending=False).iloc[1] * 1.55)
downsampled_count

In [None]:
df_majority_downsampled = data[data["Severity"] == 2].sample(n=downsampled_count, random_state=42)
df_rest = data[data["Severity"] != 2]
df_balanced = pd.concat([df_majority_downsampled, df_rest])

In [None]:
df_balanced["Severity"].value_counts() / len(df_balanced)

In [None]:
df_balanced.shape

We are down to a bit more than half a million rows in the dataset which is still a bit much considering our resources. We may downsample again while maintain the distribution of keys variables.

In [None]:
%%time
category_year_proportions = df_balanced.groupby(['Severity', 'Year']).size() / len(df_balanced)
sample_size = 0.8
# Sample based on the proportions of each Severity and Year combination
df = df_balanced.groupby(['Severity', 'Year'], group_keys=False).apply(
    lambda x: x.sample(frac=sample_size * category_year_proportions[x.name], random_state=42)
).reset_index(drop=True)

df.shape

We make sure the distribution is the same

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=data['Year'], opacity=0.7, name='Original Data'))
fig.add_trace(go.Histogram(x=df['Year'], opacity=0.7, name='Sampled Data'))
fig.update_layout(
    height=300, width=450,
    title='Comparison of Year Distribution',
    xaxis_title='Year',
    yaxis_title='Count',
    barmode='overlay',
    bargap=0.1,
    bargroupgap=0.1,
    xaxis=dict(
        tickmode='linear',  tick0=min(data['Year']), dtick=1
    )
)
# Show the plot
fig.show(config={'staticPlot': True})

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=data['Severity'], opacity=0.7, name='Original Data'))
fig.add_trace(go.Histogram(x=df['Severity'], opacity=0.7, name='Sampled Data'))
fig.update_layout(
    height=300, width=450,
    title='Comparison of Severity Distribution',
    xaxis_title='Severity',
    yaxis_title='Count',
    barmode='overlay',
    bargap=0.1,
    bargroupgap=0.1,
    xaxis=dict(
        tickmode='linear',  tick0=min(data['Severity']), dtick=1
    )
)
# Show the plot
fig.show(config={'staticPlot': True})

In [None]:
del(data)
del(df_balanced)

In [None]:
%%time
df["Start_Time"] = pd.to_datetime(df["Start_Time"])
df["End_Time"] = pd.to_datetime(df["End_Time"])
df["Date"] = pd.to_datetime(df["Date"])

In [None]:
df = df.drop(columns=["Start_Time", "End_Time", "Date"])

## Exploratory Data Analysis

In [None]:
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
boolean_cols = df.select_dtypes(include=['bool']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object','category']).columns.tolist()

### 1. Univariate Analysis

#### Numerical and ordinal categorical variables
* Time of Day: The distribution of accidents by hour shows two peaks corresponding to morning and evening rush hours, with a significant drop during early morning hours (12-5 AM). These periods correlate with traffic density.
* Day of Week: Accidents are more frequent on weekdays, especially on Fridays. This could indicate a lower traffic density on the weekends as some people stay home and less people go out during the same hours.
* Monthly Trends: There is a clear seasonality effect with the highest number of accidents in winter months, suggesting that weather conditions might play a significant role.
* Yearly Trends: The number of accidents increases each year, potentially indicating either an increase in traffic volume or improved data collection over time.
* Clear Conditions: Accidents occur frequently in clear weather, likely due to the higher number of vehicles on the road under normal conditions (normal temperature, low humidity, normal but higher pressure, high visibility, lower wind speed, and low-to-no precipitation).
* Humidity levels: More accident occured at higher humidity levels. This might correlate with poor visibility and wet road conditions, which could increase accident risk.

In [None]:
len(numerical_cols)

In [None]:
df[numerical_cols].info()

In [None]:
df[numerical_cols].describe()

In [None]:
%%time
fig = make_subplots(rows=4, cols=4, subplot_titles=numerical_cols)
for i, col in enumerate(numerical_cols):
    row = i // 4 + 1
    col_pos = i % 4 + 1
    fig.add_trace(
        go.Histogram(x=df[col], nbinsx=25 if col != 'Severity' else 4, showlegend=False),
        row=row, col=col_pos
    )

# Update layout
fig.update_layout(height=700, width=1120, title_text="Histograms of Numerical Variables")
fig.update_xaxes(tickvals=[1, 2, 3, 4], row=1, col=1)
fig.update_xaxes(tickvals=list(range(1, 13)), row=4, col=3)

# Show plot
fig.show(config={'staticPlot': True})

#### Boolean variables
* Most boolean variables (`Amenity`, `Bump`, `Give_Way`, `No_Exit`, `Railway`, `Roundabout`, `Station`, `Stop`, `Traffic_Calming`) are predominantly False (95% or more), indicating that these features are not present in the majority of accidents. Their abscence, however, could play a role in the severity of an accident.
* Hight-Risk Areas: Areas with crossing, junctions, and traffic signals see significant accident counts. As for the other boolean varibles, the values for these columns are also predominantly False (85% or more) tho more balanced.
* Night-time accidents: Most accidents (69.7%) occured during the day, thus matching traffic volume trends. In addition, about 31% of accidents occur at night, highlighting a significant proportion of accidents happening in low-light conditions.
* Highway Accidents: Although most accidents happened on local roads (72%), accidents on highways account for a notable portion (28%).

In [None]:
len(boolean_cols)

In [None]:
df[boolean_cols].info()

In [None]:
true_counts = df[boolean_cols].sum()

In [None]:
fig = go.Figure(data=go.Bar(x=true_counts.index, y=true_counts.values))
fig.update_layout(
    height=350, width=500,
    xaxis_tickangle=-90,
    xaxis_title='Boolean Columns',
    yaxis_title='Count',
    title='Total Number of True Values for Each Boolean Column',
    template='plotly_white',
)
fig.show(config={'staticPlot': True})

In [None]:
num_plots = len(boolean_cols)
num_cols = 8
num_rows = math.ceil(num_plots / num_cols)

# Create Plotly figure with subplots
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=boolean_cols, 
                    specs=[[{'type':'pie'}]*num_cols]*num_rows)
# Populate subplots with pie charts
for i, column in enumerate(boolean_cols):
    row = i // num_cols + 1  # Plotly subplots start from row 1
    col = i % num_cols + 1   # Plotly subplots start from col 1
    counts = df[column].value_counts()
    fig.add_trace(
        go.Pie(labels=counts.index, values=counts, textinfo='percent', sort=False),
        row=row, col=col
    )
    fig.update_layout(
        title=f"{column}",
        font=dict(size=10),
        margin=dict(l=10, r=10, t=40, b=10),  # Adjust margins for better layout
        showlegend=False
    )
# Update layout and show figure
fig.update_layout(
    title='Distribution of Boolean Columns',
    height=350, width=1000,
    template='plotly_white',
)

fig.show(config={'staticPlot': True})

#### Categorical variables
* Weather Conditions: Accidents are most frequent under clear and cloudy conditions.
* Location Factor: Cities and States with higher accident counts often coincide with high population density, extensive road networks, and potentially varying driving conditions due to local climate and geography.

Some variables may not be very useful as they are so we would use their transformed version or new variables extracted from them.

In [None]:
categorical_cols.remove("Street")
categorical_cols.remove("Weather_Condition")
categorical_cols.remove("ID")

In [None]:
len(categorical_cols)

In [None]:
df[categorical_cols].info()

In [None]:
print("Unique values counts")
for col in categorical_cols:
    print(f"{col}: {df[col].unique().shape[0]}")

In [None]:
# Create a subplot for Weather Category, Weather Intensity, and City
fig = make_subplots(rows=1, cols=4, subplot_titles=['Weather Category', 'Weather Intensity', 'City', 'County'])

# Plot Weather Category (bar plot)
weather_cat_counts = df['Weather_Category'].value_counts()
fig.add_trace(go.Bar(x=weather_cat_counts.index, y=weather_cat_counts.values, marker_color='orange'), row=1, col=1)

# Plot Weather Intensity (bar plot)
weather_intensity_counts = df['Weather_Intensity'].value_counts()
fig.add_trace(go.Bar(x=weather_intensity_counts.index, y=weather_intensity_counts.values, marker_color='blue'), row=1, col=2)

# Plot City (bar plot)
city_counts = df['City'].value_counts().nlargest(15)
fig.add_trace(go.Bar(x=city_counts.index, y=city_counts.values, marker_color='green'), row=1, col=3)

# Plot County (bar plot)
county_counts = df['County'].value_counts().nlargest(15)
fig.add_trace(go.Bar(x=county_counts.index, y=county_counts.values, marker_color='purple'), row=1, col=4)

fig.update_layout(title='Weather Category, Intensity, Top Cities and Top Counties',
                  height=340, width=1150, showlegend=False)

fig.show(config={'staticPlot': True})

* Most accidents occur under regular weather intensity.
* Cities with higher populations and more traffic density tend to have more accidents.

In [None]:
# Plot State (choropleth map)
state_counts = df['State'].value_counts().reset_index()
state_counts.columns = ['State', 'Counts']

# Create choropleth map for State
fig = go.Figure(data=go.Choropleth(
    locations=state_counts['State'],
    z=state_counts['Counts'],
    locationmode='USA-states',
    colorscale='Reds',
    colorbar_title='Number of Accidents'
))

fig.update_layout(
    title_text='Total Number of Accidents in the US (2016-2023)',
    geo=dict(scope='usa', projection_type='albers usa'),
    height=400, width=800,
    showlegend=True,
    barmode='group',
)

# Show the plot
fig.show(config={'staticPlot': True})

* States with higher populations and extensive road networks like Californaia tend to report more accidents.

### 2. Multivariate Analysis

#### Numerical variables

In [None]:
numerical_cols.remove("Month")
numerical_cols.remove("Day_of_Week")

In [None]:
coordinate_vars = ["Start_Lng", "Start_Lat"]

In [None]:
#%%time
#fig = px.scatter(df, x='Start_Lng', y='Start_Lat', color='Severity', opacity=0.5)
#fig.update_layout(title='Start_Lat vs Start_Lng', height=450, width=780)
#fig.show(config={'staticPlot': True})

* A scatterplot of coordinates (`Start_Lat` and `Start_Lng`) show that the most severe accidents occured in the most populated cities--mostly in the Eastern side of the US. We can also notice accidents along interstate highways.

In [None]:
time_vars = ['Hour', 'Day']

In [None]:
severities = sorted(df['Severity'].unique())
subplot_titles = [f'{time_vars[i]} - Severity {severities[j]}' for i in range(len(time_vars)) for j in range(len(severities))]

# Plot time variables
fig = make_subplots(rows=2, cols=4, subplot_titles=subplot_titles, shared_yaxes=False)

for i, var in enumerate(time_vars):
    var_counts = df[var].value_counts().sort_index()
    for j, severity in enumerate(severities):
        df_grouped = df[df['Severity'] == severity][var].value_counts().sort_index()
        df_grouped = df_grouped.div(var_counts, fill_value=float('NaN'))
        fig.add_trace(go.Bar(x=df_grouped.index, y=df_grouped.values * 100, name=f'{var} - Severity {severity}'), row=i+1, col=j+1)

fig.update_layout(height=400, width=1150, title='Ordinal Categorical Variables vs Severity', showlegend=False)    
fig.show(config={'staticPlot': True})

Time of the Day:
* Severity 1: Most accidents occur early in the morning, peaking at 7 AM, with another, less intense peak around 5 PM.
* Severity 2 and 3: Accidents follow commuting trends, with higher frequencies during rush hours (5 AM-9 AM and 1 PM-7 PM).
* Severity 4: The distribution is more uniform throughout the day, with a slight increase in the later hours.

Day of the Month:
* The distribution of accidents is fairly uniform throughout the month for all severity levels with a slight trend at the end of the month.

In [None]:
# Separate ordinal categorical variables from numerical variables
ordinal_cols = ["Month", "Day_of_Week"]

In [None]:
num_vars = [var for var in numerical_cols if var not in time_vars and var not in ordinal_cols and var not in coordinate_vars]
num_vars.remove("Year")
num_vars.remove("Severity")
len(num_vars)

In [None]:
%%time
fig = make_subplots(rows=4, cols=2, subplot_titles=num_vars)
for i, var in enumerate(num_vars):
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(go.Box(
        y=df[var], x=df['Severity'], 
        name=var,
        boxpoints="suspectedoutliers"
        #points='suspectedoutliers', box_visible=True, meanline_visible=True
    ), 
                  row=row, col=col)
    fig.update_layout(height=1050, width=800, title='Numerical Variables vs Severity', showlegend=False)

fig.show(config={'staticPlot': True})

* Distance affected: Overall, the distance slightly increases with severity, indicating more severe accidents may disrupt a longer stretch of road.
* Temperature: Higher severity accidents occur in a wider range of temperatures, but often cooler conditions.
* Humidity and Pressure: Do not show strong differentiation between severity levels.
* Visibility: Generally relatively high for all severities, but slightly lower for more severe accidents.
* Wind Speed: Higher wind speeds are associated with more severe accidents.
* Precipitation: Low across all severities, but slightly higher for more severe accidents.
* Duration: Increases with severity, indicating more severe accidents take longer to resolve.

#### Ordinal Categorical variables

In [None]:
severities = sorted(df['Severity'].unique())
subplot_titles = [f'{ordinal_cols[i]} - Severity {severities[j]}' for i in range(len(ordinal_cols)) for j in range(len(severities))]

# Plot time variables
fig = make_subplots(rows=2, cols=4, subplot_titles=subplot_titles, shared_yaxes=False)

for i, var in enumerate(ordinal_cols):
    var_counts = df[var].value_counts().sort_index()
    for j, severity in enumerate(severities):
        df_grouped = df[df['Severity'] == severity][var].value_counts().sort_index()
        df_grouped = df_grouped.div(var_counts, fill_value=float('NaN'))
        fig.add_trace(go.Bar(x=df_grouped.index, y=df_grouped.values * 100, name=f'{var} - Severity {severity}'), row=i+1, col=j+1)

fig.update_layout(height=400, width=1000, title='Ordinal Categorical Variables vs Severity', showlegend=False)
fig.show(config={'staticPlot': True})

Month of the year:
* Accidents are relatively evenly distributed across the months for all severity levels.
* There is a slight decrease in accidents of severity 3 during the winter months (November to February) while we notice an increase in accidents of severity 4 in December, likely due to weather or end of the year celebrations. Similarly, accidents of severity 1 peak during summer time.

Day of Week:
* Accidents are relatively evenly distributed across the weekdays.
* Weekdays, especially Fridays, have higher accident rates across all severities due to increased traffic and commuting. Weekends generally see lower accident rates, indicating safer driving conditions.

#### Categorical variables

To simplify our analysis, we decided to drop the "City" and "County" variables due to their high cardinality and the lack of additional context such as population size. This makes them less useful for our current scope. By excluding them, we can focus on more immediately actionable and interpretable variables.

In [None]:
categorical_cols.remove("City")
categorical_cols.remove("County")
categorical_cols.remove("Weather_Intensity")

In [None]:
len(categorical_cols)

In [None]:
df_weather_grouped = df.groupby(['Weather_Category', 'Severity']).size().reset_index(name='Count')
df_weather_grouped['Percentage'] = df_weather_grouped.groupby('Weather_Category', group_keys=False)['Count'].apply(lambda x: 100 * x / x.sum())

# Plotting
fig = px.bar(df_weather_grouped, x='Weather_Category', y='Percentage', color='Severity', barmode='group', 
             category_orders={'Severity': ['1', '2', '3', '4']},
             labels={'Percentage': 'Percentage (%)'},
             title='Weather Category vs Severity')
fig.update_layout(showlegend=True, height=350, width=370)
fig.show(config={'staticPlot': True})

* Severity 1: Generally, all weather conditions contribute to relatively low proportions of severity 1 accidents, indicating that driving conditions may be safer under these circumstances.
* Severity 2: Clear, cloudy, and various adverse weather conditions significantly increase the likelihood of severity 2 accidents, with precipitation, snowstorms, thunderstorms, and visibility issues having the highest impact.
* Severity 3 and 4: While severity 2 dominates across weather categories, severe accidents (severity 3 and 4) are less frequent but still notable under adverse weather conditions such as precipitation, snowstorms, thunderstorms, and visibility issues (smoke, fog, etc.)

In [None]:
severities = sorted(df['Severity'].unique())
# Store counts accross all severity levels
all_severity_state_counts = df["State"].value_counts().sort_index()
# Create a list to hold each choropleth map figure
figs = []
# Loop through each severity level and create a choropleth map
for severity in severities:
    # Filter data for the current severity level
    severity_data = df[df['Severity'] == severity]
    # Count occurrences of each state for the current severity level
    state_counts = severity_data['State'].value_counts()
    state_counts = state_counts.div(all_severity_state_counts).reset_index()
    state_counts.columns = ['State', 'Proportion']
    # Create choropleth map for the current severity level
    fig = go.Figure(data=go.Choropleth(
        locations=state_counts['State'],
        z=state_counts['Proportion'],
        locationmode='USA-states',
        colorscale='Reds',
        colorbar=dict(title='Proportion'),
        hoverinfo='location+z'
    ))
    # Update layout for the choropleth map
    fig.update_layout(
        height=350, width=700,
        title=f'Proportion of Severity {severity} Accidents by State',
        geo=dict(scope='usa', projection_type='albers usa'),
    )
    # Add the figure to the list
    figs.append(fig)

# Display each choropleth map separately
for fig in figs:
    fig.show(config={'staticPlot': True})

* Generally, severity 1 accidents show less variation across states, suggesting that factors influencing minor accidents might be more uniformly managed or influenced.
* States vary widely in terms of severity 2 accident proportions, with some states experiencing significantly higher rates, potentially due to factors like population density, weather conditions, or road infrastructure.
* There is more variability across states for more severe accidents (severity 3 and 4), with some states having higher proportions compared to others.

#### Boolean variables

In [None]:
len(boolean_cols)

In [None]:
num_plots = len(boolean_cols)
num_cols = 8
num_rows = math.ceil(num_plots / num_cols)

fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=boolean_cols)

# Loop through each boolean variable and create a stacked bar plot
for i, var in enumerate(boolean_cols):
    row = i // num_cols + 1
    col = i % num_cols + 1
    
    # Group by boolean variable and Severity, then calculate percentage
    df_grouped = df.groupby([var, 'Severity']).size().reset_index(name='Count')
    df_grouped['Proportion'] = df_grouped.groupby(var, group_keys=False)['Count'].apply(lambda x: x / x.sum())
 
    # Plotting
    fig_bar = px.bar(df_grouped, x=var, y='Proportion', color='Severity', barmode='stack', 
                     category_orders={'Severity': ['1', '2', '3', '4']},
                     labels={'Proportion': 'Proportion (%)'},
                     title=f'{var} vs Severity').data
    
    # Add traces to subplot
    for trace in fig_bar:
        fig.add_trace(trace, row=row, col=col)

# Update layout
fig.update_layout(
    title='Boolean Variables vs Severity',
    height=500, width=1200,
    showlegend=True
)

fig.show(config={'staticPlot': True})

* Strong Correlation with Severity 2: Amenities, bumps, crossings, no exit, stations, stop signs, and traffic calming measures.
* Moderate Correlation with Severity 2: Railways, traffic signals, and nighttime conditions.
* High Correlation with Severity 3 and 4: Highways, junctions, and nighttime conditions.

### 3. Correlation Analysis

In [None]:
# Store results
correlations = {'Variable': [], 'Correlation': [], 'Type': []}

In [None]:
# Spearman's Rank Correlation for numerical and time-related variables
for var in num_vars + time_vars:
    corr, _ = spearmanr(df[var], df['Severity'])
    correlations['Variable'].append(var)
    correlations['Correlation'].append(corr)
    correlations['Type'].append('Numerical/Time')

In [None]:
# Chi-Square and Cramér's V for categorical variables
for var in categorical_cols:
    contingency_table = pd.crosstab(df[var], df['Severity'])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    cramers_v = np.sqrt(chi2 / (df.shape[0] * (min(contingency_table.shape) - 1)))
    correlations['Variable'].append(var)
    correlations['Correlation'].append(cramers_v)
    correlations['Type'].append('Categorical')

In [None]:
# Point-Biserial Correlation for boolean variables
for var in boolean_cols:
    corr, _ = pointbiserialr(df[var], df['Severity'])
    correlations['Variable'].append(var)
    correlations['Correlation'].append(corr)
    correlations['Type'].append('Boolean')

In [None]:
# Spearman's Rank Correlation for ordinal variables
for var in ordinal_cols:
    # Encode ordinal variables to numerical
    df[var] = df[var].astype('category').cat.codes
    corr, _ = spearmanr(df[var], df['Severity'])
    correlations['Variable'].append(var)
    correlations['Correlation'].append(corr)
    correlations['Type'].append('Ordinal')

In [None]:
# Convert to DataFrame
correlations_df = pd.DataFrame(correlations)

In [None]:
fig = go.Figure()

# Iterate over unique types to create grouped bars
for type_name in correlations_df['Type'].unique():
    df_filtered = correlations_df[correlations_df['Type'] == type_name]
    
    fig.add_trace(
        go.Bar(x=df_filtered['Variable'], y=df_filtered['Correlation'], name=type_name)
    )

# Update layout
fig.update_layout(
    title='Strength of Relationships with Severity',
    xaxis_title='Variable',
    yaxis_title='Correlation Coefficient',
    xaxis_tickangle=90,
    legend_title='Type',
    legend=dict(x=1, y=1, traceorder='normal'),
    barmode='group',
    height=550,
    width=850
)

fig.show(config={'staticPlot': True})

In [None]:
corr_threshold = 0.1

In [None]:
correlations_df[np.abs(correlations_df["Correlation"]) >= corr_threshold]

Some variables have a relatively strong relationship with the severity of the accident. We could use them as features in a predictive model.
* `Distance(mi)`: Strong negative correlation (-0.447) suggests it's a significant predictor.
* `Duration(min)`: Strong negative correlation (-0.289) indicates it should be included.
* `State`: Moderate positive correlation (0.237) implies state-specific factors affecting severity should be accounted for.
* `Weather_Category`: Slight positive correlation (0.044) and domain knowledge suggests considering weather conditions.
* `Is_Highway`: Moderate positive correlation (0.251) indicates the importance of distinguishing accidents on highways.
* `Crossing`: Moderate negative correlation (-0.125) highlights the impact of accidents involving crossings.
* `Hour`: Hourly patterns can directly relate to traffic conditions, commuter behavior, and visibility, which are critical factors influencing accident severity.