<div style="border: 2px solid #007acc; padding: 15px; border-radius: 10px; background-color: #f0f8ff; text-align: left;">
    <h1 style="text-align: center;"><b><i>Road Traffic Accident Analysis in the United Kingdom. </i></b></h1>
</div>

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #007acc; text-align: left;"> </div>

<div style="border: 2px solid #007acc; padding: 15px; border-radius: 10px; background-color: #f0f8ff; text-align: left;">
    <h2 style="text-align: left; font-size: 27px;"><b>Step 3: 📝 Exploratory Data Analysis (EDA) & Visualizations.</b></h2>
</div>

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #007acc; text-align: left;"> </div>

#### **About:**
##### Exploratory Data Analysis (EDA) & Visualizations is the process of examining a dataset to understand its structure, detect patterns, identify anomalies, and uncover relationships between variables.
##### Through statistical summaries and clear visualizations, EDA helps transform raw data into meaningful insights, guiding further analysis and decision-making.

#### **Role About Accident Analysis:**
##### Exploratory Data Analysis (EDA) & Visualizations in this project focuses on understanding patterns, trends, and anomalies within UK road traffic accident records.
##### By analyzing accident characteristics such as severity, road type, weather, lighting, and location, and visualizing them through charts and graphs, we aim to **uncover factors that influence accident occurrence and impact**.
##### This step provides the foundation for deeper analysis, helping to identify key variables for predictive modeling and informed road safety strategies.

# 3.1 📥 Importing Libraries and Load the Data.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 600px;"> </div>

## 3.1.1 Import Necessary Libraries.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 37%;"> </div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

## 3.1.2 Data Loading.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 20%;"> </div>

In [None]:
# Data Loading
print('\n Loading the Road Accident Data...\n')

# Read the CSV file. The file is assumed to be in the specified directory.
df = pd.read_csv(r'C:\Entri\Project\data\my_clean_data_index.csv', encoding='UTF-8-SIG')
#df = pd.read_csv(r'C:\Entri\Project\data\Road Accident Data.csv', keep_default_na=False, na_values=[], encoding='UTF-8-SIG')

# Display dataframe shape and few rows.
print('Dataset loaded with shape:', df.shape)
print ("Dataset have:", df.shape[0], "rows and", df.shape[1], "columns")

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

# 3.2 🧬 Checking, Verify and Exporing the Cleaned Dataset.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 73%;"> </div>

## First Few Rows From Cleaned Data Set.

##### The data set is a cleaned one, having no missing, no duplicate rows.

In [None]:
df.head()

### Data Structure Info.

##### There are 18 fields the one Accident_Index is changed to index. Other fields are converted to object and two fields Number_of_Casualties, Number_of_Vehicles and Speed_limit are retained as numerical. Accident_DateTime have the date and time merged in. The GeoPoint have the lat/long values merged up.

In [None]:
# Structure
df.info()

In [None]:
#Structure
df.columns.tolist()

### Statistical Summary of the Numeric Columns.

In [None]:
df.describe().T

### Statistical Summary of the Descriptive Columns.

In [None]:
df.describe(include = 'object').T

### Verification of Non Null Values.

In [None]:
# Checking of null values.
df.isnull().sum()

In [None]:
# 2. Data types & Missing values
dtypes_missing = pd.DataFrame({
    "dtype": df.dtypes.astype(str),
    "missing_count": df.isna().sum(),
    "missing_percent": (df.isna().sum()/len(df))*100
}).reset_index().rename(columns={"index":"column"})
print("\nData Types & Missing Values:")
print(dtypes_missing)

### Verification of Duplicated Rows.

In [None]:
# 3. Duplicate check
print("\nFull row duplicates:", df.duplicated().sum())
if "Accident_Index" in df.columns:
    print("Duplicate Accident_Index:", df["Accident_Index"].duplicated().sum())

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

# 3.3 📈 Univariate Analysis.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 32%;"> </div>

### 1. Accident Severity – Bar chart of distribution.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 41%;"> </div>

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(
    x="Accident_Severity",
    data=df,
    order=df["Accident_Severity"].value_counts().index,
    hue="Accident_Severity",        # assign palette to this
    legend=False,                   # no duplicate legend
    palette="Set2"
)

plt.title("Distribution of Accident Severity", fontsize=14)
plt.xlabel("Accident Severity")
plt.ylabel("Number of Accidents")
plt.show()

# Print frequency table
severity_counts = df["Accident_Severity"].value_counts(normalize=True) * 100

print("Accident Severity Distribution (%):\n", severity_counts.round(2))

#### 🔎 Insights:
##### The majority of road accidents are classified as Slight, followed by Serious.
##### Fatal accidents account for only a small fraction of the total dataset, but they are critical from a road safety perspective.

### 2. Plotting Accident_DateTime.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 29%;"> </div>

In [None]:
df["Accident_DateTime"] = pd.to_datetime(df["Accident_DateTime"], errors="coerce")

# Extract features
df["Year"] = df["Accident_DateTime"].dt.year
df["Month"] = df["Accident_DateTime"].dt.month
df["DayOfWeek"] = df["Accident_DateTime"].dt.day_name()
df["Hour"] = df["Accident_DateTime"].dt.hour

In [None]:
# 1. Yearly trend
plt.figure(figsize=(6,4))
sns.countplot(x="Year", data=df, hue="Year", palette="Set3", legend=False)
plt.title("Accidents per Year")
plt.show()

### 3. Monthly Trend.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 17%;"> </div>

In [None]:
# Monthly trend
plt.figure(figsize=(6,4))
sns.countplot(x="Month", data=df, hue="Month", palette="Set2", legend=False)
plt.title("Accidents per Month")
plt.show()

#### Insight:
##### A sharp rise is seen in October and November, with November recording the highest accidents (~29,000) — possibly due to bad weather, shorter daylight hours, and increased traffic during pre-winter/holiday season.

### 4. Group by Year and Month.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 27%;"> </div>

In [None]:
# Group by Year and Month
monthly_counts = df.groupby(["Year","Month"]).size().reset_index(name="Accident_Count")

# Pivot for easier plotting
monthly_pivot = monthly_counts.pivot(index="Month", columns="Year", values="Accident_Count")

# Plot
plt.figure(figsize=(10,6))
sns.lineplot(data=monthly_pivot, marker="o")

plt.title("Monthly Accident Trend Comparison (2021 vs 2022)", fontsize=14)
plt.xlabel("Month")
plt.ylabel("Number of Accidents")
plt.xticks(range(1,13), 
           ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"])
plt.legend(title="Year")
plt.show()


#### Insight:
##### 2021 consistently records more accidents than 2022 across almost all months.
##### Both years dip slightly in August, possibly due to summer holidays (less commuting).
##### Both years rise again in autumn.
##### November is the peak month in both years, but 2021 (~ 15.5k) is still significantly higher than 2022 (~13.6k).
##### This aligns with seasonal risk factors like shorter daylight hours, bad weather, and higher traffic.

#### The seasonal pattern is similar in both years (dip in Feb & Aug, peak in Nov), but 2022 has fewer accidents overall, suggesting improvements in road safety or reduced travel exposure.

### 5. Day of Week Trend.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 21%;"> </div>

In [None]:
# Day of Week trend
plt.figure(figsize=(7,4))
sns.countplot(x="DayOfWeek", data=df, order=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"],
              hue="DayOfWeek", palette="Set1", legend=False)
plt.title("Accidents by Day of Week")
plt.xticks(rotation=30)
plt.show()

#### Insight:
##### Accident risk is highest on weekends, especially Saturday, and lowest on Monday.

### 6. Group by Year and DayOfWeek.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 31%;"> </div>

In [None]:
# Group by Year and DayOfWeek
dow_counts = df.groupby(["Year","DayOfWeek"]).size().reset_index(name="Accident_Count")

# Set day order manually for consistency
day_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
dow_counts["DayOfWeek"] = pd.Categorical(dow_counts["DayOfWeek"], categories=day_order, ordered=True)

# Pivot for overlapped plotting
dow_pivot = dow_counts.pivot(index="DayOfWeek", columns="Year", values="Accident_Count")

# Plot
plt.figure(figsize=(10,6))
sns.lineplot(data=dow_pivot, marker="o")

plt.title("Accidents by Day of Week (2021 vs 2022)", fontsize=14)
plt.xlabel("Day of Week")
plt.ylabel("Number of Accidents")
plt.legend(title="Year")
plt.show()


#### Insight:
##### Accidents peak on Saturday in both years, suggesting weekend travel is consistently risky.
✅ Key takeaway: Accident trends by day of week are stable across years, but overall accidents decreased in 2022 compared to 2021 — possibly due to improved safety measures, reduced traffic volumes, or other external factors.

### 7. Hourly Trend.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 15%;"> </div>

In [None]:
# Hourly trend
plt.figure(figsize=(8,4))
sns.countplot(x="Hour", data=df, hue="Hour", palette="Paired", legend=False)
plt.title("Accidents by Hour of Day")
plt.show()

#### Insight:

##### Rush hours (7–9 AM and 3–6 PM) are the most accident-prone times.

##### Peak risk is at 5 PM.

##### Road safety interventions (like traffic control, awareness campaigns, and stricter monitoring) could be prioritized during these hours.

### 8. Group by Year and Hour.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 25%;"> </div>

In [None]:
# Group by Year and Hour
hour_counts = df.groupby(["Year","Hour"]).size().reset_index(name="Accident_Count")

# Pivot for overlapped plotting
hour_pivot = hour_counts.pivot(index="Hour", columns="Year", values="Accident_Count")

# Plot
plt.figure(figsize=(12,6))
sns.lineplot(data=hour_pivot, marker="o")

plt.title("Accidents by Hour of Day (2021 vs 2022)", fontsize=14)
plt.xlabel("Hour of Day (0 = Midnight, 23 = 11PM)")
plt.ylabel("Number of Accidents")
plt.xticks(range(0,24))
plt.legend(title="Year")
plt.grid(alpha=0.3)
plt.show()


#### 🔎 Insights:

##### Both years peak around 5 PM (17 hrs).
##### This is the most accident-prone time of day in both years.
##### 2022 shows a drop in accident counts across almost all hours compared to 2021.
##### The biggest reductions are during midday (10–14 hrs) and evening rush (17–19 hrs).
##### Despite the drop, rush hours remain the most dangerous.

### 9. Hourly distribution of accidents (2021 vs 2022 overlapped).

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 57%;"> </div>

In [None]:
# Hourly distribution of accidents (2021 vs 2022 overlapped)
plt.figure(figsize=(12,6))

sns.histplot(data=df[df["Year"] == 2021], x="Hour", bins=24, color="blue", alpha=0.5, label="2021")
sns.histplot(data=df[df["Year"] == 2022], x="Hour", bins=24, color="orange", alpha=0.5, label="2022")

plt.title("Hourly Distribution of Accidents (2021 vs 2022)")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Accidents")
plt.legend()
plt.show()


### 10. Percentage Change in Accidents by Hour (2022 vs 2021).

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 55%;"> </div>

In [None]:
# Group by Year & Hour
hourly_counts = df.groupby(["Year", "Hour"]).size().reset_index(name="Accident_Count")

# Pivot to get 2021 and 2022 side by side
hourly_pivot = hourly_counts.pivot(index="Hour", columns="Year", values="Accident_Count").fillna(0)

# Calculate percentage change
hourly_pivot["% Change (2022 vs 2021)"] = ((hourly_pivot[2022] - hourly_pivot[2021]) / hourly_pivot[2021]) * 100

# Reset index for plotting
hourly_change = hourly_pivot.reset_index()

# --- Line Chart of % Change ---
plt.figure(figsize=(10,6))
sns.lineplot(x="Hour", y="% Change (2022 vs 2021)", data=hourly_change, marker="o", color="darkblue")

# Add reference line (0% change)
plt.axhline(0, color="red", linestyle="--", linewidth=1)

# Titles & labels
plt.title("Percentage Change in Accidents by Hour (2022 vs 2021)", fontsize=14, weight="bold")
plt.xlabel("Hour of Day (0-23)")
plt.ylabel("% Change")
plt.xticks(range(0,24))  # Show all hours
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()

# Show the % change table
pd.set_option("display.precision", 2)
hourly_pivot

### 11. Number of Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 23%;"> </div>

In [None]:
# Univariate Analysis - Number of Casualties
plt.figure(figsize=(12,5))

# Histogram
plt.subplot(1,2,1)
sns.histplot(df["Number_of_Casualties"], bins=20, kde=True, color="skyblue")
plt.title("Distribution of Number of Casualties")

# Boxplot
plt.subplot(1,2,2)
sns.boxplot(x=df["Number_of_Casualties"], color="lightcoral")
plt.title("Boxplot of Number of Casualties")

plt.tight_layout()
plt.show()


Histogram will show whether most accidents involve few casualties (likely concentrated at 1 or 2).

Boxplot will highlight outliers (rare but severe accidents with many casualties).

### 12. Number of Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 21%;"> </div>

In [None]:
# Univariate Analysis - Number of Vehicles

# Summary statistics
print("Summary Statistics for Number_of_Vehicles:")
print(df["Number_of_Vehicles"].describe())

# Visualization
plt.figure(figsize=(12,5))

# Histogram
plt.subplot(1,2,1)
sns.histplot(df["Number_of_Vehicles"], bins=20, kde=True, color="lightgreen")
plt.title("Distribution of Number of Vehicles in Accidents")

# Boxplot
plt.subplot(1,2,2)
sns.boxplot(x=df["Number_of_Vehicles"], color="orange")
plt.title("Boxplot of Number of Vehicles")

plt.tight_layout()
plt.show()


### 13. Speed Limits.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 15%;"> </div>

In [None]:
# Histogram of Speed Limits
plt.figure(figsize=(7,4))
sns.histplot(df["Speed_limit"], bins=10, kde=False, color="coral", edgecolor="black")
plt.title("Distribution of Speed Limits in Accidents")
plt.xlabel("Speed Limit (mph)")
plt.ylabel("Number of Accidents")
plt.show()


#### Insight:
##### This will highlight which speed limits are most common in accident records (for example, 30 mph is usually dominant in urban areas, while 60-70 mph occurs in rural and motorway accidents).

### 14. Road Type.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 15%;"> </div>

In [None]:
# Bar chart of Road Type 
plt.figure(figsize=(7,4))
sns.countplot(
    x="Road_Type",
    data=df,
    order=df["Road_Type"].value_counts().index,
    hue="Road_Type",          # <-- add hue
    palette="Set2",
    legend=False              # <-- hide duplicate legend
)
plt.title("Distribution of Accidents by Road Type")
plt.xlabel("Road Type")
plt.ylabel("Number of Accidents")
plt.xticks(rotation=30, ha="right")
plt.show()

In [None]:
# Which road types are more accident-prone?
ans = df.groupby('Road_Type').agg(Total_Accidents=('Accident_Index', 'count')).reset_index().sort_values(by='Total_Accidents', ascending=False)
ans

In [None]:
plt.figure(figsize=(5, 5))
plt.pie(ans['Total_Accidents'], labels=ans['Road_Type'], autopct='%1.1f%%', startangle=140)
plt.title('Accidents by Road Type')
plt.axis('equal')  # Ensures the pie is drawn as a circle
plt.show()

##### Most accidents happen on single carriageways

### 15. Weather Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 22%;"> </div>

In [None]:
# Weather Conditions – bar chart
plt.figure(figsize=(8,4))
sns.countplot(
    x="Weather_Conditions",
    data=df,
    order=df["Weather_Conditions"].value_counts().index,
    hue="Weather_Conditions",   # avoids FutureWarning
    palette="Set2",
    legend=False
)
plt.title("Distribution of Accidents by Weather Conditions")
plt.xlabel("Weather Conditions")
plt.ylabel("Number of Accidents")
plt.xticks(rotation=30, ha="right")
plt.show()


### 16. Light Condition.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 19%;"> </div>

In [None]:
# Count of each Light Condition
plt.figure(figsize=(10,6))
sns.countplot(x="Light_Conditions", data=df, order=df['Light_Conditions'].value_counts().index)
plt.title("Distribution of Light Conditions")
plt.xlabel("Light Conditions")
plt.ylabel("Number of Accidents")
plt.xticks(rotation=45)  # Rotate x labels if needed
plt.show()


### 17. Urban or Rural Area.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 22%;"> </div>

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y="Urban_or_Rural_Area", data=df, order=df['Urban_or_Rural_Area'].value_counts().index)
plt.title("Accidents by Urban/Rural Area")
plt.xlabel("Number of Accidents")
plt.ylabel("Area Type")
plt.show()


### 18. Road Surface Conditions.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 27%;"> </div>

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(y="Road_Surface_Conditions", data=df, order=df['Road_Surface_Conditions'].value_counts().index)
plt.title("Accidents by Road Surface Conditions")
plt.xlabel("Number of Accidents")
plt.ylabel("Road Surface")
plt.show()


### 19. Junction_Control.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 19%;"> </div>

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(y="Junction_Control", data=df, order=df['Junction_Control'].value_counts().index)
plt.title("Accidents by Junction Control Type")
plt.xlabel("Number of Accidents")
plt.ylabel("Junction Control")
plt.show()


### 20. Local Authority District.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 25%;"> </div>

In [None]:
plt.figure(figsize=(12,8))
top_districts = df['Local_Authority_District'].value_counts().head(20)
sns.barplot(x=top_districts.values, y=top_districts.index)
plt.title("Top 20 Districts by Number of Accidents")
plt.xlabel("Number of Accidents")
plt.ylabel("Local Authority District")
plt.show()


### 21. Police Force.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 15%;"> </div>

In [None]:
plt.figure(figsize=(12,8))
top_forces = df['Police_Force'].value_counts().head(20)
sns.barplot(x=top_forces.values, y=top_forces.index)
plt.title("Top 20 Police Forces by Number of Accidents")
plt.xlabel("Number of Accidents")
plt.ylabel("Police Force")
plt.show()


### 22. Vehicle Type.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 15%;"> </div>

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(y="Vehicle_Type", data=df, order=df['Vehicle_Type'].value_counts().index)
plt.title("Accidents by Vehicle Type")
plt.xlabel("Number of Accidents")
plt.ylabel("Vehicle Type")
plt.show()


### General Function For Study purpose.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 35%;"> </div>

In [None]:
# ************ For Study purpose general function for all the above graphs.

# This is a common function wrote togeneralise for the horizontal bar chart. Study purpose.
def plot_categorical_bar_matplotlib(df, column, top_n=None, figsize=(10,6), title=None, cmap='viridis'):
    """
    Plots a horizontal bar chart for a categorical column using matplotlib.
    """
    counts = df[column].value_counts()
    if top_n:
        counts = counts.head(top_n)
    
    plt.figure(figsize=figsize)
    
    # Get colormap and generate colors for each bar
    cmap_obj = matplotlib.colormaps.get_cmap(cmap)
    colors = cmap_obj(np.linspace(0, 1, len(counts)))
    
    plt.barh(counts.index, counts.values, color=colors)
    
    plt.xlabel("Number of Accidents")
    plt.ylabel(column.replace('_',' '))
    plt.title(title if title else f"Accidents by {column.replace('_',' ')}")
    plt.gca().invert_yaxis()  # Highest count on top
    plt.show()


In [None]:
#plot_categorical_bar_matplotlib(df, "Vehicle_Type")
#plot_categorical_bar_matplotlib(df, "Urban_or_Rural_Area")
#plot_categorical_bar_matplotlib(df, "Local_Authority_District", top_n=20)

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

# 3.4 🔗 Bivariate Analysis.¶

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 37%;"> </div>

## 3.4.1 General Function for Categorical X Categorical.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 52%;"> </div>

In [None]:
def plot_cat_vs_cat(df, cat1, cat2, top_n_cat1=None, top_n_cat2=None, figsize=(12,6), title=None):
    """
    Plots a grouped bar chart for two categorical variables.

    Parameters:
    - df: pandas DataFrame
    - cat1: string, column for x-axis
    - cat2: string, column for hue
    - top_n_cat1: int or None, top N categories to show for cat1
    - top_n_cat2: int or None, top N categories to show for cat2
    - figsize: tuple, figure size
    - title: string, chart title
    """
    data = df.copy()
    
    # Keep only top N categories if specified
    if top_n_cat1:
        top1 = data[cat1].value_counts().nlargest(top_n_cat1).index
        data = data[data[cat1].isin(top1)]
        
    if top_n_cat2:
        top2 = data[cat2].value_counts().nlargest(top_n_cat2).index
        data = data[data[cat2].isin(top2)]
    
    plt.figure(figsize=figsize)
    sns.countplot(x=cat1, hue=cat2, data=data)
    plt.xlabel(cat1.replace('_',' '))
    plt.ylabel("Number of Accidents")
    plt.title(title if title else f"{cat1.replace('_',' ')} vs {cat2.replace('_',' ')}")
    plt.xticks(rotation=45)
    plt.legend(title=cat2.replace('_',' '))
    plt.show()


### 1. Accident Severity vs Light Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 37%;"> </div>

In [None]:
#Accident Severity vs Light Conditions
plot_cat_vs_cat(df, "Accident_Severity", "Light_Conditions")

### 2. Accident Severity vs Urban/Rural Area.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 39%;"> </div>

In [None]:
# Accident Severity vs Urban/Rural Area
plot_cat_vs_cat(df, "Accident_Severity", "Urban_or_Rural_Area")

### 3. Accident Severity vs Weather Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 41%;"> </div>

In [None]:
# Accident Severity vs Weather Conditions (top 10 weather types only)
plot_cat_vs_cat(df, "Accident_Severity", "Weather_Conditions", top_n_cat2=10)

### 4. Accident Severity vs Road_Type.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 33%;"> </div>

In [None]:
plot_cat_vs_cat(df, "Accident_Severity", "Road_Type", top_n_cat2=10)

### 5. Accident Severity vs Junction_Control.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 39%;"> </div>

In [None]:
plot_cat_vs_cat(df, "Accident_Severity", "Junction_Control")

### 6. Accident Severity vs Road_Surface_Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

In [None]:
plot_cat_vs_cat(df, "Accident_Severity", "Road_Surface_Conditions")

### 7. Accident Severity vs Vehicle_Type.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 35%;"> </div>

In [None]:
plot_cat_vs_cat(df, "Accident_Severity", "Vehicle_Type", top_n_cat2=8)

## 3.4.2 General Function for Categorical X Numerical.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 57%;"> </div>

##### Below is the general function created so that Bar plot (for mean) and box plot(for distribution) and the statistical summary can generate. This helps easy to pass all the field names and helps the analysis easy. Only need to call the function and pass the required parameters.

In [None]:
def analyze_categorical(df, field, target="Number_of_Casualties", 
                        top_n=None, figsize=(8,4), colors="Set2", order_by="frequency"):
    """
    Analyze a categorical field against a target variable.
    
    Parameters:
    df        : DataFrame
    field     : str, categorical column name
    target    : str, numeric column (e.g., "Number_of_Casualties" or "Number_of_Vehicles")
    top_n     : int, optional (limit to top N categories after sorting)
    figsize   : tuple, size of the plots (width, height)
    colors    : str or list, seaborn palette name or custom color list
    order_by  : str, "frequency" | "alphabetical" | "mean"
    """

    print(f"\n====== {field} vs {target} ======")

    # --- Summary table on full dataset ---
    summary_table = df.groupby(field)[target].describe()

    # --- Ordering ---
    if order_by in ["frequency", "count"]:
        order = df[field].value_counts().index
    elif order_by == "alphabetical":
        order = sorted(df[field].dropna().unique())
    elif order_by == "mean":
        order = df.groupby(field)[target].mean().sort_values(ascending=False).index
    else:
        order = df[field].unique()

    # Apply top_n AFTER sorting
    if top_n:
        order = order[:top_n]

    # Restrict data and summary_table to final order
    data = df[df[field].isin(order)]
    summary_table = summary_table.reindex(order)

    categories = len(order)

    # --- Generate consistent palette ---
    if isinstance(colors, list):
        if len(colors) < categories:
            colors = sns.color_palette("husl", categories)  # auto-expand
        else:
            colors = colors[:categories]
    else:
        colors = sns.color_palette(colors, categories)

    # --- Barplot (Mean Target) ---
    plt.figure(figsize=figsize)
    sns.barplot(
        x=field, 
        y=target, 
        data=data, 
        estimator=np.mean, 
        errorbar=None, 
        palette=colors,
        order=order,
        hue=field,        
        legend=False      
    )
    plt.xticks(rotation=45)
    plt.title(f"Average {target} by {field}", fontsize=14)
    plt.ylabel(f"Average {target}")
    plt.xlabel(field)
    plt.show()

    # --- Boxplot (Distribution) ---
    plt.figure(figsize=figsize)
    sns.boxplot(
        x=field, 
        y=target, 
        data=data, 
        palette=colors,
        order=order,
        hue=field,       
        legend=False     
    )
    plt.xticks(rotation=45)
    plt.title(f"Distribution of {target} by {field}", fontsize=14)
    plt.ylabel(target)
    plt.xlabel(field)
    plt.show()

    return summary_table


# Sample call
# Order categories by frequency
#analyze_categorical(df, "Weather_Conditions", target="Number_of_Casualties", order_by="frequency")

# Order alphabetically
#analyze_categorical(df, "Vehicle_Type", target="Number_of_Vehicles", order_by="alphabetical", colors="Set3")

# Order by mean casualties
#analyze_categorical(df, "Junction_Control", target="Number_of_Casualties", order_by="mean", top_n=5, colors="coolwarm")

In [None]:
# This is another general function that gives the chart side by side and helps analysis.

def dual_analysis(data, field, target="Number_of_Casualties", order="alphabetical", top_n=None):
    """
    Create two side-by-side plots:
      1. Accident frequency by category
      2. Mean casualties (or target) by category

    Parameters:
        data   : DataFrame
        field  : str, categorical field name
        target : str, numeric field name (default 'Number_of_Casualties')
        order  : 'alphabetical' | 'frequency' | 'mean'  (sorting of categories)
        top_n  : int or None, take only top N categories (after sorting)
    """

    # --- Aggregate ---
    summary = data.groupby(field)[target].agg(
        count="count", mean="mean"
    ).reset_index()

    # --- Sorting ---
    if order == "alphabetical":
        summary = summary.sort_values(field)
    elif order == "frequency":
        summary = summary.sort_values("count", ascending=False)
    elif order == "mean":
        summary = summary.sort_values("mean", ascending=False)

    if top_n:
        summary = summary.head(top_n)

    # --- Plot ---
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Accident Frequency
    sns.barplot(
        data=summary, x=field, y="count",
        ax=axes[0], color="skyblue"
    )
    axes[0].set_title(f"Accident Frequency by {field}")
    axes[0].set_ylabel("Number of Accidents")
    axes[0].set_xlabel(field)
    axes[0].tick_params(axis="x", rotation=75)

    # Mean Casualties
    sns.barplot(
        data=summary, x=field, y="mean",
        ax=axes[1], color="salmon"
    )
    axes[1].set_title(f"Mean {target} by {field}")
    axes[1].set_ylabel(f"Average {target}")
    axes[1].set_xlabel(field)
    axes[1].tick_params(axis="x", rotation=75)

    plt.tight_layout()
    plt.show()


In [None]:
# General function for analysis
# Plots top N categories of a categorical field: bar shows count, line shows mean of a numeric field.

def plot_top_n_risk_sorted(df, field, numeric_field, top_n=10, sort_by='count'):
    """
    Plots top_n categories with bars for count and line for mean numeric_field.
    
    Parameters:
        df (pd.DataFrame): Original dataframe (not modified)
        field (str): Categorical column name
        numeric_field (str): Numeric column to calculate mean
        top_n (int): Number of top categories to show
        sort_by (str): 'count', 'mean', or 'alphabet' for sorting
    Returns:
        pd.DataFrame: grouped dataframe with count and mean
    """
    # Group by field
    grouped = df.groupby(field)[numeric_field].agg(count='count', mean='mean').reset_index()
    
    # Sort based on parameter
    if sort_by == 'count':
        grouped_sorted = grouped.sort_values(by='count', ascending=False)
        sort_label = "Count"
    elif sort_by == 'mean':
        grouped_sorted = grouped.sort_values(by='mean', ascending=False)
        sort_label = f"Mean {numeric_field}"
    elif sort_by == 'alphabet':
        grouped_sorted = grouped.sort_values(by=field, ascending=True)
        sort_label = "Alphabetical Order"
    else:
        raise ValueError("sort_by must be 'count', 'mean', or 'alphabet'")
    
    # Take top_n
    grouped_top = grouped_sorted.head(top_n)
    
    # Plot
    plt.figure(figsize=(12,6))
    sns.barplot(x=field, y='count', data=grouped_top, color='skyblue')
    plt.xticks(rotation=45, ha='right')
    
    ax2 = plt.twinx()
    sns.lineplot(x=field, y='mean', data=grouped_top, sort=False, marker='o', color='red', ax=ax2)
    
    # ✅ Dynamic title
    plt.title(f"Top {top_n} {field} sorted by {sort_label}\n(with Mean {numeric_field} line)")
    plt.ylabel('Count')
    ax2.set_ylabel(f'Mean {numeric_field}')
    
    plt.tight_layout()
    plt.show()
    
    return grouped_top


### 8. Accident Severity vs Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 43%;"> </div>

In [None]:
# Accident Severity
analyze_categorical(df, "Accident_Severity", target="Number_of_Casualties", order_by="alphabetical", colors="Set2")

##### Accident Severity	Inference
##### Fatal	
Average of 1.8 casualties per accident, with high variation (up to 48 casualties in extreme cases). Fatal accidents tend to involve more casualties compared to other severities.

##### Serious	
Average of 1.46 casualties per accident, mostly 1–2 casualties, but occasionally extreme cases (up to 42 casualties).

##### Slight	
Average of 1.33 casualties per accident, majority involve only 1 person, with rare large-scale accidents (up to 43 casualties).

##### Overall: 
Fatal accidents are generally more severe in terms of casualties per accident, while slight accidents are usually limited to single-casualty events.

### 9. Accident Severity vs Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 43%;"> </div>

In [None]:
# Accident Severity
analyze_categorical(df, "Accident_Severity", target="Number_of_Vehicles", order_by="alphabetical", colors="Set2")

##### Accident Severity	Inference
##### Fatal
Average of 1.76–1.80 vehicles per accident. Majority involve 2 vehicles, but rare extreme multi-vehicle cases occur (up to 16–48 vehicles in worst cases).
##### Serious	
Average of 1.46–1.68 vehicles per accident. Mostly 1–2 vehicles, with few large accidents (up to 19–42 vehicles).
##### Slight	
Average of 1.33–1.85 vehicles per accident. Majority involve 1–2 vehicles, though occasional large pile-ups happen (up to 32–43 vehicles).

##### Overall: 
Slight and serious accidents usually involve fewer vehicles (1–2), while fatal accidents tend to involve slightly more vehicles on average and sometimes extreme multi-vehicle collisions.

### 10. Road_Type by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 37%;"> </div>

In [None]:
# Road Type
analyze_categorical(df, "Road_Type", target="Number_of_Vehicles", order_by="alphabetical", colors="Set2")

In [None]:
# violinplot
plt.figure(figsize=(8,4))
sns.violinplot(x="Road_Type", y="Number_of_Vehicles", data=df)
plt.xlabel("Road Type")
plt.ylabel("Number of Vehicles")
plt.title("Number of Vehicles by Road Type")
plt.xticks(rotation=45)
plt.show()


##### Over all:
Most accidents across all road types involve 1–2 vehicles.

Dual carriageways and roundabouts show a higher tendency for multi-vehicle collisions.

One-way streets generally have the lowest vehicle involvement due to traffic flow restrictions.

### 11. Road_Surface_Conditions by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 52%;"> </div>

In [None]:
# Road Surface Conditions
analyze_categorical(df, "Road_Surface_Conditions", target="Number_of_Casualties", order_by="alphabetical", colors="Set2")

##### Overall:
Accidents on dry roads are the most frequent but relatively less severe, while flooded and wet conditions increase casualty severity. Extreme surfaces like ice can cause rare but very high-casualty accidents.

### 12. Urban_or_Rural_Area by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

In [None]:
# Urban or Rural Area
analyze_categorical(df, "Urban_or_Rural_Area", target="Number_of_Casualties", order_by="alphabetical", colors="Set2")

### 13. Junction_Control by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 43%;"> </div>

In [None]:
# Junction Control
analyze_categorical(df, "Junction_Control", target="Number_of_Vehicles", order_by="alphabetical", colors="Set2")

##### 📝 Key Takeaway

Stop sign junctions → highest casualty severity per crash.

Uncontrolled junctions → biggest contributor in terms of volume.

Data missing is too large to ignore and may distort real patterns.

Pileups (non-junction accidents) create extreme casualty outliers.

##### 👉 Policy/Safety implications:

Improve enforcement at uncontrolled junctions (biggest share).

Reassess stop sign junction safety design (highest severity).

Strengthen signal compliance monitoring.

### 14. Junction_Detail by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 40%;"> </div>

In [None]:
# Junction Detail
analyze_categorical(df, "Junction_Detail", target="Number_of_Vehicles", order_by="alphabetical", colors="Set2")

### 15. Junction_Detail by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 43%;"> </div>

In [None]:
# Junction_Detail vs Casualties
dual_analysis(df, "Junction_Detail", target="Number_of_Casualties", order="alphabetical")

### 16. Junction_Detail by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 40%;"> </div>

In [None]:
# Junction_Detail vs Vehicles
dual_analysis(df, "Junction_Detail", target="Number_of_Vehicles", order="alphabetical")

##### 📝 Key Takeaways

Slip roads and private driveways are riskier per crash (higher severity).

T-junctions and “Not at junction” areas are biggest contributors overall (because of sheer volume).

Roundabouts & crossroads → moderate severity, but still significant due to volume.

Pileups happen mostly outside junctions (max casualties = 32).

##### 👉 Policy/Safety implications:

Enforce speed control & signage at slip roads and private driveways → reduce high-severity crashes.

Target T-junctions and non-junction stretches with better visibility, road design, and traffic calming → reduce total crash burden.

Roundabouts are relatively safer (lower variability), but still need monitoring due to frequency.

### 17. Light_Conditions by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 45%;"> </div>

In [None]:
# Light Conditions
analyze_categorical(df, "Light_Conditions", target="Number_of_Casualties", order_by="alphabetical", colors="Set2")

##### ✅ Overall Key Takeaways

Most accidents happen during daylight (likely due to traffic volume).

Accidents in complete darkness are slightly more severe on average.

The majority of accidents are slight, but poor lighting increases risk of more serious outcomes.

Outliers in severity need checking (max 48 seems unrealistic) – may require data cleaning.

### 18. Local_Authority_District by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 49%;"> </div>

In [None]:
# Local Authority District
analyze_categorical(df, "Local_Authority_District", target="Number_of_Vehicles", order_by="count", top_n = 20, colors="Set2")

In [None]:
# Example usage:
#grouped_df = plot_top_n_risk_sorted(df, field='Local_Authority_District', numeric_field='Number_of_Casualties', top_n=10, sort_by='count')
grouped_df = plot_top_n_risk_sorted(df, field='Local_Authority_District', numeric_field='Number_of_Vehicles', top_n=20, sort_by='count')
#grouped_df = plot_top_n_risk_sorted(df, field='Local_Authority_District', numeric_field='Number_of_Casualties', top_n=10, sort_by='alphabet')
grouped_df

##### 💡 Summary Insight:

High frequency + moderate mean → busy urban areas with frequent minor collisions.

Lower frequency + higher mean → less traffic, but more serious or multi-vehicle accidents.

Targeting accident prevention measures may differ:

Urban districts: focus on congestion & minor collision mitigation.

Suburban/rural districts: focus on multi-vehicle accident prevention and major collision response.

### 19. Carriageway_Hazards by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

In [None]:
# Carriageway Hazards
analyze_categorical(df, "Carriageway_Hazards", target="Number_of_Vehicles", order_by="mean", colors="Set2")

##### Summary Insight:

High-severity accidents are associated with previous accident locations and vehicle loads.

Most frequent accidents occur under normal conditions (no visible hazard) and are mostly minor.

Pedestrian/animal-related accidents are low-impact but still important for safety measures.

### 20. Police_Force by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 39%;"> </div>

In [None]:
# Police Force
analyze_categorical(df, "Police_Force", target="Number_of_Casualties", order_by="mean", top_n=15, colors="Set2")

In [None]:
# Example usage:
#grouped_df = plot_top_n_risk_sorted(df, field='Police_Force', numeric_field='Number_of_Casualties', top_n=10, sort_by='count')
grouped_df = plot_top_n_risk_sorted(df, field='Police_Force', numeric_field='Number_of_Casualties', top_n=15, sort_by='mean')
#grouped_df = plot_top_n_risk_sorted(df, field='Police_Force', numeric_field='Number_of_Casualties', top_n=10, sort_by='alphabet')
grouped_df

##### 💡 Summary Insight:

High-traffic areas like Greater Manchester and West Yorkshire have the highest accident frequency.

Most accidents are minor (1–2 vehicles) across all police forces.

Occasional large multi-vehicle incidents occur, mostly in less urban areas, but they are rare.

### 21. Weather_Conditions by Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

In [None]:
# Weather Conditions
analyze_categorical(df, "Weather_Conditions", target="Number_of_Casualties", order_by="alphabetical", colors="Set2")


##### 💡 Summary Insight:

Most accidents happen in fine weather, indicating driver behavior is a major factor.

Poor weather slightly increases multi-vehicle involvement, but these accidents are less frequent.

Rare large collisions can occur under any weather conditions.

### 22. Vehicle_Type by Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 39%;"> </div>

In [None]:
# Vehicle Type vs Number_of_Vehicles
analyze_categorical(df, "Vehicle_Type", target="Number_of_Vehicles", order_by="alphabetical", colors="Set2")


In [None]:
# Vehicle_Type vs Casualties
dual_analysis(df, "Vehicle_Type", target="Number_of_Casualties", order="alphabetical")

In [None]:
# Vehicle_Type vs Casualties
dual_analysis(df, "Vehicle_Type", target="Number_of_Casualties", order="alphabetical")

In [None]:
# Example usage:
#grouped_df = plot_top_n_risk_sorted(df, field='Vehicle_Type', numeric_field='Number_of_Casualties', top_n=10, sort_by='count')
#grouped_df = plot_top_n_risk_sorted(df, field='Vehicle_Type', numeric_field='Number_of_Vehicles', top_n=15, sort_by='mean')
grouped_df = plot_top_n_risk_sorted(df, field='Vehicle_Type', numeric_field='Number_of_Vehicles', top_n=10, sort_by='alphabet')
grouped_df

##### 💡 Summary Insight:

Cars and vans dominate accident counts but usually involve small numbers of vehicles (1–2).

Pedal cycles and small motorcycles show slightly higher mean vehicles involved, suggesting higher multi-vehicle interaction.

Ridden horses are rare and low-risk in terms of multi-vehicle collisions.

Extreme multi-vehicle accidents exist but are uncommon, mostly involving cars.

## 3.4.3 General Function for Numerical X Numerical.

<div style="border: 2px solid #007acc; padding: 3px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 52%;"> </div>

In [None]:
def plot_scatter_with_summary(df, x, y, hue=None, figsize=(6,4), bin_x=True, bins=10):
    """
    Scatter plot, summary table, and bar+line plot for numeric vs numeric variables.

    Parameters:
        df     : DataFrame
        x, y   : numeric columns
        hue    : optional categorical column
        figsize: tuple
        bin_x  : bool, whether to bin x for grouping
        bins   : number of bins if bin_x=True

    Returns:
        summary_table : DataFrame of descriptive stats per bin/hue
    """
    data = df[[x, y]].copy()
    if hue:
        data[hue] = df[hue]

    # Bin x if needed
    if bin_x and data[x].nunique() > bins:
        data['x_bin'] = pd.qcut(data[x], q=bins, duplicates='drop')
        group_field = 'x_bin'
    else:
        group_field = hue if hue else None

    # --- Scatter plot ---
    plt.figure(figsize=figsize)
    sns.scatterplot(data=data, x=x, y=y, hue=hue, alpha=0.5)
    plt.title(f'{y} vs {x}', fontsize=14)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.tight_layout()
    plt.show()

    # --- Summary table ---
    if group_field:
        summary_table = data.groupby(group_field, observed=False)[y].describe()
    else:
        summary_table = data[y].describe().to_frame().T

    # --- Bar + Line plot ---
    if group_field:
        agg = data.groupby(group_field, observed=False)[y].agg(count='count', mean='mean').reset_index()
        
        # Convert Interval or categorical to string for plotting
        agg[group_field] = agg[group_field].astype(str)

        fig, ax1 = plt.subplots(figsize=figsize)
        
        sns.barplot(x=group_field, y='count', data=agg, color='skyblue', ax=ax1)
        ax1.set_xlabel(group_field)
        ax1.set_ylabel('Count', color='skyblue')
        ax1.tick_params(axis='y', labelcolor='skyblue')
        plt.xticks(rotation=45, ha='right')

        ax2 = ax1.twinx()
        sns.lineplot(x=group_field, y='mean', data=agg, sort=False, marker='o', color='red', ax=ax2)
        ax2.set_ylabel(f'Mean {y}', color='red')
        ax2.tick_params(axis='y', labelcolor='red')

        plt.title(f'{y} count and mean per {group_field}')
        fig.tight_layout()
        plt.show()

    return summary_table


### 23. Number_of_Vehicles vs Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

In [None]:
#Number_of_Vehicles vs Number_of_Casualties
summary = plot_scatter_with_summary(df, 'Number_of_Vehicles', 'Number_of_Casualties', bin_x=True, bins=10, hue='Road_Surface_Conditions' )
summary

##### 🔑 Summary:

Frequency decreases as the number of vehicles increases.

Severity (mean casualties) increases with the number of vehicles.

Most accidents are minor, but high-vehicle accidents, though rare, can be very severe.

### 24. Speed_limit vs Number_of_Vehicles.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 37%;"> </div>

In [None]:
# Speed_limit vs Number_of_Vehicles
#plot_scatter(df, 'Speed_limit', 'Number_of_Vehicles')
# Road_Type
# Road_Surface_Conditions
# Weather_Conditions
#Number_of_Vehicles vs Number_of_Casualties

summary = plot_scatter_with_summary(df, 'Speed_limit', 'Number_of_Vehicles', bin_x=True, bins=10, hue='Weather_Conditions')
summary

##### 💡 Insight: 
Weather alone is not the main factor; traffic density and human behavior likely play a bigger role in accident counts and casualties.

### 25. Speed_limit vs Number_of_Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 39%;"> </div>

In [None]:
#plot_scatter(df, 'Speed_limit', 'Number_of_Casualties')

summary = plot_scatter_with_summary(df, 'Speed_limit', 'Number_of_Casualties', bin_x=True, bins=10, hue='Road_Type')
summary

✅ Overall Insight:

Most accidents happen on single carriageways (high exposure), but dual carriageways have higher casualty severity per crash.

Roundabouts and one-way streets are comparatively safer.

Road safety strategies should target single and dual carriageways with stricter traffic management, speed control, and visibility improvements.

### 26. Correlation Heatmap.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 23%;"> </div>

In [None]:
numeric_cols = ['Number_of_Vehicles', 'Number_of_Casualties', 'Speed_limit']
plt.figure(figsize=(6,5))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


##### 📌 Key Insights

Casualties are not strongly explained by vehicle count or speed limit alone.
→ Other variables like weather, light conditions, road type, junction details, vehicle type likely explain accident severity better.

Weak correlations suggest complex interactions.
→ For example, a 1-vehicle accident (like a bus crash) can cause more casualties than a 3-vehicle crash.

Speed alone is not the strongest predictor.
→ Urban low-speed roads may have high casualty counts due to pedestrian involvement, while high-speed highways often have fewer casualties but more severe outcomes.

In [None]:
sns.lmplot(x='Number_of_Vehicles', y='Number_of_Casualties', data=df, height=5, aspect=1.5)

##### This is a Seaborn lmplot of Number_of_Vehicles vs Number_of_Casualties.
Casualties do not scale linearly with vehicles: 1-vehicle accidents (like a car hitting pedestrians) can cause just as many — or more — casualties than multi-vehicle pileups.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

# 3.5 🔗 Multivariate Analysis.¶

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 40%;"> </div>

### 1. Pairwise Relationships: Casualties, Vehicles, Speed & Severity.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 59%;"> </div>

#### Graph: Pair Plot

This visualization displays scatter plots and KDE distributions for multiple numeric variables, including Number_of_Casualties, Number_of_Vehicles, Speed_limit, and Accident_Severity.
Data points are color-coded by Accident_Severity to reveal severity-specific patterns.
The diagonal shows the distribution of each variable, while the off-diagonals reveal relationships and clustering patterns across variables.

#### ⚡Focus:
Explore interrelationships among accident severity, casualties, vehicles involved, and speed limits to identify potential trends or risk factors.

In [None]:
# Pairplot with better styling
sns.pairplot(
    df[["Number_of_Casualties", "Number_of_Vehicles", "Speed_limit", "Accident_Severity"]],
    hue="Accident_Severity",
    diag_kind="kde",
    height=2.5,
    palette="Set2"
)

plt.suptitle("Pairwise Relationships: Casualties, Vehicles, Speed & Severity", y=1.02)
plt.show()


#### 🔍 Insight: 
Higher accident severity often corresponds to a larger number of casualties, while speed limits and vehicle counts show moderate correlation patterns. Severe accidents tend to cluster in areas with both higher casualties and vehicle counts.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 2. Correlation Heatmap of Numeric Features.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 43%;"> </div>

#### Graph: Heatmap
This visualization displays the correlation matrix for numeric variables, excluding Year, Month, and Hour. Each cell shows the Pearson correlation coefficient between two variables, with color intensity indicating the strength and direction of the relationship. Strong positive correlations appear in warm colors, while strong negative correlations appear in cool colors, helping identify dependent or independent features.

#### ⚡Focus:
Examine relationships between numeric variables in the dataset while ignoring time-related columns (Year, Month, Hour).

In [None]:
# Select only numeric columns, excluding Year, Month, Hour
numeric_df = df.select_dtypes(include=['int64', 'float64']).drop(columns=['Year','Month','Hour'], errors='ignore')

plt.figure(figsize=(10,6))
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Numeric Features", fontsize=16)
plt.show()


#### 🔍 Insight:
The heatmap shows that Number_of_Vehicles and Number_of_Casualties are strongly positively correlated, indicating that accidents involving more vehicles tend to result in higher casualties. Speed_limit shows weaker correlations with other variables, suggesting it is more independent. Accident_Severity has moderate correlations with casualties and vehicles, highlighting its connection to accident impact.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 3. FacetGrid Scatter Plot – Vehicle vs Casualties Analysis.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 53%;"> </div>

#### Graph: FacetGrid Scatter Plot

This visualization displays scatter plots of Number_of_Vehicles vs Number_of_Casualties for each Light_Condition, with points color-coded by Weather_Conditions. Multiple subplots reveal how accident patterns vary under different lighting conditions, while the color indicates the influence of weather.

#### ⚡Focus:
Examine how the number of vehicles and casualties relate across different light and weather conditions, highlighting patterns specific to each scenario.

In [None]:
# FacetGrid Scatter Plot: Number of Vehicles vs Number of Casualties by Light and Weather Conditions
g = sns.FacetGrid(df, 
                  col="Light_Conditions", 
                  hue="Weather_Conditions", 
                  col_wrap=3, 
                  height=4, 
                  sharex=True, 
                  sharey=True)

# Plot scatter points
g.map(sns.scatterplot, "Number_of_Vehicles", "Number_of_Casualties", alpha=0.6)

# Add legend and titles
g.add_legend(title="Weather Conditions")
g.set_axis_labels("Number of Vehicles", "Number of Casualties")
g.set_titles(col_template="{col_name}")  # Show Light Condition as subplot title
plt.subplots_adjust(top=0.9)
g.fig.suptitle("Accidents: Vehicles vs Casualties by Light and Weather Conditions", fontsize=16)

plt.subplots_adjust(top=0.88)  # adjust to make space for subtitle
g.fig.text(0.5, 0.92, "Accident patterns across different Light and Weather Conditions", ha='center', fontsize=12)

plt.show()


#### 💡Insight:
Higher numbers of vehicles generally correlate with more casualties, especially under poor lighting conditions such as darkness or low visibility. Certain weather conditions, like rain or fog, amplify this effect, showing clusters with higher casualty counts.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 4. Casualty Analysis by Light Conditions and Accident Severity.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 60%;"> </div>

#### Graph: Boxplot – Casualties by Light & Severity

This visualization shows the distribution of Number_of_Casualties across different Light_Conditions, with Accident_Severity indicated by color. Each box represents the spread of casualties for a specific light condition and severity, highlighting medians, quartiles, and outliers.

#### ⚡Focus:

Compare how accident severity affects casualty counts under different lighting conditions, emphasizing patterns in low-light or adverse situations.

In [None]:
# Accident_Severity (cat) × Light_Conditions (cat) × Number_of_Casualties (num).
# Main Heading: Casualty Analysis by Light Conditions and Accident Severity

plt.figure(figsize=(12,6))
sns.boxplot(
    data=df,
    x="Light_Conditions",          # categorical (x-axis)
    y="Number_of_Casualties",      # numeric (y-axis)
    hue="Accident_Severity"        # second categorical (color/hue)
)

# Titles and labels for clarity
plt.xticks(rotation=45)
plt.xlabel("Light Conditions")
plt.ylabel("Number of Casualties")
plt.title("Casualties by Light Conditions & Accident Severity", fontsize=16)
plt.legend(title="Accident Severity")
plt.tight_layout()
plt.show()

#### 💡Insight:

Severe accidents tend to result in higher casualties, especially in poor lighting conditions like darkness or dusk. Minor accidents usually cluster near lower casualty counts, while outliers indicate rare but extreme casualty events. This helps identify high-risk light conditions for targeted safety measures.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 5. Casualty Analysis by Road Type and Accident Severity.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 54%;"> </div>

#### Graph: Boxplot – Casualties by Road Type & Severity

This visualization shows the distribution of Number_of_Casualties across different Road_Type categories, with Accident_Severity indicated by color. Each box represents the spread of casualties for a specific road type and severity, highlighting medians, quartiles, and outliers.

#### ⚡Focus:

Compare how accident severity affects casualty counts across different types of roads, highlighting high-risk road types.

In [None]:
# Accident Severity × Road Type × Number of Casualties
# Main Heading: Casualty Analysis by Road Type and Accident Severity

plt.figure(figsize=(12,6))
sns.boxplot(
    data=df,
    x="Road_Type",                # categorical (x-axis)
    y="Number_of_Casualties",     # numeric (y-axis)
    hue="Accident_Severity"       # second categorical (color/hue)
)

# Titles and labels for clarity
plt.xticks(rotation=45)
plt.xlabel("Road Type")
plt.ylabel("Number of Casualties")
plt.title("Casualties by Road Type & Accident Severity", fontsize=16)
plt.legend(title="Accident Severity")
plt.tight_layout()
plt.show()


#### 💡Insight:

Severe accidents on highways and major roads tend to result in higher casualties, while minor roads generally show lower casualty counts. Outliers indicate rare events with unusually high casualties, emphasizing the importance of safety measures on high-speed or heavily trafficked roads.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 6. Accident Severity by Area Type and Weather Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 55%;"> </div>

#### Graph: FacetGrid – Urban/Rural × Weather × Accident Severity

This visualization displays the count of accidents for each Accident_Severity across Urban and Rural areas, with Weather_Conditions indicated by color. Each subplot represents an area type, making it easier to compare patterns without clutter.

#### ⚡Focus:
Analyze how accident severity is influenced by the combination of area type and weather conditions, highlighting high-risk scenarios across all weather conditions.

In [None]:
# FacetGrid: separate plots for Urban and Rural areas
g = sns.catplot(
    data=df,
    kind="count",
    x="Accident_Severity",
    hue="Weather_Conditions",
    col="Urban_or_Rural_Area",
    height=5,
    aspect=1
)

# Titles and labels
g.fig.suptitle("Accident Severity by Area Type and Weather Conditions", fontsize=16)
g.set_axis_labels("Accident Severity", "Count")
g.set_titles("{col_name} Area")
g._legend.set_title("Weather Conditions")

plt.tight_layout()
plt.show()


#### 💡Insight:

Rural areas generally have higher counts of severe and fatal accidents under adverse weather, while urban areas mostly show minor accidents. This multi-dimensional view highlights the need for safety measures targeting specific weather and area conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 7. Extended Correlation Analysis – Numerical & Categorical Features.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 65%;"> </div>

#### Graph: Heatmap – Numerical + Encoded Categorical Features

This visualization displays the correlation matrix between numeric variables (Number_of_Vehicles, Number_of_Casualties, Speed_limit) and encoded categorical features. Each cell shows the Pearson correlation coefficient, with color intensity indicating the strength and direction of relationships. Warm colors represent strong positive correlations, while cool colors represent strong negative correlations.

#### ⚡Focus:

Identify and emphasize strong linear relationships between numeric features and encoded categorical variables. This helps quickly spot which factors most influence accident severity and casualty counts, while weak correlations are de-emphasized for clarity.

In [None]:
# Encode categorical variables using LabelEncoder and compute correlations 
# between numerical features (Number_of_Vehicles, Number_of_Casualties, Speed_limit) 
# and encoded categorical features. Finally, visualize the correlation matrix 
# as a heatmap to explore potential linear relationships.

from sklearn.preprocessing import LabelEncoder

# Copy dataframe to avoid overwriting original
df_encoded = df.copy()

# Categorical columns to encode
categorical_cols = [
    'Accident_Severity', 'Road_Type', 'Weather_Conditions',
    'Light_Conditions', 'Urban_or_Rural_Area',
    'Junction_Control', 'Junction_Detail',
    'Road_Surface_Conditions', 'Carriageway_Hazards',
    'Vehicle_Type', 'Local_Authority_District'
]

# Encode categorical columns
le = LabelEncoder()
for col in categorical_cols:
    if col in df_encoded.columns:
        df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

# Select numeric + encoded columns
selected_cols = ['Number_of_Vehicles', 'Number_of_Casualties', 'Speed_limit'] + categorical_cols
corr = df_encoded[selected_cols].corr()

# Highlight strong correlations
strong_corr = corr.copy()
strong_corr[np.abs(strong_corr) < 0.5] = 0  # set weak correlations to 0 for emphasis

# Plot heatmap
plt.figure(figsize=(14,10))
sns.heatmap(
    corr, 
    annot=True, 
    fmt=".2f", 
    cmap="coolwarm", 
    center=0,
    cbar_kws={'label': 'Correlation Coefficient'},
    linewidths=0.5
)

# Overlay strong correlations with a bold border
for i in range(len(corr)):
    for j in range(len(corr)):
        if abs(corr.iloc[i, j]) >= 0.5:
            plt.gca().add_patch(plt.Rectangle((j,i),1,1, fill=False, edgecolor='black', lw=2))

# Titles and labels
plt.title("Extended Correlation Heatmap (Numerical + Encoded Categorical Features & Strong Correlations Highlighted)", fontsize=18)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


#### 💡Insight:

Number_of_Vehicles and Number_of_Casualties have a strong positive correlation, confirming that accidents involving more vehicles tend to result in higher casualties.

Accident_Severity shows moderate to strong positive correlations with Number_of_Casualties and certain features like Light_Conditions and Road_Type, indicating that severe accidents are associated with specific road and lighting conditions.

Vehicle_Type and Carriageway_Hazards show strong correlations with other variables, suggesting their influence on accident outcomes.

Many encoded categorical variables exhibit weak correlations (<0.5) with numeric features, indicating independence or minimal linear relationship.

This heatmap allows quick identification of key features for predictive modeling, focusing on those with strong correlations to casualties and severity.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 8. Feature Correlation with Number of Casualties.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 47%;"> </div>

#### Graph: Bar Plot – Feature Correlation

This visualization shows the Pearson correlation of all numeric and encoded categorical features with Number_of_Casualties. Features are sorted by correlation strength, with warm colors indicating strong positive correlations and cool colors indicating weaker or negative correlations.

#### ⚡Focus:

Quickly identify the features most strongly associated with the number of casualties, emphasizing those with significant linear relationships.

This complements the Insight section and aligns with the visual emphasis on strong correlations.

In [None]:
plt.figure(figsize=(8,6))

# Correlation of all features with Number_of_Casualties
casualty_corr = corr['Number_of_Casualties'].sort_values(ascending=False)

# Create color array: strong correlations (>=0.5) in dark red, others in light gray
colors = ['firebrick' if abs(x) >= 0.5 else 'lightgray' for x in casualty_corr.values]

# Barplot
sns.barplot(
    x=casualty_corr.values,
    y=casualty_corr.index,
    hue=casualty_corr.index,
    palette=colors, #"coolwarm",
    dodge=False,
    legend=False               # hide duplicate legend
)
plt.title("Correlation of Features with Number of Casualties", fontsize=16)
plt.xlabel("Correlation with Number of Casualties")
plt.ylabel("Features")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


#### 💡Insight:

Number_of_Vehicles shows the strongest positive correlation with casualties, emphasizing that accidents involving more vehicles tend to result in higher casualty counts.

Accident_Severity and Light_Conditions have moderate positive correlations, highlighting their influence on accident outcomes.

Other features, including most encoded categorical variables, show weak correlations (<0.5), indicating they have minimal linear relationship with casualty counts.

Highlighting strong correlations visually allows quick identification of key factors to prioritize for modeling, safety analysis, or targeted interventions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 9. Feature Correlation with Number of Casualties (One-Hot Encoded).

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 65%;"> </div>

#### Graph: Bar Plot – Correlation of All Features

This visualization shows the correlation of numeric and one-hot encoded categorical features with Number_of_Casualties, sorted by strength. The color intensity indicates the magnitude and direction of the correlation.

#### ⚡Focus:

Examine which one-hot encoded features (e.g., specific road types, light conditions, or junction controls) are most associated with casualty counts to identify important factors influencing accidents.

In [None]:
# One-hot encode selected categorical columns
df_encoded = pd.get_dummies(df, columns=["Road_Type", "Weather_Conditions", 
                                         "Light_Conditions", "Junction_Control"],
                            drop_first=True)

# Compute correlation with Number_of_Casualties
casualty_corr = df_encoded.corr(numeric_only=True)["Number_of_Casualties"].sort_values(ascending=False)

# Plot barplot
plt.figure(figsize=(10,6))
sns.barplot(x=casualty_corr.values, y=casualty_corr.index, palette="coolwarm", hue=casualty_corr.index, legend=False )
plt.title("Correlation of All Features with Number of Casualties (After One-Hot Encoding)", fontsize=16)
plt.xlabel("Correlation with Number of Casualties")
plt.ylabel("Features")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

#### 💡Insight:

Original numeric features like Number_of_Vehicles remain the strongest predictors of casualties.

Certain one-hot encoded categories, such as specific Road_Type or Light_Conditions, show moderate correlations, highlighting their influence on accident severity.

This detailed view helps identify specific scenarios (e.g., particular road types or junction controls) that contribute more to casualty counts, aiding targeted safety measures or predictive modeling.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 10. Casualties by Severity × Junction Control × Light Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 60%;"> </div>

#### Graph: FacetGrid Boxplot – Casualties Across Junctions and Light Conditions

This visualization displays the distribution of Number_of_Casualties for each Light_Conditions, with Accident_Severity indicated by color. Multiple subplots represent different Junction_Control types, allowing comparison of accident severity patterns across junction types and lighting conditions.

#### ⚡Focus:

Examine how accident severity and lighting conditions interact at different junction types to identify scenarios with higher casualty risks.

In [None]:
# FacetGrid Boxplot: Number of Casualties by Light Conditions, Accident Severity, and Junction Control

# Increase figure size for readability
g = sns.catplot(
    x="Light_Conditions",
    y="Number_of_Casualties",
    hue="Accident_Severity",
    col="Junction_Control",
    data=df,
    kind="box",
    col_wrap=3,   # makes multiple columns wrap into rows
    height=5,     # size of each small plot
    aspect=1.2    # width/height ratio
)

# Improve readability
g.set_titles("Junction Control: {col_name}")
g.set_axis_labels("Light Conditions", "Number of Casualties")
g.set_xticklabels(rotation=45)
plt.subplots_adjust(top=0.85)
g.fig.suptitle("Casualties by Severity × Junction Control × Light Conditions", fontsize=18)

plt.show()


#### 💡Insight:

Severe accidents often result in higher casualties, particularly under poor lighting conditions.

Certain junction types, such as crossroads or mini-roundabouts, show wider variability in casualties, indicating higher-risk locations.

Minor accidents are generally clustered at lower casualty counts, while extreme outliers highlight rare but serious events.

This multi-dimensional view helps prioritize safety interventions at specific junctions under certain light conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 11. Casualties by Severity × Weather Conditions × Area Type.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 57%;"> </div>

#### Graph: Faceted Boxplot – Casualties across Weather Conditions and Area Types
This visualization compares the Number_of_Casualties for each Weather_Conditions category, segmented by Urban_or_Rural_Area. Accident severity levels are represented with colors to highlight patterns across environmental and geographical conditions.

#### ⚡Focus:

Understand how weather and urban vs. rural settings interact to influence casualty numbers across different severity levels.

In [None]:
# Faceted boxplot: Casualties by Severity × Weather Conditions × Urban/Rural Area

g = sns.catplot(
    x="Road_Surface_Conditions",
    y="Number_of_Casualties",
    hue="Accident_Severity",
    col="Road_Type",
    data=df,
    kind="box",
    col_wrap=3,     # Wrap columns into rows
    height=5,
    aspect=1.3
)

# Improve readability
g.set_titles("Road Type: {col_name}")
g.set_axis_labels("Road Surface Conditions", "Number of Casualties")
g.set_xticklabels(rotation=45)
plt.subplots_adjust(top=0.85)
g.fig.suptitle("Casualties by Severity × Road Surface Conditions × Road Type", fontsize=16)

plt.show()


#### 💡Insight:

Severe weather (e.g., rain, snow, fog) tends to correlate with higher casualties in rural areas due to poor visibility and road conditions.

Urban areas often show higher accident counts but fewer casualties per accident, possibly due to lower speed limits.

Rural accidents under poor weather conditions show wider variability, indicating potential for severe outcomes.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 12. Casualties by Severity × Urban/Rural Area × Weather Conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 65%;"> </div>

#### Graph Faceted Barplot

Average casualties for each weather condition under different urban/rural areas, with severity level as the hue.

#### ⚡Focus

Understand how weather conditions and urban/rural location jointly affect accident severity and casualty counts.

In [None]:
# Calculate average casualties per Weather_Conditions for sorting
order = (
    df.groupby("Weather_Conditions")["Number_of_Casualties"]
    .mean()
    .sort_values(ascending=False)
    .index
)

# Faceted barplot: Casualties by Severity × Urban/Rural Area × Weather Conditions
g = sns.catplot(
    x="Weather_Conditions",
    y="Number_of_Casualties",
    hue="Accident_Severity",
    col="Urban_or_Rural_Area",
    data=df,
    kind="bar",
    col_wrap=2,
    height=5,
    aspect=1.5,
    order=order     # Sort by average casualties
)

# Improve readability
g.set_titles("Area: {col_name}")
g.set_axis_labels("Weather Conditions", "Avg. Number of Casualties")
g.set_xticklabels(rotation=45)
plt.subplots_adjust(top=0.85)
g.fig.suptitle("Casualties by Severity × Urban/Rural Area × Weather Conditions", fontsize=16)

plt.show()


#### 💡Insight

Rainy and Foggy conditions show higher average casualties, especially in rural areas.

Urban areas have more accidents, but rural areas see higher casualties per accident under severe weather.

Clear weather has the highest frequency of accidents but often lower casualties per accident, likely due to better control and visibility.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 13. Mean Number of Casualties by Road Surface × Weather × Severity.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 67%;"> </div>

#### Graph: Faceted Heatmap.

Each subplot shows the average number of casualties for combinations of Road Surface Conditions (Y-axis) and Weather Conditions (X-axis), separated by Accident Severity. Cells are color-coded according to the mean casualties, and high-risk values (top 25%) are highlighted in bold.

#### ⚡Focus

Examine how different road surface types interact with weather conditions to influence accident severity and casualty counts.

Identify high-risk scenarios where the combination of surface and weather leads to more casualties.

Compare patterns across different severity levels side by side.

In [None]:
# All unique road surfaces and weather conditions
all_roads = sorted(df["Road_Surface_Conditions"].unique())
all_weather = sorted(df["Weather_Conditions"].unique())

# Group and calculate mean casualties
heatmap_data = (
    df.groupby(["Road_Surface_Conditions", "Weather_Conditions", "Accident_Severity"])
      ["Number_of_Casualties"]
      .mean()
      .reset_index()
)

# Threshold for high-risk cells
high_risk_threshold = heatmap_data["Number_of_Casualties"].quantile(0.75)

# Unique severities
severities = heatmap_data["Accident_Severity"].unique()
num_severities = len(severities)

# Find common color scale limits
vmin = heatmap_data["Number_of_Casualties"].min()
vmax = heatmap_data["Number_of_Casualties"].max()

# Create subplots
fig, axes = plt.subplots(1, num_severities, figsize=(6*num_severities, 6), sharey=True)

for ax, severity in zip(axes, severities):
    # Pivot and reindex to ensure all road surfaces and weather conditions are present
    pivot_table = (
        heatmap_data[heatmap_data["Accident_Severity"] == severity]
        .pivot(index="Road_Surface_Conditions", columns="Weather_Conditions", values="Number_of_Casualties")
        .reindex(index=all_roads, columns=all_weather, fill_value=0)
    )

    # Annotate high-risk cells
    annot = pivot_table.copy()
    for col in annot.columns:
        annot[col] = annot[col].map(lambda x: f"{x:.2f}" if x < high_risk_threshold else f"**{x:.2f}**")

    sns.heatmap(
        pivot_table,
        annot=annot,
        fmt='',
        cmap="YlOrRd",
        vmin=vmin, vmax=vmax,
        cbar=(ax==axes[-1]),
        cbar_kws={'label': 'Mean Casualties'},
        ax=ax
    )
    
    ax.set_title(f"Severity = {severity}", fontsize=14)
    ax.set_xlabel("Weather Conditions", fontsize=12)
    ax.set_ylabel("Road Surface Conditions", fontsize=12)
    
    # Ensure all road surface labels appear on Y-axis
    ax.set_yticks([i + 0.5 for i in range(len(all_roads))])
    ax.set_yticklabels(all_roads, rotation=0, fontsize=10)
    
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10)

plt.suptitle("Mean Number of Casualties by Road Surface × Weather × Severity", fontsize=16, y=1.05)
plt.tight_layout()
plt.show()


#### 💡Insight

Surfaces like Wet, Snow, or Ice combined with adverse weather (Rain, Snow, Fog) show higher average casualties, especially for severe accidents.

Dry surfaces under clear weather generally have low casualties, even for moderate severity.

Highlighting high-risk cells makes it easy to pinpoint dangerous combinations, which can inform road safety interventions and preventive measures.

The consistent color scale across severities helps quickly compare severity levels for the same road/weather conditions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

### 14. Average Casualties by Junction Detail × Road Surface Conditions × Severity (Sorted by Total Casualties).

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 89%;"> </div>

#### Graph Faceted Barplot.

Each subplot represents a Road Surface Condition. Bars show the average number of casualties for each Junction Detail, sorted from highest to lowest total casualties. Accident Severity (Slight, Serious, Fatal) is color-coded using a custom palette (green → slight, orange → serious, red → fatal). Each bar is annotated with its exact value.

#### ⚡Focus

Identify which junctions contribute most to casualties under each road surface condition.

Compare severity-specific risks across junction types and road surfaces.

Make high-risk junctions immediately visible through sorting and annotations.

In [None]:
# Compute mean casualties per combination
grouped = (
    df.groupby(["Road_Surface_Conditions", "Junction_Detail", "Accident_Severity"])
      ["Number_of_Casualties"].mean()
      .reset_index()
)

# Define severity palette
palette = {
    "Slight": "#4daf4a",   # Green
    "Serious": "#ff7f00",  # Orange
    "Fatal": "#e41a1c"     # Red
}

# Facet grid barplots
g = sns.FacetGrid(
    grouped, 
    col="Road_Surface_Conditions", 
    col_wrap=3, 
    height=4, 
    sharey=False
)

# Map bars sorted by total casualties per junction
def sorted_barplot(data, **kwargs):
    # Sum casualties across severities for sorting
    order = data.groupby("Junction_Detail")["Number_of_Casualties"].sum().sort_values(ascending=False).index
    ax = sns.barplot(
        data=data,
        x="Junction_Detail",
        y="Number_of_Casualties",
        hue="Accident_Severity",
        palette=palette,
        order=order,
        **kwargs
    )
    # Annotate bars
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.1f}', 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha='center', va='bottom', fontsize=8, rotation=90)
    return ax

g.map_dataframe(sorted_barplot)

# Styling
g.add_legend(title="Accident Severity")
g.set_axis_labels("Junction Detail", "Avg. Number of Casualties")
g.set_titles("{col_name}")
for ax in g.axes.flatten():
    ax.tick_params(axis="x", rotation=45)

plt.subplots_adjust(top=0.9)
g.fig.suptitle("Average Casualties by Junction Detail × Road Surface Conditions × Severity", fontsize=14)
plt.show()


#### 💡Insight

Crossroads and Mini-roundabouts consistently have the highest average casualties, especially under Wet or Snow surfaces.

Slight severity dominates simpler junctions like T-junctions, while Serious and Fatal accidents cluster at complex junctions.

Annotated values and sorted bars make it easy to prioritize high-risk locations for targeted safety measures.

This visualization clearly shows how road surface conditions amplify risk at complex junctions.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 99%;"> </div>

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

## Geograph Map

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 19%;"> </div>

In [None]:
#Check the Lat/long field.
type(df.loc[0, 'GeoPoint'])

In [None]:
# Convert GeoPoint string to float with full precision and created two fields.
df['Latitude'] = df['GeoPoint'].apply(lambda x: float(x.strip('()').split(',')[0]))
df['Longitude'] = df['GeoPoint'].apply(lambda x: float(x.strip('()').split(',')[1]))

# Optional: display full precision
pd.set_option('display.float_format', lambda x: '%.6f' % x)

# Check
df[['GeoPoint','Latitude','Longitude']].head()


### Make the Risk Score and Map the 10 Top Cities. 

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 45%;"> </div>

In [None]:
# Risk score Calculation.
total_groupby = df['Local_Authority_District'].value_counts()
serious_or_fatal = df[df['Accident_Severity'].isin(['Serious', 'Fatal'])]
serious_fatal_total_groupby = serious_or_fatal['Local_Authority_District'].value_counts()

#total_groupby
#serious_or_fatal
#serious_fatal_total_groupby

In [None]:
risk_df = pd.DataFrame({
    'Total_Accidents': total_groupby,
    'Serious_or_Fatal': serious_fatal_total_groupby
}).fillna(0)

risk_df['Risk_Score (%)'] = (risk_df['Serious_or_Fatal'] / risk_df['Total_Accidents']) * 100
top_risky = risk_df.sort_values(by='Risk_Score (%)', ascending=False).head(20)

print("Top 10 Most Dangerous Districts Based on Risk Score:\n")
print(top_risky)

In [None]:
#pip install folium

### Create the Map, with TopCities Pointed and With ToolTips.

<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #f0f8ff; text-align: left;width: 55%;"> </div>

# As of now block this working code. As it is not static its not working in the github. Also the code is blocking. 
# So its marking now as Mark down

import folium
import matplotlib.cm as cm
import matplotlib.colors as mcolors
import pandas as pd

# Step 0: Ensure district column exists
top_risky_coords = top_risky.copy()
top_risky_coords['District'] = top_risky_coords.index  # create column from index

# Step 1: Get mean lat/lon per district
district_coords = df.groupby('Local_Authority_District')[['Latitude','Longitude']].mean().reset_index()
district_coords.rename(columns={'Local_Authority_District':'District'}, inplace=True)

# Step 2: Merge top risky districts with coordinates
top_risky_coords = pd.merge(top_risky_coords, district_coords, on='District', how='left')

# Step 3: Normalize Risk_Score for color mapping and radius
norm = mcolors.Normalize(vmin=top_risky_coords['Risk_Score (%)'].min(),
                         vmax=top_risky_coords['Risk_Score (%)'].max())
cmap = cm.Reds

# Radius scaling: adjust min/max radius as needed
min_radius = 5
max_radius = 20
risk_min = top_risky_coords['Risk_Score (%)'].min()
risk_max = top_risky_coords['Risk_Score (%)'].max()
def scale_radius(risk):
    return min_radius + (risk - risk_min)/(risk_max - risk_min)*(max_radius - min_radius)

# Step 4: Create Folium map with fixed frame size
m = folium.Map(
    location=[54.0, -2.0],
    zoom_start=6,
    tiles='OpenStreetMap',
    width=900,    # frame width in pixels
    height=400    # frame height in pixels
)

# Step 5: Add circles with serial numbers and tooltips
for i, row in top_risky_coords.iterrows():
    rgba_color = cmap(norm(row['Risk_Score (%)']))
    hex_color = mcolors.to_hex(rgba_color)
    radius = scale_radius(row['Risk_Score (%)'])
    
    tooltip_text = (
        f"<b>{i+1}. {row['District']}</b><br>"
        f"Total Accidents: {row['Total_Accidents']}<br>"
        f"Serious/Fatal: {row['Serious_or_Fatal']}<br>"
        f"Risk Score: {row['Risk_Score (%)']:.2f}%"
    )
    
    # Circle marker
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=radius,
        color=hex_color,
        fill=True,
        fill_color=hex_color,
        fill_opacity=0.7,
        popup=folium.Popup(tooltip_text, max_width=250),
        tooltip=folium.Tooltip(tooltip_text, sticky=True)
    ).add_to(m)
    
    # Serial number on top
    folium.map.Marker(
        [row['Latitude'], row['Longitude']],
        icon=folium.DivIcon(
            html=f'<div style="font-size:10pt; font-weight:bold; color:black">{i+1}</div>'
        )
    ).add_to(m)

m


<div style="border: 2px solid #007acc; padding: 2px; border-radius: 10px; background-color: #007acc; text-align: left;width: 99%;"> </div>

In [None]:
# To get a static mode the below code is used for Map case. The real working code is the above one.

import matplotlib.pyplot as plt

# Copy of top risky districts with coordinates
top_risky_plot = top_risky.copy()
top_risky_plot['District'] = top_risky_plot.index
top_risky_plot = pd.merge(
    top_risky_plot,
    df.groupby('Local_Authority_District')[['Latitude','Longitude']].mean().reset_index().rename(columns={'Local_Authority_District':'District'}),
    on='District', how='left'
)

# Scale radius by Risk Score
max_radius = 3000  # adjust for visibility
top_risky_plot['Radius'] = (top_risky_plot['Risk_Score (%)'] / top_risky_plot['Risk_Score (%)'].max()) * max_radius

# UK map bounding box
fig, ax = plt.subplots(figsize=(6,8))
ax.set_xlim([-8, 2])   # approx longitudes of UK
ax.set_ylim([49, 61])  # approx latitudes of UK

# Add circles
for i, row in top_risky_plot.iterrows():
    ax.scatter(row['Longitude'], row['Latitude'],
               s=row['Radius'], color='red', alpha=0.6)
    ax.text(row['Longitude'], row['Latitude'],
            f"{i+1}\n{row['Total_Accidents']}/{row['Serious_or_Fatal']}",
            fontsize=9, ha='center', va='center', fontweight='bold')

# Title and axes
plt.title("Top 10 Most Dangerous UK Districts by Risk Score\n(Number / Serious-Fatal)", fontsize=16)
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
plt.tight_layout()
plt.savefig("top_10_risky_districts_static.png", dpi=300)
plt.show()
