# Exploratory Analysis

Having cleaned our raw data in the previous notebook, we now have the appropriate data that we can work with.

Our main goal is to predict the `Severity` of the accident based on the different variables that we are given. Thus, we now want to find the relationship of the different variables with the `Severity` of the accident

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import datetime as dt
sb.set()

In [None]:
cal_accident_df = pd.read_csv('california_accident_data.csv')

We then look at the summary of the data given

In [None]:
cal_accident_df.head()

In [None]:
print('shape:', cal_accident_df.shape)
print(cal_accident_df.info())

In [None]:
cal_accident_df.describe()

## Analysis: Severity

`Severity`: A scale from 1-4 on how severe the accident is, with 1 being the least severe and 4 being the most

Now, let's do some EDA on `Severity`

In [None]:
cal_accident_df['Severity'].describe()

In [None]:
# print count and ratio of total count of the different severity levels
print('number of Severity levels:', len(cal_accident_df['Severity'].unique()))
count = cal_accident_df['Severity'].value_counts()
ratio = round(cal_accident_df['Severity'].value_counts(normalize=True) * 100, 2)
tmp = pd.concat([count, ratio], axis=1, keys=['count', '%'])
print(tmp)

# plot bar and pie chart
f, axes = plt.subplots(1, 2, figsize=(14, 8))

axes[0].set_title('Percentage Severity Distribution')
cal_accident_df['Severity'].value_counts().plot.pie(autopct = '%1.1f%%', ax = axes[0])

axes[1].set_title('Count')
sb.countplot(x = 'Severity', data = cal_accident_df, ax = axes[1])

plt.show()

From the above, we can see that the average `Severity` level is 2, with the mean, median and mode of `Severity` being 2. We also note that the `Severity` level of 1 & 4 are minimal, with their percentage distribution being less than 1%. This information will be helpful for us later on in our EDA

## Null Values

We want to see the number of null values that each variable has

In [None]:
count = cal_accident_df.isna().sum()
ratio = round(cal_accident_df.isna().mean() * 100, 2)
null_values = pd.concat([count, ratio], axis = 1, keys = ['count', '%'])
print(null_values.sort_values(by='%', ascending=False))

## Analysis: Date & Time

The variable `Start_Time` provide us with the date and time of the accident. Let us first convert the column type to python `datetime` object. This will make it easier for us when comparing and grouping them.

In [None]:
# convert Start_Time to datetime object
cal_accident_df['Start_Time'] = pd.to_datetime(cal_accident_df['Start_Time'])
print(cal_accident_df['Start_Time'].dtype)

### Month

Let us group them by their month and severity level

In [None]:
# Group DataFrame by month and severity, and count occurrences
month_counts = cal_accident_df.groupby([cal_accident_df['Start_Time'].dt.month, 'Severity']).size().unstack(fill_value=0)
print(month_counts)

# Plot a stacked bar graph
month_counts.plot(kind='bar', stacked=True)
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Number of Occurrences by Month and Severity')
plt.xticks(rotation=0)  # Ensure x-axis labels are not rotated
plt.legend(title='Severity')
plt.tight_layout()
plt.show()

In [None]:
# group them by severity
acc_severity_1 = pd.DataFrame(cal_accident_df[cal_accident_df['Severity'] == 1])
acc_severity_2 = pd.DataFrame(cal_accident_df[cal_accident_df['Severity'] == 2])
acc_severity_3 = pd.DataFrame(cal_accident_df[cal_accident_df['Severity'] == 3])
acc_severity_4 = pd.DataFrame(cal_accident_df[cal_accident_df['Severity'] == 4])

month_counts_1 = acc_severity_1.groupby(acc_severity_1['Start_Time'].dt.month).size()
percentage_1 = round((month_counts_1 / month_counts_1.sum()) * 100, 2)
month_counts_2 = acc_severity_2.groupby(acc_severity_2['Start_Time'].dt.month).size()
percentage_2 = round((month_counts_2 / month_counts_2.sum()) * 100, 2)
month_counts_3 = acc_severity_3.groupby(acc_severity_3['Start_Time'].dt.month).size()
percentage_3 = round((month_counts_3 / month_counts_3.sum()) * 100, 2)
month_counts_4 = acc_severity_4.groupby(acc_severity_4['Start_Time'].dt.month).size()
percentage_4 = round((month_counts_4 / month_counts_4.sum()) * 100, 2)

tmp = pd.concat([month_counts_1, percentage_1,
                 month_counts_2, percentage_2,
                 month_counts_3, percentage_3,
                 month_counts_4, percentage_4],
                 axis=1, keys=['1', '%', '2', '%', '3', '%', '4', '%'])
tmp = tmp.fillna(0)
tmp = tmp.sort_index(ascending=True)
print(tmp)


# plot graph
f, axes = plt.subplots(1, 4, figsize= (14, 6))
month_counts_1.plot(kind='bar', title='Severity = 1', ax=axes[0])
month_counts_2.plot(kind='bar', title='Severity = 2', ax=axes[1])
month_counts_3.plot(kind='bar', title='Severity = 3', ax=axes[2])
month_counts_4.plot(kind='bar', title='Severity = 4', ax=axes[3])

### Conclusion

It seems that the number of accidents are most during the start and end of the year while it is at its least during the middle of the year. However, upon further inspection, this seems to be only true for `Severity=2`. 

Futher inspection of other factors may lead explain the differing trends

### Day

Let us group them by their day and severity level

In [None]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Group DataFrame by month and severity, and count occurrences
day_counts = cal_accident_df.groupby([cal_accident_df['Start_Time'].dt.day_name(), 'Severity']).size().unstack(fill_value=0)
day_counts = day_counts.reindex(day_order)
print(day_counts)

# Plot a stacked bar graph
day_counts.plot(kind='bar', stacked=True)
plt.xlabel('Day')
plt.ylabel('Count')
plt.title('Number of Occurrences by Day and Severity')
plt.xticks(rotation=0)  # Ensure x-axis labels are not rotated
plt.legend(title='Severity')
plt.tight_layout()
plt.show()

In [None]:
day_counts_1 = acc_severity_1.groupby(acc_severity_1['Start_Time'].dt.day_name()).size()
percentage_1 = round((day_counts_1 / day_counts_1.sum()) * 100, 2)
day_counts_2 = acc_severity_2.groupby(acc_severity_2['Start_Time'].dt.day_name()).size()
percentage_2 = round((day_counts_2 / day_counts_2.sum()) * 100, 2)
day_counts_3 = acc_severity_3.groupby(acc_severity_3['Start_Time'].dt.day_name()).size()
percentage_3 = round((day_counts_3 / day_counts_3.sum()) * 100, 2)
day_counts_4 = acc_severity_4.groupby(acc_severity_4['Start_Time'].dt.day_name()).size()
percentage_4 = round((day_counts_4 / day_counts_4.sum()) * 100, 2)

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts_1 = day_counts_1.reindex(day_order)
day_counts_2 = day_counts_2.reindex(day_order)
day_counts_3 = day_counts_3.reindex(day_order)
day_counts_4 = day_counts_4.reindex(day_order)

tmp = pd.concat([day_counts_1, percentage_1,
                 day_counts_2, percentage_2,
                 day_counts_3, percentage_3,
                 day_counts_4, percentage_4],
                 axis=1, keys=['1', '%', '2', '%', '3', '%', '4', '%'])

# tmp = tmp.reindex(day_order)
print(tmp)


# plot graph
f, axes = plt.subplots(1, 4, figsize= (14, 6))
day_counts_1.plot(kind='bar', title='Severity = 1', ax=axes[0])
day_counts_2.plot(kind='bar', title='Severity = 2', ax=axes[1])
day_counts_3.plot(kind='bar', title='Severity = 3', ax=axes[2])
day_counts_4.plot(kind='bar', title='Severity = 4', ax=axes[3])

### Conclusion

It seems that all `Severity` levels follow the trend of higher number of accidents in the weekdays, with the most occuring on Friday, with a significant drop on the weekends. This is as suspected as most people are working on the weekdays, meaning that they need to commute work and with more people on the roads, more accidents tend to happen

### Hour

Let us group them by their hour and severity level

In [None]:
# Group DataFrame by hour and severity, and count occurrences
hour_counts = cal_accident_df.groupby([cal_accident_df['Start_Time'].dt.hour, 'Severity']).size().unstack(fill_value=0)
print(hour_counts)

# Plot a stacked bar graph
hour_counts.plot(kind='bar', stacked=True)
plt.xlabel('Hour')
plt.ylabel('Count')
plt.title('Number of Occurrences by Hour and Severity')
plt.xticks(rotation=0)  # Ensure x-axis labels are not rotated
plt.legend(title='Severity')
plt.tight_layout()
plt.show()

In [None]:
hour_counts_1 = acc_severity_1.groupby(acc_severity_1['Start_Time'].dt.hour).size()
percentage_1 = round((hour_counts_1 / hour_counts_1.sum()) * 100, 2)
hour_counts_2 = acc_severity_2.groupby(acc_severity_2['Start_Time'].dt.hour).size()
percentage_2 = round((hour_counts_2 / hour_counts_2.sum()) * 100, 2)
hour_counts_3 = acc_severity_3.groupby(acc_severity_3['Start_Time'].dt.hour).size()
percentage_3 = round((hour_counts_3 / hour_counts_3.sum()) * 100, 2)
hour_counts_4 = acc_severity_4.groupby(acc_severity_4['Start_Time'].dt.hour).size()
percentage_4 = round((hour_counts_4 / hour_counts_4.sum()) * 100, 2)

tmp = pd.concat([hour_counts_1, percentage_1,
                 hour_counts_2, percentage_2,
                 hour_counts_3, percentage_3,
                 hour_counts_4, percentage_4],
                 axis=1, keys=['1', '%', '2', '%', '3', '%', '4', '%'])

# tmp = tmp.reindex(day_order)
print(tmp)


# plot graph
f, axes = plt.subplots(1, 4, figsize= (14, 6))
hour_counts_1.plot(kind='bar', title='Severity = 1', ax=axes[0], xticks=range(0, 24, 4))
hour_counts_2.plot(kind='bar', title='Severity = 2', ax=axes[1], xticks=range(0, 24, 4))
hour_counts_3.plot(kind='bar', title='Severity = 3', ax=axes[2], xticks=range(0, 24, 4))
hour_counts_4.plot(kind='bar', title='Severity = 4', ax=axes[3], xticks=range(0, 24, 4))

### Conclusion

It seems that there is a spike in the number of accidents from hour 6-9 and 

## Analysis: boolean values

From the summary, we can see that columns 12 to 24 consist of boolean values. These variables are main road features such as objects, road sign, buildings. We then decided to see the relationship of the different variables with `Severity`

In [None]:
bool_cols = ['Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop',
             'Traffic_Calming','Traffic_Signal','Turning_Loop']

In [None]:
# print count and corresponding ratio of T/F
for col in bool_cols:
    count = cal_accident_df[col].value_counts()
    ratio = round(cal_accident_df[col].value_counts(normalize=True) * 100, 2)
    tmp = pd.concat([count, ratio], axis=1, keys=['count', '%'])
    print(tmp)
    print('-'*40)

## Conclusion

From the above data, we can see that almost all of the columns have at > 90% False outcome (`Junction` has 89 % False but we round it up to 90 %)

This makes sense as the presence of such road features would make drivers be more alert and thus would react accordingly to their surroundings and environment

## Analysing boolean values

From the summary, we can see that columns 12 to 24 consist of boolean values. These variables are main road features such as objects, road sign, buildings. We then decided to see the relationship of the different variables with `Severity`

In [None]:
bool_cols = ['Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway','Roundabout','Station','Stop',
             'Traffic_Calming','Traffic_Signal','Turning_Loop']

In [None]:
# print count and corresponding ratio of T/F
for col in bool_cols:
    count = cal_accident_df[col].value_counts()
    ratio = cal_accident_df[col].value_counts(normalize=True)
    tmp = pd.concat([count, ratio], axis=1, keys=['count', 'ratio'])
    print(tmp)
    print('-'*40)

From the above data, we can see that almost all of the columns have at > 90% False outcome (`Junction` has 89 % False but we round it up to 90 %)

# Analysing numerical data

In [None]:
numeric_data = pd.DataFrame(cal_accident_df[["Severity", "Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                                             "Wind_Chill", "Precipitation", "Pressure"]])

f, axes = plt.subplots(9, 3, figsize=(30, 40))

count = 0
for var in numeric_data:
    sb.boxplot(data = numeric_data[var], orient = "h", ax = axes[count,0])
    sb.histplot(data = numeric_data[var], ax = axes[count,1])
    sb.violinplot(data = numeric_data[var], orient = "h", ax = axes[count,2])
    count += 1



In [None]:
    # Correlation Matrix
    print(numeric_data.corr())

    # Heatmap of the Correlation Matrix
    f = plt.figure(figsize=(20, 20))
    sb.heatmap(numeric_data.corr(), vmin = -1, vmax = 1, linewidths = 1,
               annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = numeric_data)

In [None]:
severity_4_df = pd.DataFrame(cal_accident_df[cal_accident_df["Severity"] == 4])
severity_3_df = pd.DataFrame(cal_accident_df[cal_accident_df["Severity"] == 3])
severity_2_df = pd.DataFrame(cal_accident_df[cal_accident_df["Severity"] == 2])
severity_1_df = pd.DataFrame(cal_accident_df[cal_accident_df["Severity"] == 1])

In [None]:
severity_4_df.info()


## Severity = 4

In [None]:
severity_4_df[["Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                "Wind_Chill", "Precipitation", "Pressure"]].describe()

In [None]:
numeric_value = ["Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                "Wind_Chill", "Precipitation", "Pressure"]

f, axes = plt.subplots(8, 3, figsize=(30, 40))

count = 0
for var in numeric_value:
    temp_db = pd.DataFrame(severity_4_df[var])
    sb.boxplot(data = temp_db, orient = "h", ax = axes[count, 0])
    sb.histplot(data = temp_db, ax = axes[count,1])
    sb.violinplot(data = temp_db, orient = "h", ax = axes[count,2])
    count += 1



## Severity = 3

In [None]:
severity_3_df[["Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                "Wind_Chill", "Precipitation", "Pressure"]].describe()

In [None]:
f, axes = plt.subplots(8, 3, figsize=(30, 40))

count = 0
for var in numeric_value:
    temp_db = pd.DataFrame(severity_3_df[var])
    sb.boxplot(data = temp_db, orient = "h", ax = axes[count, 0])
    sb.histplot(data = temp_db, ax = axes[count,1])
    sb.violinplot(data = temp_db, orient = "h", ax = axes[count,2])
    count += 1



## Severity = 2

In [None]:
severity_2_df[["Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                "Wind_Chill", "Precipitation", "Pressure"]].describe()

In [None]:
f, axes = plt.subplots(8, 3, figsize=(30, 40))

count = 0
for var in numeric_value:
    temp_db = pd.DataFrame(severity_2_df[var])
    sb.boxplot(data = temp_db, orient = "h", ax = axes[count, 0])
    sb.histplot(data = temp_db, ax = axes[count,1])
    sb.violinplot(data = temp_db, orient = "h", ax = axes[count,2])
    count += 1

## Severity = 1

In [None]:
severity_1_df[["Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", 
                "Wind_Chill", "Precipitation", "Pressure"]].describe()

In [None]:
f, axes = plt.subplots(8, 3, figsize=(30, 40))

count = 0
for var in numeric_value:
    temp_db = pd.DataFrame(severity_1_df[var])
    sb.boxplot(data = temp_db, orient = "h", ax = axes[count, 0])
    sb.histplot(data = temp_db, ax = axes[count,1])
    sb.violinplot(data = temp_db, orient = "h", ax = axes[count,2])
    count += 1

From the analysis of how variables - "Humidity(%)", "Distance", "Visibility", "Wind_Speed", "Temperature", "Wind_Chill", "Precipitation", "Pressure" 
we can see that most accidents occur in normal weather condition. This shows that drivers may be driving less carefully in good weather conditions as compared to 