# CSC 405 - Project Stage I Report  
**Group 11**  
**Project Title:** US Traffic Accident Severity Prediction  
**Dataset Source:** https://www.kaggle.com/sobhanmoosavi/us-accidents  

## Task 1 Problem Framing  
Our project is a supervised learning classification task, where we aim to predict the severity of US traffic accidents. We want to identify which variables (e.g. weather, time, traffic, and road conditions) that most affect the severity of an accident. This study can benefit drivers and commuters by providing insight into conditions that increase accident risk and where defensive driving should be used. This project could also help government and city planners better improve safety on the road and prevent future accidents. The US accidents dataset we chose fits because it is large (7 million records), covers 49 US states over seven years, and over 40 features related to traffic and weather that are useful for building our predictive model. 

## Task 2 Data Exploration
**Why:** To view the first ten records of our dataset.

In [2]:
import pandas as pd

df = pd.read_csv("../data/US_Accidents_March23.csv")

df.head(10)

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,False,False,False,False,True,False,Day,Day,Day,Day
5,A-6,Source2,3,2016-02-08 07:44:26,2016-02-08 08:14:26,40.10059,-82.925194,,,0.01,...,False,False,False,False,False,False,Day,Day,Day,Day
6,A-7,Source2,2,2016-02-08 07:59:35,2016-02-08 08:29:35,39.758274,-84.230507,,,0.0,...,False,False,False,False,False,False,Day,Day,Day,Day
7,A-8,Source2,3,2016-02-08 07:59:58,2016-02-08 08:29:58,39.770382,-84.194901,,,0.01,...,False,False,False,False,False,False,Day,Day,Day,Day
8,A-9,Source2,2,2016-02-08 08:00:40,2016-02-08 08:30:40,39.778061,-84.172005,,,0.0,...,False,False,False,False,False,False,Day,Day,Day,Day
9,A-10,Source2,3,2016-02-08 08:10:04,2016-02-08 08:40:04,40.10059,-82.925194,,,0.01,...,False,False,False,False,False,False,Day,Day,Day,Day


**Why:** To view the amount of records and features we have in our dataset.  

In [4]:
print(df.shape)

(7728394, 46)


**Why:** To get information on each feature in our dataset including their name and raw datatype.  

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 46 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Source                 object 
 2   Severity               int64  
 3   Start_Time             object 
 4   End_Time               object 
 5   Start_Lat              float64
 6   Start_Lng              float64
 7   End_Lat                float64
 8   End_Lng                float64
 9   Distance(mi)           float64
 10  Description            object 
 11  Street                 object 
 12  City                   object 
 13  County                 object 
 14  State                  object 
 15  Zipcode                object 
 16  Country                object 
 17  Timezone               object 
 18  Airport_Code           object 
 19  Weather_Timestamp      object 
 20  Temperature(F)         float64
 21  Wind_Chill(F)          float64
 22  Humidity(%)       

### Calculating missing percent
**Why:** This calculates the percent of records missing a given feature. This helps us understand
which features have the most missing data and where we have to calculate the median to fill in the missing data.

In [6]:
missing_percent = (df.isnull().sum() / len(df)) * 100

missing_summary = pd.DataFrame({
    'Feature': df.columns,
    'Missing %': missing_percent.round(2)
})

missing_summary

Unnamed: 0,Feature,Missing %
ID,ID,0.0
Source,Source,0.0
Severity,Severity,0.0
Start_Time,Start_Time,0.0
End_Time,End_Time,0.0
Start_Lat,Start_Lat,0.0
Start_Lng,Start_Lng,0.0
End_Lat,End_Lat,44.03
End_Lng,End_Lng,44.03
Distance(mi),Distance(mi),0.0


## Task 3 Feature Exploration
Here, we deal with the missing values and outliers in our dataset.  
**Why:** We need a clean dataset to work with in order for our research to be complete and so that outliers do not effect the outcome of our analysis. 

In [13]:
import pandas as pd
import numpy as np

# My macbook was really slow in retrieving all the records (so nrows for now)
df = pd.read_csv("../data/US_Accidents_March23.csv", nrows=100000)

# Columns or features we plan to clean and the target feature
clean_columns = [
    "Severity",
    "Start_Time", "Sunrise_Sunset",
    "Temperature(F)", "Precipitation(in)", "Visibility(mi)",
    "Traffic_Signal", "Junction", "Crossing",
    "State", "City"
]
df = df[clean_columns].copy()
# df.head(10)

# Weather Features
print("Before Missing %:")
print((df[["Temperature(F)", "Precipitation(in)", "Visibility(mi)"]].isnull().mean() * 100).round(2))

# Filling missing temperature values with their state median, if not possible, then its filled with US median
state_temp_median = df.groupby("State")["Temperature(F)"].transform("median")
global_temp_median = df["Temperature(F)"].median()
df["Temperature(F)"] = df["Temperature(F)"].fillna(state_temp_median).fillna(global_temp_median)

# Fill missing precipitation values with no precipitation
df["Precipitation(in)"]= df["Precipitation(in)"].fillna(0.0)

# Fill missing visibility values with average visibility
df["Visibility(mi)"] = df["Visibility(mi)"].fillna(df["Visibility(mi)"].median())

# Deal with outliers by keeping feature values in range to ensure realistic analysis results
df["Temperature(F)"] = df["Temperature(F)"].clip(-50, 130)
df["Precipitation(in)"] = df["Precipitation(in)"].clip(0, 5)
df["Visibility(mi)"] = df["Visibility(mi)"].clip(0, 50)

print("After Missing %:")
print((df[["Temperature(F)", "Precipitation(in)", "Visibility(mi)"]].isnull().mean() * 100).round(2))
df.head(50)

Before Missing %:
Temperature(F)        1.59
Precipitation(in)    92.63
Visibility(mi)        1.85
dtype: float64
After Missing %:
Temperature(F)       0.0
Precipitation(in)    0.0
Visibility(mi)       0.0
dtype: float64


Unnamed: 0,Severity,Start_Time,Sunrise_Sunset,Temperature(F),Precipitation(in),Visibility(mi),Traffic_Signal,Junction,Crossing,State,City
0,3,2016-02-08 05:46:00,Night,36.9,0.02,10.0,False,False,False,OH,Dayton
1,2,2016-02-08 06:07:59,Night,37.9,0.0,10.0,False,False,False,OH,Reynoldsburg
2,2,2016-02-08 06:49:27,Night,36.0,0.0,10.0,True,False,False,OH,Williamsburg
3,3,2016-02-08 07:23:34,Night,35.1,0.0,9.0,False,False,False,OH,Dayton
4,2,2016-02-08 07:39:07,Day,36.0,0.0,6.0,True,False,False,OH,Dayton
5,3,2016-02-08 07:44:26,Day,37.9,0.03,7.0,False,False,False,OH,Westerville
6,2,2016-02-08 07:59:35,Day,34.0,0.0,7.0,False,False,False,OH,Dayton
7,3,2016-02-08 07:59:58,Day,34.0,0.0,7.0,False,False,False,OH,Dayton
8,2,2016-02-08 08:00:40,Day,33.3,0.0,5.0,False,False,False,OH,Dayton
9,3,2016-02-08 08:10:04,Day,37.4,0.02,3.0,False,False,False,OH,Westerville


# Summary

**Alexa**: Focused on weather features and handled missing values in Temperature, Precipitation, and Visibility. Missing temperature values were populated by their state median temperature, if that was not available, the US median temperature was used. Missing Visibility values were filled with median. Missing precipitation is assumed to be no precipitation. Removed outliers like unrealistic temperatures and precipitation.   

**Zeta**: 

**Liz**:

