# CSC 405 - Project Stage I Report  
**Group 11**  
**Project Title:** US Traffic Accident Severity Prediction  
**Dataset Source:** https://www.kaggle.com/sobhanmoosavi/us-accidents  

## Task 1 Problem Framing  
Our project is a prediction task, where we aim to predict the severity of US traffic accidents. We want to identify which variables (e.g. weather, time, traffic, and road conditions) that most affect the severity of an accident. This study can benefit drivers and commuters by providing insight into conditions that increase accident risk and where defensive driving should be used. This project could also help government and city planners better improve safety on the road and prevent future accidents. The US accidents dataset we chose fits because it is large (7 million records), covers 49 US states over seven years, and over 40 features related to traffic and weather that are useful for building our predictive model. 

## Task 2 Data Exploration
**Why:** To view the first ten records of our dataset and get information on each feature in our dataset in order to create a dataset dictionary. The missing percent calculates the percent of records missing a given feature. This helps us understand which features have the most missing data and where we have to calculate the median to fill in the missing data.

In [5]:
import pandas as pd

df = pd.read_csv("../data/US_Accidents_March23.csv")

df.head(10)

first_examples = df.head(1).iloc[0]

missing_percent = (df.isnull().sum() / len(df) * 100).round(2)

dataset_dictionary = pd.DataFrame({
    "Feature": df.columns,
    "Type": df.dtypes,
    "Example":[first_examples[col] for col in df.columns],
    "Missing %": missing_percent.values
})

dataset_dictionary

Unnamed: 0,Feature,Type,Example,Missing %
ID,ID,object,A-1,0.0
Source,Source,object,Source2,0.0
Severity,Severity,int64,3,0.0
Start_Time,Start_Time,object,2016-02-08 05:46:00,0.0
End_Time,End_Time,object,2016-02-08 11:00:00,0.0
Start_Lat,Start_Lat,float64,39.865147,0.0
Start_Lng,Start_Lng,float64,-84.058723,0.0
End_Lat,End_Lat,float64,,44.03
End_Lng,End_Lng,float64,,44.03
Distance(mi),Distance(mi),float64,0.01,0.0


## Task 3 Feature Exploration
Here, we deal with the missing values and outliers in our dataset.  
**Why:** We need a clean dataset to work with in order for our research to be complete and so that outliers do not effect the outcome of our analysis. 

In [7]:
import pandas as pd
import numpy as np
import os

# My macbook was really slow in retrieving all the records (so nrows for now)
df = pd.read_csv("../data/US_Accidents_March23.csv", nrows=100000)

# Columns or features we plan to clean and the target feature
clean_columns = [
    "Severity",
    "Start_Time", "Sunrise_Sunset",
    "Temperature(F)", "Precipitation(in)", "Visibility(mi)",
    "Traffic_Signal", "Junction", "Crossing", "Railway",
    "State", "City"
]
df = df[clean_columns].copy()
# df.head(10)

# Weather Features
print("Before Missing %:")
print((df[["Temperature(F)", "Precipitation(in)", "Visibility(mi)"]].isnull().mean() * 100).round(2))

# Filling missing temperature values with their state median, if not possible, then its filled with US median
state_temp_median = df.groupby("State")["Temperature(F)"].transform("median")
global_temp_median = df["Temperature(F)"].median()
df["Temperature(F)"] = df["Temperature(F)"].fillna(state_temp_median).fillna(global_temp_median)

# Fill missing precipitation values with no precipitation
df["Precipitation(in)"]= df["Precipitation(in)"].fillna(0.0)

# Fill missing visibility values with average visibility
df["Visibility(mi)"] = df["Visibility(mi)"].fillna(df["Visibility(mi)"].median())

# Deal with outliers by keeping feature values in range to ensure realistic analysis results
df["Temperature(F)"] = df["Temperature(F)"].clip(-50, 130)
df["Precipitation(in)"] = df["Precipitation(in)"].clip(0, 5)
df["Visibility(mi)"] = df["Visibility(mi)"].clip(0, 50)

print("After Missing %:")
print((df[["Temperature(F)", "Precipitation(in)", "Visibility(mi)"]].isnull().mean() * 100).round(2))
# df.head(50)

out_dir = "../data"
os.makedirs(out_dir, exist_ok=True)

out_path = os.path.join(out_dir, "newdata.csv")
df.to_csv(out_path, index =False, encoding="utf-8")

print(f"Saved cleaned dataset: {out_path}")

Before Missing %:
Temperature(F)        1.59
Precipitation(in)    92.63
Visibility(mi)        1.85
dtype: float64
After Missing %:
Temperature(F)       0.0
Precipitation(in)    0.0
Visibility(mi)       0.0
dtype: float64
Saved cleaned dataset: ../data/newdata.csv


# Summary

**Alexa**: Focused on weather features and handled missing values in Temperature, Precipitation, and Visibility. Missing temperature values were populated by their state median temperature, if that was not available, the US median temperature was used. Missing Visibility values were filled with median. Missing precipitation is assumed to be no precipitation. Removed outliers like unrealistic temperatures and precipitation.   

**Zeta**: Focused on traffic features such as Railway, Crossing, Junction, and Traffic_Signal. There were no missing values in any of these features.

**Liz**: Focused on time of day features.

