# Machine Learning - Final Project
## Accident Severity Prediction
**Team**: *Jennifer Lord, Konstantinos Georgiou, Russ Limber, Sanjeev Singh, Sara Howard*

## Where to put the code
- Place the preprocessing functions/classes in [\<project root\>/project_libs/project/preprocessing.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/preprocessing.py)
- The models in [\<project root\>/project_libs/project/models.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/models.py)
- Any plotting related functions in [\<project root\>/project_libs/project/plotter.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/plotter.py)


**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/prototype1.yml](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/confs/prototype1.yml)<br>
To load it run:
```python
config_path='confs/prototype1.yml'
conf = Configuration(config_src=config_path)
# Get the dataset loader config
loader_config = conf.get_config('data_loader')['config']['dataset'] # type = Dict
print(books.keys())
print(loader_config['url'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/project_libs"*
- project_libs/**project**: This project's code (imported later)
- project_libs/**configuration**: Class that creates config objects from yml files
- project_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## For more info check out:
- the **[Project Board](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/projects/1)**
- and the **[Current Issues](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [12]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/UTK-ML-Dream-Team/accident-severity-prediction.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [13]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/accident-severity-prediction')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [14]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
import datetime 
# Numpy
import numpy as np
import pandas as pd

# Import preprocessing lib
from project_libs.project import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the YML file

In [15]:
from project_libs import Configuration

In [42]:
# The path of configuration and log save path
config_path = "confs/prototype1.yml"
# !cat "$config_path"
# Load the configuratåion
conf = Configuration(config_src=config_path)
# Get the books dict
loader_config = conf.get_config('data_loader')['config']['dataset']
preprocessing_config = conf.get_config('data_loader')['config']['preprocessing']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-11-16 19:31:13 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/accident-severity-detection-prediction/confs/prototype1.yml[0m
2021-11-16 19:31:13 Config       INFO     [1m[37mConfiguration Tag: prototype_1[0m


# ------------------------------------------------------------------

# Start of Project Code

In [17]:
from project_libs import project as proj

## Data Loading

In [18]:
# Download Dataset again if requested
should_download = loader_config['download']
if should_download:
    kaggle_dataset_name = loader_config['kaggle_dataset_name']
    !mkdir ~/.kaggle
    !cp confs/kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    !mkdir data
    !cd data && kaggle datasets download -d $kaggle_dataset_name && unzip -o us-accidents.zip && rm us-accidents.zip

mkdir: /Users/gkos/.kaggle: File exists
mkdir: data: File exists
Downloading us-accidents.zip to /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/accident-severity-detection-prediction/data
 99%|███████████████████████████████████████▌| 116M/117M [00:07<00:00, 18.5MB/s]
100%|████████████████████████████████████████| 117M/117M [00:07<00:00, 16.2MB/s]
Archive:  us-accidents.zip
  inflating: US_Accidents_Dec20_updated.csv  


In [49]:
# Load Dataset
accidents_df_original = pd.read_csv (loader_config['local_dataset_name'])
accidents_df = accidents_df_original.copy()

## Exploration - Sampling Tests

In [20]:
# Print Basic Info
print(f"Number of rows: {accidents_df.shape[0]}")
print(f"Number of Columns: {accidents_df.shape[1]}")
print(f"Columns: {accidents_df.columns}")

Number of rows: 1516064
Number of Columns: 47
Columns: Index(['ID', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
       'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street',
       'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
       'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
       'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
       'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
       'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')


In [23]:
# -- Filter By Cities -- #
print("Unique Cities: ")
cities = accidents_df.City.unique().tolist()
print(f"{cities[:10]}, ..")
print(f"Number of cities: {len(cities)}")
print("----------------------------------------------------")

# Try different number of cities filters
for num_cities in [20, 50, 100, 500, 1000, 1500, 2000]:
    current_num = accidents_df[accidents_df.City.isin(cities[:num_cities])].shape[0]
    print(f"Number of rows when only FIRST {num_cities} were included: {current_num}")

Unique Cities: 
['Dublin', 'Dayton', 'Cincinnati', 'Akron', 'Williamsburg', 'Batavia', 'Cleveland', 'Lima', 'Westerville', 'Jamestown'], ..
Number of cities: 10658
----------------------------------------------------
Number of rows when only FIRST 20 were included: 28208
Number of rows when only FIRST 50 were included: 49824
Number of rows when only FIRST 100 were included: 62172
Number of rows when only FIRST 500 were included: 279012
Number of rows when only FIRST 1000 were included: 693738
Number of rows when only FIRST 1500 were included: 775703
Number of rows when only FIRST 2000 were included: 1037886


In [25]:
# -- Filter By Date -- #
accidents_df.loc[:, 'Start_Time_dt'] = pd.to_datetime(accidents_df.Start_Time)
print(f"Earliest date: {accidents_df.Start_Time_dt.min()}")
print(f"Most Recent date: {accidents_df.Start_Time_dt.max()}")
print("----------------------------------------------------")

# Try different date filters
dates = [(2017, 1), (2018, 1), (2019, 1), (2020, 1), (2020, 6), (2020, 9)]
for year, month in dates:
    condition = accidents_df.Start_Time_dt.dt.date>=datetime.date(year=year,month=month,day=1)
    current_num = accidents_df.Start_Time_dt[condition].shape[0]
    print(f"Number of rows when only dates STARTED FROM {month}/{year} were included: {current_num}")


Earliest date: 2016-02-08 00:37:08
Most Recent date: 2020-12-31 23:28:56
----------------------------------------------------
Number of rows when only dates STARTED FROM 1/2017 were included: 1386739
Number of rows when only dates STARTED FROM 1/2018 were included: 1216640
Number of rows when only dates STARTED FROM 1/2019 were included: 1049704
Number of rows when only dates STARTED FROM 1/2020 were included: 787932
Number of rows when only dates STARTED FROM 6/2020 were included: 546313
Number of rows when only dates STARTED FROM 9/2020 were included: 480503


In [26]:
# -- Filter By States and Date -- #
print("Unique States: ")
states = accidents_df.State.unique().tolist()
print(states[:10])
print(f"Number of states: {len(states)}")
print("----------------------------------------------------")

# Filter By NE States
staes_of_choice =  ['PA', 'NY', 'VY', 'ME', 'NH', 'MA', 'RI', 'CT', 'NJ', 'DE', 'DC', 'MD']
accidents_df_filtered = accidents_df[accidents_df.State.isin(staes_of_choice)].copy()
print(f"Number of rows when only North Eastern states were included: {accidents_df_filtered.shape[0]}")

# By Date
year, month = 2020, 1
condition = accidents_df_filtered.Start_Time_dt.dt.date>=datetime.date(year=year,month=month,day=1)
current_num = accidents_df_filtered.Start_Time_dt[condition].shape[0]
print(f"Number of rows when only dates STARTED FROM {month}/{year} for the North Easter States were included: {current_num}")

Unique States: 
['OH', 'IN', 'KY', 'WV', 'MI', 'PA', 'CA', 'NV', 'MN', 'TX']
Number of states: 49
----------------------------------------------------
Number of rows when only North Eastern states were included: 206216
Number of rows when only dates STARTED FROM 1/2020 for the North Easter States were included: 114565


## Preprocessing

In [82]:
from project_libs.project import preprocessing as pre

In [86]:
# Create a copy of the df
accidents_df_isolated = accidents_df_original.copy()

# Isolate city state
city_list = preprocessing_config['city_list']
state_list = preprocessing_config['state_list']
accidents_df_isolated = pre.isolate_city_state(accidents_df_isolated, city_list, state_list)
display(accidents_df_isolated)

Unnamed: 0,ID,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
98138,A-2814738,3,2016-11-30 15:01:56,2016-11-30 21:01:56,33.662770,-111.999580,33.66622,-111.99952,0.238,At AZ-101-LOOP/Exit 15 - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
98139,A-2814739,2,2016-11-30 15:19:52,2016-11-30 21:19:52,33.668760,-112.072750,33.66898,-112.05849,0.820,At 7th St/Exit 26 - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
98141,A-2814741,2,2016-11-30 17:05:39,2016-11-30 23:05:39,33.484250,-112.113190,33.47491,-112.11320,0.645,At Thomas Rd/Exit 201 - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
98142,A-2814742,2,2016-11-30 17:10:37,2016-11-30 23:10:37,33.295215,-111.972420,33.28853,-111.97132,0.466,At Chandler Blvd/Exit 160 - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
98146,A-2814746,2,2016-11-30 18:28:58,2016-12-01 00:28:58,33.461900,-112.092145,33.46190,-112.09904,0.397,At 19th Ave/Exit 143C - Accident.,...,False,False,False,False,False,False,Night,Night,Night,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1515761,A-4239104,3,2019-08-23 16:58:31,2019-08-23 17:26:11,41.815300,-87.630480,41.81212,-87.63042,0.220,At 47th St/Exit 56B - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
1515762,A-4239105,3,2019-08-23 16:03:57,2019-08-23 16:33:31,41.715130,-87.630040,41.71499,-87.63090,0.045,At I-57/Exit 63 - Accident. unconfirmed report.,...,False,False,False,False,False,False,Day,Day,Day,Day
1515763,A-4239106,3,2019-08-23 16:03:57,2019-08-23 16:33:31,41.943350,-87.716690,41.94859,-87.72162,0.442,At Addison St/Exit 45A - Accident.,...,False,False,False,False,False,False,Day,Day,Day,Day
1515764,A-4239107,3,2019-08-23 16:03:57,2019-08-23 16:33:31,41.718620,-87.625210,41.71513,-87.63004,0.347,Ramp to I-57 Southbound - Accident. unconfirme...,...,False,False,False,False,False,False,Day,Day,Day,Day


In [84]:
### --- Russ's Code --- ###

# Create a copy of the df
accidents_df_russ = accidents_df_isolated.copy()

env_vars = preprocessing_config['env_vars']
accidents_df_russ = pre.subset_df(accidents_df_russ, env_vars)

print('Percent of missing rows by column', '\n\n', accidents_df_russ.isnull().sum()/len(accidents_df_russ))

temp_wind = pre.subset_df(accidents_df_russ, ['Temperature(F)', 'Wind_Speed(mph)'])
pre.OLS(temp_wind, np.array(accidents_df_russ['Wind_Chill(F)']))

accidents_df_russ['Wind_Chill(F)'].fillna((accidents_df_russ['Temperature(F)']*1.0778 + accidents_df_russ['Wind_Speed(mph)']*-0.7083), inplace=True)

print('Percent of missing rows by column', '\n\n', accidents_df_russ.isnull().sum()/len(accidents_df_russ))

accidents_df_russ = pre.basic_impute(accidents_df_russ)

print('Percent of missing rows by column', '\n\n', accidents_df_russ.isnull().sum()/len(accidents_df_russ))


Percent of missing rows by column 

 Weather_Timestamp        0.007985
Temperature(F)           0.010175
Wind_Chill(F)            0.373365
Humidity(%)              0.010639
Pressure(in)             0.010352
Visibility(mi)           0.009920
Wind_Direction           0.017584
Wind_Speed(mph)          0.103427
Precipitation(in)        0.376229
Weather_Condition        0.009721
Sunrise_Sunset           0.000000
Civil_Twilight           0.000000
Nautical_Twilight        0.000000
Astronomical_Twilight    0.000000
dtype: float64
                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.999
Model:                            OLS   Adj. R-squared (uncentered):              0.999
Method:                 Least Squares   F-statistic:                          2.042e+07
Date:                Tue, 16 Nov 2021   Prob (F-statistic):                        0.00
Time:                        20:

In [89]:
### --- Sanjeev's Code --- ###

# Create a copy of the df
accidents_df_sanjeev = accidents_df_isolated.copy()

infra_vars = preprocessing_config['infra_vars']
accidents_df_infra = accidents_df_isolated[infra_vars].copy()
print('Number of missing rows by column', '\n', accidents_df_infra.isnull().sum())


# There is no missing value and duplicated meaing variables, all of the infra structures variables would be used. 
accidents_df_sanjeev = pre.preprocess_loc_basic_var(accidents_df_sanjeev)
print(accidents_df_sanjeev.columns)

Number of missing rows by column 
 Traffic_Signal     0
Crossing           0
Station            0
Amenity            0
Bump               0
Give_Way           0
Junction           0
No_Exit            0
Railway            0
Roundabout         0
Stop               0
Traffic_Calming    0
Turning_Loop       0
dtype: int64
Index(['Severity', 'Start_Time', 'End_Time', 'Distance(mi)', 'Description',
       'Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng', 'Street', 'Side',
       'City', 'County', 'Timezone'],
      dtype='object')


## Create Model

In [None]:
from project_libs.project import models

## Hyperparameter Tuning

## Training

## Testing

## Evaluation

## Plots

In [None]:
from project_libs.project import plotter as pl