# Machine Learning - Final Project
## Accident Severity Prediction
**Team**: *Jennifer Lord, Konstantinos Georgiou, Russ Limber, Sanjeev Singh, Sara Howard*

## Where to put the code
- Place the preprocessing functions/classes in [\<project root\>/project_libs/project/preprocessing.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/preprocessing.py)
- The models in [\<project root\>/project_libs/project/models.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/models.py)
- Any plotting related functions in [\<project root\>/project_libs/project/plotter.py](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/project_libs/project/plotter.py)


**The code is reloaded automatically. Any class object needs to reinitialized though.** 

## Config file
The yml/config file is located at: [confs/prototype1.yml](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/blob/master/confs/prototype1.yml)<br>
To load it run:
```python
config_path='confs/prototype1.yml'
conf = Configuration(config_src=config_path)
# Get the dataset loader config
loader_config = conf.get_config('data_loader')['config']['dataset'] # type = Dict
print(books.keys())
print(loader_config['url'])
```
To reload the config just run the 2nd and 3rd command.

## Libraries Overview:
All the libraries are located under *"\<project root>/project_libs"*
- project_libs/**project**: This project's code (imported later)
- project_libs/**configuration**: Class that creates config objects from yml files
- project_libs/**fancy_logger**: Logger that can be used instead of prints for text formatting (color, bold, underline etc)

## For more info check out:
- the **[Project Board](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/projects/1)**
- and the **[Current Issues](https://github.com/UTK-ML-Dream-Team/accident-severity-prediction/issues)**

# ------------------------------------------------------------------

## On Google Collab?
- **If yes, run the two cells and press the two buttons below:**
- Otherwise go to "***Import the base Libraries***"

In [1]:
# Import Jupyter Widgets
import os
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display
# Clone the repository if you're in Google Collab
def clone_project(is_collab: bool = False):
    print("Cloning Project..")
    !git clone https://github.com/UTK-ML-Dream-Team/accident-severity-prediction.git
    print("Project cloned.")
       
print("Clone project?")
print("(If you do this you will ovewrite local changes on other files e.g. configs)")
print("Not needed if you're not on Google Collab")
btn = widgets.Button(description="Yes, clone")
btn.on_click(clone_project)
display(btn)

Clone project?
(If you do this you will ovewrite local changes on other files e.g. configs)
Not needed if you're not on Google Collab


Button(description='Yes, clone', style=ButtonStyle())

In [2]:
# Clone the repository if you're in Google Collab
def change_dir(is_collab: bool = False):
    try:
        print("Changing dir..")
        os.chdir('/content/accident-severity-prediction')
        print('done')
        print("Current dir:")
        print(os.getcwd())
        print("Dir Contents:")
        print(os.listdir())
        print("\nInstalling Requirements")
        !pip install -r requirements.txt
    except Exception:
        print("Error: Project not cloned")
       
print("Are you on Google Collab?")
btn = widgets.Button(description="Yes")
btn.on_click(change_dir)
display(btn)

Are you on Google Collab?


Button(description='Yes', style=ButtonStyle())

### To commit and push Google Collab notebook to Github
Click **File > Save a copy on Gihtub**

# ------------------------------------------------------------------

# Initializations

## Import the base Libraries

In [3]:
# Imports
%load_ext autoreload
%autoreload 2
from importlib import reload as reload_lib
from typing import *
import os
import re
from pprint import pprint
import datetime 
# Numpy
import numpy as np
import pandas as pd

# Import preprocessing lib
from project_libs.project import *

## Load the YML file

In [4]:
from project_libs import Configuration

In [5]:
# The path of configuration and log save path
config_path = "confs/prototype1.yml"
# !cat "$config_path"
# Load the configuratåion
conf = Configuration(config_src=config_path)
# Get the books dict
loader_config = conf.get_config('data_loader')['config']['dataset']
# print(books.keys())
# pprint(books)  # Pretty print the books dict

2021-11-05 19:03:45 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/accident-severity-detection-prediction/confs/prototype1.yml[0m
2021-11-05 19:03:45 Config       INFO     [1m[37mConfiguration Tag: prototype_1[0m


## Setup Logger and Example

In [6]:
from project_libs import ColorizedLogger
log_path = "logs/prototype1.log"
# Load and setup logger
logger = ColorizedLogger(logger_name='Notebook', color='cyan')
ColorizedLogger.setup_logger(log_path=log_path, debug=False, clear_log=True)
# Examples
logger.info("Logger Examples:")
logger.nl(num_lines=1) # New lines
logger.warn("Logger Warning underlined", attrs=['underline']) 
# Atrs:  bold, dark, underline, blink, reverse, concealed
logger.error("Logger Error in red&yellow", color="yellow", on_color="on_red")
# Colors: on_grey, on_red, on_green, on_yellow, on_blue, on_magenta, on_cyan, on_white

2021-11-05 19:03:45 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/accident-severity-detection-prediction/logs/prototype1.log[0m
2021-11-05 19:03:45 Notebook     INFO     [1m[36mLogger Examples:[0m

2021-11-05 19:03:45 Notebook     ERROR    [1m[41m[33mLogger Error in red&yellow[0m


# ------------------------------------------------------------------

# Start of Project Code

In [7]:
from project_libs import project as proj

## Data Loading

In [8]:
# Download Dataset again if requested
should_download = loader_config['download']
if should_download:
    kaggle_dataset_name = loader_config['kaggle_dataset_name']
    !mkdir ~/.kaggle
    !cp confs/kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    !mkdir data
    !cd data && kaggle datasets download -d $kaggle_dataset_name && unzip -o us-accidents.zip && rm us-accidents.zip

mkdir: /Users/gkos/.kaggle: File exists
mkdir: data: File exists
Downloading us-accidents.zip to /Users/gkos/Insync/delfinas7kostas@gmail.com/Google Drive/Projects/UTK/accident-severity-detection-prediction/data
100%|████████████████████████████████████████| 117M/117M [00:11<00:00, 10.7MB/s]
100%|████████████████████████████████████████| 117M/117M [00:11<00:00, 11.0MB/s]
Archive:  us-accidents.zip
  inflating: US_Accidents_Dec20_updated.csv  


In [10]:
# Load Dataset
accidents_df_original = pd.read_csv (loader_config['local_dataset_name'])
accidents_df = accidents_df_original.copy()

## Exploration

In [11]:
# Print Basic Info
logger.info(f"Number of rows: {accidents_df.shape[0]}")
logger.info(f"Number of Columns: {accidents_df.shape[1]}")
logger.info(f"Columns: {accidents_df.columns}")

2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows: 1516064[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of Columns: 47[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mColumns: Index(['ID', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
       'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street',
       'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
       'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
       'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
       'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
       'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')[0m


In [12]:
# -- Filter By Cities -- #
logger.info("Unique Cities: ")
cities = accidents_df.City.unique().tolist()
logger.info(f"{cities[:10]}, ..")
logger.info(f"Number of cities: {len(cities)}")
logger.info("----------------------------------------------------")

# Try different number of cities filters
for num_cities in [20, 50, 100, 500, 1000, 1500, 2000]:
    current_num = accidents_df[accidents_df.City.isin(cities[:num_cities])].shape[0]
    logger.info(f"Number of rows when only FIRST {num_cities} were included: {current_num}")

2021-11-05 19:04:29 Notebook     INFO     [1m[36mUnique Cities: [0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36m['Dublin', 'Dayton', 'Cincinnati', 'Akron', 'Williamsburg', 'Batavia', 'Cleveland', 'Lima', 'Westerville', 'Jamestown'], ..[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of cities: 10658[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36m----------------------------------------------------[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows when only FIRST 20 were included: 28208[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows when only FIRST 50 were included: 49824[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows when only FIRST 100 were included: 62172[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows when only FIRST 500 were included: 279012[0m
2021-11-05 19:04:29 Notebook     INFO     [1m[36mNumber of rows when only FIRST 1000 were included: 693738[0m
2021-11

In [13]:
# -- Filter By Date -- #
accidents_df.loc[:, 'Start_Time_dt'] = pd.to_datetime(accidents_df.Start_Time)
logger.info(f"Earliest date: {accidents_df.Start_Time_dt.min()}")
logger.info(f"Most Recent date: {accidents_df.Start_Time_dt.max()}")
logger.info("----------------------------------------------------")

# Try different date filters
dates = [(2017, 1), (2018, 1), (2019, 1), (2020, 1), (2020, 6), (2020, 9)]
for year, month in dates:
    condition = accidents_df.Start_Time_dt.dt.date>=datetime.date(year=year,month=month,day=1)
    current_num = accidents_df.Start_Time_dt[condition].shape[0]
    logger.info(f"Number of rows when only dates STARTED FROM {month}/{year} were included: {current_num}")


2021-11-05 19:04:30 Notebook     INFO     [1m[36mEarliest date: 2016-02-08 00:37:08[0m
2021-11-05 19:04:30 Notebook     INFO     [1m[36mMost Recent date: 2020-12-31 23:28:56[0m
2021-11-05 19:04:30 Notebook     INFO     [1m[36m----------------------------------------------------[0m
2021-11-05 19:04:30 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 1/2017 were included: 1386739[0m
2021-11-05 19:04:31 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 1/2018 were included: 1216640[0m
2021-11-05 19:04:31 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 1/2019 were included: 1049704[0m
2021-11-05 19:04:31 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 1/2020 were included: 787932[0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 6/2020 were included: 546313[0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36mNumber of rows 

In [14]:
# -- Filter By States and Date -- #
logger.info("Unique States: ")
states = accidents_df.State.unique().tolist()
logger.info(states[:10])
logger.info(f"Number of states: {len(states)}")
logger.info("----------------------------------------------------")

# Filter By NE States
staes_of_choice =  ['PA', 'NY', 'VY', 'ME', 'NH', 'MA', 'RI', 'CT', 'NJ', 'DE', 'DC', 'MD']
accidents_df_filtered = accidents_df[accidents_df.State.isin(staes_of_choice)].copy()
logger.info(f"Number of rows when only North Eastern states were included: {accidents_df_filtered.shape[0]}")

# By Date
year, month = 2020, 1
condition = accidents_df_filtered.Start_Time_dt.dt.date>=datetime.date(year=year,month=month,day=1)
current_num = accidents_df_filtered.Start_Time_dt[condition].shape[0]
logger.info(f"Number of rows when only dates STARTED FROM {month}/{year} for the North Easter States were included: {current_num}")

2021-11-05 19:04:32 Notebook     INFO     [1m[36mUnique States: [0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36m['OH', 'IN', 'KY', 'WV', 'MI', 'PA', 'CA', 'NV', 'MN', 'TX'][0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36mNumber of states: 49[0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36m----------------------------------------------------[0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36mNumber of rows when only North Eastern states were included: 206216[0m
2021-11-05 19:04:32 Notebook     INFO     [1m[36mNumber of rows when only dates STARTED FROM 1/2020 for the North Easter States were included: 114565[0m


## Preprocessing

In [15]:
from project_libs.project import preprocessing as pre

In [16]:
accidents_df = accidents_df_original.copy()

## Load Model

In [None]:
from project_libs.project import models

## Hyperparameter Tuning

## Training

## Testing

## Evaluation

## Plots

In [None]:
from project_libs.project import plotter as pl