## You're here! 
Welcome to your first competition in the [ITI's AI Pro training program](https://ai.iti.gov.eg/epita/ai-engineer/)! We hope you enjoy and learn as much as we did prepairing this competition.


## Introduction

In the competition, it's required to predict the `Severity` of a car crash given info about the crash, e.g., location.

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

### Prerequisites
You should know how to use python and a little bit of Machine Learning. You can apply the techniques you learned in the training program and submit the new solutions! 

### Checklist
You can participate in this competition the way you perefer. However, I recommend following these steps if this is your first time joining a competition on Kaggle.

* Fork this notebook and run the cells in order.
* Submit this solution.
* Make changes to the data processing step as you see fit.
* Submit the new solutions.

*You can submit up to 5 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Don't hesitate to leave a comment or contact me if you have any question!

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [219]:
import pandas as pd
import os


## Exploratory Data Analysis
In this step, one should load the data and analyze it. However, I'll load the data and do minimal analysis. You are encouraged to do thorough analysis!

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [220]:

df = pd.read_csv('train.csv')

print("The shape of the dataset is {}.\n\n".format(df.shape))

df.head()

The shape of the dataset is (6407, 16).




Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


We've got 6407 examples in the dataset with 14 featues, 1 ID, and the `Severity` of the crash.

By looking at the features and a sample from the data, the features look of numerical and catogerical types. What about some descriptive statistics?

In [221]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6407 entries, 0 to 6406
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            6407 non-null   int64  
 1   Lat           6407 non-null   float64
 2   Lng           6407 non-null   float64
 3   Bump          6407 non-null   bool   
 4   Distance(mi)  6407 non-null   float64
 5   Crossing      6407 non-null   bool   
 6   Give_Way      6407 non-null   bool   
 7   Junction      6407 non-null   bool   
 8   No_Exit       6407 non-null   bool   
 9   Railway       6407 non-null   bool   
 10  Roundabout    6407 non-null   bool   
 11  Stop          6407 non-null   bool   
 12  Amenity       6407 non-null   bool   
 13  Side          6407 non-null   object 
 14  Severity      6407 non-null   int64  
 15  timestamp     6407 non-null   object 
dtypes: bool(9), float64(3), int64(2), object(2)
memory usage: 406.8+ KB


In [222]:
df.drop(columns='ID').describe()

Unnamed: 0,Lat,Lng,Distance(mi),Severity
count,6407.0,6407.0,6407.0,6407.0
mean,37.765653,-122.40599,0.135189,2.293429
std,0.032555,0.028275,0.39636,0.521225
min,37.609619,-122.51044,0.0,1.0
25%,37.737096,-122.41221,0.0,2.0
50%,37.768238,-122.404835,0.0,2.0
75%,37.787813,-122.392478,0.041,3.0
max,37.825626,-122.349734,6.82,4.0


The output shows desciptive statistics for the numerical features, `Lat`, `Lng`, `Distance(mi)`, and `Severity`. I'll use the numerical features to demonstrate how to train the model and make submissions. **However you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.**

In [223]:
for col in df.columns:
    print(df[col].value_counts())
    print('////////////////////////////\n')

2047    1
3387    1
1314    1
3363    1
5416    1
       ..
4727    1
2684    1
637     1
4735    1
0       1
Name: ID, Length: 6407, dtype: int64
////////////////////////////

37.808498    265
37.752502    106
37.808110     94
37.808253     89
37.807710     74
            ... 
37.811572      1
37.825462      1
37.770149      1
37.792115      1
37.806953      1
Name: Lat, Length: 2061, dtype: int64
////////////////////////////

-122.366852    269
-122.367190     94
-122.366974     92
-122.367640     74
-122.403008     63
              ... 
-122.385828      1
-122.475652      1
-122.447708      1
-122.420078      1
-122.461198      1
Name: Lng, Length: 1937, dtype: int64
////////////////////////////

False    6407
Name: Bump, dtype: int64
////////////////////////////

0.000    3923
0.010     499
0.037     158
0.420      40
0.208      35
         ... 
0.789       1
0.359       1
1.682       1
0.498       1
0.230       1
Name: Distance(mi), Length: 579, dtype: int64
//////////////////////

In [224]:
df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,False,False,False,False,False,False,False,True,R,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,False,False,False,False,False,False,False,False,R,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,False,False,False,False,False,False,True,False,R,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,False,False,True,False,False,False,False,False,R,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,False,False,False,False,False,False,False,False,R,2,2019-10-09 08:47:00


### Read Weather CSV

In [225]:

weather_df = pd.read_csv('weather-sfcsv.csv')

print("The shape of the dataset is {}.\n\n".format(df.shape))

weather_df.head()

The shape of the dataset is (6407, 16).




Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Wind_Chill(F),Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected
0,2020,27,7,18,Fair,64.0,0.0,64.0,70.0,20.0,10.0,No
1,2017,30,9,17,Partly Cloudy,,,71.1,57.0,9.2,10.0,No
2,2017,27,6,5,Overcast,,,57.9,87.0,15.0,9.0,No
3,2016,7,9,9,Clear,,,66.9,73.0,4.6,10.0,No
4,2019,19,10,2,Fair,52.0,0.0,52.0,89.0,0.0,9.0,No


In [226]:
weather_df.isna().sum()

Year                    0
Day                     0
Month                   0
Hour                    0
Weather_Condition       1
Wind_Chill(F)        3609
Precipitation(in)    3327
Temperature(F)          2
Humidity(%)             2
Wind_Speed(mph)       345
Visibility(mi)          1
Selected                0
dtype: int64

In [227]:
print(weather_df.head())

weather_df = weather_df.drop(columns=['Wind_Chill(F)'])
weather_df['Precipitation(in)'] = weather_df['Precipitation(in)'].fillna(weather_df['Precipitation(in)'].median())
weather_df['Wind_Speed(mph)'] = weather_df['Wind_Speed(mph)'].fillna(weather_df['Wind_Speed(mph)'].median())
weather_df.isna().sum()

   Year  Day  Month  ...  Wind_Speed(mph) Visibility(mi)  Selected
0  2020   27      7  ...             20.0           10.0        No
1  2017   30      9  ...              9.2           10.0        No
2  2017   27      6  ...             15.0            9.0        No
3  2016    7      9  ...              4.6           10.0        No
4  2019   19     10  ...              0.0            9.0        No

[5 rows x 12 columns]


Year                 0
Day                  0
Month                0
Hour                 0
Weather_Condition    1
Precipitation(in)    0
Temperature(F)       2
Humidity(%)          2
Wind_Speed(mph)      0
Visibility(mi)       1
Selected             0
dtype: int64

In [228]:
weather_df = weather_df.dropna()
weather_df.isna().sum()

Year                 0
Day                  0
Month                0
Hour                 0
Weather_Condition    0
Precipitation(in)    0
Temperature(F)       0
Humidity(%)          0
Wind_Speed(mph)      0
Visibility(mi)       0
Selected             0
dtype: int64

In [229]:
print(weather_df.shape)

(6899, 11)


In [230]:
weather_df['Date'] = weather_df['Year']*10000 + weather_df['Month']*100 + weather_df['Day']
weather_df.head()

Unnamed: 0,Year,Day,Month,Hour,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Date
0,2020,27,7,18,Fair,0.0,64.0,70.0,20.0,10.0,No,20200727
1,2017,30,9,17,Partly Cloudy,0.0,71.1,57.0,9.2,10.0,No,20170930
2,2017,27,6,5,Overcast,0.0,57.9,87.0,15.0,9.0,No,20170627
3,2016,7,9,9,Clear,0.0,66.9,73.0,4.6,10.0,No,20160907
4,2019,19,10,2,Fair,0.0,52.0,89.0,0.0,9.0,No,20191019


In [231]:
weather_df = weather_df.rename(columns={'Hour': 'Time'})
weather_df = weather_df.drop(columns=['Year', 'Day', 'Month'])
weather_df.head()

Unnamed: 0,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Date
0,18,Fair,0.0,64.0,70.0,20.0,10.0,No,20200727
1,17,Partly Cloudy,0.0,71.1,57.0,9.2,10.0,No,20170930
2,5,Overcast,0.0,57.9,87.0,15.0,9.0,No,20170627
3,9,Clear,0.0,66.9,73.0,4.6,10.0,No,20160907
4,2,Fair,0.0,52.0,89.0,0.0,9.0,No,20191019


In [232]:
weather_df['Weather_Condition'] = weather_df['Weather_Condition'].astype('category')
weather_df['Weather_Condition'] = weather_df['Weather_Condition'].cat.codes
weather_df.head()

Unnamed: 0,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Date
0,18,3,0.0,64.0,70.0,20.0,10.0,No,20200727
1,17,17,0.0,71.1,57.0,9.2,10.0,No,20170930
2,5,16,0.0,57.9,87.0,15.0,9.0,No,20170627
3,9,0,0.0,66.9,73.0,4.6,10.0,No,20160907
4,2,3,0.0,52.0,89.0,0.0,9.0,No,20191019


In [233]:
weather_df['datetime'] = weather_df['Date'] * 100 + weather_df['Time']
weather_df.head()

Unnamed: 0,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Date,datetime
0,18,3,0.0,64.0,70.0,20.0,10.0,No,20200727,2020072718
1,17,17,0.0,71.1,57.0,9.2,10.0,No,20170930,2017093017
2,5,16,0.0,57.9,87.0,15.0,9.0,No,20170627,2017062705
3,9,0,0.0,66.9,73.0,4.6,10.0,No,20160907,2016090709
4,2,3,0.0,52.0,89.0,0.0,9.0,No,20191019,2019101902


In [234]:
print(weather_df.shape)

(6899, 10)


In [235]:
weather_df = weather_df.drop_duplicates(subset='datetime', keep='last')
weather_df.head()

Unnamed: 0,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,Date,datetime
0,18,3,0.0,64.0,70.0,20.0,10.0,No,20200727,2020072718
1,17,17,0.0,71.1,57.0,9.2,10.0,No,20170930,2017093017
2,5,16,0.0,57.9,87.0,15.0,9.0,No,20170627,2017062705
3,9,0,0.0,66.9,73.0,4.6,10.0,No,20160907,2016090709
8,15,14,0.0,55.0,64.0,18.4,10.0,No,20190214,2019021415


In [236]:
print(weather_df.shape)

(5970, 10)


### Feature Manipulation

In [237]:
cols_to_encode = ['Give_Way' , 'No_Exit','Crossing', 'Junction', 'Railway', 'Stop', 'Amenity', 'Side']
for col in cols_to_encode:
    df[col] = df[col].astype('category')
    df[col] = df[col].cat.codes
    

df.head()

Unnamed: 0,ID,Lat,Lng,Bump,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Stop,Amenity,Side,Severity,timestamp
0,0,37.76215,-122.40566,False,0.044,0,0,0,0,0,False,0,1,1,2,2016-03-25 15:13:02
1,1,37.719157,-122.448254,False,0.0,0,0,0,0,0,False,0,0,1,2,2020-05-05 19:23:00
2,2,37.808498,-122.366852,False,0.0,0,0,0,0,0,False,1,0,1,3,2016-09-16 19:57:16
3,3,37.78593,-122.39108,False,0.009,0,0,1,0,0,False,0,0,1,1,2020-03-29 19:48:43
4,4,37.719141,-122.448457,False,0.0,0,0,0,0,0,False,0,0,1,2,2019-10-09 08:47:00


In [238]:
df[['Date', 'Time']] = df['timestamp'].str.split(' ', expand=True)
df['Date'] = df['Date'].apply(lambda x: x[0:4] + x[5:7] + x[8:])
df['Time'] = df['Time'].apply(lambda x: x[0:2])
df['Date'] = df['Date'].astype('int')
df['Time'] = df['Time'].astype('int')

In [239]:
df = df.drop(columns=['Bump', 'Roundabout'])
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,timestamp,Date,Time
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,2016-03-25 15:13:02,20160325,15
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,2020-05-05 19:23:00,20200505,19
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,2016-09-16 19:57:16,20160916,19
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,2020-03-29 19:48:43,20200329,19
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,2019-10-09 08:47:00,20191009,8


In [240]:
#Remove Timestamp
df = df.drop(columns=['timestamp'])
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8


### Merge Crash and Weather Datasets

In [241]:
df = pd.merge(df, weather_df, on=['Date', 'Time'], how='left')

In [242]:
print(df.isna().sum())

ID                   0
Lat                  0
Lng                  0
Distance(mi)         0
Crossing             0
Give_Way             0
Junction             0
No_Exit              0
Railway              0
Stop                 0
Amenity              0
Side                 0
Severity             0
Date                 0
Time                 0
Weather_Condition    2
Precipitation(in)    2
Temperature(F)       2
Humidity(%)          2
Wind_Speed(mph)      2
Visibility(mi)       2
Selected             2
datetime             2
dtype: int64


In [243]:
df = df.dropna()

In [244]:
print(df.shape)

(6405, 23)


In [245]:
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),Selected,datetime
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15,17.0,0.0,66.0,54.0,17.3,10.0,No,2016033000.0
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,78.0,20.0,10.0,No,2020051000.0
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19,0.0,0.0,62.1,80.0,9.2,10.0,No,2016092000.0
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19,3.0,0.0,58.0,70.0,10.0,10.0,No,2020033000.0
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,65.0,3.0,10.0,No,2019101000.0


In [246]:
df = df.drop(columns=['datetime', 'Selected'])

### Read holidays file and convert it to csv file

In [247]:
from bs4 import BeautifulSoup
dataset_path = '/kaggle/input/car-crashes-severity-prediction/'

file = open('holidays.xml')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')

date = soup.find_all('date')
description = soup.find_all('description')

data = []
for i in range(0, len(date)):
    rows = [date[i].get_text(), description[i].get_text()]
    data.append(rows)

x_df = pd.DataFrame(data, columns=['date','description'], dtype = float)
print(x_df)    


          date                            description
0   2012-01-02                           New Year Day
1   2012-01-16             Martin Luther King Jr. Day
2   2012-02-20  Presidents Day (Washingtons Birthday)
3   2012-05-28                           Memorial Day
4   2012-07-04                       Independence Day
..         ...                                    ...
85  2020-09-07                              Labor Day
86  2020-10-12                           Columbus Day
87  2020-11-11                           Veterans Day
88  2020-11-26                       Thanksgiving Day
89  2020-12-25                          Christmas Day

[90 rows x 2 columns]


In [248]:
x_df['date'] = x_df['date'].apply(lambda x: x[0:4] + x[5:7] + x[8:])
print(x_df)

        date                            description
0   20120102                           New Year Day
1   20120116             Martin Luther King Jr. Day
2   20120220  Presidents Day (Washingtons Birthday)
3   20120528                           Memorial Day
4   20120704                       Independence Day
..       ...                                    ...
85  20200907                              Labor Day
86  20201012                           Columbus Day
87  20201111                           Veterans Day
88  20201126                       Thanksgiving Day
89  20201225                          Christmas Day

[90 rows x 2 columns]


In [249]:
x_df.head()

Unnamed: 0,date,description
0,20120102,New Year Day
1,20120116,Martin Luther King Jr. Day
2,20120220,Presidents Day (Washingtons Birthday)
3,20120528,Memorial Day
4,20120704,Independence Day


In [250]:
#Try Merge and add description field
x_df.rename(columns={'date': 'Date'}, inplace=True)
x_df['Date'] = x_df['Date'].astype('int')
df = pd.merge(df, x_df, on='Date', how='left')
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15,17.0,0.0,66.0,54.0,17.3,10.0,
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,78.0,20.0,10.0,
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19,0.0,0.0,62.1,80.0,9.2,10.0,
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19,3.0,0.0,58.0,70.0,10.0,10.0,
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,65.0,3.0,10.0,


In [251]:
df.isna().sum()

ID                      0
Lat                     0
Lng                     0
Distance(mi)            0
Crossing                0
Give_Way                0
Junction                0
No_Exit                 0
Railway                 0
Stop                    0
Amenity                 0
Side                    0
Severity                0
Date                    0
Time                    0
Weather_Condition       0
Precipitation(in)       0
Temperature(F)          0
Humidity(%)             0
Wind_Speed(mph)         0
Visibility(mi)          0
description          6257
dtype: int64

In [252]:
df['description'] = df['description'].fillna('None')

In [253]:
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15,17.0,0.0,66.0,54.0,17.3,10.0,
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,78.0,20.0,10.0,
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19,0.0,0.0,62.1,80.0,9.2,10.0,
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19,3.0,0.0,58.0,70.0,10.0,10.0,
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,65.0,3.0,10.0,


In [254]:
df['description'].value_counts()

None                                     6257
Presidents Day (Washingtons Birthday)      23
Thanksgiving Day                           19
Christmas Day                              19
Columbus Day                               18
Veterans Day                               18
Labor Day                                  14
Independence Day                           12
New Year Day                               10
Martin Luther King Jr. Day                  9
Memorial Day                                6
Name: description, dtype: int64

In [255]:
df['description'] = df['description'].astype('category')
df['description'] = df['description'].cat.codes
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15,17.0,0.0,66.0,54.0,17.3,10.0,7
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,78.0,20.0,10.0,7
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19,0.0,0.0,62.1,80.0,9.2,10.0,7
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19,3.0,0.0,58.0,70.0,10.0,10.0,7
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,65.0,3.0,10.0,7


In [256]:
#print(df['holiday'].value_counts())

In [257]:
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Amenity,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Humidity(%),Wind_Speed(mph),Visibility(mi),description
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,1,2,20160325,15,17.0,0.0,66.0,54.0,17.3,10.0,7
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,78.0,20.0,10.0,7
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,0,1,3,20160916,19,0.0,0.0,62.1,80.0,9.2,10.0,7
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,0,1,1,20200329,19,3.0,0.0,58.0,70.0,10.0,10.0,7
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,65.0,3.0,10.0,7


#### Feature Removal

In [258]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6405 entries, 0 to 6404
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6405 non-null   int64  
 1   Lat                6405 non-null   float64
 2   Lng                6405 non-null   float64
 3   Distance(mi)       6405 non-null   float64
 4   Crossing           6405 non-null   int8   
 5   Give_Way           6405 non-null   int8   
 6   Junction           6405 non-null   int8   
 7   No_Exit            6405 non-null   int8   
 8   Railway            6405 non-null   int8   
 9   Stop               6405 non-null   int8   
 10  Amenity            6405 non-null   int8   
 11  Side               6405 non-null   int8   
 12  Severity           6405 non-null   int64  
 13  Date               6405 non-null   int64  
 14  Time               6405 non-null   int64  
 15  Weather_Condition  6405 non-null   float64
 16  Precipitation(in)  6405 

In [259]:
df = df.drop(columns=['Humidity(%)', 'Amenity'])

In [260]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6405 entries, 0 to 6404
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6405 non-null   int64  
 1   Lat                6405 non-null   float64
 2   Lng                6405 non-null   float64
 3   Distance(mi)       6405 non-null   float64
 4   Crossing           6405 non-null   int8   
 5   Give_Way           6405 non-null   int8   
 6   Junction           6405 non-null   int8   
 7   No_Exit            6405 non-null   int8   
 8   Railway            6405 non-null   int8   
 9   Stop               6405 non-null   int8   
 10  Side               6405 non-null   int8   
 11  Severity           6405 non-null   int64  
 12  Date               6405 non-null   int64  
 13  Time               6405 non-null   int64  
 14  Weather_Condition  6405 non-null   float64
 15  Precipitation(in)  6405 non-null   float64
 16  Temperature(F)     6405 

In [261]:
df.head()

Unnamed: 0,ID,Lat,Lng,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Wind_Speed(mph),Visibility(mi),description
0,0,37.76215,-122.40566,0.044,0,0,0,0,0,0,1,2,20160325,15,17.0,0.0,66.0,17.3,10.0,7
1,1,37.719157,-122.448254,0.0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,20.0,10.0,7
2,2,37.808498,-122.366852,0.0,0,0,0,0,0,1,1,3,20160916,19,0.0,0.0,62.1,9.2,10.0,7
3,3,37.78593,-122.39108,0.009,0,0,1,0,0,0,1,1,20200329,19,3.0,0.0,58.0,10.0,10.0,7
4,4,37.719141,-122.448457,0.0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,3.0,10.0,7


In [262]:
import math
df['x'] = df['Lat'].apply(math.cos) * df['Lng'].apply(math.cos)
df['y'] = df['Lat'].apply(math.cos) * df['Lng'].apply(math.sin)
df['z'] = df['Lat'].apply(math.sin)
df = df.drop(columns=['Lat', 'Lng'])
df.head()

Unnamed: 0,ID,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Wind_Speed(mph),Visibility(mi),description,x,y,z
0,0,0.044,0,0,0,0,0,0,1,2,20160325,15,17.0,0.0,66.0,17.3,10.0,7,-0.991254,-0.11596,0.062996
1,1,0.0,0,0,0,0,0,0,1,2,20200505,19,14.0,0.0,59.0,20.0,10.0,7,-0.997073,-0.073778,0.020044
2,2,0.0,0,0,0,0,0,1,1,3,20160916,19,0.0,0.0,62.1,9.2,10.0,7,-0.982066,-0.153714,0.109168
3,3,0.009,0,0,1,0,0,0,1,1,20200329,19,3.0,0.0,58.0,10.0,10.0,7,-0.987693,-0.130167,0.086709
4,4,0.0,0,0,0,0,0,0,1,2,20191009,8,3.0,0.0,58.0,3.0,10.0,7,-0.997089,-0.073575,0.020028


In [263]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6405 entries, 0 to 6404
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 6405 non-null   int64  
 1   Distance(mi)       6405 non-null   float64
 2   Crossing           6405 non-null   int8   
 3   Give_Way           6405 non-null   int8   
 4   Junction           6405 non-null   int8   
 5   No_Exit            6405 non-null   int8   
 6   Railway            6405 non-null   int8   
 7   Stop               6405 non-null   int8   
 8   Side               6405 non-null   int8   
 9   Severity           6405 non-null   int64  
 10  Date               6405 non-null   int64  
 11  Time               6405 non-null   int64  
 12  Weather_Condition  6405 non-null   float64
 13  Precipitation(in)  6405 non-null   float64
 14  Temperature(F)     6405 non-null   float64
 15  Wind_Speed(mph)    6405 non-null   float64
 16  Visibility(mi)     6405 

In [264]:
#Normalize Features
for col in df.columns:
    if col not in ['ID', 'Severity', 'x', 'y', 'z', 'Date', 'Time']:
        df[col] = (df[col] - df[col].mean()) / df[col].std()
df.head()

Unnamed: 0,ID,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Wind_Speed(mph),Visibility(mi),description,x,y,z
0,0,-0.229882,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20160325,15,0.85458,-0.164097,0.759585,1.013952,0.339836,0.073219,-0.991254,-0.11596,0.062996
1,1,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20200505,19,0.422216,-0.164097,-0.116255,1.436932,0.339836,0.073219,-0.997073,-0.073778,0.020044
2,2,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,3.038122,0.231183,3,20160916,19,-1.595482,-0.164097,0.271617,-0.254989,0.339836,0.073219,-0.982066,-0.153714,0.109168
3,3,-0.318191,-0.299713,-0.021646,1.748845,-0.012495,-0.16511,-0.329099,0.231183,1,20200329,19,-1.163118,-0.164097,-0.241375,-0.129662,0.339836,0.073219,-0.987693,-0.130167,0.086709
4,4,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20191009,8,-1.163118,-0.164097,-0.241375,-1.226277,0.339836,0.073219,-0.997089,-0.073575,0.020028


#### Add Day_Of_Week column

In [265]:
import datetime

In [266]:
def get_day_of_week(date):
    date = int(date)
    day = date%100
    date = date//100
    month = date%100
    year = date//100
    return datetime.datetime(year, month, day).weekday()

In [267]:
df['DayOfWeek'] = df['Date'].apply(get_day_of_week)
df.head()

Unnamed: 0,ID,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Wind_Speed(mph),Visibility(mi),description,x,y,z,DayOfWeek
0,0,-0.229882,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20160325,15,0.85458,-0.164097,0.759585,1.013952,0.339836,0.073219,-0.991254,-0.11596,0.062996,4
1,1,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20200505,19,0.422216,-0.164097,-0.116255,1.436932,0.339836,0.073219,-0.997073,-0.073778,0.020044,1
2,2,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,3.038122,0.231183,3,20160916,19,-1.595482,-0.164097,0.271617,-0.254989,0.339836,0.073219,-0.982066,-0.153714,0.109168,4
3,3,-0.318191,-0.299713,-0.021646,1.748845,-0.012495,-0.16511,-0.329099,0.231183,1,20200329,19,-1.163118,-0.164097,-0.241375,-0.129662,0.339836,0.073219,-0.987693,-0.130167,0.086709,6
4,4,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20191009,8,-1.163118,-0.164097,-0.241375,-1.226277,0.339836,0.073219,-0.997089,-0.073575,0.020028,2


#### Get Time of Day

In [268]:
def get_time_of_day(time):
    if time >= 0 and time < 6:
        return 0
    if time >= 6 and time < 12:
        return 1
    if time >= 12 and time < 18:
        return 2
    return 3

In [269]:
df['TimeOfDay'] = df['Time'].apply(get_time_of_day)
df.head()

Unnamed: 0,ID,Distance(mi),Crossing,Give_Way,Junction,No_Exit,Railway,Stop,Side,Severity,Date,Time,Weather_Condition,Precipitation(in),Temperature(F),Wind_Speed(mph),Visibility(mi),description,x,y,z,DayOfWeek,TimeOfDay
0,0,-0.229882,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20160325,15,0.85458,-0.164097,0.759585,1.013952,0.339836,0.073219,-0.991254,-0.11596,0.062996,4,2
1,1,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20200505,19,0.422216,-0.164097,-0.116255,1.436932,0.339836,0.073219,-0.997073,-0.073778,0.020044,1,3
2,2,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,3.038122,0.231183,3,20160916,19,-1.595482,-0.164097,0.271617,-0.254989,0.339836,0.073219,-0.982066,-0.153714,0.109168,4,3
3,3,-0.318191,-0.299713,-0.021646,1.748845,-0.012495,-0.16511,-0.329099,0.231183,1,20200329,19,-1.163118,-0.164097,-0.241375,-0.129662,0.339836,0.073219,-0.987693,-0.130167,0.086709,6,3
4,4,-0.340898,-0.299713,-0.021646,-0.571717,-0.012495,-0.16511,-0.329099,0.231183,2,20191009,8,-1.163118,-0.164097,-0.241375,-1.226277,0.339836,0.073219,-0.997089,-0.073575,0.020028,2,1


In [270]:
df = df.drop(columns='Time')

## Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 

*Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument `random_state` in the following command* 

In [271]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['Severity']) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID','Severity'])
y_train = train_df['Severity']

X_val = val_df.drop(columns=['ID' ,'Severity'])
y_val = val_df['Severity']


As pointed out eariler, I'll use the numerical features to train the classifier. **However, you shouldn't use the numerical features only to make the final submission if you want to make it to the top of the leaderboard.** 

In [272]:
#Checking for duplicates
duplicates = X_train[X_train.duplicated()]
print(duplicates)

      Distance(mi)  Crossing  Give_Way  ...         z  DayOfWeek  TimeOfDay
3658     -0.340898 -0.299713 -0.021646  ...  0.020397          1          3
2313     -0.340898 -0.299713 -0.021646  ...  0.035551          5          2
2037     -0.340898 -0.299713 -0.021646  ...  0.068614          1          3
3717     -0.340898 -0.299713 -0.021646  ...  0.053365          3          1
2275      1.079603 -0.299713 -0.021646  ...  0.047844          1          0
...            ...       ...       ...  ...       ...        ...        ...
1207     -0.151666 -0.299713 -0.021646  ...  0.061799          5          1
4773     -0.340898 -0.299713 -0.021646  ...  0.111188          3          3
4177     -0.340898 -0.299713 -0.021646  ...  0.062837          4          2
5135     -0.315668 -0.299713 -0.021646  ...  0.109168          2          3
3449      0.448830 -0.299713 -0.021646  ...  0.040943          1          1

[107 rows x 20 columns]


## Model Training

Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions. 

In [273]:
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=8, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

Now let's test our classifier on the validation dataset and see the accuracy.

In [274]:
print("The accuracy of the classifier on the validation set is ", (classifier.score(X_val, y_val)))

The accuracy of the classifier on the validation set is  0.834504293520687


Well. That's a good start, right? A classifier that predicts all examples' `Severity` as 2 will get around 0.63. You should get better score as you add more features and do better data preprocessing.

In [275]:
#Train with all data
X_train = df.drop(columns=['ID', 'Severity'])
y_train = df['Severity']

# Create an instance of the classifier
classifier = RandomForestClassifier(max_depth=8, random_state=0)

# Train the classifier
classifier = classifier.fit(X_train, y_train)

## Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 

First, we'll load the data.

In [276]:
test_df = pd.read_csv('test.csv')

Note that the test set has the same features and doesn't have the `Severity` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

Now we'll add `Severity` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [277]:
X_test = test_df.drop(columns=['ID'])
print(X_test)

# You should update/remove the next line once you change the features used for training
#X_test = X_test[['Lat', 'Lng', 'Distance(mi)']]
for col in cols_to_encode:
    X_test[col] = X_test[col].astype('category')
    X_test[col] = X_test[col].cat.codes
X_test = X_test.drop(columns=['Bump', 'Roundabout'])

X_test[['Date', 'Time']] = X_test['timestamp'].str.split(' ', expand=True)
X_test['Date'] = X_test['Date'].apply(lambda x: x[0:4] + x[5:7] + x[8:])
X_test['Time'] = X_test['Time'].apply(lambda x: x[0:2])
X_test['Date'] = X_test['Date'].astype('int')
X_test['Time'] = X_test['Time'].astype('int')
X_test = X_test.drop(columns=['timestamp'])


X_test = pd.merge(X_test, weather_df, on=['Date', 'Time'], how='left')

X_test = X_test.dropna()
X_test = X_test.drop(columns='datetime')


x_df = x_df.rename(columns={'date': 'Date'})
x_df['Date'] = x_df['Date'].astype('int')

#mask = (X_test.set_index([ 'Date' ]).index.isin(x_df.set_index([ 'Date' ]).index))

#X_test['holiday'] = np.where(mask, 1, 0)
X_test = pd.merge(X_test, x_df, on='Date', how='left')
#X_test.head()
X_test['description'] = X_test['description'].fillna('None')
X_test['description'] = X_test['description'].astype('category')
X_test['description'] = X_test['description'].cat.codes

X_test = X_test.drop(columns=['Humidity(%)', 'Amenity', 'Selected'])

X_test['DayOfWeek'] = X_test['Date'].apply(get_day_of_week)
X_test['TimeOfDay'] = X_test['Time'].apply(get_time_of_day)
X_test = X_test.drop(columns='Time')


X_test['x'] = X_test['Lat'].apply(math.cos) * X_test['Lng'].apply(math.cos)
X_test['y'] = X_test['Lat'].apply(math.cos) * X_test['Lng'].apply(math.sin)
X_test['z'] = X_test['Lat'].apply(math.sin)
X_test = X_test.drop(columns=['Lat', 'Lng'])

print(X_test.info())

            Lat         Lng   Bump  ...  Amenity  Side            timestamp
0     37.786060 -122.390900  False  ...    False     R  2016-04-04 19:20:31
1     37.769609 -122.415057  False  ...    False     R  2020-10-28 11:51:00
2     37.807495 -122.476021  False  ...    False     R  2019-09-09 07:36:45
3     37.761818 -122.405869  False  ...    False     R  2019-08-06 15:46:25
4     37.732350 -122.414100  False  ...    False     R  2018-10-17 09:54:58
...         ...         ...    ...  ...      ...   ...                  ...
1596  37.812973 -122.362335  False  ...    False     R  2020-06-26 22:32:22
1597  37.761818 -122.405861  False  ...    False     R  2016-12-03 07:16:30
1598  37.732260 -122.431970  False  ...    False     R  2017-02-20 06:32:44
1599  37.786782 -122.390126  False  ...    False     R  2019-10-31 20:35:00
1600  37.773040 -122.406570  False  ...    False     R  2019-05-27 20:45:47

[1601 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries

In [278]:
y_test_predicted = classifier.predict(X_test)

test_df['Severity'] = y_test_predicted

Now we're ready to generate the submission file. The submission file needs the columns `ID` and `Severity` only.

In [279]:
test_df[['ID', 'Severity']].to_csv('submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows. 

1. Press `Save Version` on the upper right corner of this notebook.
2. Write a `Version Name` of your choice and choose `Save & Run All (Commit)` then click `Save`.
3. Wait for the saved notebook to finish running the go to the saved notebook.
4. Scroll down until you see the output files then select the `submission.csv` file and click `Submit`.

Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

## Conclusion

In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.