# IBM Data Science Professional Certificate

## Capstone Project

#### This is the Applied Data Science Capstone Project notebook created by <a href="https://www.linkedin.com/in/chin-hung-kwok-07927b152">Chin Hung Kwok</a>.

### Introduction/Business Problem

Background: <br>
Car Accidents are caused by a lot of different factors, such as weather conditions, time of the day, light conditions or road conditions. Once car accident happened, police and ambulance have to arrive as soon as possible to handle the case. Without any prediction, police and ambulanceman might not be able to be prepared for the case. Therefore, I would like to create a map which fetches the real-time conditions of the traffic and weather and shows the predicted severity of car accidents that might happen. Police and Ambulance might make use of this map to be prepared for the potential accidents.

_Given day of the week, time, weather, light and road conditions, predict accident severity within the operating geographic area of a police force and ambulance._

### Data Source

##### UK Accidents 10 years history with many variables
<br>
link: <a href='https://www.kaggle.com/benoit72/uk-accidents-10-years-history-with-many-variables'>https://www.kaggle.com/benoit72/uk-accidents-10-years-history-with-many-variables</a>

The data source consists of three csv files, namely Accidents0514, Causualties0514, and Vehicles0514. It includes the data for the accidents, casualties, vehicles, respectively.

In this project, the data file "Accidents0514" and "Vehicles0514" are mainly used.

It includes 32 variables, such as the accident index, the longtidues and latitudes, speed limit, etc. Independents chosen for this project is namely:

1. Day of week
2. Time (hour)
3. Light Condition
4. Weather Conditions
5. Road conditions
6. Number of Vehicles
7. Sex of Driver
8. Age band of Driver
9. 1st point of impact

### Load Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_acc = pd.read_csv('data/accidents.csv')
df_veh = pd.read_csv('data/vehicles.csv')

In [None]:
df_agg = pd.merge(df_acc,df_veh,how='inner',on='Accident_Index')
df_agg.head()

In [None]:
df_agg.shape

In [None]:
df_agg['Accident_Severity'].value_counts()

### Data Visualisation/Exploration

1. Day of week
2. Time (hour)
3. Light Condition
4. Weather Conditions
5. Road conditions
6. Number of Vehicles
7. Sex of Driver
8. Age band of Driver
9. 1st point of impact

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap

In [None]:
ax = df_agg['Accident_Severity'].value_counts().plot(kind='bar',
                                                     color=['forestgreen', 'darkorange', 'mediumblue'],
                                                     figsize=(10, 6))
ax.set_xticklabels(['Slight','Serious','Fatal'])
ax.set_title('Distribution of the Car Accident Severity')

In [None]:
df_acc_dow = pd.pivot_table(df_agg, values='Accident_Index', \
                            index=['Accident_Severity'], \
                            columns=['Day_of_Week'], \
                            aggfunc=np.count_nonzero).transpose()
ax = df_acc_dow.plot(kind='bar', stacked=True)

In [None]:
df_agg['Time'] = pd.to_datetime(df_agg['Time']).dt.hour

In [None]:
df_time = pd.pivot_table(df_agg, values='Accident_Index', \
                            index=['Accident_Severity'], \
                            columns=['Time'], \
                            aggfunc=np.count_nonzero).transpose()

df_acc_time.plot(kind='bar',stacked=True)

In [None]:
df_light = pd.pivot_table(df_agg, values='Accident_Index', \
                            index=['Accident_Severity'], \
                            columns=['Light_Conditions'], \
                            aggfunc=np.count_nonzero).transpose()
ax = df_acc_light.plot(kind='bar',stacked=True)

light_lookup = pd.read_excel("data/metadata.xls",sheet_name="Light Conditions")
xticks = light_lookup['label'].values[0:len(light_lookup['code'].values)-1]
ax.set_xticklabels(xticks)

In [None]:
df_agg.drop(df_agg[df_agg['Weather_Conditions'] == -1].index)['Weather_Conditions'].value_counts()

In [None]:
df_weather = pd.pivot_table(df_agg.drop(df_agg[df_agg['Weather_Conditions'] == -1].index), values='Accident_Index', \
                            index=['Accident_Severity'], \
                            columns=['Weather_Conditions'], \
                            aggfunc=np.count_nonzero).transpose()

ax = df_acc_weather.plot(kind='bar',stacked=True)

weather_lookup = pd.read_excel("data/metadata.xls",sheet_name="Weather")
xticks = np.insert(weather_lookup['label'].values[0:len(weather_lookup['code'].values)-1],0,'')
ax.set_xticklabels(xticks)
ax.minorticks_off()


In [None]:
df_acc_sl = pd.pivot_table(df_acc, values='Accident_Index', \
                            index=['Accident_Severity'], \
                            columns=['Speed_limit'], \
                            aggfunc=np.count_nonzero).transpose()

df_acc_sl.plot(kind='bar',stacked=True)

In [None]:
ax = pd.pivot_table(df_agg.drop(df_agg[(df_agg['Sex_of_Driver'] == -1) | (df_agg['Sex_of_Driver'] == 3)].index), values='Accident_Index',
                index=['Accident_Severity'],
                columns=['Sex_of_Driver'],
                aggfunc=np.count_nonzero).transpose().plot(kind='bar',stacked=True)

sex_lookup = pd.read_excel("data/metadata.xls",sheet_name="Sex of Driver")
xticks = sex_lookup['label'].values[0:len(sex_lookup['code'].values)-2]
ax.set_xticklabels(xticks)


In [None]:
ax = pd.pivot_table(df_agg.drop(df_agg[df_agg['1st_Point_of_Impact'] == -1].index), values='Accident_Index', \
                index=['Accident_Severity'], \
                columns=['1st_Point_of_Impact'], \
                aggfunc=np.count_nonzero).transpose().plot(kind='bar',stacked=True)

impact_lookup = pd.read_excel("data/metadata.xls",sheet_name="1st Point of Impact")
xticks = impact_lookup['label'].values[0:len(impact_lookup['code'].values)-1]
ax.set_xticklabels(xticks)

In [None]:
ax = pd.pivot_table(df_agg.drop(df_agg[df_agg['Age_Band_of_Driver'] == -1].index), values='Accident_Index', \
                index=['Accident_Severity'], \
                columns=['Age_Band_of_Driver'], \
                aggfunc=np.count_nonzero).transpose().plot(kind='bar',stacked=True)

age_lookup = pd.read_excel("data/metadata.xls",sheet_name="Age Band")
xticks = age_lookup['label'].values[0:len(age_lookup['code'].values)-1]
ax.set_title('Distribution of Age band of Driver')
ax.set_xlabel('Age band of Driver')
ax.set_xticklabels(xticks)

In [None]:
# !conda install -c conda-forge folium=0.5.0 --yes


In [None]:
#Preprocessing
df_agg['Latitude'] = df_agg['Latitude'].astype(float)
df_agg['Longitude'] = df_agg['Longitude'].astype(float)
heat_df = df_agg[df_agg['Accident_Severity'] == 1].loc[:,['Latitude', 'Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude'])
heat_data = heat_df.sample(len(heat_df)).values

#Heatmap
m = folium.Map(location=[54.251186, -4.463196],width=800,height=800, min_zoom=5, max_zoom=18, zoom_start=6, min_lat=48, max_lat=60, min_lon=-13, max_lon=4)
HeatMap(heat_data.tolist(),radius=9.5).add_to(m)
m
                                

In [None]:
#Preprocessing
df_agg['Latitude'] = df_agg['Latitude'].astype(float)
df_agg['Longitude'] = df_agg['Longitude'].astype(float)
heat_df = df_agg[df_agg['Accident_Severity'] == 2].loc[:,['Latitude', 'Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude'])
heat_data = heat_df.sample(len(heat_df)).values

#Heatmap
m = folium.Map(location=[54.251186, -4.463196],width=800,height=800, min_zoom=5, max_zoom=18, zoom_start=6, min_lat=48, max_lat=60, min_lon=-13, max_lon=4)
HeatMap(heat_data.tolist(),radius=9.5).add_to(m)
m
                                

In [None]:
#Preprocessing
df_agg['Latitude'] = df_agg['Latitude'].astype(float)
df_agg['Longitude'] = df_agg['Longitude'].astype(float)
heat_df = df_agg[df_agg['Accident_Severity'] == 3].loc[:,['Latitude', 'Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude'])
heat_data = heat_df.sample(len(heat_df)).values

#Heatmap
m = folium.Map(location=[54.251186, -4.463196],width=800,height=800, min_zoom=5, max_zoom=18, zoom_start=6, min_lat=48, max_lat=60, min_lon=-13, max_lon=4)
HeatMap(heat_data.tolist(),radius=9.5).add_to(m)
m
                                

### Data preprocessing

Independent Variables:

1. Day of week
2. Time (hour)
3. Light Condition
4. Weather Conditions
5. Road conditions
6. Number of Vehicles
7. Number of Casualties
8. Sex of Driver
9. Age band of Driver
10. 1st point of impact
11. Road Type
12. Hit Object in Carriageway
13. Hit Object off Carriageway
14. Vehicle Leaving Carriageway
15. Special Conditions at Site
16. Skidding and Overturning

In [None]:
df = df_agg.loc[:,['Accident_Severity','Day_of_Week','Time','Light_Conditions','Weather_Conditions','Road_Surface_Conditions','Number_of_Vehicles'\
                      ,'Number_of_Vehicles','Sex_of_Driver','Age_Band_of_Driver','1st_Point_of_Impact','Road_Type','Hit_Object_in_Carriageway','Hit_Object_off_Carriageway'\
                      ,'Vehicle_Leaving_Carriageway','Special_Conditions_at_Site','Skidding_and_Overturning']]

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# df.drop(df[df['Accident_Severity'] == 3].sample(frac=0.95,replace=False,random_state=1).index,inplace=True)

In [None]:
# df['Accident_Severity'].value_counts()

In [None]:
# df.drop(df[df['Accident_Severity'] == 2].sample(frac=0.8,replace=False,random_state=1).index,inplace=True)

In [None]:
# df['Accident_Severity'].value_counts()

In [None]:
df['Time'] = pd.to_datetime(df['Time']).dt.hour

In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

#### Clean Missing Data for each column

##### Weather Conditions

In [None]:
df['Weather_Conditions'].value_counts()

In [None]:
df = df.drop(df[df['Weather_Conditions'] == -1].index)

In [None]:
df['Weather_Conditions'].value_counts()

##### Raod Surface Conditions

In [None]:
df['Road_Surface_Conditions'].value_counts()

In [None]:
df = df.drop(df[df['Road_Surface_Conditions'] == -1].index)
df['Road_Surface_Conditions'].value_counts()

##### Sex of Driver

In [None]:
df['Sex_of_Driver'].value_counts()

In [None]:
df = df.drop(df[(df['Sex_of_Driver'] == -1) | (df['Sex_of_Driver'] == 3)].index)
df['Sex_of_Driver'].value_counts()

##### Age band of Driver

In [None]:
df['Age_Band_of_Driver'].value_counts()

In [None]:
df = df.drop(df[(df['Age_Band_of_Driver'] == -1)].index)
df['Age_Band_of_Driver'].value_counts()

##### 1st Point of Impact

In [None]:
df['1st_Point_of_Impact'].value_counts()

In [None]:
df = df.drop(df[(df['1st_Point_of_Impact'] == -1)].index)
df['1st_Point_of_Impact'].value_counts()

##### Hit Object in Carriageway

In [None]:
df['Hit_Object_in_Carriageway'].value_counts()

In [None]:
df = df.drop(df[(df['Hit_Object_in_Carriageway'] == -1)].index)
df['Hit_Object_in_Carriageway'].value_counts()

##### Hit Object off Carriageway

In [None]:
df['Hit_Object_off_Carriageway'].value_counts()

In [None]:
df = df.drop(df[(df['Hit_Object_off_Carriageway'] == -1)].index)
df['Hit_Object_off_Carriageway'].value_counts()

##### Vehicle Leaving Carriageway

In [None]:
df['Vehicle_Leaving_Carriageway'].value_counts()

In [None]:
df = df.drop(df[(df['Vehicle_Leaving_Carriageway'] == -1)].index)
df['Vehicle_Leaving_Carriageway'].value_counts()

##### Special Conditions at Site

In [None]:
df['Special_Conditions_at_Site'].value_counts()

In [None]:
df = df.drop(df[(df['Special_Conditions_at_Site'] == -1)].index)
df['Special_Conditions_at_Site'].value_counts()

##### Skidding and Overturning

In [None]:
df['Skidding_and_Overturning'].value_counts()

In [None]:
df = df.drop(df[(df['Skidding_and_Overturning'] == -1)].index)
df['Skidding_and_Overturning'].value_counts()

#### Separate Featuers and Target

In [None]:
X = df.iloc[:,1:]

In [None]:
X[0:5]

In [None]:
y = df['Accident_Severity'].values

In [None]:
y[0:5]

### Normalise Data

In [None]:
from sklearn import preprocessing

In [None]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

### Classification

The models used are:

1. K Nearest Neighbor(KNN)
2. Decision Tree
3. Support Vector Machine
4. Logistic Regression

The Evaluation Metric:
1. Jaccard Score
2. F1-score
3. Logloss (For Logistic Regression only)

#### Split the dataset into train set and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)
print('Train Set: ', X_train.shape, y_train.shape)
print('Test Set: ', X_test.shape, y_test.shape)

In [None]:
from sklearn import metrics
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [None]:
test_results = {'Algorithm': [],'Jaccard': [],'F1-score': [],'LogLoss': []}

### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    KNN_model = KNeighborsClassifier(n_neighbors=n).fit(X_train, y_train)
    KNN_yhat = KNN_model.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test,KNN_yhat)
    std_acc[n-1] = np.std(KNN_yhat==y_test)/np.sqrt(KNN_yhat.shape[0])

print(mean_acc)
print("The best accuracy was with", mean_acc.max(), "with k =", mean_acc.argmax()+1) 


In [None]:
import matplotlib.pyplot as plt

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks), mean_acc - 1 * std_acc, mean_acc + 1 * std_acc, alpha = 0.30)
plt.legend(('Accuracy ', '+/- 1xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors=9).fit(X_train, y_train)

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(X_train,y_train)

In [None]:
DT_yhat = Tree_model.predict(X_test)

print('Decision Tree')
print('F1-score: ', f1_score(y_test, DT_yhat, average = 'weighted'))
print('Jaccard: ', jaccard_similarity_score(y_test, DT_yhat))

test_results['Algorithm'].append('Decision Tree')
test_results['Jaccard'].append(jaccard_similarity_score(y_test, DT_yhat))
test_results['F1-score'].append(f1_score(y_test, DT_yhat, average = 'weighted'))
test_results['LogLoss'].append('NA')

In [None]:
predTree[0:5]

In [None]:
# !conda install -c conda-forge pydotplus -y
# !conda install -c conda-forge python-graphviz -y

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from sklearn import tree
%matplotlib inline 

In [None]:
dot_data = StringIO()
filename = "severityTree.png"
featureNames = df.columns[1:]
targetNames = df["Accident_Severity"].unique().tolist()
out=tree.export_graphviz(Tree_model,
                         feature_names=featureNames,
                         out_file=dot_data,
                         class_names= ['1','2','3'],
                         filled=True,
                         special_characters=True,
                         rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_similarity_score

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR_yhat = LR.predict(X_test)
LR_yhat_prob = LR.predict_proba(X_test)

print('Logistict Regression')
print('F1-score: ', f1_score(y_test, LR_yhat, average = 'weighted'))
print('Jaccard: ', jaccard_similarity_score(y_test, LR_yhat))

test_results['Algorithm'].append('LogisticRegression')
test_results['Jaccard'].append(jaccard_similarity_score(y_test, LR_yhat))
test_results['F1-score'].append(f1_score(y_test, LR_yhat, average = 'weighted'))
test_results['LogLoss'].append(log_loss(y_test, LR_yhat_prob))

### Results

In [None]:
pd.DataFrame(test_results)