# Hackital 2019: Smart City Hackathon @GWU
## Road Junction Traffic Prediction by Chang Feng

## 1. Data Retrieval and understanding the data

In [1]:
import numpy as np
import pandas as pd

df_traf = pd.read_csv("data.csv")
df_traf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48120 entries, 0 to 48119
Data columns (total 4 columns):
DateTime    48120 non-null object
Junction    48120 non-null int64
Vehicles    48120 non-null int64
ID          48120 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [2]:
def extract_year(row):
    ID = str(row["ID"])
    return ID[0:4]

def extract_month(row):
    ID = str(row["ID"])
    return ID[4:6]

def extract_day(row):
    ID = str(row["ID"])
    return ID[6:8]

def extract_hour(row):
    ID = str(row["ID"])
    return ID[8:10]

df_traf['year'] = df_traf.apply(lambda row: extract_year(row), axis=1)
df_traf['month'] = df_traf.apply(lambda row: extract_month(row), axis=1)
df_traf['day'] = df_traf.apply(lambda row: extract_day(row), axis=1)
df_traf['hour'] = df_traf.apply(lambda row: extract_hour(row), axis=1)

df_traf = df_traf.drop(['DateTime','ID'],axis=1)

In [19]:
df_traf.describe()

Unnamed: 0,Junction,Vehicles
count,48120.0,48120.0
mean,2.180549,22.791334
std,0.966955,20.750063
min,1.0,1.0
25%,1.0,9.0
50%,2.0,15.0
75%,3.0,29.0
max,4.0,180.0


In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_traf, test_size = 0.2)


## 2. Data Cleaning 

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


cat_attribs = ['Junction','year','month','day','hour']
full_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(sparse=False, categories = 'auto'),cat_attribs),
    ])

X_train = full_pipeline.fit_transform(train_set)
y_train = train_set["Vehicles"].values

X_test = full_pipeline.fit_transform(test_set)
y_test = test_set["Vehicles"].values


## 3. Model Selection and Training

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error 

lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)

lin_reg_pred = lin_reg.predict(X_test)
lin_rmse = np.sqrt(mean_squared_error(y_test, lin_reg_pred))

print(lin_rmse)

11.121696800001606


In [15]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)

tree_reg.fit(X_train, y_train)

tree_reg_pred = tree_reg.predict(X_test)
tree_rmse = np.sqrt( mean_squared_error(y_test, tree_reg_pred))
tree_rmse

6.828912058318615

In [11]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators = 50, random_state = 42)
forest_reg.fit(X_train, y_train)

forest_reg_pred = forest_reg.predict(X_test)
forest_rmse = np.sqrt(mean_squared_error(y_test, forest_reg_pred))
print(forest_rmse)

5.907436167927504


## 4. Conclusion 

In this hackathon, I built a model using Random Forest which is able to predict traffic at traffic intersections. The average number of vehicles in a cross section at a given time is about 23 and my model is able to produce a prediction with a Root Mean Squared Error of 5.9. I believe this prediction is good enough to be used to time the length traffic lights considering each car won't take very long to go through the junction. 