<a href="https://colab.research.google.com/github/Vishu-Gupta/MLProjects/blob/main/01%20Kaggle%20Projects/06%20Forecast_Traffic_Flow_TPS_Mar'22/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This a Kaggle competition , part of the Mar'22 TPS.

Link: https://www.kaggle.com/c/tabular-playground-series-mar-2022

## 01 Linking with Kaggle and getting the datasets

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [3]:
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
! chmod 600 ~/.kaggle/kaggle.json

In [4]:
! kaggle competitions download tabular-playground-series-mar-2022

test.csv: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
Downloading train.csv.zip to /content
100% 4.69M/4.69M [00:00<00:00, 17.9MB/s]



In [5]:
!unzip train.csv.zip

Archive:  train.csv.zip
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train.csv               


In [6]:
!rm train.csv.zip # removing the zip file

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
data = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

## 02 Data Description
train.csv - the training set, comprising measurements of traffic congestion across 65 roadways from April through September of 1991.

row_id - a unique identifier for this instance

time - the 20-minute period in which each measurement was taken

x - the east-west midpoint coordinate of the roadway

y - the north-south midpoint coordinate of the roadway

direction - the direction of travel of the roadway. EB indicates 
"eastbound" travel, for example, while SW indicates a "southwest" direction of travel.

congestion - congestion levels for the roadway during each hour; the target. The congestion measurements have been normalized to the range 0 to 100.

test.csv - the test set; you will make hourly predictions for roadways identified by a coordinate location and a direction of travel on the day of 1991-09-30.

## 03 Data Preparation and EDA

In [9]:
data.shape # size of training data

(848835, 6)

In [10]:
df_test.shape # size of test data

(2340, 5)

In [11]:
data.head(20)

Unnamed: 0,row_id,time,x,y,direction,congestion
0,0,1991-04-01 00:00:00,0,0,EB,70
1,1,1991-04-01 00:00:00,0,0,NB,49
2,2,1991-04-01 00:00:00,0,0,SB,24
3,3,1991-04-01 00:00:00,0,1,EB,18
4,4,1991-04-01 00:00:00,0,1,NB,60
5,5,1991-04-01 00:00:00,0,1,SB,58
6,6,1991-04-01 00:00:00,0,1,WB,26
7,7,1991-04-01 00:00:00,0,2,EB,31
8,8,1991-04-01 00:00:00,0,2,NB,49
9,9,1991-04-01 00:00:00,0,2,SB,46


In [12]:
data.nunique()

row_id        848835
time           13059
x                  3
y                  4
direction          8
congestion       101
dtype: int64

In [13]:
data.isnull().sum() # check for missing values

row_id        0
time          0
x             0
y             0
direction     0
congestion    0
dtype: int64

In [14]:
# no missing values.
# check for datatypes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 848835 entries, 0 to 848834
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   row_id      848835 non-null  int64 
 1   time        848835 non-null  object
 2   x           848835 non-null  int64 
 3   y           848835 non-null  int64 
 4   direction   848835 non-null  object
 5   congestion  848835 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 38.9+ MB


In [15]:
#time is an object. Needs to be converted into datetime datatype
data['time'] = pd.to_datetime(data['time'])

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 848835 entries, 0 to 848834
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   row_id      848835 non-null  int64         
 1   time        848835 non-null  datetime64[ns]
 2   x           848835 non-null  int64         
 3   y           848835 non-null  int64         
 4   direction   848835 non-null  object        
 5   congestion  848835 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 38.9+ MB


In [17]:
# for traffic the features given are the x, y points , and also the direction.
# in addition some important feature could be day_of_week , hour of day, the 20 minute interval of the data
# deriving these features

data['day_of_week'] = data['time'].dt.day_of_week
data['hour'] = data['time'].dt.hour
data['minute_interval'] = data['time'].dt.minute

In [18]:
data.nunique()

row_id             848835
time                13059
x                       3
y                       4
direction               8
congestion            101
day_of_week             7
hour                   24
minute_interval         3
dtype: int64

In [19]:
# keeping hours as numerical. Encoding all others

categ_cols = ['x','y','direction','day_of_week','minute_interval']
df = data
df = pd.get_dummies(df,columns=categ_cols,drop_first=True)

In [20]:
df.head()

Unnamed: 0,row_id,time,congestion,hour,x_1,x_2,y_1,y_2,y_3,direction_NB,direction_NE,direction_NW,direction_SB,direction_SE,direction_SW,direction_WB,day_of_week_1,day_of_week_2,day_of_week_3,day_of_week_4,day_of_week_5,day_of_week_6,minute_interval_20,minute_interval_40
0,0,1991-04-01,70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1991-04-01,49,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,1991-04-01,24,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,3,1991-04-01,18,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,1991-04-01,60,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
features = list(df.columns)
features.remove('row_id')
features.remove('congestion')
features.remove('time')

In [22]:
len(features)

21

In [23]:
# convert the test data in same fashion
df_test['time'] = pd.to_datetime(df_test['time'])
df_test['day_of_week'] = df_test['time'].dt.day_of_week
df_test['hour'] = df_test['time'].dt.hour
df_test['minute_interval'] = df_test['time'].dt.minute
df_test = pd.get_dummies(df_test,columns=categ_cols,drop_first=True)

In [24]:
df_test.shape

(2340, 17)

In [25]:
features

['hour',
 'x_1',
 'x_2',
 'y_1',
 'y_2',
 'y_3',
 'direction_NB',
 'direction_NE',
 'direction_NW',
 'direction_SB',
 'direction_SE',
 'direction_SW',
 'direction_WB',
 'day_of_week_1',
 'day_of_week_2',
 'day_of_week_3',
 'day_of_week_4',
 'day_of_week_5',
 'day_of_week_6',
 'minute_interval_20',
 'minute_interval_40']

In [33]:
# since its only one day data , need to match with train
df_test['day_of_week_1'] = 0
df_test['day_of_week_2'] = 0
df_test['day_of_week_3'] = 0
df_test['day_of_week_4'] = 0
df_test['day_of_week_5'] = 0
df_test['day_of_week_6'] = 0

In [27]:
from sklearn.model_selection import train_test_split
df_train , df_validate = train_test_split(df,train_size=0.8,random_state=42)

## 04 Model building

In [28]:
import statsmodels.api as sm
model = sm.OLS(df_train['congestion'],sm.add_constant(df_train[features]))
results = model.fit()
results.summary()

  import pandas.util.testing as tm
  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,congestion,R-squared:,0.24
Model:,OLS,Adj. R-squared:,0.24
Method:,Least Squares,F-statistic:,10220.0
Date:,"Tue, 01 Mar 2022",Prob (F-statistic):,0.0
Time:,19:08:52,Log-Likelihood:,-2786400.0
No. Observations:,679068,AIC:,5573000.0
Df Residuals:,679046,BIC:,5573000.0
Df Model:,21,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,41.5297,0.084,492.121,0.000,41.364,41.695
hour,0.3545,0.003,138.165,0.000,0.349,0.359
x_1,7.2087,0.047,152.503,0.000,7.116,7.301
x_2,7.1951,0.047,152.849,0.000,7.103,7.287
y_1,-2.2911,0.055,-41.764,0.000,-2.399,-2.184
y_2,3.9788,0.053,74.738,0.000,3.874,4.083
y_3,-3.9003,0.053,-73.223,0.000,-4.005,-3.796
direction_NB,2.2778,0.058,38.953,0.000,2.163,2.392
direction_NE,-10.6695,0.069,-154.105,0.000,-10.805,-10.534

0,1,2,3
Omnibus:,2981.851,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3018.2
Skew:,0.162,Prob(JB):,0.0
Kurtosis:,2.96,Cond. No.,108.0


In [29]:
# removing the minute_interval parameters as they have no dependency

features.remove('minute_interval_20')
features.remove('minute_interval_40')

# Model 2
model = sm.OLS(df_train['congestion'],sm.add_constant(df_train[features]))
results = model.fit()
results.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,congestion,R-squared:,0.24
Model:,OLS,Adj. R-squared:,0.24
Method:,Least Squares,F-statistic:,11300.0
Date:,"Tue, 01 Mar 2022",Prob (F-statistic):,0.0
Time:,19:08:54,Log-Likelihood:,-2786400.0
No. Observations:,679068,AIC:,5573000.0
Df Residuals:,679048,BIC:,5573000.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,41.5290,0.081,515.475,0.000,41.371,41.687
hour,0.3545,0.003,138.166,0.000,0.349,0.359
x_1,7.2087,0.047,152.503,0.000,7.116,7.301
x_2,7.1951,0.047,152.849,0.000,7.103,7.287
y_1,-2.2911,0.055,-41.764,0.000,-2.399,-2.184
y_2,3.9788,0.053,74.738,0.000,3.874,4.083
y_3,-3.9003,0.053,-73.223,0.000,-4.005,-3.796
direction_NB,2.2778,0.058,38.953,0.000,2.163,2.392
direction_NE,-10.6695,0.069,-154.105,0.000,-10.805,-10.534

0,1,2,3
Omnibus:,2981.857,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3018.206
Skew:,0.162,Prob(JB):,0.0
Kurtosis:,2.96,Cond. No.,108.0


In [30]:
# go for complex models
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=42)
dt_base = dt.fit(df_train[features],df_train['congestion'])
df_train['dt_base_pred'] = dt_base.predict(df_train[features])
df_validate['dt_base_pred'] = dt_base.predict(df_validate[features])

In [31]:
from sklearn.metrics import mean_absolute_error
print('DT-Base : Training error is ',mean_absolute_error(df_train['congestion'],df_train['dt_base_pred']))
print('DT-Base : Validation error is ',mean_absolute_error(df_validate['congestion'],df_validate['dt_base_pred']))

DT-Base : Training error is  6.302402783719343
DT-Base : Validation error is  6.393968368193805


In [34]:
df_test['dt_base_pred'] = dt_base.predict(df_test[features])

In [35]:
df_test[['row_id','dt_base_pred']].rename(columns={'dt_base_pred':'congestion'}).to_csv('Submission dt_base.csv',index=False)

In [37]:
!kaggle competitions submit tabular-playground-series-mar-2022 -f 'Submission dt_base.csv' -m "Untuned Decision tree model"

100% 53.4k/53.4k [00:07<00:00, 7.10kB/s]
Successfully submitted to Tabular Playground Series - Mar 2022