# Metro Interstate Traffic Volume Data

#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
-Nowadays, traffic is a major issue for everyone, and it is a source of stress for anyone
who has to deal with it on a daily basis. The growth of the population delays traffic and
makes it worse day by day. The settlement of modern civilization looks at it, but it is
unable to act in such a way as to protect people. We can watch traffic, collect data, and
anticipate the next and subsequent observations using a variety of approaches and
patterns. The observation agency then makes observations, which are then required out
and predictions are made. Being stuck in a cosmopolitan city's traffic is the most
common occurrence in one's life.
The goal of this project is to build a prediction model using multiple machine learning
techniques and to use a template to document the end-to-end stages. We're trying to
forecast the value of a continuous variable with the Metro Interstate Traffic Volume
dataset, which is a regression issue.


### 2) Data Collection
- Dataset Source - https://www.kaggle.com
- The data consists of 9 column and 48204 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns



import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Data PreProcessing

In [3]:
data = pd.read_csv(r"F:\study material\Data Science\modular coding assignment\metro interstate traffic volume prediction\notebooks\data\data.csv")
data

Unnamed: 0.1,Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,0,no holiday,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,1,no holiday,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,2,no holiday,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,3,no holiday,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,4,no holiday,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918
...,...,...,...,...,...,...,...,...,...,...
48199,48199,no holiday,283.45,0.0,0.0,75,Clouds,broken clouds,2018-09-30 19:00:00,3543
48200,48200,no holiday,282.76,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 20:00:00,2781
48201,48201,no holiday,282.73,0.0,0.0,90,Thunderstorm,proximity thunderstorm,2018-09-30 21:00:00,2159
48202,48202,no holiday,282.09,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 22:00:00,1450


In [4]:
data['holiday'].unique()

array(['no holiday', 'yes holiday'], dtype=object)

In [5]:
data.drop("Unnamed: 0",axis=1,inplace=True)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB


In [7]:
data.isnull().sum()

holiday                0
temp                   0
rain_1h                0
snow_1h                0
clouds_all             0
weather_main           0
weather_description    0
date_time              0
traffic_volume         0
dtype: int64

In [8]:
data.describe()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
count,48204.0,48204.0,48204.0,48204.0,48204.0
mean,281.20587,0.334264,0.000222,49.362231,3259.818355
std,13.338232,44.789133,0.008168,39.01575,1986.86067
min,0.0,0.0,0.0,0.0,0.0
25%,272.16,0.0,0.0,1.0,1193.0
50%,282.45,0.0,0.0,64.0,3380.0
75%,291.806,0.0,0.0,90.0,4933.0
max,310.07,9831.3,0.51,100.0,7280.0


# Feature Engineering and model Training

In [9]:
data.sort_values(by=['date_time'],inplace=True)
# sorting values according to time

In [10]:
data['date_time']=pd.to_datetime(data['date_time'])


In [11]:
data['weekday'] = data.date_time.dt.weekday
data['hour'] = data.date_time.dt.hour
data['month'] = data.date_time.dt.month
data['year'] = data.date_time.dt.year


In [12]:
def remove_outlier(df,x):
    Q3,Q1 = np.percentile(df,[75,25])
    IQR = Q3 - Q1
    # Upper bound
    upper = np.where(df >= (Q3+1.5*IQR))
    # Lower bound
    lower = np.where(df <= (Q1-1.5*IQR))
 
    #Removing the Outliers
    x.drop(upper[0], inplace = True)
    x.drop(lower[0], inplace = True)
    
remove_outlier(data['temp'],data)

In [13]:
def hour_modify(x):
    Early_Morning = [4,5,6,7]
    Morning = [8,9,10,11]
    Afternoon = [12,13,14,15]
    Evening = [16,17,18,19]
    Night = [20,21,22,23]
    Late_Night = [24,1,2,3]
    if x in Early_Morning:
        return 'Early Morning'
    elif x in Morning:
        return 'Morning'
    elif x in Afternoon:
        return 'Afternoon'
    elif x in Evening:
        return 'Evening'
    elif x in Night:
        return 'Night'
    else:
        return 'Late Night'
    
data['hour'] = data.hour.apply(hour_modify)

In [14]:
data[['month','weekday']] = data[['month','weekday']] .astype('object')


In [15]:
data.drop(['rain_1h','snow_1h'],axis=1,inplace=True)

In [16]:
data.drop(['weather_description','year'], axis=1, inplace=True)


In [17]:
data.set_index('date_time',inplace=True)


In [18]:
# data.to_csv("cleaned.csv")

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns



import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [20]:
# data = pd.read_csv(r"F:\study material\Data Science\modular coding assignment\metro interstate traffic volume prediction\notebooks\data\cleaned.csv")

In [21]:
# data.set_index('date_time',inplace=True)

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48194 entries, 2012-10-02 09:00:00 to 2018-09-30 23:00:00
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   holiday         48194 non-null  object 
 1   temp            48194 non-null  float64
 2   clouds_all      48194 non-null  int64  
 3   weather_main    48194 non-null  object 
 4   traffic_volume  48194 non-null  int64  
 5   weekday         48194 non-null  object 
 6   hour            48194 non-null  object 
 7   month           48194 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 3.3+ MB


In [23]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import warnings

In [25]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [26]:

X = data.drop('traffic_volume',axis=1)
y = data["traffic_volume"]

In [27]:
categorical_cols = X.select_dtypes(include="object").columns
numerical_cols = X.select_dtypes(exclude="object").columns

In [28]:
numerical_cols

Index(['temp', 'clouds_all'], dtype='object')

In [34]:
weekday = data['weekday'].unique()
hour = data['hour'].unique()
month = data['month'].unique()
weather_main = data['weather_main'].unique()
holiday = data['holiday'].unique()


array([1, 2, 3, 4, 5, 6, 0], dtype=object)

In [30]:
num_pipeline = Pipeline(
    steps=[
        ('scaler' , StandardScaler())
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ('Ordinalencoder' , OrdinalEncoder(categories=[holiday,weather_main,weekday,hour,month])),
        ('scaler' , StandardScaler())
    ]
)


preprocessor = ColumnTransformer([
    ('num_pipeline' , num_pipeline , numerical_cols),    
    ('cat_pipeline' , cat_pipeline , categorical_cols)    
    
    ])


In [132]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.1,random_state=351)

In [133]:

X_train = pd.DataFrame(preprocessor.fit_transform(X_train),columns= preprocessor.get_feature_names_out())
X_test = pd.DataFrame(preprocessor.transform(X_test),columns= preprocessor.get_feature_names_out())

In [142]:
models = {
    'AdaBoostRegressor' : AdaBoostRegressor(),
    'RandomForestRegressor' : RandomForestRegressor(),
    'KNeighborsRegressor' : KNeighborsRegressor(),
    'DecisionTreeRegressor' : DecisionTreeRegressor(),
    'LinearRegression' : LinearRegression(),
    'Lasso' : Lasso(),
    'Ridge' :  Ridge()   
    }

7

In [149]:
report = {}
def evaluate_models(models):
    for i in range(len(models)):
        model = list(models.values())[i]
        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        score = r2_score(y_pred,y_test)
        report[model] = score
    return report

In [151]:
evaluate_models(models)
#  as we can see that random forest is best for our model so we will choose that one

{AdaBoostRegressor(): 0.46554208282787624,
 RandomForestRegressor(): 0.6779059972597472,
 KNeighborsRegressor(): 0.5888454532730667,
 DecisionTreeRegressor(): 0.5659332382023572,
 LinearRegression(): -0.9244919596658667,
 Lasso(): -0.929659989559334,
 Ridge(): -0.9246029484180627}

In [152]:
model = RandomForestRegressor()

In [153]:
model.fit(X_train,y_train)

In [154]:
y_pred = model.predict(X_test)

In [155]:
r2_score(y_pred,y_test)

0.6778288869694096

In [156]:
scores = []
for i in range(10):
    X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.30,random_state=i)
    
    X_train = pd.DataFrame(preprocessor.fit_transform(X_train),columns= preprocessor.get_feature_names_out())
    X_test = pd.DataFrame(preprocessor.transform(X_test),columns= preprocessor.get_feature_names_out())
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    scores.append(r2_score(y_test,y_pred))
    

In [157]:
np.argmax(scores)

8

In [158]:
scores[np.argmax(scores)]

0.7501246439601276

### RandomForestRegressor with random_state = 8 has the best accuracy for our model so we will go with this