# I-94 Interstate Highway Traffic Analysis & Visualization

The aim of this analysis and exploration is look at what causes high traffic volumes on the I-94 Interstate highway.
<br> <br>
This data was taken from a station located between Minneapolis and Saint Paul, the direction is westbound. This data and the result of this analysis should not be used to generalize all traffic on the I-94. For more information, please see the documentation from the data source linked below. 
<br> <br>
Data Source: https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

In [None]:
import os 
for dirname, _, filenames in os.walk('/kaggle/input'): 
    for filename in filenames: 
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

i94 = pd.read_csv('../input/metro-traffic-volume/Metro_Interstate_Traffic_Volume.csv')

i94.head()

In [None]:
i94.tail()

In [None]:
# checking columns in the dataset

i94.info()

In [None]:
# we notice date_time column is type 'object', we will need to convert/cast it to datetime type. 

i94['date_time'] = pd.to_datetime(i94['date_time'])

In [None]:
# checking if our column 'date_time' is type datetime64
i94.info()

# Exploration

In [None]:
i94['traffic_volume'].plot.hist()
plt.show()

In [None]:
#examining traffic_volume column

i94['traffic_volume'].describe()

There are multiple factors influencing traffic volume, we can assume time of the day is one factor. The date_time column contains date and time of the day, we will isolate for time of the day and examine traffic volumes.

We will assume the following:
1) Day time - 7 am to 7 pm 
<br>
2) Night time - 7 pm to 7 am

# Time as a Factor

In [None]:
day = i94.copy()[(i94['date_time'].dt.hour >= 7) & (i94['date_time'].dt.hour < 19)]
day.describe()

In [None]:
night = i94.copy()[(i94['date_time'].dt.hour >= 19) | (i94['date_time'].dt.hour < 7)]
night.shape
night.describe()

Just by using the 'describe()' method, it's clear that day time has a much higher traffic volume than night. Lets visualize it to see it better.

In [None]:
plt.figure(figsize = (12,3.5))

# plotting day time 
plt.subplot(1,2,1)
plt.hist(day['traffic_volume'])
plt.ylim(0,8000)
plt.xlim(-100, 7500)
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.title('Traffic Volume: Day')

#plotting night time
plt.subplot(1,2,2)
plt.hist(night['traffic_volume'])
plt.ylim(0,8000)
plt.xlim(-100, 7500)
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.title('Traffic Volume: Night')


plt.show()

We can observe few things from these plots:
- Day time plot is skewed to the left
- Night time plot is skewed to the right
- Day time traffic volume is much higher than night, this trend was also seen earlier by the 'describe()' method.

### What causes high traffic volumes during the day?

## Time of the day

In [None]:
# lets take a look at traffic volumes by month

day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean()
by_month['traffic_volume']

In [None]:
# creating a line plot to show how traffic volume changes with time

by_month['traffic_volume'].plot.line()
plt.xlabel('Month of the Year')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume Throughout The Year')

plt.show()

As expected winter months (December - February) have dips in traffic volume. Interestingly, we see a dip in July. Lets look at July for each year from 2013-2018 to see if we observe the same pattern.

In [None]:
day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]
only_july.groupby('year').mean()['traffic_volume'].plot.line()
plt.show()

It appears that the decrease of traffic in July comes from the year 2016. Apparently this is due to construction of the I-94 highway in the summer of 2016.

Lets take a look at the average traffic per day 

In [None]:
day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']

0 - Monday <br>
1 - Tuesday <br>
2 - Wednesday <br>
3 - Thursday <br>
4 - Friday <br>
5 - Saturday <br>
6 - Sunday <br>

In [None]:
#plotting average traffic per day

by_dayofweek['traffic_volume'].plot.line()
plt.xlabel('Day of Week')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume Per Day of Week')
plt.show()

Again as expected, Saturday and Sunday have lower traffic volumes than the rest of the week.

**We will now split our data between weekened and business day for a more fair analysis**

In [None]:
day.head()

In [None]:
day['hour'] = day['date_time'].dt.hour

weekend_days = day.copy()[day['dayofweek'] >= 5]
business_days = day.copy()[day['dayofweek'] <= 4]


In [None]:
weekend_days.head()

In [None]:
business_days.head()

Lets groupby hour of the day and examine both data sets

In [None]:
by_hour_weekend = weekend_days.groupby('hour').mean()
by_hour_business = business_days.groupby('hour').mean()


In [None]:
by_hour_weekend.head()

In [None]:
by_hour_business.head()

In [None]:
plt.figure(figsize = (12,4))

# Weekend Days
plt.subplot(1,2,1)
by_hour_weekend['traffic_volume'].plot.line()
plt.xlabel('Hour of the Day')
plt.ylabel('Traffic Volume')
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Average Traffic Volume Per Hour during Weekends')

#Business Days
plt.subplot(1,2,2)
by_hour_business['traffic_volume'].plot.line()
plt.xlabel('Hour of the Day')
plt.ylabel('Traffic Volume')
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Average Traffic Volume Per Hour during Business Days')

plt.show()


Given these plots,we can deduce the following: <br>
- On average business days have higher traffic volumes than weekend days.
- Weekend days experience a rise in traffic volumes from early mornings (7 am) and a reach peak mid-day (12 pm)
- There is a decrease in traffic volumes for business day from 6 - 8 am, this is likely due to rush hour time. 
- We then see another increase from 1 - 4 pm, this is likely due to people coming back from work.

# Weather as a Factor

Weather plays a big part in traffic volumes. Lets take a look. Since we're only looking at day time data, we will be using our 'day' dataset we created earlier.

In [None]:
day.head()

In [None]:
day.corr()

In [None]:
#lets visualize correlation values using a heatmap

correlation = day.corr()

plt.figure(figsize = (12,6))

sns.heatmap(correlation, annot = True, linecolor='white',linewidths=0.1, cmap = 'BuGn')
plt.title('Correlation Matrix')

plt.show()

Traffic volume correlates most strongly with temperature (r = 0.13), this is not a very strong correlation so instead we will be looking at the categorical variable 'weather_main' and 'weather_description'. 

In [None]:
#grouping by weather 
by_weather_main = day.groupby('weather_main').mean()

In [None]:
by_weather_main.head()

In [None]:
plt.figure(figsize = (12,7))

by_weather_main['traffic_volume'].plot.barh()
plt.xlabel('Traffic Volume')
plt.ylabel('Weather')
plt.title('Average Traffic Volumes for each Weather from 2012-2018 on I-94 Interstate Highway')

plt.show()

The trend is not very easy to observe, it appears that most weather condition cause a traffic volume of >4000 on average. Lets take a look at 'weather_description' since its much more detailed. 

In [None]:
by_weather_description = day.groupby('weather_description').mean()

In [None]:
by_weather_description.head()

In [None]:
plt.figure(figsize = (14,12))

by_weather_description['traffic_volume'].plot.barh()
plt.xlabel('Traffic Volume')
plt.ylabel('Weather Description')
plt.title('Average Traffic Volumes for each Weather description from 2012-2018 on I-94 Interstate Highway')
plt.axvline(x = 5000, color = 'Red')

plt.show()

We can see that 3 descriptions of weather give a traffic volume of over 5000 on average:
1) Shower Snow <br>
2) Proximity thunderstorm with drizzle <br>
3) Light rain and snow <br>

# Conclusion

The aim of this EDA was to look at what factors cause high traffic volumes on the I94-Interstate Highway. We have concluded that time (time of day/ day of week/ month) and weather (description) are the biggest contributor to high traffic volumes from 2012-2018 on the I-94 interstate highway. 
<br>
Below is a summary of my findings: <br>
- Traffic volumes are higher during the day.
- Traffic volumes are higher from March-October and lower from November-February (July was an exception).
- Business days show higher volumes than weekend days, rush hour appears to be from (7-16).
- Traffic volume show week correlation with temperature (r = 0.13). 
- Weather_main was also not a great indicator of traffic volumes.
- Three Weather description appear to best correlate with high traffic volumes (> 5000).
    - Shower Snow
    - Proximity thunderstorm with drizzle
    - Light rain and snow