# Westbound I-94 Traffic 

## Table of Contents

1. [**Introduction**](#1)
    - Project Description
    - Data Description
2. [**Acquiring and Loading Data**](#2)
	- Importing Libraries and Notebook Setup
    - Loading Data
    - Basic Data Exploration
    - Areas to Fix
3. [**Data Proprocessing**](#3)
4. [**Exploratory Data Analysis**](#4)
5. [**Conclusion**](#5)
    - Insights
    - Suggestions
    - Possible Next Steps
6. [**Epilogue**](#6) 
    - References
    - Versioning

---

# 1

## Introduction

![Minneapolis - St. Paul](mn-stpl.png)

### Project Description

**Goal/Purpose:** 

The goal of this project is to determine indicators of heavy traffic on I-94. 

<p>&nbsp;</p>

**Questions to be Answered:**

- How does weather impact traffic? 
- What are the seasonal impacts on traffic?
- What is the average impact to travel during heavy commute periods?

<p>&nbsp;</p>

**Assumptions/Methodology/Scope:** 

Briefly describe assumptions,processing steps, and the scope of this project.

<p>&nbsp;</p>

### Data Description

**Content:** 

This dataset is a csv file about Minneapolis-St.Paul traffic. The dataset lasts from 2012-2018 and contains hourly information about westbound traffic on I-94, including weather and holidays. 

<p>&nbsp;</p>

**Description of Attributes:** 

Here you can describe what each column represents.
| Column  | Description |
| :------ | :---------- |
| holiday | Categorical US National holidays plus regional holiday, Minnesota State Fair |
| temp | Numeric Average temp in kelvin |
| rain_1h | Numeric Amount in mm of rain that occurred in the hour |
| snow_1h | Numeric Amount in mm of snow that occurred in the hour  |
| clouds_all | Numeric Percentage of cloud cover |
| weather_main | Categorical Short textual description of the current weather |
| weather_description | Categorical Longer textual description of current weather |
| date_time | DateTime Hour of the data collected in local CST time |
| traffic_volume | Numeric Hourly I-94 ATR 301 reported westbound traffic  |
<p>&nbsp;</p>

**Acknowledgements:** 

This dataset is provided by John Hogue, Social Data Science & General Mills,  and the original source can be found on [Metro Interstate Traffic Volume Data Set - UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume#).

---

# 2

## Acquiring and Loading Data
### Importing Libraries and Notebook Setup

In [114]:
# Ignore warnings if needed
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import datetime
import numpy as np
import pandas as pd
import pandas.api.types as ptypes

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Pandas settings
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 60
pd.options.display.float_format = '{:,.3f}'.format

# Visualization settings
from matplotlib import rcParams
plt.style.use('dark_background')
rcParams['figure.figsize'] = (16, 5)   
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['font.size'] = 12
# rcParams['figure.dpi'] = 300
rcParams['savefig.dpi'] = 300
plt.rc('xtick', labelsize=11)
plt.rc('ytick', labelsize=11)
custom_palette = ['#003f5c', '#444e86', '#955196', '#dd5182', '#ff6e54', '#ffa600']
custom_hue = ['#004c6d', '#346888', '#5886a5', '#7aa6c2', '#9dc6e0', '#c1e7ff']
custom_divergent = ['#00876c', '#6aaa96', '#aecdc2', '#f1f1f1', '#f0b8b8', '#e67f83', '#d43d51']
sns.set_palette(custom_palette)
%config InlineBackend.figure_format = 'retina'

### Loading Data

In [115]:
# Load DataFrame
file = 'metro_interstate_traffic_volume.csv'
traffic = pd.read_csv(file)

### Basic Data Exploration

#### Number of Rows and Columns

In [116]:
# Show rows and columns count
print(f"Rows count: {traffic.shape[0]}\nColumns count: {traffic.shape[1]}")

Rows count: 48204
Columns count: 9


#### Display First and Last Rows

In [117]:
# Look at first 5 rows
traffic.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [118]:
# Look at last 5 rows
traffic.tail()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
48199,,283.45,0.0,0.0,75,Clouds,broken clouds,2018-09-30 19:00:00,3543
48200,,282.76,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 20:00:00,2781
48201,,282.73,0.0,0.0,90,Thunderstorm,proximity thunderstorm,2018-09-30 21:00:00,2159
48202,,282.09,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 22:00:00,1450
48203,,282.12,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 23:00:00,954


#### Check Data Types

In [119]:
# Show data types
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              61 non-null     object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB


- `holiday`, `weather_main`, `weather_description`, `date_time` are **strings**.
- `temp`, `rain_1h`, and `snow_1h` are **floats**.
- `clouds_all` and `traffic_volume` are **integers**.

#### Check Missing Data

In [120]:
# Print percentage of missing values
missing_percent = traffic.isna().mean().sort_values(ascending=False)
print('---- Percentage of Missing Values (%) -----')
if missing_percent.sum():
    print(missing_percent[missing_percent > 0] * 100)
else:
    print(None)

---- Percentage of Missing Values (%) -----
holiday   99.873
dtype: float64


#### Check for Duplicate Rows

In [121]:
# Show number of duplicated rows
print(f"No. of entirely duplicated rows: {traffic.duplicated().sum()}")

# Show duplicated rows
traffic[traffic.duplicated()]

No. of entirely duplicated rows: 17


Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
18697,,286.29,0.0,0.0,1,Clear,sky is clear,2015-09-30 19:00:00,3679
23851,,289.06,0.0,0.0,90,Clouds,overcast clouds,2016-06-01 10:00:00,4831
26784,,289.775,0.0,0.0,56,Clouds,broken clouds,2016-09-21 15:00:00,5365
26980,,287.86,0.0,0.0,0,Clear,Sky is Clear,2016-09-29 19:00:00,3435
27171,,279.287,0.0,0.0,56,Clouds,broken clouds,2016-10-07 18:00:00,4642
28879,,267.89,0.0,0.0,90,Snow,light snow,2016-12-06 18:00:00,4520
29268,,254.22,0.0,0.0,1,Clear,sky is clear,2016-12-19 00:00:00,420
34711,,295.01,0.0,0.0,40,Clouds,scattered clouds,2017-06-21 11:00:00,4808
34967,,292.84,0.0,0.0,1,Clear,sky is clear,2017-06-30 10:00:00,4638
34969,,294.52,0.0,0.0,1,Clear,sky is clear,2017-06-30 11:00:00,4725


#### Check Uniqueness of Data

In [122]:
# Print the number of unique values
num_unique = traffic.nunique().sort_values()
print('---- Number of Unique Values -----')
print(num_unique)

---- Number of Unique Values -----
holiday                   11
weather_main              11
snow_1h                   12
weather_description       38
clouds_all                60
rain_1h                  372
temp                    5843
traffic_volume          6704
date_time              40575
dtype: int64


#### Check Data Range

In [123]:
# Print summary statistics
traffic.describe(include='all')

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
count,61,48204.0,48204.0,48204.0,48204.0,48204,48204,48204,48204.0
unique,11,,,,,11,38,40575,
top,Labor Day,,,,,Clouds,sky is clear,2013-05-19 10:00:00,
freq,7,,,,,15164,11665,6,
mean,,281.206,0.334,0.0,49.362,,,,3259.818
std,,13.338,44.789,0.008,39.016,,,,1986.861
min,,0.0,0.0,0.0,0.0,,,,0.0
25%,,272.16,0.0,0.0,1.0,,,,1193.0
50%,,282.45,0.0,0.0,64.0,,,,3380.0
75%,,291.806,0.0,0.0,90.0,,,,4933.0


### Areas to Fix
**Data Types**
- `date_time` should be a **datetime** type instead.

**Missing Data**
- `holiday` column has a significant amount of missing data, which likely is indicative of a "normal" day not a holiday 

**Duplicate Rows**
- 17 duplicate rows to be removed

**Uniqueness of Data**
- Data uniqueness does not pose a concern

**Data Range**
- `rain_1h` and `snow_1h` both have a fairly limited range of data, however it appears that `rain_1h` may have an erroneous entry for it's maximum value 

---

# 3

## Data Preprocessing

Here you can add sections like:

- Renaming columns
- Drop Redundant Columns
- Changing Data Types
- Dropping Duplicates
- Handling Missing Values
- Handling Unreasonable Data Ranges
- Feature Engineering / Transformation

Use `assert` where possible to show that preprocessing is done.

### Rename Columns

In [124]:
# Rename columns
columns_to_rename = {
    'holiday':'holiday',
    'temp':'temp_K',
    'rain_1h':'rain_mmph',
    'snow_1h':'snow_mmph',
    'clouds_all':'cloudcover',
    'weather_main':'weather_main',
    'weather_description':'weather_description',
    'date_time':'date_time',
    'traffic_volume':'traffic_volume'
}
traffic.rename(columns=columns_to_rename, inplace=True)

In [125]:
# Verify columns are renamed
traffic.columns

Index(['holiday', 'temp_K', 'rain_mmph', 'snow_mmph', 'cloudcover',
       'weather_main', 'weather_description', 'date_time', 'traffic_volume'],
      dtype='object')

### Drop Redundant Columns

In [126]:
# Check the proportion of the most frequent value in each column
print('---- Frequency of the Mode (%) -----')
mode_dict = {col: (traffic[col].value_counts().iat[0] / traffic[col].size * 100) for col in traffic.columns}
mode_series = pd.Series(mode_dict)
mode_series

---- Frequency of the Mode (%) -----


holiday                0.015
temp_K                 0.266
rain_mmph             92.808
snow_mmph             99.869
cloudcover            34.109
weather_main          31.458
weather_description   24.199
date_time              0.012
traffic_volume         0.104
dtype: float64

In [127]:
# Show the value frequency of each column greater than the mode's threshold
threshold = 80
for col in mode_series[mode_series > threshold].index:
    print(traffic[col].value_counts(dropna=False))
    print()

rain_mmph
0.000    44737
0.250      948
0.510      256
1.020      123
0.300      121
         ...  
1.280        1
1.470        1
4.660        1
2.080        1
2.350        1
Name: count, Length: 372, dtype: int64

snow_mmph
0.000    48141
0.050       14
0.060       12
0.510        6
0.250        6
0.130        6
0.100        6
0.320        5
0.170        3
0.440        2
0.080        2
0.210        1
Name: count, dtype: int64



Despite both of these columns having limited range of data, both are highly important to the intent of this analysis. The `snow_mmph` column will need to be considered very cautiously due to it's lack of diversity in data. 

### Changing Data Types

In [128]:
# Convert columns to the right data types
traffic['date_time'] = pd.to_datetime(traffic['date_time'], infer_datetime_format=True)

In [129]:
# Verify conversion
assert ptypes.is_datetime64_any_dtype(traffic['date_time']) 

### Dropping Duplicates

In [130]:
# Drop entirely duplicated rows
traffic.drop_duplicates(inplace=True, ignore_index=True)

In [131]:
# Verify rows dropped
assert traffic.duplicated().sum()==0

### Handling Missing Values

In [132]:
traffic['holiday'].fillna('na', inplace=True)

In [133]:
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48187 entries, 0 to 48186
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48187 non-null  object        
 1   temp_K               48187 non-null  float64       
 2   rain_mmph            48187 non-null  float64       
 3   snow_mmph            48187 non-null  float64       
 4   cloudcover           48187 non-null  int64         
 5   weather_main         48187 non-null  object        
 6   weather_description  48187 non-null  object        
 7   date_time            48187 non-null  datetime64[ns]
 8   traffic_volume       48187 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 3.3+ MB


### Handling Unreasonable Data Ranges

In [134]:
traffic['rain_mmph'].describe()

count   48,187.000
mean         0.334
std         44.797
min          0.000
25%          0.000
50%          0.000
75%          0.000
max      9,831.300
Name: rain_mmph, dtype: float64

After looking into what reasonable amounts of rain in an hour are, it seems fairly standard that anything greater than 50 mm of rain in 1 hour is considered violent. After evaluating which rows indicate greater than 60 mm of rain per hour, appropriate rows will be dropped.

In [135]:
print(traffic[traffic.rain_mmph > 60])

      holiday  temp_K  rain_mmph  snow_mmph  cloudcover weather_main   
24870      na 302.110  9,831.300      0.000          75         Rain  \

      weather_description           date_time  traffic_volume  
24870     very heavy rain 2016-07-11 17:00:00            5535  


In [136]:
traffic = traffic[traffic.rain_mmph < 60]
print(len(traffic))

48186


---

# 4

## Exploratory Data Analysis

Here is where your analysis begins. You can add different sections based on your project goals.

### Exploring `Column Name`

In [137]:
# Code and visualization

**Observations**
- Ob 1
- Ob 2
- Ob 3

---

# 5

## Conclusion

### Insights 
State the insights/outcomes of your project or notebook.

### Suggestions

Make suggestions based on insights.

### Possible Next Steps
Areas to expand on:
- (if there is any)

---

# 6

## Epilogue

### References

This is how we use inline citation[<sup id="fn1-back">[1]</sup>](#fn1).

[<span id="fn1">1.</span>](#fn1-back) _subject (date)._ Title. Available at: https://website.com (Accessed: Date). 

> Use [https://www.citethisforme.com/](https://www.citethisforme.com/) to create citations.

### Versioning
Notebook and insights by (author).
- Version: 1.0.0
- Date: 