# INFO 3402 – Week 04: Assignment

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Background

In lecture, we explored the effects of the COVID-19 shutdowns in spring 2020 on DIA's passenger traffic. We're going to extend that analyis by looking at road traffic as well.

The Colorado Department of Transportation maintains the [Traffic Data Explorer](https://dtdapps.coloradodot.info/otis/TrafficData#ui/0/0/2/criteria//119/false/true/) with a network of continuous traffic recorders throughout Colorado. Historical data is available going back to 1996 or before. I contacted the office and they shared data for 2019 through 2021 for every station in the state.

## Question 01: Load libraries and inspect datasets (4 pts)

Import the `numpy` and `pandas` libraries.

Load the CDOT "traffic" CSV and store as `traffic_df`. Make sure to use the "parse_dates" parameter in `read_csv` on one column for future steps. Inspect the first few rows. (2 pts)

Load the CDOT "station" CSV and store as `station_df`. Set the index to be "COUNTSTATIONID". Inspect the first few rows. (2 pt)

## Question 02: Tidy the traffic data (8 pts)

Identify the columns that are "id_vars" *or* identify the columns that are "value_vars" as a Raw NBConvert cell. (1 pt)

Based on the `.shape` of `traffic_df` and the value_vars you've identified, perform a calculation estimating the number of rows in the reshaped tidy data. (2 pts)

Reshape the DataFrame into a tidy format using either a `melt` or `stack` strategy and call the DataFrame `traffic_tidy_df`. Make sure the variable and value columns in the resulting reshaped DataFrame are labeled "Hour" and "Count". (4 pts)

Check how the shape of the melted data compares to your prediction. (1 pt)

## Question 03: Calculate some aggregate statistics (12 pts)

Use the `dt.year` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.year.html)) method to extract the year from "COUNTDATE" and store as a column called "Year" in `traffic_tidy_df`. (2 pts)

Use the `dt.month` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html)) to extract the month from the "COUNTDATE" and store as a column called "Month" in `traffic_tidy_df`. (2 pts)

Groupby the "COUNTSTATIONID", "DIR", "Year", and "Month" columns. Aggregate to the total traffic and store as `station_agg_traffic`. (2 pts)

Station ID "4" is on US-36 near McCaslin Blvd. Access the `station_agg_traffic` DataFrame using `.loc` and extract the traffic counts for Station 4's east- and west-bound in March 2019 and 2020 (four total values). (4 pts)

Calculate and print the percentage change in traffic (**both** directions) for March 2020 compared to March 2019. (2 pts)

## Question 04: Reshape annual traffic data (6 pts)

Unstack the "Year" column from `station_agg_traffic`, assign to `annual_traffic_unstack`, and inspect to ensure it is a simple MultiIndexed DataFrame with three columns (2 pts). 

Calculate the *percentage* difference between each row's 2020 traffic and 2019 traffic, store as a new column "PctDiff_20_19", and inspect. (2 pts)

Use the "PctDiff_20_19" Series to identify the station that had the largest drop in traffic from 2019 to 2020. What month did this occur? What was the percentage drop in traffic? Use the "Location" field from `station_df` we loaded in Question 01 to identify the station. (3 pts)

### Extra credit (4 points)

We expected traffic to decrease significantly but a 95% drop-off on one of the busiest interstates seems unusual. Let's dig into the data to understand what's causing this.

Use any combination of Boolean indexing, aggregating, or reshaping to find the number of unique "COUNTDATES" in April 2020 for east-bound station ID 105. (1 pt)

What percentage of stations in this data are similarly missing at least one date? (2 pts)

What is the average temporal coverage (n_unique days / n_possible days) across stations? (1 pt)

## Appendix

This is documentation of how I cleaned up the data for use in the assignment. There's nothing you need to do here for the assignment, but I think there is some valuable patterns and examples to use. There's definitely some hints how to do things for the assignment here as well.

In [3]:
cdot_df = pd.read_excel('cdot_2019_21.xlsx',sheet_name=0)
cdot_df.head()

Unnamed: 0,COUNTSTATIONID,Location,Latitude,Longitude,DIR,COUNTDATE,DAYOFWEEK,CALYR,HOUR0,HOUR1,HOUR2,HOUR3,HOUR4,HOUR5,HOUR6,HOUR7,HOUR8,HOUR9,HOUR10,HOUR11,HOUR12,HOUR13,HOUR14,HOUR15,HOUR16,HOUR17,HOUR18,HOUR19,HOUR20,HOUR21,HOUR22,HOUR23
0,1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141,EAST,20191119,3,2019,4,6,5,6,13,16,43,70,117,67,61,62,70,64,65,74,84,52,49,35,23,24,17,9
1,1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141,EAST,20191120,4,2019,104,6,4,3,10,16,34,81,62,119,58,70,69,84,71,57,70,70,30,192,26,17,15,5
2,1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141,EAST,20191121,5,2019,2,2,3,4,10,14,32,71,69,66,60,52,69,345,312,97,64,70,48,46,21,18,9,4
3,1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141,EAST,20191122,6,2019,5,3,9,6,6,18,35,93,49,70,129,58,78,74,88,114,102,165,79,47,42,31,22,16
4,1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141,EAST,20191123,7,2019,8,5,3,3,9,22,26,31,105,40,38,64,45,45,63,55,57,54,55,50,224,34,19,8


Check to make sure all the station IDs have the same Latitudes and Longitudes.

In [18]:
unique_lat_longs = pd.pivot_table(
    data = cdot_df,
    index = 'COUNTSTATIONID',
    columns = ['DIR'],
    values = ['Latitude','Longitude'],
    aggfunc = 'nunique'
)

unique_lat_longs.stack().sort_values('Latitude',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Latitude,Longitude
COUNTSTATIONID,DIR,Unnamed: 2_level_1,Unnamed: 3_level_1
1,EAST,1.0,1.0
506,SOUTH,1.0,1.0
312,SOUTH,1.0,1.0
314,NORTH,1.0,1.0
314,SOUTH,1.0,1.0
...,...,...,...
215,SOUTH,1.0,1.0
216,NORTH,1.0,1.0
216,SOUTH,1.0,1.0
217,EAST,1.0,1.0


Tidy up the data by removing the duplicated values on Lociation, Latitude, and Longitude.

In [34]:
stationid_df = cdot_df[['COUNTSTATIONID','Location','Latitude','Longitude']].drop_duplicates().set_index('COUNTSTATIONID')
stationid_df.to_csv('cdot_stations.csv')
stationid_df.head()

Unnamed: 0_level_0,Location,Latitude,Longitude
COUNTSTATIONID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"ON SH 6 E/O SH 59, HAXTUN",40.62971,-102.572141
2,"ON I-70 W/O SH 36, AIR PARK RD, AURORA",39.739875,-104.671758
3,"ON SH 470 NW/O SH 85, SANTA FE DR, LITTLETON",39.567108,-105.054139
4,"ON SH 36 SE/O SH 170, MCCASLIN BLVD, SUPERIOR",39.946916,-105.148988
7,"ON SH 14 MI E/O CR 33, AULT",40.582532,-104.747863


Drop columns for Location, Latitude, Longitude, CALYR, and DAYOFWEEK.

In [35]:
cdot_drop_df = cdot_df.drop(columns = ['Location','Latitude','Longitude','CALYR','DAYOFWEEK'])
cdot_drop_df.to_csv('cdot_traffic.csv',index=False)
cdot_drop_df.head()

Unnamed: 0,COUNTSTATIONID,DIR,COUNTDATE,HOUR0,HOUR1,HOUR2,HOUR3,HOUR4,HOUR5,HOUR6,HOUR7,HOUR8,HOUR9,HOUR10,HOUR11,HOUR12,HOUR13,HOUR14,HOUR15,HOUR16,HOUR17,HOUR18,HOUR19,HOUR20,HOUR21,HOUR22,HOUR23
0,1,EAST,20191119,4,6,5,6,13,16,43,70,117,67,61,62,70,64,65,74,84,52,49,35,23,24,17,9
1,1,EAST,20191120,104,6,4,3,10,16,34,81,62,119,58,70,69,84,71,57,70,70,30,192,26,17,15,5
2,1,EAST,20191121,2,2,3,4,10,14,32,71,69,66,60,52,69,345,312,97,64,70,48,46,21,18,9,4
3,1,EAST,20191122,5,3,9,6,6,18,35,93,49,70,129,58,78,74,88,114,102,165,79,47,42,31,22,16
4,1,EAST,20191123,8,5,3,3,9,22,26,31,105,40,38,64,45,45,63,55,57,54,55,50,224,34,19,8
