This notebook is being developed as part of the Code Kentucky Python Data Analyst pathway.<br>

Technical Specifications:
- Python 3.11.7<br>
- Instructions on setting up a Python virtual environment are contained towards the bottom of the README.md<br>
- Dependencies are available by running the following command: pip install -r requirements.txt

Code Kentucky Required Features #1, 2, 4, 5 are contained in the first cell blocks.<br>
- Feature #1: Read multiple data files (JSON, CSV, Excel, etc.)<br>
- Feature #2: Clean the data and perform a pandas merge, then calculate some new values based on the new data set.<br>
- Feature #4:  Utilize a Python virtual environment and include instructions in your README on how the user should set one up.

Goal: Quantify the impact of road closures based on three metrics:
1) Total number of closures<br>
2) Frequency of closures<br>
3) Duration of closures.<br>

The list below are a few sample questions that I hope to answer:<br>
1) How many closures occur statewide each year? (Normal bar graph showing count per year?)<br>
2) How many road closures occur in each county per year? (Normal bar graph with year as x-axis and count of closures?)<br>
3) How often, or how frequently, is a single road being closed due to rainfall? (Horizonatal bar graph with roadname as Y axis or pivot table output?)<br>
4) What is the average duration of road closures?

Methodology:
1) Import/load road closure data directly from hosted web server into Pandas.<br>
2) Parse out the Latitude and Longitude by stripping unneeded hyperlink characters.<br>
3) Produce standalone latitude and longitude columns/fields, which is preferred for mapping in most BI software.<br>
3) Standardize timestamps to assist with calculating duration.<br>
4) Modify the duration calculation to show hours as float64, making it easier to use in popular BI tools.<br>
5) Summarize the results by year, county, and roadway using record counts and caculated durations.
6) If time allows, develop an overall score that takes into consideration the frequency and duration of events.

DISCLAIMER:  Results may vary.  In addition to historic data, this notebook is also utilizing current year data.  The data source is updated every 2 minutes but only when there are active road closures due to weather related events.

-Chris Lambert

In [1]:
#Import Python Libraries / Dependencies
import pandas as pd

In [2]:
#Analyze the 2021 dataset.
#I prefer to perform the import and cleaning in-memory as opposed to using local files.

#Load the 2021 dataset from the web server. 
df2021=pd.read_csv("https://storage.googleapis.com/kytc-its-2020-openrecords/toc/KYTC-TOC-Weather-Closures-Historic-2021.csv")

#Export the 2021 dataset to a csv file as a method of copying the data to the local machine. 
df2021.to_csv("KYTC-TOC-Weather-Closures-Historic-2021.csv", index=False)

#The url needs to be cleaned to reveal the latitude and longitude, just in case I need them for mapping.  
#I assume a strip that essentially performs a find/replace would work here.
df2021['Route_Link'] = df2021['Route_Link'].str.replace('https://kytc.maps.arcgis.com/apps/webappviewer/index.html?id=327a38decc8c4e5cb882dc6cd0f9d45d&zoom=14&center=', '')
df2021[['longitude','latitude']] = df2021.Route_Link.str.split(",",expand=True,)
df2021 = df2021.drop('Route_Link', axis=1)

#I need to clean that before I can calculate the duration.
#The reported_on time and the end_date are different timestamps.
df2021['End_Date'] = df2021['End_Date'].str.replace('+00:00', '')
df2021['Duration_Default'] = pd.to_datetime(df2021['End_Date']) - pd.to_datetime(df2021['Reported_On'])
df2021['Duration_Hours'] = df2021['Duration_Default'].dt.total_seconds() / 3600

#Print the first 3 rows just to check the data.
print(df2021.head(3))

#Export the clean 2021 dataset to a csv file to show progress.
df2021.to_csv("kytc-closures-2021.csv", index=False)


   District    County    Route          Road_Name  Begin_MP  End_MP  \
0         1   Ballard  KY-1345           MYERS RD       2.0     2.0   
1         1   Ballard  KY-1345           MYERS RD       2.0     2.0   
2         1  Calloway  KY-1536  OUTLAND SCHOOL RD       1.0     1.0   

                                            Comments          Reported_On  \
0  KY 1345/Myers Road Blocked at 2mm in Ballard C...  2021-07-28 00:05:00   
1  KY 1345/Myers Road Blocked both directions at ...  2021-07-28 00:05:00   
2  Blocked just north of the intersection with Cl...  2021-12-06 13:00:00   

              End_Date   longitude   latitude Duration_Default  Duration_Hours  
0  2021-07-28 04:14:01  -88.979133  37.005148  0 days 04:09:01        4.150278  
1  2021-07-28 16:04:01  -88.979133  37.005148  0 days 15:59:01       15.983611  
2  2021-12-07 16:58:01  -88.255129  36.609757  1 days 03:58:01       27.966944  


In [16]:
dtypes = df2021.dtypes
print(dtypes)

District                      int64
County                       object
Route                        object
Road_Name                    object
Begin_MP                    float64
End_MP                      float64
Comments                     object
Reported_On                  object
End_Date                     object
longitude                    object
latitude                     object
Duration_Default    timedelta64[ns]
Duration_Hours              float64
dtype: object


In [3]:
#Analyze the 2022 dataset.

#Load the 2022 dataset from the web server. 
df2022=pd.read_csv("https://storage.googleapis.com/kytc-its-2020-openrecords/toc/KYTC-TOC-Weather-Closures-Historic-2022.csv")

#Export the 2022 dataset to a csv file as a method of copying the data to the local machine. 
df2022.to_csv("KYTC-TOC-Weather-Closures-Historic-2022.csv", index=False)

#The url needs to be cleaned.
#Beginning in 2022, the url is different, requiring some additional stripping.
#The latitude and longiturde are in a different order in this url.
df2022['Route_Link'] = df2022['Route_Link'].str.replace('https://goky.ky.gov/?lat=','')
df2022['Route_Link'] = df2022['Route_Link'].str.replace('&lng=',',')
df2022['Route_Link'] = df2022['Route_Link'].str.replace('&zoom=14','')
df2022[['latitude','longitude']] = df2022.Route_Link.str.split(",",expand=True,)
df2022 = df2022.drop('Route_Link', axis=1)

#I need to clean that before I can calculate the duration.
#The reported_on time and the end_date are the same timestamps but I'm keeping the strip code in just in case.
df2022['End_Date'] = df2022['End_Date'].str.replace('+00:00', '')
df2022['Duration_Default'] = pd.to_datetime(df2022['End_Date']) - pd.to_datetime(df2022['Reported_On'])
df2022['Duration_Hours'] = df2022['Duration_Default'].dt.total_seconds() / 3600

#Print the first 3 rows just to check the data.
print(df2022.head(3))

#Export the clean 2022 dataset to a csv file to show progress.
df2022.to_csv("kytc-closures-2022.csv", index=False)


   District    County    Route Road_Name  Begin_MP  End_MP  \
0         1  Carlisle   KY-121  KY-121 N       7.3     7.3   
1         1  Carlisle  KY-1628   KY-1628       2.0     2.0   
2         1  Carlisle  KY-1628   KY-1628       2.0     2.0   

                                            Comments          Reported_On  \
0  Roadway blocked due to flooding. Updates as av...  2022-04-12 08:56:00   
1                 Roadway Closed due to Flood damage  2022-04-26 08:11:00   
2  Roadway blocked due to flood damage. Updates a...  2022-04-26 08:11:00   

              End_Date   latitude   longitude Duration_Default  Duration_Hours  
0  2022-04-12 10:04:00  36.907913  -88.917011  0 days 01:08:00        1.133333  
1  2022-04-26 10:50:00  36.922194  -88.898473  0 days 02:39:00        2.650000  
2  2022-04-28 08:02:01  36.922194  -88.898473  1 days 23:51:01       47.850278  


In [4]:
#Analyze the 2023 dataset.
#An error was found in the 2023 Route_Link column.
#The ending characters were incorrectly published and included the roadname in additon to the zoom level.

#Load the 2022 dataset from the web server. 
df2023=pd.read_csv("https://storage.googleapis.com/kytc-its-2020-openrecords/toc/KYTC-TOC-Weather-Closures-Historic-2023.csv")

#Export the 2022 dataset to a csv file as a method of copying the data to the local machine. 
df2023.to_csv("KYTC-TOC-Weather-Closures-Historic-2023.csv", index=False)

#The url needs to be cleaned.
#Beginning in 2022, the url is different, requiring some additional stripping.
#The placement of latitude and longitude are consistent between 2022-2024 but the ending of the URL in 2023 forced me to use a regex.
df2023['Route_Link'] = df2023['Route_Link'].str.replace('https://goky.ky.gov/?lat=','')
df2023['Route_Link'] = df2023['Route_Link'].str.replace('&lng=',',')
df2023['Route_Link'] = df2023['Route_Link'].str.replace('&.*', '', regex=True) #Regex was needed to compensate for an output error in the 2023 data.
df2023[['latitude','longitude']] = df2023.Route_Link.str.split(",",expand=True,)
df2023 = df2023.drop('Route_Link', axis=1)

#I need to clean that before I can calculate the duration.
#The reported_on time and the end_date are the same timestamps but I'm keeping the strip code in just in case.
df2023['End_Date'] = df2023['End_Date'].str.replace('+00:00', '')
df2023['Duration_Default'] = pd.to_datetime(df2023['End_Date']) - pd.to_datetime(df2023['Reported_On'])
df2023['Duration_Hours'] = df2023['Duration_Default'].dt.total_seconds() / 3600

#Print the first 3 rows just to check the data.
print(df2023.head(3))

#Export the clean 2023 dataset to a csv file to show progress.
df2023.to_csv("kytc-closures-2023.csv", index=False)

   District   County   Route          Road_Name  Begin_MP  End_MP  \
0         1  Ballard  KY-310  TURNER LANDING RD       6.0     6.0   
1         1  Ballard  KY-310  TURNER LANDING RD       6.0     6.0   
2         1  Ballard  KY-358     HINKLEVILLE RD       4.0     4.0   

                                            Comments          Reported_On  \
0  Roadway blocked due to flooding. Updates as av...  2023-07-19 17:58:02   
1     Road blocked due to flooding at this location.  2023-07-26 12:29:13   
2  Roadway blocked due to flooding. Updates as av...  2023-07-19 18:00:48   

              End_Date   latitude   longitude Duration_Default  Duration_Hours  
0  2023-07-22 20:02:01  37.131104    -89.0201  3 days 02:03:59       74.066389  
1  2023-07-26 15:10:01  37.131104    -89.0201  0 days 02:40:48        2.680000  
2  2023-07-22 20:06:00  37.046533  -88.933658  3 days 02:05:12       74.086667  


In [5]:
#Analyze the 2024 dataset.
#Expect results to change: this dataset will be updated throughout the year, as events occur.
#This will produce different calculations as the year progresses.

#Load the 2022 dataset from the web server. 
df2024=pd.read_csv("https://storage.googleapis.com/kytc-its-2020-openrecords/toc/KYTC-TOC-Weather-Closures-Historic-2024.csv")

#Export the 2022 dataset to a csv file as a method of copying the data to the local machine. 
df2024.to_csv("KYTC-TOC-Weather-Closures-Historic-2024.csv", index=False)

#The url needs to be cleaned.
#Beginning in 2022, the url is different, requiring some additional stripping.
#The placement of latitude and longitude are consistent between 2022-2024 but the ending of the URL in 2023 forced me to use a regex.
df2024['Route_Link'] = df2024['Route_Link'].str.replace('https://goky.ky.gov/?lat=','')
df2024['Route_Link'] = df2024['Route_Link'].str.replace('&lng=',',')
df2024['Route_Link'] = df2024['Route_Link'].str.replace('&.*', '', regex=True)
df2024[['latitude','longitude']] = df2024.Route_Link.str.split(",",expand=True,)
df2024 = df2024.drop('Route_Link', axis=1)

#I need to clean that before I can calculate the duration.
#The reported_on time and the end_date are the same timestamps but I'm keeping the strip code in just in case.
df2024['End_Date'] = df2024['End_Date'].str.replace('+00:00', '')
df2024['Duration_Default'] = pd.to_datetime(df2023['End_Date']) - pd.to_datetime(df2024['Reported_On'])
df2024['Duration_Hours'] = df2024['Duration_Default'].dt.total_seconds() / 3600

#Print the first 3 rows just to check the data.
print(df2024.head(3))

#Export the clean 2023 dataset to a csv file to show progress.
df2024.to_csv("kytc-closures-2024.csv", index=False)

   District   County   Route          Road_Name  Begin_MP  End_MP  \
0         1  Ballard  KY-310  TURNER LANDING RD       6.0     6.0   
1         1  Ballard  KY-310  TURNER LANDING RD       6.0     6.0   
2         1  Ballard  KY-358     HINKLEVILLE RD       4.0     4.0   

                                            Comments          Reported_On  \
0  Roadway blocked due to flooding. Updates as av...  2023-07-19 17:58:02   
1     Road blocked due to flooding at this location.  2023-07-26 12:29:13   
2  Roadway blocked due to flooding. Updates as av...  2023-07-19 18:00:48   

              End_Date   latitude   longitude Duration_Default  Duration_Hours  
0  2023-07-22 20:02:01  37.131104    -89.0201  3 days 02:03:59       74.066389  
1  2023-07-26 15:10:01  37.131104    -89.0201  0 days 02:40:48        2.680000  
2  2023-07-22 20:06:00  37.046533  -88.933658  3 days 02:05:12       74.086667  


In [6]:
#Merge the dataframes together.



- Feature list #3 choice: Make 3 matplotlib (or another plotting library) visualizations to display your data.
- Feature list #3 optonal: I may use Tableau if I have time.

In [7]:
#quantify the total count of road closures by year

In [8]:
#quantify the total count of road closures by county

In [9]:
#quantify the count of closures by individual roadways

In [10]:
#quantify the duration of closures by year

In [11]:
#quantify the duration of closures by county

In [12]:
#quantify the duration of closures by road

In [13]:
#BONUS:  Map our the clsoures on a geospacial map.
#possible methods include using folium or geopandas.

#https://python-visualization.github.io/folium/latest/getting_started.html

In [14]:
#BONUS:  Perform the entire analysis in PowerBI and leave the file in the repository.