# Evaluate Clusters

This notebook create plots of the areas with the top 10 most dangerous clusters. All plots have been saved as html files in the `data/fatality_plots` subdirectory.

**Top Five Locations with Highest Fatality Rates are:**

| Location                                   | Fatality Rate |
|--------------------------------------------|---------------|
| TX 54 eighteen miles north of Van Horn, TX                 | 0.57          |
| US 277 sixty south of Sonora, TX                  | 0.53          |
| TX 349 on the County Line of Pecos/Terrel County | 0.50          |
|US 380 thirteen miles east of Tahoka, TX                | 0.45          |
| US 90 ninety west of Del Rio, TX                      | 0.44          |

**Summary of Findings:**

1. The number one most dangerous cluster in Texas is TX 54 which is a rural road in West Texas north of Van Horn. This may be attributed to the increased traffic volume generated by the oil and gas industry, as well as the rocket development industry in the area.

2. The KMeans algorithm does not perform well clustering distinct intersections in urban environments. This can be seen in plot number 2 of US 277 if the user pans to Del Rio, TX. Cluster centroids very clearly represent multiple intersections. Intersections should contain distinct clusters to determine which might need design improvments to increase safety.

3. Some clusters represent side street and major highways and it is confusing whether the dangerous roadway segment is on the highway or a side street. This is can be seen in plot 3 on US 380 South of Lubbock, TX. 

4. Four of the five most dangerous clusters on this list are on rural roads in West Texas. This could be due to low crash counts in these areas skewing to fatality rates. The analysis could be improved by separating crashes in urban and rural environments before clustering. Urban crash clusters might be more accurate if a different clustering algorithm was used such as DBSCAN. Overall KMeans algorithm seems sufficient for clustering.

In [1]:
from pathlib import Path
import sys

# get root directory of project
ROOT_DIR = Path.cwd().parent
PLOTS_DIR = ROOT_DIR / 'plots' / 'fatality_plots'
sys.path.append(str(ROOT_DIR))

from src import evaluate_clusters
import config

**Get the processed crash data from s3 bucket**

In [2]:
s3_url = config.S3_OUTPUT_DIR # 's3://public-crash-data/clean-data/'
crash_data_df = evaluate_clusters.read_csv_from_s3(s3_url)

**1. TX 54 North of Van Horn, TX**

TX 54 eighteen miles outside of Van Horn, TX has the highest fatality ratio in the entire state of Texas. This could be because of increased traffic due to oil and gas presence in the area. Additionally, Blue Origin has a Rocket Developement facility in the area and the roads might not be built to handle the high traffic volume in the area.

In [3]:
# plot and save the plot in the data/fatality_plots subdirectory
map_plot = evaluate_clusters.plot_highest_fatality(crash_data_df, rank=1, distance=50, save_path=PLOTS_DIR)

Fatality Rate: 0.57


**2. US 277 South of Sonora, TX**

US 277 apprximately 60 miles south of Sonora, TX is the second most fatal roadway segment in Texas. The KMeans clustering model performs well when clustering and evaluating rural segments, however it is lacking in urban environments. The distinction can be seen by panning to Del Rio in the plot below. It combines many intersections in Del Rio into one cluster. Realistically each intersection should remain distinct to determine which ones might be more dangerous than others.

In [4]:
# plot and save the plot in the data/fatality_plots subdirectory
map_plot = evaluate_clusters.plot_highest_fatality(crash_data_df, rank=2, distance=50, save_path=PLOTS_DIR)

Fatality Rate: 0.53


**3. TX 349 on the County Line of Pecos/Terrel County**

TX 349 on the county line of Pecos and Terrel County is the third most dangerous roadway segment. There is not a cluster in the town of Sheffield. This could be because there are no crashes in the dataset in Sheffield or the KMeans clustering model performed poorly in this area. Additional visual analysis shoud be done in this area to determine the accuracy of clusters.

In [5]:
# plot and save the plot in the data/fatality_plots subdirectory
map_plot = evaluate_clusters.plot_highest_fatality(crash_data_df, rank=3, distance=50, save_path=PLOTS_DIR)

Fatality Rate: 0.5


**4. US 380 east of Tahoka, TX**

US 380 thirteen miles east of Tahoka, TX is the 4th most dangerous roadway segment in Texas. This is first cluster located in a more urban environemnt. The centroid is located a little to the north of the US 380 which raises the question on whether the fatal crashes took place on the US 380 or they occurred on a side street. Additional visual analysis of crashes in this location should be performed to determine which streets the fatal crashes occured in this area. 

In [6]:
# plot and save the plot in the data/fatality_plots subdirectory
map_plot = evaluate_clusters.plot_highest_fatality(crash_data_df, rank=4, distance=50, save_path=PLOTS_DIR)

Fatality Rate: 0.45


5. **US 90 West of Del Rio**

US 90 approximately 90 miles west of Del Rio is the 5th most dangerous cluster in the Texas. Four of the top five clusters have been in rural areas in West Texas which has much less traffic volume than in denser urban areas in Texas. Perhaps this study could be improved by separating the rural areas in Texas and the urban areas when performing the clustering analysis. Though urban clusters might have lower fatality rate values there coudl still be good justification for design imporvements in these areas as well to make Texas roads safer.

In [7]:
# plot and save the plot in the data/fatality_plots subdirectory
map_plot = evaluate_clusters.plot_highest_fatality(crash_data_df, rank=5, distance=50, save_path=PLOTS_DIR)

Fatality Rate: 0.44
