# Create a Subset of the Data for Testing

This notebook loads in the combined_cleaned_group_crash.csv from an s3 bucket and then takes a subset of the data containing all crashes in four regions. This original data contained 2,884,080 crashes so the regions subsets were defined for simple local development of functions and modules.

The results were written to an s3 bucket called `processed_pandas_data/{region}_crash_data.csv` and a local CSV file store in the `data/pandas_data/` folder with the same file naming convention.

**Four Subset Regions:**

1. Boerne, TX: 3,111 crashes
2. Downtown San Antonio, TX: 4,326 crashes
3. Sugarland, TX: 20,046 crashes
4. Downtown Austin, TX: 4,326 crashes

**Warning**

The dataset is approximately 1.3 GB and it takes a long time to load all the data into a df on my local machine

In [1]:
from pathlib import Path
import sys

# get root directory of project
ROOT_DIR = Path.cwd().parent
print(f"ROOT_DIR: {ROOT_DIR}")

# Add the root directory to sys.path
sys.path.append(str(ROOT_DIR))

from src import pandas_clustering
import config

DATA_DIR = ROOT_DIR / 'data'
print(f"DATA_DIR: {DATA_DIR}")

ROOT_DIR: c:\Users\alipe\OneDrive\Desktop\Rice_Classes\COMP643\final_project_git
DATA_DIR: c:\Users\alipe\OneDrive\Desktop\Rice_Classes\COMP643\final_project_git\data


**Read in the Crash Data**

In [2]:
s3_url = config.S3_RAW_DATA_URL
crash_data_pandas = pandas_clustering.read_crash_data(s3_url)

  return  pd.read_csv(csv_buffer)


**Columns**

See what columns exist in the pandas dataset...

In [3]:
crash_data_pandas.head()

Unnamed: 0,Crash_ID,Crash_Fatal_Fl,Cmv_Involv_Fl,Schl_Bus_Fl,Rr_Relat_Fl,Medical_Advisory_Fl,Amend_Supp_Fl,Active_School_Zone_Fl,Crash_Date,Crash_Time,...,Investigator_Narrative,Secondary_Crash_Fl,Rpt_Dir_Traffic,Rpt_Sec_Speed_Limit,Investigat_Time_Rwy_Clrd,Investigat_Time_Scn_Clrd,Investigat_Notify_Date,Investigat_Arrv_Date,Investigat_Date_Rwy_Clrd,Investigat_Date_Scn_Clrd
0,15657177,N,N,N,N,N,Y,N,06/02/2019,12:58 PM,...,,,,,,,,,,
1,16406486,N,N,N,N,N,Y,N,05/09/2019,03:22 PM,...,,,,,,,,,,
2,16473665,N,N,N,N,N,Y,N,06/15/2019,11:00 AM,...,,,,,,,,,,
3,16871051,N,Y,N,N,N,Y,N,06/12/2019,09:53 AM,...,,,,,,,,,,
4,16995273,N,N,N,N,N,Y,N,05/01/2019,02:40 PM,...,,,,,,,,,,


## Filter Rows to only contain crashes from 2022

In [4]:
crash_data_2022 = pandas_clustering.filter_by_year(crash_data_pandas, 2022)

DataFrame saved to txdot-crash-data-rice/processed_pandas_data/crash_data_2022 in s3


In [5]:
crash_data_2022.info()

<class 'pandas.core.frame.DataFrame'>
Index: 635424 entries, 1828859 to 2464282
Columns: 180 entries, Crash_ID to Investigat_Date_Scn_Clrd
dtypes: datetime64[ns](1), float64(79), int64(35), object(65)
memory usage: 877.5+ MB


## Create Regions

I am going to create 4 small regions to test the clustering model on:

1. Boerne, TX
2. Downtown San Antonio, TX
3. Sugarland, TX
4. Midland, TX

**Boerne, TX**

In [None]:
lat_bounds = (29.7, 29.8)  # example latitude range
long_bounds = (-98.8, -98.6)  # example longitude range
region_filename = 'boerne_crash_data.csv'

pandas_clustering.create_region_csv(crash_data_pandas, lat_bounds, long_bounds, region_filename)

**Downtown San Antonio, TX**

In [6]:
lat_bounds = (29.4, 29.45)  #  latitude range
long_bounds = (-98.5, -98.45)  #  longitude range
region_filename = 'north_san_antonio_crash_data.csv'

pandas_clustering.create_region_csv(crash_data_pandas, lat_bounds, long_bounds, region_filename)

DataFrame saved to txdot-crash-data/north_san_antonio_crash_data.csv


Unnamed: 0,Crash_ID,Crash_Fatal_Fl,Cmv_Involv_Fl,Schl_Bus_Fl,Rr_Relat_Fl,Medical_Advisory_Fl,Amend_Supp_Fl,Active_School_Zone_Fl,Crash_Date,Crash_Time,...,Investigator_Narrative,Secondary_Crash_Fl,Rpt_Dir_Traffic,Rpt_Sec_Speed_Limit,Investigat_Time_Rwy_Clrd,Investigat_Time_Scn_Clrd,Investigat_Notify_Date,Investigat_Arrv_Date,Investigat_Date_Rwy_Clrd,Investigat_Date_Scn_Clrd
381,17030974,N,N,N,N,N,N,N,04/24/2019,07:00 PM,...,,,,,,,,,,
651,17031991,N,N,N,N,N,N,N,04/24/2019,06:30 PM,...,,,,,,,,,,
808,17032423,N,N,N,N,N,N,N,04/24/2019,08:30 PM,...,,,,,,,,,,
884,17032586,N,N,N,N,N,N,N,04/24/2019,05:41 PM,...,,,,,,,,,,
1047,17032970,N,N,N,N,N,N,N,04/26/2019,02:54 AM,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2890029,19734240,N,N,N,N,N,N,N,8/12/2023,4:55 AM,...,,,,,,,,,,
2890107,19735273,N,N,N,N,N,N,N,8/16/2023,8:25 AM,...,,,,,,,,,,
2890119,19735477,N,N,N,N,N,N,N,8/17/2023,8:51 AM,...,,,,,,,,,,
2890328,19738863,N,N,N,N,N,N,N,8/16/2023,11:00 AM,...,,,,,,,,,,


**Sugarland, TX**

In [7]:
lat_bounds = (29.55, 29.65)  #  latitude range
long_bounds = (-95.7, -95.5)  #  longitude range
region_filename = 'sugarland_crash_data.csv'

pandas_clustering.create_region_csv(crash_data_pandas, lat_bounds, long_bounds, region_filename)

DataFrame saved to txdot-crash-data/sugarland_crash_data.csv


Unnamed: 0,Crash_ID,Crash_Fatal_Fl,Cmv_Involv_Fl,Schl_Bus_Fl,Rr_Relat_Fl,Medical_Advisory_Fl,Amend_Supp_Fl,Active_School_Zone_Fl,Crash_Date,Crash_Time,...,Investigator_Narrative,Secondary_Crash_Fl,Rpt_Dir_Traffic,Rpt_Sec_Speed_Limit,Investigat_Time_Rwy_Clrd,Investigat_Time_Scn_Clrd,Investigat_Notify_Date,Investigat_Arrv_Date,Investigat_Date_Rwy_Clrd,Investigat_Date_Scn_Clrd
182,17030052,N,N,N,N,N,N,N,04/24/2019,08:03 AM,...,,,,,,,,,,
369,17030918,N,N,N,N,N,N,N,04/24/2019,01:55 PM,...,,,,,,,,,,
531,17031598,N,N,N,N,N,N,N,04/24/2019,09:14 AM,...,,,,,,,,,,
997,17032854,N,N,N,N,N,N,N,04/25/2019,04:00 PM,...,,,,,,,,,,
1334,17033659,N,N,N,N,N,N,N,04/25/2019,03:12 PM,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2888974,19727636,N,N,N,N,N,N,N,8/17/2023,5:20 PM,...,,,,,,,,,,
2889351,19730377,N,N,N,N,N,N,N,8/18/2023,8:57 PM,...,,,,,,,,,,
2889750,19731594,N,N,N,N,N,N,N,8/4/2023,6:39 PM,...,,,,,,,,,,
2889982,19733825,N,N,N,N,N,N,N,8/16/2023,4:16 PM,...,,,,,,,,,,


**Downtown Austin, TX**

In [8]:
lat_bounds = (30.25, 30.35)  #  latitude range
long_bounds = (-97.73, -97.63)  #  longitude range
region_filename = 'austin_crash_data.csv'

pandas_clustering.create_region_csv(crash_data_pandas, lat_bounds, long_bounds, region_filename)

DataFrame saved to txdot-crash-data/austin_crash_data.csv


Unnamed: 0,Crash_ID,Crash_Fatal_Fl,Cmv_Involv_Fl,Schl_Bus_Fl,Rr_Relat_Fl,Medical_Advisory_Fl,Amend_Supp_Fl,Active_School_Zone_Fl,Crash_Date,Crash_Time,...,Investigator_Narrative,Secondary_Crash_Fl,Rpt_Dir_Traffic,Rpt_Sec_Speed_Limit,Investigat_Time_Rwy_Clrd,Investigat_Time_Scn_Clrd,Investigat_Notify_Date,Investigat_Arrv_Date,Investigat_Date_Rwy_Clrd,Investigat_Date_Scn_Clrd
124,17029759,N,N,N,N,N,N,N,04/24/2019,12:21 AM,...,,,,,,,,,,
127,17029763,N,N,N,N,N,N,N,04/24/2019,08:11 AM,...,,,,,,,,,,
154,17029849,N,N,N,N,N,N,N,04/24/2019,09:35 AM,...,,,,,,,,,,
156,17029851,N,Y,N,N,N,N,N,04/24/2019,10:35 AM,...,,,,,,,,,,
1044,17032955,N,N,N,N,N,Y,N,04/24/2019,11:56 PM,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2889930,19733143,N,N,N,N,N,N,N,8/13/2023,3:45 AM,...,,,,,,,,,,
2890314,19738559,N,N,N,N,N,N,N,8/5/2023,11:23 PM,...,,,,,,,,,,
2890350,19739194,N,N,N,N,N,N,N,8/4/2023,1:55 PM,...,,,,,,,,,,
2890602,19742914,N,N,N,N,N,N,N,7/31/2023,7:16 PM,...,,,,,,,,,,
