# Data Cleaning Notebook

With such a massive dataset, the cleaning process was both detailed and meticulous. Below, I provide a high-level overview of my general steps. More granular details, along with the specific justifications for my decisions, can be found within each corresponding section of this notebook.

## Overview

I cleaned each of the three datasets (crashes_df, people_df, and vehicles_df) with the primary goals of making the data more manageable and reducing noise by eliminating unnecessary features and handling null values.

Throughout the cleaning process, I kept my target variable, most_severe_injury, in focus. Since most_severe_injury is a crash-level feature, the analysis in this project is centered around crash-level data. This distinction was especially important when working with people_df (person-level data) and vehicles_df (vehicle-level data), as merging these datasets with crashes_df required careful attention to avoid many-to-many relationships that could skew feature values. For instance, merging without proper aggregation could lead to inflated counts or inaccurate distributions of features such as the “number of injured persons per crash.” Without aggregation, a single crash with multiple people involved would be duplicated for each person in people_df, leading to an overrepresentation of crashes and skewed averages or totals. Proper aggregation ensures that each crash appears only once in the merged dataset

To address this, I aggregated the cleaned people_df and vehicles_df into people_aggregated and vehicles_aggregated, respectively, before merging them with crashes_cleaned. This ensured a smooth merge process that maintained a one-to-one relationship with crash_record_id across the combined dataset.

In order to simplify the analysis, I then adjusted the target variable (most_severe_injury) into a binary classification: serious injuries (fatalities and incapacitating injuries) versus non-serious injuries (minor or no injuries). This adjustment focused the modeling on predicting significant crash outcomes, providing a clearer and more manageable classification.

The final merged dataset still contained over 600,000 records, which posed computational challenges. To address this, I employed stratified random sampling to create a representative subset of the data. This approach preserved the proportional distribution of classes in my target variable, most_severe_injury. To ensure reproducibility, I set a fixed random_state during sampling.

The resulting stratified sample, ready for modeling, was uploaded to Kaggle to streamline reproducibility and accessibility for others.

## 1.0 Importing Necessary Libraries

In [1]:
# for getting data
import os
import zipfile
import os
import zipfile
import json
from pathlib import Path

# for managing data
import gc

# for checking runtime
import time

# for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline

# for feature selection
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

### 1.1 Environment Setup and data loading

For reproducibility purposes, the next two code block allow the user to download the data for this project from kaggle.

Please be aware that running this code will require you to enter your kaggle username and API key. The code will not proceed if you do not provide accurate information. 

To create a kaggle account, [click here](https://www.kaggle.com/account/login). 

For more details on obtaining your kaggle API key, [click here](https://github.com/Kaggle/kaggle-api/blob/main/docs/README.md).

In [2]:
# Getting Kaggle username and key from user input or environment variables
os.environ["KAGGLE_USERNAME"] = input("Enter Kaggle username: ")
os.environ["KAGGLE_KEY"] = input("Enter Kaggle key: ")

# Detect the environment (Google Colab or local machine)
if 'google.colab' in str(get_ipython()):
    # For Google Colab, use the /root/.kaggle directory
    kaggle_path = Path('/root/.kaggle')
else:
    # For local machine, use the home directory
    kaggle_path = Path.home() / '.kaggle'

# Create the .kaggle directory if it doesn't exist
os.makedirs(kaggle_path, exist_ok=True)

# Create the kaggle.json file with the correct API credentials
kaggle_json = {
    "username": os.getenv("KAGGLE_USERNAME"),
    "key": os.getenv("KAGGLE_KEY")
}

# Write the kaggle.json file in the correct location
with open(kaggle_path / 'kaggle.json', 'w') as f:
    json.dump(kaggle_json, f)

# Set file permissions to secure the API key (optional but recommended)
os.chmod(kaggle_path / 'kaggle.json', 0o600)

# Check if the credentials are set correctly (for debugging purposes)
print("Kaggle credentials are set up successfully.")

Enter Kaggle username: ckucewicz
Enter Kaggle key: 177a019058583e7df97a8ade860bbe3e
Kaggle credentials are set up successfully.


In [3]:
# Step 2: Download dataset using Kaggle API

# Set dataset identifier and download path
dataset_identifier = 'ckucewicz/Chicago-Traffic-Data'
download_path = Path('./data')  # Local or Colab download folder

# Ensure the download path exists
os.makedirs(download_path, exist_ok=True)

# Detect the environment and run the appropriate download command
if 'google.colab' in str(get_ipython()):
    print("Downloading dataset in Google Colab...")
    !kaggle datasets download -d {dataset_identifier} --path {download_path}
else:
    print("Downloading dataset in Jupyter Notebook...")
    os.system(f"kaggle datasets download -d {dataset_identifier} --path {download_path}")

# Step 3: Unzip the dataset
zip_filename = download_path / 'Chicago-Traffic-Data.zip'  # Adjust the ZIP filename
unzip_path = download_path / 'Chicago-Traffic-Data'

# Ensure the extraction path exists
os.makedirs(unzip_path, exist_ok=True)

# Unzip the dataset
try:
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall(unzip_path)
    print(f"Dataset extracted to: {unzip_path}")
except FileNotFoundError:
    print(f"Error: {zip_filename} not found. Ensure the dataset was downloaded successfully.")

# Step 4: Load CSV files into pandas DataFrames
csv_files = {
    'people': 'chicago_traffic_data/people.csv',
    'traffic_crashes': 'chicago_traffic_data/traffic_crashes.csv',
    'vehicles': 'chicago_traffic_data/vehicles.csv',
}

# Initialize a dictionary to store DataFrames
dataframes = {}

for key, relative_path in csv_files.items():
    csv_path = unzip_path / relative_path  # Create the full path
    print(f"Loading {key} from {csv_path}...")
    
    try:
        # Load CSV into pandas DataFrame
        dataframes[key] = pd.read_csv(csv_path, low_memory=True)
        print(f"{key} DataFrame loaded successfully.")
    except FileNotFoundError:
        print(f"Error: {relative_path} not found in the extracted files.")

Downloading dataset in Jupyter Notebook...


100%|██████████| 394M/394M [00:09<00:00, 44.1MB/s] 


Dataset URL: https://www.kaggle.com/datasets/ckucewicz/Chicago-Traffic-Data
License(s): apache-2.0
Downloading Chicago-Traffic-Data.zip to data

Dataset extracted to: data/Chicago-Traffic-Data
Loading people from data/Chicago-Traffic-Data/chicago_traffic_data/people.csv...


  exec(code_obj, self.user_global_ns, self.user_ns)


people DataFrame loaded successfully.
Loading traffic_crashes from data/Chicago-Traffic-Data/chicago_traffic_data/traffic_crashes.csv...


  exec(code_obj, self.user_global_ns, self.user_ns)


traffic_crashes DataFrame loaded successfully.
Loading vehicles from data/Chicago-Traffic-Data/chicago_traffic_data/vehicles.csv...


  exec(code_obj, self.user_global_ns, self.user_ns)


vehicles DataFrame loaded successfully.


In [4]:
# stores each dataset in its own variable
people_df = dataframes['people']
traffic_crashes_df = dataframes['traffic_crashes']
vehicles_df = dataframes['vehicles']

## 2.0 Data Cleaning

### 2.1 Crashes

The crashes dataset included my target variable, so I was careful with how I handled data cleaning. 

My steps included:
* **Preview Data**: `.head()`


* **Understand Dataset Structure**: `.info()`


* **Format Feature Names and Row Values**: `.lower()`


* **Drop features with overly high null values**: `.isna().sum()/ len(df)` for percentage of nulls for each feature


* **Check for duplicates**: `.duplicated().sum()`


* **Keep or drop features with remaining nulls**:

    * make intentional decisions to keep or drop using `.value_counts()` distribution and domain knowledge


* **inspect remaining features**: `.value_counts()`; 

    * make intentional decisions to keep or drop using `.value_counts()` distribution and domain knowledge; 
    * make note of any features to keep that will need cleaning/cardinality reduction/etc.


* **remove unuseful features**: 

    * `.drop()` for list of features deemed not useful for analysis; 
    * store trimmed df as 'df_name_cleaned'
    

* **reduce feature cardinality with label reclassification**:

    * `trafficway_type` and `lane_cnt`
    * `crash_hour`, `crash_day_of_week`, `crash_month`
    * `posted_speed_limit`
    * `traffic_control_device`
    * `prim_contributory_cause`
    * `most_severe_injury`
 
 
* **Convert data types**: 

    * stored data types to reflect true data types 
    * (text variables as strings, numeric variables as int, categorical as category, ect.)
    
 
* **remove remaining nulls**: `.dropna()`

#### 2.1.1 Preview Data: `.head()`

In [5]:
# previews the first 5 rows of the dataframe
traffic_crashes_df.head()

Unnamed: 0,CRASH_RECORD_ID,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,...,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION
0,6c1659069e9c6285a650e70d6f9b574ed5f64c12888479...,,08/18/2023 12:50:00 PM,15,OTHER,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,OTHER,...,1.0,0.0,1.0,0.0,12,6,8,,,
1,5f54a59fcb087b12ae5b1acff96a3caf4f2d37e79f8db4...,,07/29/2023 02:45:00 PM,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN (NOT RAISED),...,0.0,0.0,1.0,0.0,14,7,7,41.85412,-87.665902,POINT (-87.665902342962 41.854120262952)
2,61fcb8c1eb522a6469b460e2134df3d15f82e81fd93e9c...,,08/18/2023 05:58:00 PM,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDALCYCLIST,NOT DIVIDED,...,1.0,0.0,1.0,0.0,17,6,8,41.942976,-87.761883,POINT (-87.761883496974 41.942975745006)
3,004cd14d0303a9163aad69a2d7f341b7da2a8572b2ab33...,,11/26/2019 08:38:00 AM,25,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PEDESTRIAN,ONE-WAY,...,0.0,0.0,1.0,0.0,8,3,11,,,
4,a1d5f0ea90897745365a4cbb06cc60329a120d89753fac...,,08/18/2023 10:45:00 AM,20,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,FIXED OBJECT,OTHER,...,0.0,0.0,1.0,0.0,10,6,8,,,


#### 2.1.2 Understand Structure: `.info()`

In [6]:
# provides info about the dataframe features, non-null values, and datatypes
traffic_crashes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 901446 entries, 0 to 901445
Data columns (total 48 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   CRASH_RECORD_ID                901446 non-null  object 
 1   CRASH_DATE_EST_I               66531 non-null   object 
 2   CRASH_DATE                     901446 non-null  object 
 3   POSTED_SPEED_LIMIT             901446 non-null  int64  
 4   TRAFFIC_CONTROL_DEVICE         901446 non-null  object 
 5   DEVICE_CONDITION               901446 non-null  object 
 6   WEATHER_CONDITION              901446 non-null  object 
 7   LIGHTING_CONDITION             901446 non-null  object 
 8   FIRST_CRASH_TYPE               901446 non-null  object 
 9   TRAFFICWAY_TYPE                901446 non-null  object 
 10  LANE_CNT                       199022 non-null  float64
 11  ALIGNMENT                      901446 non-null  object 
 12  ROADWAY_SURFACE_COND          

#### 2.1.3 Format Feature names and Row Values: `.lower()`

From the first two steps of previewing the data and understanding its structure, two things stand out: There is A LOT of data, and it is messy with varying amounts of null values in each feature. I decided to make the feature names lower case simply for readability, and as a precuror cleaning step I made all string values lower case in the hopes that this may deal with some label misspellings.

In [7]:
# converts all feature names to lower case 
traffic_crashes_df.columns = traffic_crashes_df.columns.str.lower()

In [8]:
# Convert all string values in object columns to lowercase
for col in traffic_crashes_df.select_dtypes(include='object').columns:
    traffic_crashes_df[col] = traffic_crashes_df[col].str.lower()

#### 2.1.4 Drop features with overly high null values: 

While there are a lot of records in this dataframe, some features also have a lot of null values.  I know that features with a significant majority of null values will not be helpful for analysis so I start by trying to identify these features and remove them right off the bat. Rather than look at the count of nulls, I used `(.isna().sum()/ len(df)) *100` to get the percentage of nulls for each feature. This is easier to grasp than null counts in the 10 and hundred thousands. 

For quick cleaning, I chose a 90% null threshold to automatically remove features. For the features with between 60-90% nulls, I decided to inspect them more closely before blinding removing. 

In [9]:
# stores the percentages of null values for each feature, rounded to 2 places, in a variable
missing_percentage = round((traffic_crashes_df.isna().sum()/len(traffic_crashes_df)*100), 2)
missing_percentage

crash_record_id                   0.00
crash_date_est_i                 92.62
crash_date                        0.00
posted_speed_limit                0.00
traffic_control_device            0.00
device_condition                  0.00
weather_condition                 0.00
lighting_condition                0.00
first_crash_type                  0.00
trafficway_type                   0.00
lane_cnt                         77.92
alignment                         0.00
roadway_surface_cond              0.00
road_defect                       0.00
report_type                       3.11
crash_type                        0.00
intersection_related_i           77.03
not_right_of_way_i               95.45
hit_and_run_i                    68.64
damage                            0.00
date_police_notified              0.00
prim_contributory_cause           0.00
sec_contributory_cause            0.00
street_no                         0.00
street_direction                  0.00
street_name              

In [10]:
# selecting all features with 90% or more of its values are null
high_null_features = traffic_crashes_df.columns[(traffic_crashes_df.isna().sum() / len(traffic_crashes_df) * 100) >= 90]
high_null_features

# creating a list of features with 90% or more null values
high_null_features_list = list(high_null_features)
high_null_features_list

crashes_cleaned = traffic_crashes_df.drop(columns=high_null_features_list)


In [11]:
# selecting all features with 90% or more of its values are null
high_null_features = traffic_crashes_df.columns[(traffic_crashes_df.isna().sum() / len(traffic_crashes_df) * 100) >= 90]
high_null_features

Index(['crash_date_est_i', 'not_right_of_way_i', 'photos_taken_i',
       'statements_taken_i', 'dooring_i', 'work_zone_i', 'work_zone_type',
       'workers_present_i'],
      dtype='object')

In [12]:
# creating a list of features with 90% or more null values
high_null_features_list = list(high_null_features)
high_null_features_list

['crash_date_est_i',
 'not_right_of_way_i',
 'photos_taken_i',
 'statements_taken_i',
 'dooring_i',
 'work_zone_i',
 'work_zone_type',
 'workers_present_i']

In [13]:
# drops all features with 90% or more null values, saves the new datafram as crashes_cleaned
crashes_cleaned = traffic_crashes_df.drop(columns=high_null_features_list)

#### 2.1.5 Check for duplicates: `.duplicated().sum()`

Checking for duplicate rows is another common cleaning step. In this case, there were no duplicate rows. 

In [14]:
# Check for duplicate crash_record_id values in crashes_cleaned
print(f"Number of duplicate crash_record_id values in crashes_cleaned: {crashes_cleaned['crash_record_id'].duplicated().sum()}")

Number of duplicate crash_record_id values in crashes_cleaned: 0


#### 2.1.6 Keep or drop features with remaining nulls

As mentioned earlier, I wanted at least look at features with between 60-90% null values before removing them so I inspected the value_counts() for these 3 features:
* `hit_and_run_i` describes the aftermath of the crash, not a contributing factor so I drop this
* `intersection_related_i` initially seems like it could be a good feature to keep, but upon further inspection, it appears to be vague. CDOT's definition: "A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection", makes me feel confident removing this as it seems like it is ultimally up to the officer's disgretion, which may not lead to useful analysis, so I dropped it. 
* `lane_cnt`: Despite the high percentage of null values I decided to keep this. My domain knowledge makes me think the number of lanes could affect how cautiously or uncautiously drivers driver. Drivers generally will drive more cautiously down a narrow two or one lane, than on an open 6 lane freeway. 

Once I created a new dataframe, leaving out features I removed, no longer needed the original dataset so I deleted it from memory due to its large size and ability to take up a large amount of memory. 

In [15]:
# saves all features with between 60%-90% null values as a variable

medium_null_features = traffic_crashes_df.columns[
    ((traffic_crashes_df.isna().sum() / len(traffic_crashes_df) * 100) >= 60) &
    ((traffic_crashes_df.isna().sum() / len(traffic_crashes_df) * 100) < 90)
]

In [16]:
# converts the above variable to list type
medium_null_features_list = list(medium_null_features)

In [17]:
# iterates through list of medium null features and prints the value_counts() for each
for feature in medium_null_features_list:
    print(f"Value counts for column '{feature}':")
    print(crashes_cleaned[feature].value_counts())
    print("-"* 32)

Value counts for column 'lane_cnt':
2.0          91162
4.0          49589
1.0          32550
3.0           8678
0.0           8032
6.0           4502
5.0           1940
8.0           1908
7.0            184
10.0           162
99.0           108
9.0             66
11.0            30
12.0            29
20.0            15
22.0            13
15.0             7
16.0             7
14.0             5
30.0             5
40.0             4
60.0             3
21.0             3
25.0             2
100.0            2
902.0            1
24.0             1
80.0             1
218474.0         1
45.0             1
17.0             1
299679.0         1
19.0             1
400.0            1
13.0             1
1191625.0        1
35.0             1
433634.0         1
41.0             1
28.0             1
44.0             1
Name: lane_cnt, dtype: int64
--------------------------------
Value counts for column 'intersection_related_i':
y    197181
n      9873
Name: intersection_related_i, dtype: int64
------

In [18]:
# deletes the traffic_crashes_df dataframe to clear up memory

del traffic_crashes_df

In [19]:
# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.

gc.collect()

84

#### 2.1.7 Inspect remaining features: `.value_counts()`

Once columns with null values were taken care of, I used `.value_counts()` to inspect many of the remaining columns to look for things like cardianlity, misspelling, class labels, stored data types, etc. I used some domain knowledge to avoid looking through every single feature. 

Some features that I decided to keep because they might be helpful in predicting my target had unclear labels or a high number of labels (aka high cardinality). In order to make sure these features were not only useful for analysis, but also ready for the modeling phase, I used some domain knowledge to help reclassify some of the feature labels into less, more easily understandable labels. 

In [20]:
# prints the distribution of values within latitude feature
crashes_cleaned['latitude'].value_counts()

41.976201    1438
41.900959     817
41.791420     628
41.751461     615
41.722257     489
             ... 
41.812116       1
41.742553       1
41.757199       1
41.936622       1
41.951640       1
Name: latitude, Length: 319343, dtype: int64

In [21]:
# prints the distribution of values within longitude feature
crashes_cleaned['longitude'].value_counts()

-87.905309    1438
-87.619928     817
-87.580148     628
-87.585972     615
-87.585276     489
              ... 
-87.610788       1
-87.555370       1
-87.647808       1
-87.616396       1
-87.732269       1
Name: longitude, Length: 319308, dtype: int64

In [22]:
# prints the distribution of values within location feature
crashes_cleaned['location'].value_counts()

point (-87.905309125103 41.976201139024)    1438
point (-87.619928173678 41.900958919109)     817
point (-87.580147768689 41.791420282098)     628
point (-87.585971992965 41.751460603167)     615
point (-87.585275565077 41.722257273006)     489
                                            ... 
point (-87.639121523275 41.869503004763)       1
point (-87.602411667064 41.804009965883)       1
point (-87.674723215169 41.96342865936)        1
point (-87.755347401896 41.896861551221)       1
point (-87.732268503259 41.951640180538)       1
Name: location, Length: 319546, dtype: int64

While location data seems like it be very insightful, the cardinality is simply too high to help with predictive capacity, and will ultimately restrict my limited computing power. With more time, in future iterations of this project I hope to be able to process the location data to be insightful and able to be kept, but for the purposes of a minimum viable product, I'll remove lat, long, and location.

In [23]:
# prints the distribution of values within posted_speed_limit feature
crashes_cleaned['posted_speed_limit'].value_counts()

30    664045
35     59626
25     57789
20     37717
15     32112
10     21096
40      8612
0       7584
45      5951
5       4957
55       883
50       276
3        221
9         96
39        95
99        66
60        53
1         41
24        38
2         31
65        20
32        20
34        16
33        14
11        11
26        11
36         8
6          7
70         7
7          6
18         4
12         4
22         4
14         4
23         3
29         3
31         2
8          2
38         2
16         2
4          2
62         1
63         1
44         1
49         1
46         1
Name: posted_speed_limit, dtype: int64

This feature could be important, but will require cardinality reduction. To help make this feature better able to be handled for modeling I decided to reclassify this feature into 3 speed groups: low (**0-25** mph), medium (**25-40** mph), and high (**40+ mph**). 

These decisions were intentional as both reports from Philadelphia's and Chicago's transportation department site speed as a key factor with speeds greater than 40 resulting in high probability of death for pedestrians if hit. 

In [24]:
# Categorize speed limits directly without using a function
crashes_cleaned['speed_limit_category'] = pd.cut(
    crashes_cleaned['posted_speed_limit'],
    bins=[-float('inf'), 25, 40, float('inf')],
    labels=['Low', 'Medium', 'High'],
    right=True
)

# Check the result
crashes_cleaned['speed_limit_category'].value_counts()

Medium    732454
Low       161731
High        7261
Name: speed_limit_category, dtype: int64

In [25]:
# prints the distribution of values within traffic_control_device feature
crashes_cleaned['traffic_control_device'].value_counts()

no controls                 510287
traffic signal              249882
stop sign/flasher            89361
unknown                      38328
other                         6096
yield                         1365
lane use marking              1226
other reg. sign               1103
pedestrian crossing sign       636
railroad crossing gate         581
flashing control signal        373
school zone                    353
delineators                    352
police/flagman                 309
rr crossing sign               195
other railroad crossing        192
no passing                      58
bicycle crossing sign           34
Name: traffic_control_device, dtype: int64

During my time working with Philadelphia's Vision Zero department, I learned that traffic light signals have the potential to increase chances of speeding as opposed to stop signs, mainly in the context of drivers speeding up at yellow lights to not get stuck at the light. With this knowledge, I feel like this feature could be an insightful predictor, but will need to reduce its cardinality. I reclassified this feature into 4 categories: Signal, sign, markings & lanes, and other. Reduced cardinality will reduce the dimensionality during OneHotEncoding, leading to improved computational efficiency. 

In [26]:
# Create a dictionary to map the original 'traffic_control_device' values to more specific categories
traffic_control_mapping = {
    'traffic signal': 'Signal',
    'flashing control signal': 'Signal',
    'pedestrian crossing sign': 'Signal',  # If it's a signal
    'railroad crossing gate': 'Signal',    # If it uses lights
    
    'stop sign/flasher': 'Sign',
    'yield': 'Sign',
    'school zone': 'Sign',
    'railroad crossing sign': 'Sign',      # If static sign
    'other warning sign': 'Sign',
    'bicycle crossing sign': 'Sign',
    'no passing': 'Sign',
    
    'lane use marking': 'Markings & Lanes',
    'delineators': 'Markings & Lanes',
    
    'no controls': 'Other',
    'unknown': 'Other',
    'other': 'Other',
    'police/flagman': 'Other',
    'other railroad crossing': 'Other'
}

# Apply the mapping to the 'traffic_control_device' column
crashes_cleaned['traffic_control_category'] = crashes_cleaned['traffic_control_device'].map(traffic_control_mapping)

# Check the value counts for the new grouped categories
crashes_cleaned['traffic_control_category'].value_counts()

Other               555212
Signal              251472
Sign                 91886
Markings & Lanes      1578
Name: traffic_control_category, dtype: int64

In [27]:
# prints the distribution of values within device_condition feature
crashes_cleaned['device_condition'].value_counts()

no controls                 516329
functioning properly        307784
unknown                      63428
other                         6836
functioning improperly        4113
not functioning               2562
worn reflective material       295
missing                         99
Name: device_condition, dtype: int64

The top two categories that make up the majority of this feature are 'no controls' and 'functioning properly'. Then the next two frequent classes are 'unknown' and 'other'. Due to this, this feature does not feel like it will be particularly insightful, so we can drop it. 

In [28]:
# prints the distribution of values within lighting_condition feature
crashes_cleaned['lighting_condition'].value_counts()

daylight                  578548
darkness, lighted road    197098
unknown                    42569
darkness                   42455
dusk                       25737
dawn                       15039
Name: lighting_condition, dtype: int64

This seems like it could offer insightful predictive capacity, and the cardinality is not high. I will keep this. 

In [29]:
# prints the distribution of values within first_crash_type feature
crashes_cleaned['first_crash_type'].value_counts()

parked motor vehicle            208646
rear end                        199321
sideswipe same direction        138501
turning                         129668
angle                            97996
fixed object                     41874
pedestrian                       21320
pedalcyclist                     14331
sideswipe opposite direction     12509
rear to front                     9252
other object                      8981
head on                           7639
rear to side                      5512
other noncollision                2745
rear to rear                      1907
animal                             655
overturned                         543
train                               46
Name: first_crash_type, dtype: int64

This information seems like it could be somewhat potentially insightful, but I feel other features will offer better insights. So in an effort to reduce the dataset to only the most useful features, I will remove this feature. This could be a feature to include in future iterations of this project.

In [30]:
# prints the distribution of values within trafficway_type feature
crashes_cleaned['trafficway_type'].value_counts()

not divided                        388246
divided - w/median (not raised)    142466
one-way                            114072
four way                            62561
parking lot                         61011
divided - w/median barrier          50946
other                               24388
alley                               14802
t-intersection                      12409
unknown                             10603
center turn lane                     6374
driveway                             2890
ramp                                 2834
unknown intersection type            2762
five point, or more                  1385
y-intersection                       1350
traffic route                        1166
not reported                          687
roundabout                            308
l-intersection                        186
Name: trafficway_type, dtype: int64

This information feels like it could be important but I will keep this feature. With some label reclassification I reduce this feature down to 7 categories, making for a more insightful and computationally efficient feature for prediction. The new reclassified feature's name is 'road_category' so I will keep it and drop the original trafficway_type.

In [31]:
# Define intersection types
intersection_types = ['roundabout', 'l-intersection', 'y-intersection', 
                      'five point, or more', 'center turn lane', 
                      't-intersection', 'unknown intersection type']

# Define conditions for both blocks (with block 2 modification)
conditions = [
    (crashes_cleaned['trafficway_type'] == 'one-way') & (crashes_cleaned['lane_cnt'] == 1),
    (crashes_cleaned['trafficway_type'] == 'one-way') & (crashes_cleaned['lane_cnt'] > 1),
    (crashes_cleaned['trafficway_type'].isin(intersection_types)),
    (crashes_cleaned['trafficway_type'].isin(['unknown', 'not reported'])) | 
    (pd.isnull(crashes_cleaned['trafficway_type'])) | 
    (pd.isnull(crashes_cleaned['lane_cnt'])),
    (crashes_cleaned['trafficway_type'].isin(['parking lot', 'driveway', 'ramp', 'alley', 'other'])),
    # Modified condition for 'multi-lane bidirectional' from Block 2
    (crashes_cleaned['lane_cnt'] > 1) & 
    (~crashes_cleaned['trafficway_type'].isin([
        'one-way', 'four way', 'unknown', 'not reported', 
        'other', 'parking lot', 'driveway', 'ramp', 'alley'
    ]))
]

# Define corresponding categories
choices = [
    'single-lane one way',
    'multi-lane one way',
    'intersection',
    'unknown',  # Combined "unknown" and "not reported"
    'other',
    'multi-lane bidirectional'
]

# Apply classification
crashes_cleaned['road_category'] = np.select(conditions, choices, default='unknown')

In [32]:
# Check the distribution of categories in the new column
crashes_cleaned['road_category'].value_counts()

unknown                     689310
multi-lane bidirectional    138918
intersection                 24774
other                        19589
single-lane one way          17992
multi-lane one way           10863
Name: road_category, dtype: int64

In [33]:
# prints the distribution of values within alignment feature
crashes_cleaned['alignment'].value_counts()

straight and level       880103
straight on grade         11022
curve, level               6352
straight on hillcrest      2267
curve on grade             1313
curve on hillcrest          389
Name: alignment, dtype: int64

While this feature would ideally be helpful as a predictor, the data here is not conducive for analysis. Most of the entries are 'straight and level', so I will remove it.

In [34]:
# prints the distribution of values within weather_condition feature
crashes_cleaned['weather_condition'].value_counts()

clear                       709235
rain                         77962
unknown                      51500
snow                         28844
cloudy/overcast              26333
other                         2789
freezing rain/drizzle         1787
fog/smoke/haze                1353
sleet/hail                    1026
blowing snow                   453
severe cross wind gate         156
blowing sand, soil, dirt         8
Name: weather_condition, dtype: int64

This feature seems important, but will have to compare it to roadway_surface_cond feature as they both contain similar information. I will make a decision about which will be most insightful, and drop the other. 

In [35]:
# prints the distribution of values within roadway_surface_cond feature
crashes_cleaned['roadway_surface_cond'].value_counts()

dry                667224
wet                117323
unknown             80085
snow or slush       28524
ice                  5678
other                2290
sand, mud, dirt       322
Name: roadway_surface_cond, dtype: int64

This is somewhat redundant with weather condition. I chose to remove weather_condition  and keep roadway_surface_cond due to its lower cardinality. 

In [36]:
# prints the distribution of values within street_direction feature
crashes_cleaned['street_direction'].value_counts()

w    322771
s    301079
n    216752
e     60840
Name: street_direction, dtype: int64

Even though it feel like there could be some predictive power with this feature, I chose to drop it as I think it will not be as insightful as other features, and with the vast amount of features across 3 datasets, I am only trying to keep the ones I feel are most important. With more time I could potentially run a simple decision tree and obtain the feature_importances to help with this decision, but with an approaching deadline I opt to simply remove it. 

In [37]:
# prints the distribution of values within street_name feature
crashes_cleaned['street_name'].value_counts()

western ave        24619
pulaski rd         21778
cicero ave         20285
ashland ave        19606
halsted st         17440
                   ...  
franklin sd            1
lacey ave              1
stetson sub ave        1
11th pl                1
29th pl                1
Name: street_name, Length: 1648, dtype: int64

Again I feel like this could offer some predictive insights, but due to its high cardinality I will remove it. 

In [38]:
# prints the distribution of values within road_defect feature
crashes_cleaned['road_defect'].value_counts()

no defects           718022
unknown              166233
rut, holes             6350
other                  4893
worn surface           3741
shoulder defect        1547
debris on roadway       660
Name: road_defect, dtype: int64

The main two labels here are "no defects" and unknown. This will not be helpful for analysis, so I will remove.

In [39]:
# prints the distribution of values within crash_type feature
crashes_cleaned['crash_type'].value_counts()

no injury / drive away              658842
injury and / or tow due to crash    242604
Name: crash_type, dtype: int64

This feature describes the aftermath of the crash which is not helpful for this model. I will remove it.

In [40]:
# prints the distribution of values within beat_of_occurrence feature
crashes_cleaned['beat_of_occurrence'].value_counts()

1834.0    10913
114.0      9281
813.0      9093
815.0      8590
1831.0     8244
          ...  
1653.0      502
1655.0      313
1652.0      241
1650.0       69
6100.0        7
Name: beat_of_occurrence, Length: 276, dtype: int64

Another potentially insightful feature but with high cardinality, and limited computing power, I choose to remove this. In future iterations, it would be interesting to inspect this feature further.

In [41]:
# prints the distribution of values within most_severe_injury feature
crashes_cleaned['most_severe_injury'].value_counts()

no indication of injury     772801
nonincapacitating injury     71130
reported, not evident        39463
incapacitating injury        15074
fatal                          985
Name: most_severe_injury, dtype: int64

This is my target variable, so I will keep it, but to improve its interpretability and ensure it is aligned with the problem context of predicting fatal or serious injuries, I reclassify it to have two classes: non-serious injury and serious injury. 

In [42]:
# Replace the string 'nan' with actual NaN values
crashes_cleaned['most_severe_injury'] = crashes_cleaned['most_severe_injury'].replace('nan', np.nan)

# Now categorize the injuries into 'Serious' and 'Non-serious'
crashes_cleaned['severity_category'] = crashes_cleaned['most_severe_injury'].replace({
    'no indication of injury': 'Non-serious',
    'nonincapacitating injury': 'Non-serious',
    'reported, not evident': 'Non-serious',
    'incapacitating injury': 'Serious',
    'fatal': 'Serious'
})

In [43]:
# prints the distribution of values within injuries_total feature
crashes_cleaned['injuries_total'].value_counts()

0.0     772815
1.0      95189
2.0      21269
3.0       6479
4.0       2302
5.0        825
6.0        325
7.0        133
8.0         53
9.0         27
10.0        16
11.0         9
15.0         8
12.0         6
21.0         4
13.0         3
17.0         1
14.0         1
19.0         1
16.0         1
Name: injuries_total, dtype: int64

This information is redundant with my target so I choose to drop it.

In [44]:
# prints the distribution of values within injuries_fatal feature
crashes_cleaned['injuries_fatal'].value_counts()

0.0    898482
1.0       912
2.0        64
3.0         8
4.0         1
Name: injuries_fatal, dtype: int64

This information is redundant with my target so I choose to drop it.

In [45]:
# prints the distribution of values within injuries_incapacitating feature
crashes_cleaned['injuries_incapacitating'].value_counts()

0.0     884243
1.0      13370
2.0       1395
3.0        312
4.0        107
5.0         29
6.0          7
7.0          2
10.0         1
8.0          1
Name: injuries_incapacitating, dtype: int64

This information is redundant with my target so I choose to drop it. This information is captured in the 'most_severe_injury' feature.

In [46]:
# prints the distribution of values within prim_contributory_cause feature
crashes_cleaned['prim_contributory_cause'].value_counts()

unable to determine                                                                 352689
failing to yield right-of-way                                                        99589
following too closely                                                                86950
not applicable                                                                       47632
improper overtaking/passing                                                          44963
failing to reduce speed to avoid crash                                               37868
improper backing                                                                     34796
improper lane usage                                                                  32108
driving skills/knowledge/experience                                                  30632
improper turning/no signal                                                           30203
disregarding traffic signals                                                         17608

This feature feels like it is particularly insightful so I will keep it, but will perform label reclassification to better prepare it for modeling by reducing cardinality, and improve interpretability using more understandable grouping labels. 

In [47]:
# Create a mapping for the primary contributory causes
cause_mapping = {
    'distraction - from inside vehicle': 'Distraction',
    'distraction - from outside vehicle': 'Distraction',
    'cell phone use other than texting': 'Distraction',
    'distraction - other electronic device (navigation device, dvd player, etc.)': 'Distraction',
    'texting': 'Distraction',
    'bicycle advancing legally on red light': 'Distraction',
    'motorcycle advancing legally on red light': 'Distraction',
    
    'operating vehicle in erratic, reckless, careless, negligent or aggressive manner': 'Aggressive/Reckless Driving',
    'failing to reduce speed to avoid crash': 'Aggressive/Reckless Driving',
    'exceeding authorized speed limit': 'Aggressive/Reckless Driving',
    'exceeding safe speed for conditions': 'Aggressive/Reckless Driving',
    'driving on wrong side/wrong way': 'Aggressive/Reckless Driving',
    'disregarding stop sign': 'Aggressive/Reckless Driving',
    'disregarding traffic signals': 'Aggressive/Reckless Driving',
    'disregarding yield sign': 'Aggressive/Reckless Driving',
    'passing stopped school bus': 'Aggressive/Reckless Driving',
    'improper overtaking/passing': 'Aggressive/Reckless Driving',
    'failing to yield right-of-way': 'Aggressive/Reckless Driving',
    'following too closely': 'Aggressive/Reckless Driving',
    'improper lane usage': 'Aggressive/Reckless Driving',
    'improper turning/no signal': 'Aggressive/Reckless Driving',
    
    'driving skills/knowledge/experience': 'Driver\'s Condition/Experience',
    'physical condition of driver': 'Driver\'s Condition/Experience',
    'vision obscured (signs, tree limbs, buildings, etc.)': 'Driver\'s Condition/Experience',
    'under the influence of alcohol/drugs (use when arrest is effected)': 'Driver\'s Condition/Experience',
    'had been drinking (use when arrest is not made)': 'Driver\'s Condition/Experience',
    
    'weather': 'Environmental and Road Conditions',
    'road engineering/surface/marking defects': 'Environmental and Road Conditions',
    'road construction/maintenance': 'Environmental and Road Conditions',
    'evasive action due to animal, object, nonmotorist': 'Environmental and Road Conditions',
    'animal': 'Environmental and Road Conditions',
    
    'unable to determine': 'Unknown/Other',
    'not applicable': 'Unknown/Other',
    'related to bus stop': 'Unknown/Other',
    'obstructed crosswalks': 'Unknown/Other',
    
    # Add the missing categories
    'improper backing': 'Aggressive/Reckless Driving',
    'equipment - vehicle condition': 'Driver\'s Condition/Experience',
    'disregarding other traffic signs': 'Aggressive/Reckless Driving',
    'disregarding road markings': 'Aggressive/Reckless Driving',
    'turning right on red': 'Aggressive/Reckless Driving'
}

# Apply the mapping to categorize the causes
crashes_cleaned['crash_cause_category'] = crashes_cleaned['prim_contributory_cause'].map(cause_mapping)

In [48]:
# Find unique values in 'prim_contributory_cause' that are not in the 'cause_mapping'
missing_values = crashes_cleaned[~crashes_cleaned['prim_contributory_cause'].isin(cause_mapping.keys())]['prim_contributory_cause'].unique()

print(missing_values)

[]


In [49]:
# Check the value counts in the new category column
crashes_cleaned['crash_cause_category'].value_counts()

Aggressive/Reckless Driving          417688
Unknown/Other                        400902
Driver's Condition/Experience         51717
Environmental and Road Conditions     19367
Distraction                           11772
Name: crash_cause_category, dtype: int64

In [50]:
# prints the distribution of values within sec_contributory_cause feature
crashes_cleaned['sec_contributory_cause'].value_counts()

not applicable                                                                      371652
unable to determine                                                                 324878
failing to reduce speed to avoid crash                                               33161
failing to yield right-of-way                                                        28925
driving skills/knowledge/experience                                                  28101
following too closely                                                                23735
improper overtaking/passing                                                          14021
improper lane usage                                                                  12692
weather                                                                               9915
improper turning/no signal                                                            9382
improper backing                                                                      7194

This feature is redundant to prim_contributory_cause, and a high majority of values are either 'not applicable' or 'unable to determine' so it will be dropped. 

In [51]:
# prints the distribution of values within crash_date feature
crashes_cleaned['crash_date'].value_counts()

12/29/2020 05:00:00 pm    30
11/10/2017 10:30:00 am    27
02/17/2022 03:30:00 pm    21
11/21/2024 10:30:00 am    20
11/21/2024 10:00:00 am    20
                          ..
12/23/2016 12:41:00 pm     1
10/03/2020 05:32:00 pm     1
08/02/2021 05:15:00 pm     1
01/08/2020 02:35:00 pm     1
09/13/2023 01:08:00 pm     1
Name: crash_date, Length: 592919, dtype: int64

Will remove this feature. This information is captured in crash_hour, crash_day_of_the_week, crash_month. I keep the following three features and perform reclassification to help prepare for modeling.

In [52]:
# prints the distribution of values within crash_hour feature
crashes_cleaned['crash_hour'].value_counts()

15    69825
16    68993
17    67144
14    60189
18    55381
13    54478
12    52818
8     47683
11    45742
9     41217
10    40942
19    40838
7     38207
20    33003
21    29440
22    27107
23    23508
0     19638
6     19488
1     16760
2     14336
5     12390
3     11848
4     10471
Name: crash_hour, dtype: int64

In [53]:
# prints the distribution of values within crash_day_of_week feature
crashes_cleaned['crash_day_of_week'].value_counts()

6    146122
7    133158
5    129717
3    128456
4    127880
2    123620
1    112493
Name: crash_day_of_week, dtype: int64

In [54]:
# prints the distribution of values within crash_month feature
crashes_cleaned['crash_month'].value_counts()

10    86680
9     82227
8     80821
7     78568
11    78175
6     77697
5     77268
12    74429
3     67812
4     66417
1     66068
2     65284
Name: crash_month, dtype: int64

In [55]:
crashes_cleaned['time_of_day'] = pd.cut(
    crashes_cleaned['crash_hour'], 
    bins=[-1, 5, 11, 17, 23], 
    labels=['Night (Late)', 'Morning', 'Afternoon', 'Night (Early)'],
    right=True
)

In [56]:
crashes_cleaned['day_of_week'] = crashes_cleaned['crash_day_of_week'].replace({
    1: 'Sun',
    2: 'Mon',
    3: 'Tues',
    4: 'Wed',
    5: 'Thur',
    6: 'Fri',
    7: 'Sat'
})

In [57]:
# prints the distribution of values within day_of_week feature
crashes_cleaned['day_of_week'].value_counts()

Fri     146122
Sat     133158
Thur    129717
Tues    128456
Wed     127880
Mon     123620
Sun     112493
Name: day_of_week, dtype: int64

In [58]:
# groups crash_months into seasons and saves as new feature: season

crashes_cleaned['season'] = pd.cut(
    crashes_cleaned['crash_month'], 
    bins=[0, 2, 5, 8, 11, 12], 
    labels=['Winter', 'Spring', 'Summer', 'Fall', 'Winter'],
    right=True,
    ordered=False
)

In [59]:
# prints the distribution of values within the new season feature
crashes_cleaned['season'].value_counts()

Fall      247082
Summer    237086
Spring    211497
Winter    205781
Name: season, dtype: int64

In [60]:
# prints the count of null values within each feature
crashes_cleaned.isna().sum()

crash_record_id                       0
crash_date                            0
posted_speed_limit                    0
traffic_control_device                0
device_condition                      0
weather_condition                     0
lighting_condition                    0
first_crash_type                      0
trafficway_type                       0
lane_cnt                         702424
alignment                             0
roadway_surface_cond                  0
road_defect                           0
report_type                       28066
crash_type                            0
intersection_related_i           694392
hit_and_run_i                    618754
damage                                0
date_police_notified                  0
prim_contributory_cause               0
sec_contributory_cause                0
street_no                             0
street_direction                      4
street_name                           1
beat_of_occurrence                    5


#### 2.1.8 Remove unuseful features: `.drop()` for list of features deemed not useful for analysis and store trimmed df as ‘crashes_cleaned’

Domain knowledge and better understanding of the features and business problem led me to remove several features. Features dealing with aftermath, such as damage, and 'date_police_notified' both deal with aftermath, and thus do not offer much predictive insights, so they are removed. 

In [61]:
crashes_cleaned.drop(columns = [
    'crash_date',
    'hit_and_run_i',
    'device_condition',
    'weather_condition',
    'road_defect',
    'crash_type',
    'damage',
    'date_police_notified',
    'sec_contributory_cause',
    'street_no',
    'report_type',
    'beat_of_occurrence',
    'num_units',
    'alignment',
    'injuries_total',
    'injuries_fatal',
     'injuries_incapacitating',
     'injuries_non_incapacitating',
     'injuries_reported_not_evident',
     'injuries_no_indication',
    'injuries_unknown',
    'location',
    'street_direction',
    'lane_cnt', 
    'intersection_related_i',
    'trafficway_type', 
    'crash_hour', 
    'crash_day_of_week', 
    'crash_month', 
    'posted_speed_limit', 
    'traffic_control_device', 
    'street_name', 
    'most_severe_injury',
    'prim_contributory_cause',
    'latitude',
    'longitude',
    'first_crash_type'
], inplace = True)

#### 2.1.10 Convert data types: stored data types to reflect true data types (categorical variables as strings, numeric variables as int, etc.)

Ensuring that features' stored data types match their true data type is another common cleaning step. Most of the features in the cleaned dataframe were categorical variables, so I saved them as 'category' data type. I could've also stored them as object types, but category types are easier to store and use less memory, which is important in contexts like this where you're dealing with such a vast amount of data. 

In [62]:
# Convert all the columns (except 'crash_record_id') to category type
crashes_cleaned[[col for col in crashes_cleaned.columns if col != 'crash_record_id']] = crashes_cleaned[[col for col in crashes_cleaned.columns if col != 'crash_record_id']].astype('category')

# Verify the changes
crashes_cleaned.dtypes

crash_record_id               object
lighting_condition          category
roadway_surface_cond        category
speed_limit_category        category
traffic_control_category    category
road_category               category
severity_category           category
crash_cause_category        category
time_of_day                 category
day_of_week                 category
season                      category
dtype: object

Once datatypes were addressed and any unhelpful features removed, I checked the distribution of my target feature. I also had to decide how to handle the remaining null values. Since the remaining null values were such a small part of the total, I just decided to remove any row with a null value knowing that this would still leave me with plenty of data. 

In [63]:
# prints the distribution of the target variable severity_category
crashes_cleaned['severity_category'].value_counts()

Checking the distribution of my target, I can see there is significant class imbalance, and this will inform some of my future data preparation decisions. Right now I still have **16**k of my target class, but I will be cautious with removing rows as I want to make sure there are enough data.

In [64]:
# checks the percentage of null values within each feature
(crashes_cleaned.isna().sum()/ len(crashes_cleaned))* 100

In [65]:
# drops any remaining rows that contain null values
crashes_cleaned.dropna(inplace = True)

With such small percentages of remaining null values I feel confident simply removing them as opposed to imputing values which could potentially introduce noise.

The crashes_cleaned dataframe is now prepared for merging. 

### 2.2 People 

Steps:

* Steps:
* **Preview Data**: `.head()`


* **Understand Structure**: `.info()`


* **Format Feature names and Row Values**: `.lower()`


* **Drop features with overly high null values**: `.isna().sum()/ len(df)` for percentage of nulls for each feature


* **Check for duplicates**: `.duplicated().sum()`


* **inspect remaining features**: `.value_counts()`; 

    * make intentional decisions to keep or drop using `.value_counts()` distribution and domain knowledge; 
    * make note of any features to keep that will need cleaning/cardinality reduction/etc.


* **remove unuseful features**: 

    * `.drop()` for list of features deemed not useful for analysis; 
    * store trimmed df as 'df_name_cleaned'


* **Convert data types**: 

    * stored data types to reflect true data types 
    * (categorical variables as strings, numeric variables as int, ect.)


* **reduce feature cardinality with label reclassification**:

    * 'safety_equipment' to 'safety_equipment_category'
    * 'age' to 'age_group'
 
 
* **remove remaining nulls**: `.dropna()`

#### 2.2.1 Preview data

In [66]:
# previews the first 5 rows of the data
people_df.head()

#### 2.2.2 Understand Structure

In [67]:
# provides info about the dataframe
people_df.info()

#### 2.2.3 Format Feature names and Row Values

Initial observations after previewing the first 5 rows and the dataframe structure with `.info()` are that this dataset is massive with nearly 2 million records. Extensive cleaning will need to take place to reduce the size of this dataset for computational efficiency. 

For a first step, I decided to make the feature names lower case simply for readability, and as a precuror cleaning step I made all string values lower case in the hopes that this may deal with some misspellings.

In [68]:
# converts all feature names to lower case
people_df.columns = people_df.columns.str.lower()

In [69]:
# Convert all string values in object columns to lowercase
for col in people_df.select_dtypes(include='object').columns:
    people_df[col] = people_df[col].str.lower()

#### 2.2.4 Drop features with overly high null values

#### 2.1.4 Drop features with overly high null values: 

While there are a lot of records in this dataframe, some features also have a lot of null values.  I know that features with a significant majority of null values will not be helpful for analysis so I start by trying to identify these features and remove them right off the bat. Rather than look at the count of nulls, I used `(.isna().sum()/ len(df)) *100` to get the percentage of nulls for each feature. This is easier to grasp than null counts in the 10 and hundred thousands. 

For quick cleaning, I chose a 90% null threshold to automatically remove features. For the features with between less than 90% nulls, I decided to inspect them more closely before blinding removing. 

In [70]:
# prints the percentage of null values per feature, rounded to 2 places
round((people_df.isna().sum()/ len(people_df)*100), 2)

In [71]:
# selecting all features with 90% or more of its values are null and stores features as a list
ppl_high_null_features = list(people_df.columns[(people_df.isna().sum() / len(people_df) * 100) >= 90])
ppl_high_null_features

['ems_agency',
 'ems_run_no',
 'pedpedal_action',
 'pedpedal_visibility',
 'pedpedal_location',
 'bac_result value',
 'cell_phone_use']

In [72]:
# drops all features with 90% or more null values
people_cleaned = people_df.drop(columns=ppl_high_null_features)
people_cleaned.info()

#### 2.2.5 deleting people from memory

Once I created a new dataframe by dropping the features with high null counts, I no longer needed the original people_df dataset so I deleted it from memory due to its large size and ability to take up a large amount of memory. 

In [73]:
# deletes the people_df dataframe to clear up memory

del people_df

In [74]:
# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.

gc.collect()

464

#### 2.2.6 Checking for duplicates

This data set has 3 different ID features, so to get a better sense of what each feature and row represent I calculated the number of rows that contain duplicates for the three id features to better understand how this dataset relates to the crashes and vehicles dataset.  

In [75]:
# Define the list of ID columns to check for duplicates
id_columns = ['person_id', 'crash_record_id', 'vehicle_id']

# Loop through each column to check for duplicates
for column in id_columns:
    # Directly calculate and print the count of duplicates
    duplicates_count = people_cleaned[column].duplicated().sum()
    print(f"Number of duplicate {column} rows: {duplicates_count}")

    # If needed, display duplicate rows (uncommon for large datasets due to memory concerns)
    if duplicates_count > 0:
        print(f"Example duplicate rows for {column}:")
        print(people_cleaned[people_cleaned[column].duplicated()].head())  # Show only the first few rows
    print("\n" + "="*50)  # Separator for readability

##### Explanation of Output:

1. **Duplicate person_id Rows:**
   - No duplicate person_id rows were found, which means each person_id is unique in this dataset.

2. **Duplicate crash_record_id Rows:**
   - A total of 1,080,392 rows are duplicates based on crash_record_id. This suggests that multiple individuals (drivers, passengers, etc.) may have been associated with the same crash. This is expected if there are multiple people involved in the same crash event.

3. **Duplicate vehicle_id Rows:**
   - A total of 421,011 rows are duplicates based on vehicle_id. This indicates that some vehicles appear in multiple records, potentially due to different passengers or crashes involving the same vehicle.

#### 2.2.7 Inspect remaining features

Once columns with high null values were taken care of, I used `.value_counts()` to inspect many of the remaining columns to look for things like cardianlity, misspelling, class labels, stored data types, etc. I used some domain knowledge to avoid looking through every single feature. Similar to my process in the traffic_crashes dataframe, some for useful features with potentially unclear labels or a high cardinality, reclassify feature labels into less, more easily understandable labels. 

In [76]:
# prints the distribution of values within crash_type feature
people_cleaned['person_type'].value_counts()

In [77]:
# prints the distribution of values within crash_type feature
people_cleaned['crash_date'].value_counts()

In [78]:
# prints the distribution of values within crash_type feature
people_cleaned['seat_no'].value_counts()

In [79]:
# prints the distribution of values within crash_type feature
people_cleaned['safety_equipment'].value_counts(normalize = True)

Could be an important predictor, but need to reduce cardinality. In the cell below, I reclassify the `safety_equipment` labels into 4 interpretable classes. This reclassification helps reduce cardinality which will be helpful during OneHotEncoding and improve computation efficiency.  

In [80]:
# Create a dictionary to map the original 'safety_equipment' values to broader categories
safety_equipment_mapping = {
    # Used Equipment
    'safety belt used': 'Used',
    'child restraint used': 'Used',
    'child restraint - forward facing': 'Used',
    'bicycle helmet (pedacyclist involved only)': 'Used',
    'child restraint - type unknown': 'Used',
    'child restraint - rear facing': 'Used',
    'dot compliant motorcycle helmet': 'Used',
    'helmet used': 'Used',
    'booster seat': 'Used',
    'child restraint used improperly': 'Used',

    # Not Used Equipment
    'safety belt not used': 'Not Used',
    'helmet not used': 'Not Used',
    'child restraint not used': 'Not Used',
    'not dot compliant motorcycle helmet': 'Not Used',
    'should/lap belt used improperly': 'Not Used',

    # Unknown Equipment Usage
    'usage unknown': 'Unknown',

    # Other/Special Case Equipment
    'none present': 'Other/Special Case', 
    'wheelchair': 'Other/Special Case',
    'stretcher': 'Other/Special Case',
    
    # Catch-all for any unknown or missing values
    'unknown': 'Other/Special Case',  
}

# Apply the mapping to the 'safety_equipment' column
people_cleaned['safety_equipment_category'] = people_cleaned['safety_equipment'].map(safety_equipment_mapping)

In [81]:
# Check the value counts for the new grouped categories
people_cleaned['safety_equipment_category'].value_counts(normalize = True)

Used                  0.478260
Unknown               0.476501
Other/Special Case    0.033962
Not Used              0.011277
Name: safety_equipment_category, dtype: float64

Based on this output, even after recategorization, this feature will not be very useful. About 95% of the data is split between "used" and "unknown", with the remaining 5% split between "other/special case" and "not used". I will drop this feature.

In [82]:
# prints the distribution of values within crash_type feature
people_cleaned['airbag_deployed'].value_counts()

did not deploy                            987711
not applicable                            424194
deployment unknown                        397461
deployed, front                            61565
deployed, combination                      50895
deployed, side                             17944
deployed other (knee, air, belt, etc.)       973
Name: airbag_deployed, dtype: int64

This feature might be more helpful if we simply knew: did the airbag deploy or not? In the cell below, I reclassify the `airbag_deployed` labels into 3 interpretable classes. This reclassification helps reduce cardinality which will be helpful during OneHotEncoding and improve computation efficiency.  

In [83]:
# Define the mapping for airbag_deployed
airbag_mapping = {
    'did not deploy': 'Not Deployed',
    'not applicable': 'Not Deployed',  # Assuming "not applicable" should be considered as unknown
    'deployment unknown': 'Unknown',
    'deployed, front': 'Deployed',
    'deployed, combination': 'Deployed',
    'deployed, side': 'Deployed',
    'deployed other (knee, air, belt, etc.)': 'Deployed'
}

# Apply the mapping to the 'airbag_deployed' column
people_cleaned['airbag_deployed'] = people_cleaned['airbag_deployed'].map(airbag_mapping)

In [84]:
# check the new feature's value counts
people_cleaned['airbag_deployed'].value_counts()

Not Deployed    1411905
Unknown          397461
Deployed         131377
Name: airbag_deployed, dtype: int64

In [85]:
# prints the distribution of values within crash_type feature
people_cleaned['ejection'].value_counts()

none                  1820806
unknown                125622
totally ejected          5904
partially ejected        1449
trapped/extricated       1196
Name: ejection, dtype: int64

This feature might be normally be helpful, but it is far too skewed to be helpful for this analysis. It contains about 8.5k values other than 'none' or 'unknown'. This will be removed.

In [86]:
# prints the distribution of values within crash_type feature
people_cleaned['injury_classification'].value_counts()

no indication of injury     1803602
nonincapacitating injury      98271
reported, not evident         58267
incapacitating injury         17878
fatal                          1089
Name: injury_classification, dtype: int64

We can drop this. It contains similar information to 'most_severe_injury' but is more imbalanced so I will drop it and keep 'most_severe_injury' as my target. This decision is aided by the fact that for this project I am focused more on crash-level data, than people-level data. Simply, I want to focus on which crashes resulted in serious injury, more general than which people were seriously injured. So while I could compute this with injury_classification, most_severe_injury already describes what I'm analyzing and thus will require less target feature preparation than if I chose to use injury_classification.

In [87]:
# prints class distribution percentages of driver_action, rounded to 2 places
round(people_cleaned['driver_action'].value_counts(normalize = True), 2)

none                                 0.36
unknown                              0.25
failed to yield                      0.09
other                                0.09
followed too closely                 0.06
improper backing                     0.03
improper turn                        0.03
improper lane change                 0.03
improper passing                     0.02
disregarded control devices          0.02
too fast for conditions              0.01
wrong way/side                       0.00
improper parking                     0.00
overcorrected                        0.00
evading police vehicle               0.00
cell phone use other than texting    0.00
emergency vehicle on call            0.00
texting                              0.00
stopped school bus                   0.00
license restrictions                 0.00
Name: driver_action, dtype: float64

This feature is similar to 'prim_contributory_cause' in crashes, but this one contains 20% nulls. Will drop this and keep prim_contributory_cause instead to avoid multicollinearity. 

In [88]:
# prints the distribution of values within crash_type feature
people_cleaned['driver_vision'].value_counts()

not obscured              784184
unknown                   753832
other                      15389
moving vehicles             8800
parked vehicles             5429
windshield (water/ice)      4169
blinded - sunlight          1879
trees, plants                614
buildings                    558
blinded - headlights         168
blowing materials            108
hillcrest                    102
embankment                    85
signboard                     38
Name: driver_vision, dtype: int64

This feature is too skewed to provide any real analytical benefit. The top two classes by 500k values are 'not_obscured' and 'unknown', so I will drop this feature.

In [89]:
# prints the distribution of values within crash_type feature
people_cleaned['physical_condition'].value_counts()

normal                          1020511
unknown                          527441
impaired - alcohol                 6635
removed by ems                     5697
other                              4585
emotional                          4213
fatigued/asleep                    4108
illness/fainted                    1408
had been drinking                  1128
impaired - drugs                    726
impaired - alcohol and drugs        416
medicated                           193
Name: physical_condition, dtype: int64

Most classes here are 'normal' or 'unknown' which is not particularly insightful, so I will remove this feature. 

In [90]:
# prints the distribution of values within crash_type feature
people_cleaned['bac_result'].value_counts()

test not offered                   1554246
test refused                         16163
test performed, results unknown       3715
test taken                            2784
Name: bac_result, dtype: int64

Most values are 'test not offered' and 'test refused'. Again, not very insightful, so I will drop.

In [91]:
# prints the distribution of values within crash_type feature
people_cleaned['sex'].value_counts()

m    1023595
f     743117
x     179808
Name: sex, dtype: int64

This feature could be insightful. Will change the label names to be more descriptive. 

In [92]:
# Replace the values in the 'sex' feature
people_cleaned['sex'] = people_cleaned['sex'].replace({'m': 'male', 'f': 'female', 'x': 'other'})

# Verify the changes
people_cleaned['sex'].unique()

array(['male', 'other', 'female', nan], dtype=object)

In [93]:
# prints the distribution of values within crash_type feature
people_cleaned['age'].value_counts()

 26.0     39239
 25.0     39229
 27.0     39187
 28.0     38579
 24.0     38031
          ...  
-40.0         1
-177.0        1
-49.0         1
-47.0         1
-59.0         1
Name: age, Length: 117, dtype: int64

This information could be interesting to investigate and feels like it could be an insightful predictor, but its high cardinality means it will need some reclassifying in order to best be insightful. I chose to reclassify this based on age group, focusing on delineating between ages too young to drive **<16** years old, ages that are eligible to drive but brains are not fully developed (**16-26**), ages of fully developed adult brains (**27-65**), and then older folks (**65+**). I am hopeful that these labels will provide more insight than the original labels and help improve computational efficiency during modeling. 

In [94]:
# Sample data (replace with your actual DataFrame)
age_bins_df = pd.DataFrame({
    'age': [5, 15, 16, 25, 30, 60, 100, 120, -5, 200]
})

# Define the bins for age groups
age_bins = [1, 16, 27, 66, 115]

# Labels for the age groups
age_labels = ['1-15', '16-26', '27-65', '65+']  

# Apply corrections for age values outside the valid range (negative, 0, or greater than 115)
people_cleaned['age'] = people_cleaned['age'].apply(lambda x: np.nan if x < 1 or x > 115 else x)

# Apply pd.cut() to create a new 'age_group' column
people_cleaned['age_group'] = pd.cut(people_cleaned['age'], bins=age_bins, labels=age_labels, right=False)

# Print the first few rows to verify the new grouping
print(people_cleaned[['age', 'age_group']].head())

    age age_group
0  25.0     16-26
1  37.0     27-65
2   NaN       NaN
3   NaN       NaN
4   NaN       NaN


#### 2.2.8 remove unuseful features

Using domain knowledge and understanding of the business problem and target varaible, I decided several features such as drivers license state and class are not relevant, so I dropped them. After all was said and done with feature removal, the remaining dataset had 4 features, one of which was the secondary key: crash_record_id.

In [95]:
people_cleaned.drop(columns=['person_id',
                      'person_type',
                      'vehicle_id',
                      'drivers_license_state', 
                      'drivers_license_class',
                      'city', 
                      'state', 
                      'zipcode',
                      'hospital', 
                      'crash_date',
                      'seat_no',
                      'ejection',
                      'injury_classification',
                      'driver_vision',
                      'driver_action',
                      'physical_condition',
                      'bac_result',
                      'age', 
                      'safety_equipment',
                      'safety_equipment_category'
                     ], inplace = True)
people_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1979859 entries, 0 to 1979858
Data columns (total 4 columns):
 #   Column           Dtype   
---  ------           -----   
 0   crash_record_id  object  
 1   sex              object  
 2   airbag_deployed  object  
 3   age_group        category
dtypes: category(1), object(3)
memory usage: 47.2+ MB


In [96]:
# Convert the specified columns to category type
people_cleaned[['age_group', 'airbag_deployed', 'sex']] = people_cleaned[['age_group', 'airbag_deployed', 'sex']].astype('category')

In [97]:
# prints the percentages of null values within each feature
(people_cleaned.isna().sum()/ len(people_cleaned)*100)

By printing the percentage of null values in the remaining columns I see that sex and airbag_deployed only have a small percentage of null values, while age_group has almost 15 times more. Despite this, age_group still contains a majority of non-null values. With the dataset now containing nearly 2 million records, I feel fine simply dropping all rows with null values, as I will still have plenty of data to work with. 

In [98]:
# drops all rows with null values
people_cleaned.dropna(inplace = True)

In [99]:
# confirms that cleaning of this dataset is complete
people_cleaned.isna().sum()

With zero null values in each of 4 features, next up I will need to aggregate this data to help with merging the three datasets. I will provide further justifications for aggregations and my merging process further along in this notebook.

### 2.3 Vehicles

2.3.4 Drop features with overly high null values

2.3.5 Check for duplicates

2.3.6 inspect remaining features

2.3.6.1 reduce feature cardinality with bucketing

2.3.7 remove unuseful features

2.3.8 Convert data types

2.3.9 remove remaining nulls

Steps:

* Steps:
* **Preview Data**: `.head()`


* **Understand Structure**: `.info()`


* **Format Feature names and Row Values**: `.lower()`


* **Drop features with overly high null values**: `.isna().sum()/ len(df)` for percentage of nulls for each feature


* **Check for duplicates**: `.duplicated().sum()`


* **inspect remaining features**: `.value_counts()`; 

    * make intentional decisions to keep or drop using `.value_counts()` distribution and domain knowledge; 
    * make note of any features to keep that will need cleaning/cardinality reduction/etc.


* **remove unuseful features**: 

    * `.drop()` for list of features deemed not useful for analysis; 
    * store trimmed df as 'df_name_cleaned'


* **reduce feature cardinality with label reclassification**:

    * 'safety_equipment' to 'safety_equipment_category'
    * 'age' to 'age_group'
 
 
* **Convert data types**: 

    * stored data types to reflect true data types 
    * (text as string, categorical variables as categories, numeric variables as int, ect.) 
 
* **remove remaining nulls**: `.dropna()`

#### 2.3.1 Preview Data

In [100]:
vehicles_df.head()

#### 2.3.2 Understand Structure

In [101]:
vehicles_df.info()

#### 2.3.3 Format Feature names and Row Values

In [102]:
vehicles_df.columns = vehicles_df.columns.str.lower()

In [103]:
# Convert all string values in object columns to lowercase
for col in vehicles_df.select_dtypes(include='object').columns:
    vehicles_df[col] = vehicles_df[col].str.lower()

#### 2.3.4 Drop features with overly high null values

In [104]:
# creating a list of all features with 90% or more of its values are null
high_null_features = list(vehicles_df.columns[(vehicles_df.isna().sum() / len(vehicles_df) * 100) >= 90])
high_null_features

['cmrc_veh_i',
 'fire_i',
 'exceed_speed_limit_i',
 'towed_by',
 'towed_to',
 'area_00_i',
 'area_03_i',
 'area_04_i',
 'area_09_i',
 'cmv_id',
 'usdot_no',
 'ccmc_no',
 'ilcc_no',
 'commercial_src',
 'gvwr',
 'carrier_name',
 'carrier_state',
 'carrier_city',
 'hazmat_placards_i',
 'hazmat_name',
 'un_no',
 'hazmat_present_i',
 'hazmat_report_i',
 'hazmat_report_no',
 'mcs_report_i',
 'mcs_report_no',
 'hazmat_vio_cause_crash_i',
 'mcs_vio_cause_crash_i',
 'idot_permit_no',
 'wide_load_i',
 'trailer1_width',
 'trailer2_width',
 'trailer1_length',
 'trailer2_length',
 'total_vehicle_length',
 'axle_cnt',
 'vehicle_config',
 'cargo_body_type',
 'load_type',
 'hazmat_out_of_service_i',
 'mcs_out_of_service_i',
 'hazmat_class']

In [105]:
vehicles_cleaned = vehicles_df.drop(columns=high_null_features)

In [106]:
# Check for duplicates based on crash_unit_id, crash_record_id, and vehicle_id
vehicles_cleaned[vehicles_cleaned.duplicated(subset=['crash_unit_id', 'crash_record_id', 'vehicle_id'], keep=False)]

Unnamed: 0,crash_unit_id,crash_record_id,crash_date,unit_no,unit_type,num_passengers,vehicle_id,make,model,lic_plate_state,...,area_02_i,area_05_i,area_06_i,area_07_i,area_08_i,area_10_i,area_11_i,area_12_i,area_99_i,first_contact_point


The output tells us that there are no duplicate rows in the vehicles_cleaned DataFrame based on the specified subset of columns: crash_unit_id, crash_record_id, and vehicle_id.

#### 2.3.5 deleting vehicles dataset from memory

In [107]:
# deletes the vehicle_df dataframe to clear up memory

del vehicles_df

In [108]:
# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.

gc.collect()

81

In [109]:
vehicles_cleaned['num_passengers'].value_counts()

This is redundant information. This information does not include the driver, but this information is captured in occupant count. will drop this one. 

In [110]:
vehicles_cleaned['unit_no'].value_counts()

This is aftermath. unhelpful. remove

In [111]:
vehicles_cleaned['unit_type'].value_counts()

Most of the values are drivers or parked cars. this will not be useful for analysis

In [112]:
vehicles_cleaned['make'].value_counts()

high cardinality

In [113]:
vehicles_cleaned['model'].value_counts()

This feels like it could be helpful, but many unknowns and 'other', and very high cardinality. The important information that we'd gain from this is already included in vehicle_type. So we can drop

In [114]:
vehicles_cleaned['vehicle_defect'].value_counts()

Most of the values are none or unknown. This will not be particularly useful for analysis. can drop

In [115]:
vehicles_cleaned['vehicle_type'].value_counts()

In [116]:
vehicles_cleaned['travel_direction'].value_counts()

Unhelpful for analysis. Remove

In [117]:
vehicles_cleaned['maneuver'].value_counts()

This feature could be important as it has to do with what was happening prior to the crash.

In [118]:
vehicles_cleaned['towed_i'].value_counts()

Aftermath; Unhelpful for analysis

In [119]:
vehicles_cleaned['occupant_cnt'].value_counts()

It is unclear what the area_##_i features represent. They will be removed

In [120]:
vehicles_cleaned['first_contact_point'].value_counts()

This feature could indicate

In [121]:
vehicle_features_to_drop = ['num_passengers', 
                            'crash_unit_id',
                            'crash_date',
                            'unit_type',
                            'make', 
                            'model',
                            'vehicle_id',
                           'vehicle_defect',
                           'unit_no',
                           'lic_plate_state',
                            'vehicle_year',
                           'vehicle_use',
                           'travel_direction',
                           'towed_i',
                            'area_01_i',
                           'area_02_i', 
                            'area_05_i',
                            'area_06_i',
                            'area_07_i',
                            'area_08_i',
                            'area_10_i',
                            'area_11_i',
                            'area_12_i',
                            'area_99_i', 
                           'first_contact_point']

In [122]:
vehicles_cleaned = vehicles_cleaned.drop(columns=vehicle_features_to_drop)

In [123]:
vehicles_cleaned.info()

In [124]:
list(vehicles_cleaned.columns)

['crash_record_id', 'vehicle_type', 'maneuver', 'occupant_cnt']

In [125]:
# Create a dictionary to map the original vehicle types to more specific categories
vehicle_type_mapping = {
    'passenger': 'Passenger Vehicles',
    'sport utility vehicle (suv)': 'SUVs',
    'van/mini-van': 'Passenger Vehicles',
    'pickup': 'Trucks',
    'truck - single unit': 'Trucks',
    'single unit truck with trailer': 'Trucks',
    'other': 'Other',
    'bus over 15 pass.': 'Buses',
    'bus up to 15 pass.': 'Buses',
    'tractor w/ semi-trailer': 'Trucks',
    'tractor w/o semi-trailer': 'Trucks',
    'motorcycle (over 150cc)': 'Motorcycles',
    'other vehicle with trailer': 'Other',
    'autocycle': 'Motorcycles',
    'moped or motorized bicycle': 'Motorcycles',
    'motor driven cycle': 'Motorcycles',
    'all-terrain vehicle (atv)': 'Recreational/Off-Highway Vehicles',
    'farm equipment': 'Farm and Specialized Equipment',
    '3-wheeled motorcycle (2 rear wheels)': 'Motorcycles',
    'recreational off-highway vehicle (rov)': 'Recreational/Off-Highway Vehicles',
    'snowmobile': 'Recreational/Off-Highway Vehicles',
    'unknown/na': np.nan  # Set 'unknown/na' to NaN
}

# Apply the mapping to the 'vehicle_type' column
vehicles_cleaned['vehicle_category'] = vehicles_cleaned['vehicle_type'].map(vehicle_type_mapping)

# Check the value counts for the new grouped categories
vehicles_cleaned['vehicle_category'].value_counts()

Passenger Vehicles                   1210579
SUVs                                  250398
Trucks                                114924
Buses                                  25067
Other                                  24559
Motorcycles                             6140
Recreational/Off-Highway Vehicles        237
Farm and Specialized Equipment            87
Name: vehicle_category, dtype: int64

In [126]:
# Filter out rows with 'Recreational/Off-Highway Vehicles' and 'Farm and Specialized Equipment'
vehicles_cleaned = vehicles_cleaned[~vehicles_cleaned['vehicle_category'].isin(['Recreational/Off-Highway Vehicles', 'Farm and Specialized Equipment'])]

# Check the value counts after removing those categories
vehicles_cleaned['vehicle_category'].value_counts()


Passenger Vehicles    1210579
SUVs                   250398
Trucks                 114924
Buses                   25067
Other                   24559
Motorcycles              6140
Name: vehicle_category, dtype: int64

In [127]:
vehicles_cleaned.info()

In [128]:
# Modify the maneuver mapping to treat 'unknown/na' as NaN
maneuver_mapping = {
    'straight ahead': 'Standard Movement',
    'slow/stop in traffic': 'Standard Movement',
    'passing/overtaking': 'Standard Movement',
    'unknown/na': np.nan,  # Set 'unknown/na' to NaN
    
    'parked': 'Reversing/Stopping',
    'entering traffic lane from parking': 'Reversing/Stopping',
    'starting in traffic': 'Reversing/Stopping',
    
    'turning left': 'Turn/Change of Direction',
    'turning right': 'Turn/Change of Direction',
    'u-turn': 'Turn/Change of Direction',
    'changing lanes': 'Turn/Change of Direction',
    'turning on red': 'Turn/Change of Direction',
    
    'backing': 'Reversing/Stopping',
    'avoiding vehicles/objects': 'Avoidance/Emergency Response',
    'skidding/control loss': 'Avoidance/Emergency Response',
    'negotiating a curve': 'Avoidance/Emergency Response',
    
    'leaving traffic lane to park': 'Reversing/Stopping',
    'enter from drive/alley': 'Reversing/Stopping',
    
    'driving wrong way': 'Special Cases',
    'diverging': 'Special Cases',
    'driverless': 'Special Cases',
    'disabled': 'Special Cases',
    
    'other': 'Special Cases',
}

# Apply the mapping to the 'maneuver' column
vehicles_cleaned['maneuver_category'] = vehicles_cleaned['maneuver'].map(maneuver_mapping)

# Check the value counts for the new 'maneuver_category'
vehicles_cleaned['maneuver_category'].value_counts()

In [129]:
# Replace 0, 99, and negative values with NaN in 'occupant_cnt' for future dropping
vehicles_cleaned['occupant_cnt'] = vehicles_cleaned['occupant_cnt'].replace([0, 99], np.nan)

# Define bins for the categories (including the 20-98 range for Very Large Group)
bins = [1, 4, 8, 19, 98, float('inf')]  # Adjusted to include 20-98 for Very Large Group
labels = ['Single Occupancy', 'Small Group', 'Medium Group', 'Large Group', 'Very Large Group']  # 5 labels for 5 bins

# Use pd.cut to categorize the 'occupant_cnt' column based on the bins
vehicles_cleaned['occupant_category'] = pd.cut(
    vehicles_cleaned['occupant_cnt'], 
    bins=bins, 
    labels=labels, 
    right=True, 
    include_lowest=False  # Exclude 0 from Single Occupancy
)

# Handle the NaN values for 'occupant_category' (those rows with invalid occupant_cnt, such as 0 or 99)
vehicles_cleaned['occupant_category'] = vehicles_cleaned['occupant_category'].where(
    vehicles_cleaned['occupant_category'].notna(), np.nan
)

# Check the categories to ensure the correct label reclassification
vehicles_cleaned['occupant_category'].value_counts()

Single Occupancy    248056
Small Group           8763
Medium Group           429
Large Group             91
Very Large Group         0
Name: occupant_category, dtype: int64

In [130]:
# Remove 'Very Large Group' rows since it contains zero values
vehicles_cleaned = vehicles_cleaned[vehicles_cleaned['occupant_category'] != 'Very Large Group']

# Drop 'Very Large Group' from the categorical data if it still exists
vehicles_cleaned['occupant_category'] = vehicles_cleaned['occupant_category'].cat.remove_categories('Very Large Group')

# Check the categories to ensure the correct label reclassification
vehicles_cleaned['occupant_category'].value_counts()

Single Occupancy    248056
Small Group           8763
Medium Group           429
Large Group             91
Name: occupant_category, dtype: int64

In [131]:
vehicles_cleaned.info()

In [132]:
vehicles_cleaned.isna().sum()

In [133]:
vehicles_cleaned['occupant_cnt'].isna().sum()

In [134]:
vehicles_cleaned['occupant_cnt'].value_counts()

In [135]:
# Drop the 'occupant_category' and 'occupant_cnt' column
vehicles_cleaned = vehicles_cleaned.drop(columns=['occupant_category', 'occupant_cnt', 'vehicle_type', 'maneuver'])


In [136]:
vehicles_cleaned.isna().sum()

crash_record_id           0
vehicle_category     206831
maneuver_category    204227
dtype: int64

In [137]:
# Drop rows with any null (NaN) values
vehicles_cleaned = vehicles_cleaned.dropna()

In [138]:
vehicles_cleaned.info()

In [139]:
vehicles_cleaned.isna().sum()

### Merging `crashes_cleaned`, `people_cleaned`, & `vehicles_cleaned`

Relationships Between Tables and Justification for Merging

1. Relationship Between crashes_cleaned and people_cleaned
	•	crash_record_id is the primary key in crashes_cleaned and appears in people_cleaned.
	•	Each crash in crashes_cleaned can involve multiple people (drivers, passengers, pedestrians).

This means the relationship is:
	•	One-to-Many: One crash (crashes_cleaned) can have many people (people_cleaned) associated with it.

2. Relationship Between crashes_cleaned and vehicles_cleaned
	•	crash_record_id is the primary key in crashes_cleaned and appears in vehicles_cleaned.
	•	Each crash in crashes_cleaned can involve multiple vehicles.

This means the relationship is:
	•	One-to-Many: One crash (crashes_cleaned) can have many vehicles (vehicles_cleaned) associated with it.

3. Relationship Between people_cleaned and vehicles_cleaned
	•	Both tables are linked via crash_record_id, but they describe different entities.
	•	People (people_cleaned) and vehicles (vehicles_cleaned) may not have a direct relationship unless there’s another shared identifier (e.g., vehicle_id or person_id).

This means the relationship is:
	•	Many-to-Many: Many people can be in many vehicles within the same crash. (However, this relationship is indirectly expressed through crash_record_id.)

### Aggregation

The goal of aggregating the people_cleaned dataset is to handle the many-to-one relationship between the people_cleaned and crashes_cleaned datasets. Each crash_record_id in crashes_cleaned may have multiple associated records in people_cleaned, as a single crash may involve multiple people. Since our focus is on predicting the severity of crashes using the severity_category (the target variable in crashes_cleaned), we need to aggregate the people data to ensure each crash record has only one corresponding row of data.

The aggregation process involves grouping the people_cleaned dataset by crash_record_id, which is the shared key between the two datasets. For features like sex, age_group, and airbag_deployed, we use the most frequent value for each crash. In cases where there is a tie (e.g., multiple values with the same frequency), we resolve the tie by selecting the value with the highest count using the idxmax() function on the value counts of each group. This ensures consistency and avoids ambiguity in cases of a tie.

Similarly, the vehicles_cleaned dataset also has a many-to-one relationship with crashes_cleaned, where each crash_record_id in crashes_cleaned can have multiple associated vehicle records. As with the people_cleaned data, we need to aggregate the vehicle data to ensure that each crash record has a corresponding single row. The aggregation will allow us to focus on features like vehicle_category and other vehicle-specific attributes that might affect crash severity.

By grouping the vehicles_cleaned dataset by crash_record_id, we can apply the same aggregation logic as with the people data. This ensures that we retain the most important vehicle-specific features while also maintaining a consistent one-to-one relationship between crashes_cleaned and the aggregated datasets.

In [140]:
# Start the timer
start_time = time.time()

# Perform a single groupby and aggregate all columns at once
people_aggregated = (
    people_cleaned.groupby('crash_record_id').agg({
        'sex': lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan,
        'age_group': lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan,
        'airbag_deployed': lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan
    }).reset_index()
)

# Stop the timer
end_time = time.time()

# Calculate and print the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Previews the aggregated results
people_aggregated.head()

Elapsed time: 298.16 seconds


Unnamed: 0,crash_record_id,sex,age_group,airbag_deployed
0,000013b0123279411e0ec856dae95ab9f0851764350b7f...,female,27-65,Not Deployed
1,00002c0771fb6f2c70ba775b7f6b501608cadea85c1dd1...,female,27-65,Not Deployed
2,00005696946846c8b8a1d378dba4e2a5ed84a9b2876fe0...,male,27-65,Not Deployed
3,000070ed7a6357c3298f5edc6fb7d5ce925a10f46660f3...,male,27-65,Not Deployed
4,0000c280b9c15e9ec96aa2eed34bf0f3ef1d604c6ea460...,female,27-65,Not Deployed


Large Data Processing:

* If you’re working with large datasets (e.g., in data science or machine learning) and have explicitly deleted variables (e.g., using del) but still notice high memory usage, calling gc.collect() ensures the unused memory is released.
    
Long-Running Programs:

* For applications that run continuously (like servers or data pipelines), invoking gc.collect() at specific intervals can help manage memory.

In [141]:
# deletes the people_cleaned dataframe to clear up memory
del people_cleaned

# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.

gc.collect()

21

In [142]:
# Start the timer
start_time = time.time()

# Aggregating the vehicles data by crash_record_id
vehicles_aggregated = vehicles_cleaned.groupby('crash_record_id').agg({
    'vehicle_category': lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan,  # Select the first mode
    'maneuver_category': lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan   # Select the first mode
}).reset_index()

# Stop the timer
end_time = time.time()

# Calculate and print the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# View the aggregated vehicles data
vehicles_aggregated.head()

Elapsed time: 266.49 seconds


Unnamed: 0,crash_record_id,vehicle_category,maneuver_category
0,000013b0123279411e0ec856dae95ab9f0851764350b7f...,Passenger Vehicles,Reversing/Stopping
1,00002c0771fb6f2c70ba775b7f6b501608cadea85c1dd1...,Passenger Vehicles,Standard Movement
2,00005696946846c8b8a1d378dba4e2a5ed84a9b2876fe0...,Trucks,Reversing/Stopping
3,000070ed7a6357c3298f5edc6fb7d5ce925a10f46660f3...,Passenger Vehicles,Standard Movement
4,0000b70a00c8809f76b5234f81753264d9160c314cc5e6...,Passenger Vehicles,Reversing/Stopping


In [143]:
# deletes the vehicles_cleaned dataframe to clear up memory
del vehicles_cleaned

# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.
gc.collect()

21

In [144]:
# Merge the dataframes using an inner join
merged_df = crashes_cleaned.merge(people_aggregated, on='crash_record_id', how='inner')

In [145]:
# deletes the crashes_cleaned dataframe to clear up memory
del crashes_cleaned

# Perform garbage collection to free up memory by releasing unreferenced objects
# This helps to manage memory usage, especially when working with large datasets or memory-intensive operations.

gc.collect()

42

In [146]:
merged_df = merged_df.merge(vehicles_aggregated, on='crash_record_id', how='inner')

In [147]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 705346 entries, 0 to 705345
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   crash_record_id           705346 non-null  object  
 1   lighting_condition        705346 non-null  category
 2   roadway_surface_cond      705346 non-null  category
 3   speed_limit_category      705346 non-null  category
 4   traffic_control_category  705346 non-null  category
 5   road_category             705346 non-null  category
 6   severity_category         705346 non-null  category
 7   crash_cause_category      705346 non-null  category
 8   time_of_day               705346 non-null  category
 9   day_of_week               705346 non-null  category
 10  season                    705346 non-null  category
 11  sex                       705346 non-null  object  
 12  age_group                 705346 non-null  object  
 13  airbag_deployed           705

In [148]:
# Convert all the columns (except 'crash_record_id') to category type
merged_df[[col for col in merged_df.columns if col != 'crash_record_id']] = merged_df[[col for col in merged_df.columns if col != 'crash_record_id']].astype('category')

In [149]:
# Verify the changes
merged_df.dtypes

crash_record_id               object
lighting_condition          category
roadway_surface_cond        category
speed_limit_category        category
traffic_control_category    category
road_category               category
severity_category           category
crash_cause_category        category
time_of_day                 category
day_of_week                 category
season                      category
sex                         category
age_group                   category
airbag_deployed             category
vehicle_category            category
maneuver_category           category
dtype: object

In [150]:
merged_df.columns = merged_df.columns.str.lower()

In [151]:
# Convert all string values in object columns to lowercase
for col in merged_df.select_dtypes(include='object').columns:
    merged_df[col] = merged_df[col].str.lower()

In [152]:
# Check the result
merged_df.isna().sum()

crash_record_id             0
lighting_condition          0
roadway_surface_cond        0
speed_limit_category        0
traffic_control_category    0
road_category               0
severity_category           0
crash_cause_category        0
time_of_day                 0
day_of_week                 0
season                      0
sex                         0
age_group                   0
airbag_deployed             0
vehicle_category            0
maneuver_category           0
dtype: int64

In [153]:
merged_df['severity_category'].value_counts()

Non-serious    692193
Serious         13153
Name: severity_category, dtype: int64

In [154]:
# Check for duplicate crash_record_id values
merged_df[merged_df.duplicated(subset='crash_record_id', keep=False)]

Unnamed: 0,crash_record_id,lighting_condition,roadway_surface_cond,speed_limit_category,traffic_control_category,road_category,severity_category,crash_cause_category,time_of_day,day_of_week,season,sex,age_group,airbag_deployed,vehicle_category,maneuver_category


In [155]:
# Iterate over each column in the DataFrame and print value counts for each feature
for column in merged_df.columns:
    print(f"Value counts for {column}:")
    print(merged_df[column].value_counts())
    print("-" * 50)  # Optional separator for readability

Value counts for crash_record_id:
6c1659069e9c6285a650e70d6f9b574ed5f64c12888479093dfeef179c0344ec6d2057eae224b5c0d5dfc278c0a237f8c22543f07fdef2e4a95a3849871c9345    1
b1099e521b2895496fd1e07adf0a7a8e5951524324b46fc6f1bb55a7cdd91151d6333fef63913fdfe3e75cd4ba47c9e8aba1a54d3af37ce3db416a6e5cf584f8    1
b108accb609c2d67307d0c8c8c7d76be4792faed6554abe99f3780468e738e7493f8ee9089239c632fe500868edeefe1859fc5b2820fb5be47986b67c91985a8    1
b108c6a47e34dd63f88879fd045946d5a95c90cae2910b8aab3f26cbd958601fcdf779fbbca2bd13650741a81064e7d5d19c87a776db46b32b5b3104c6708e69    1
b108cdbe0649a3def6b17541148bff1b8a3b5289ac888ae34c615e96c2e745ec79fe85bed2fb00953f34362e65d01d30dc0a6aa62c140d7b7f89ea2c79a8ea6a    1
                                                                                                                                   ..
559aa3a1f71ea855105d98432e7b25145eaef80df6e74f369827fa2ae563859454e044ee166f7c0b0a3e8180eebdb0a49c6bb7a90187bd1ea5ce1a021133ecca    1
559aa4fee2087be0d8d2e17dafe8

In [156]:
merged_df.isna().sum()

crash_record_id             0
lighting_condition          0
roadway_surface_cond        0
speed_limit_category        0
traffic_control_category    0
road_category               0
severity_category           0
crash_cause_category        0
time_of_day                 0
day_of_week                 0
season                      0
sex                         0
age_group                   0
airbag_deployed             0
vehicle_category            0
maneuver_category           0
dtype: int64

In [157]:
# List of columns you want to clean
columns_to_clean = ['airbag_deployed', 'roadway_surface_cond', 'lighting_condition']  # replace with your actual column names

# Iterate over each column in the list
for column in columns_to_clean:
    # Check for unwanted values in the current column and remove rows
    merged_df = merged_df[~merged_df[column].str.contains(
        'unknown|not applicable|other object|unknown/other', case=False, na=False)]

# Verify the changes
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 639462 entries, 0 to 705345
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   crash_record_id           639462 non-null  object  
 1   lighting_condition        639462 non-null  category
 2   roadway_surface_cond      639462 non-null  category
 3   speed_limit_category      639462 non-null  category
 4   traffic_control_category  639462 non-null  category
 5   road_category             639462 non-null  category
 6   severity_category         639462 non-null  category
 7   crash_cause_category      639462 non-null  category
 8   time_of_day               639462 non-null  category
 9   day_of_week               639462 non-null  category
 10  season                    639462 non-null  category
 11  sex                       639462 non-null  category
 12  age_group                 639462 non-null  category
 13  airbag_deployed           639

In [158]:
# Drop the 'crash_record_id' column from the merged dataframe
merged_df = merged_df.drop('crash_record_id', axis=1)

In [159]:
merged_df.shape

(639462, 15)

In [160]:
merged_df.isna().sum()

lighting_condition          0
roadway_surface_cond        0
speed_limit_category        0
traffic_control_category    0
road_category               0
severity_category           0
crash_cause_category        0
time_of_day                 0
day_of_week                 0
season                      0
sex                         0
age_group                   0
airbag_deployed             0
vehicle_category            0
maneuver_category           0
dtype: int64

In [161]:
merged_df['severity_category'].value_counts(normalize = True)

Non-serious    0.980882
Serious        0.019118
Name: severity_category, dtype: float64

From the above output, we see that the classes are greatly imbalanced. This will be something I will need to address during the modeling phase

In [162]:
# grouping classes into two groups for binary classification: 0 and 1
merged_df.severity_category.replace({
    'Serious' : 1,
    'Non-serious' : 0},
    inplace = True
)

## Very large data so going to take a subset for the feature importances modeling portion

Chose 10% sampling as this provided a subset of just under 50k records. This optimizes computational efficiency and memory storage, while also ensuring there is adequate data

In [163]:
# Define the desired sample size (e.g., 15% of the dataset)
sample_size = 0.15 

# Perform stratified sampling to retain class distribution
crashes_finalized_df, _ = train_test_split(
    merged_df, 
    test_size=1-sample_size,  # Retain only `sample_size` fraction
    stratify=merged_df['severity_category'], 
    random_state=42  # For reproducibility
)

In [164]:
crashes_finalized_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95919 entries, 517577 to 154697
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   lighting_condition        95919 non-null  category
 1   roadway_surface_cond      95919 non-null  category
 2   speed_limit_category      95919 non-null  category
 3   traffic_control_category  95919 non-null  category
 4   road_category             95919 non-null  category
 5   severity_category         95919 non-null  int64   
 6   crash_cause_category      95919 non-null  category
 7   time_of_day               95919 non-null  category
 8   day_of_week               95919 non-null  category
 9   season                    95919 non-null  category
 10  sex                       95919 non-null  category
 11  age_group                 95919 non-null  category
 12  airbag_deployed           95919 non-null  category
 13  vehicle_category          95919 non-null

In [165]:
# Confirm the class distribution remains the same
print("Original class distribution:")
print(merged_df['severity_category'].value_counts(normalize=True))

Original class distribution:
0    0.980882
1    0.019118
Name: severity_category, dtype: float64


In [166]:
print("\nSampled class distribution:")
print(crashes_finalized_df['severity_category'].value_counts(normalize=True))


Sampled class distribution:
0    0.98088
1    0.01912
Name: severity_category, dtype: float64


## Exporting dataset to kaggle

I save the cleaned dataframe as a csv and manually upload it to kaggle for use in the rest of the project. 

In [167]:
# Save the cleaned and finalized DataFrame to a CSV file in the 'data' directory
crashes_finalized_df.to_csv('./data/crashes_finalized_df.csv', index=False)