# Maryland Traffic Violation Analysis

## Summer 2024 Data Science Project

Abebaw Tereda, Oscar Javier Soto,  and Geremew Belew


# Introduction

The analysis of traffic violations in Maryland provides critical insights into the patterns and trends of traffic-related incidents within the region. This dataset was sourced from Kaggle. By examining this data, we aim to identify the most common types of traffic violations and their frequencies. We will also analyze the temporal patterns of traffic violations, including peak times and seasonal variations, examine the geographic distribution of traffic violations across different areas, and explore demographic factors, such as race and gender, to understand their influence on traffic violation trends.

The dataset includes detailed information on each traffic violation, such as the date and time of the incident, location, description of the violation, and demographic details of the individuals involved. This helps us analyze the temporal and spatial distribution of traffic violations and examine potential correlations with demographic variables.

Understanding the dynamics of traffic violations is crucial for developing targeted interventions to improve road safety and reduce traffic-related incidents. Insights derived from this analysis can inform policymakers, law enforcement agencies, and community stakeholders, enabling them to implement data-driven strategies to enhance traffic management and public safety.

The analysis will employ various statistical and data visualization techniques to uncover patterns and relationships within the dataset. We will use Pandas for data manipulation, Matplotlib for visualization, and machine learning algorithms to predict trends and identify high-risk factors associated with traffic violations. Analyze the temporal patterns of traffic violations, including peak times and seasonal variations.


# Data Curation
The dataset used in this analysis is sourced from Kaggle: [Traffic Violations in Maryland County](https://www.kaggle.com/datasets/rounak041993/traffic-violations-in-maryland-county). 

The dataset includes detailed information on each traffic violation, such as the date and time of the incident, location, description of the violation, and demographic details of the individuals involved. This helps us analyze traffic violation's temporal and spatial distribution and examine potential correlations with demographic variables.

## Data preprocessing

### Loading and understanding the Dataset

First, we import the necessary libraries and load the dataset into a Pandas DataFrame. Pandas is a powerful data manipulation library in Python that simplifies data handling and preparation.

In [3]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore, chi2_contingency

Load Traffic_Violations.csv dataset into traffic_violations_df dataframe and display the dataframe.

In [4]:
#Display the dataframe after loading the csv file
traffic_violations_df = pd.read_csv("Traffic_Violations.csv")
traffic_violations_df.head()

  traffic_violations_df = pd.read_csv("Traffic_Violations.csv")


Unnamed: 0,Date Of Stop,Time Of Stop,Agency,SubAgency,Description,Location,Latitude,Longitude,Accident,Belts,...,Charge,Article,Contributed To Accident,Race,Gender,Driver City,Driver State,DL State,Arrest Type,Geolocation
0,09/24/2013,17:11:00,MCP,"3rd district, Silver Spring",DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,8804 FLOWER AVE,,,No,No,...,13-401(h),Transportation Article,No,BLACK,M,TAKOMA PARK,MD,MD,A - Marked Patrol,
1,08/29/2017,10:19:00,MCP,"2nd district, Bethesda",DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC...,WISCONSIN AVE@ ELM ST,38.981725,-77.092757,No,No,...,21-201(a1),Transportation Article,No,WHITE,F,FAIRFAX STATION,VA,VA,A - Marked Patrol,"(38.981725, -77.0927566666667)"
2,12/01/2014,12:52:00,MCP,"6th district, Gaithersburg / Montgomery Village",FAILURE STOP AND YIELD AT THRU HWY,CHRISTOPHER AVE/MONTGOMERY VILLAGE AVE,39.162888,-77.229088,No,No,...,21-403(b),Transportation Article,No,BLACK,F,UPPER MARLBORO,MD,MD,A - Marked Patrol,"(39.1628883333333, -77.2290883333333)"
3,08/29/2017,09:22:00,MCP,"3rd district, Silver Spring",FAILURE YIELD RIGHT OF WAY ON U TURN,CHERRY HILL RD./CALVERTON BLVD.,39.056975,-76.954633,No,No,...,21-402(b),Transportation Article,No,BLACK,M,FORT WASHINGTON,MD,MD,A - Marked Patrol,"(39.056975, -76.9546333333333)"
4,08/28/2017,23:41:00,MCP,"6th district, Gaithersburg / Montgomery Village",FAILURE OF DR. TO MAKE LANE CHANGE TO AVAIL. L...,355 @ SOUTH WESTLAND DRIVE,,,No,No,...,21-405(e1),Transportation Article,No,WHITE,M,GAITHERSBURG,MD,MD,A - Marked Patrol,


Display column types.

In [5]:
#Display the column list
traffic_violations_df.columns

Index(['Date Of Stop', 'Time Of Stop', 'Agency', 'SubAgency', 'Description',
       'Location', 'Latitude', 'Longitude', 'Accident', 'Belts',
       'Personal Injury', 'Property Damage', 'Fatal', 'Commercial License',
       'HAZMAT', 'Commercial Vehicle', 'Alcohol', 'Work Zone', 'State',
       'VehicleType', 'Year', 'Make', 'Model', 'Color', 'Violation Type',
       'Charge', 'Article', 'Contributed To Accident', 'Race', 'Gender',
       'Driver City', 'Driver State', 'DL State', 'Arrest Type',
       'Geolocation'],
      dtype='object')

In this step, we inspect the data types of each column in the original dataset. Understanding the data types is crucial because it helps us identify which columns need type conversion or further preprocessing. For instance, date and time columns should be converted to datetime objects, and categorical variables might need to be converted to numerical formats for analysis. By examining the data types, we ensure that the data is in the correct format for subsequent analysis and manipulation.

In [6]:
#Display the datatype the original dataset
traffic_violations_df.dtypes

Date Of Stop                object
Time Of Stop                object
Agency                      object
SubAgency                   object
Description                 object
Location                    object
Latitude                   float64
Longitude                  float64
Accident                    object
Belts                       object
Personal Injury             object
Property Damage             object
Fatal                       object
Commercial License          object
HAZMAT                      object
Commercial Vehicle          object
Alcohol                     object
Work Zone                   object
State                       object
VehicleType                 object
Year                       float64
Make                        object
Model                       object
Color                       object
Violation Type              object
Charge                      object
Article                     object
Contributed To Accident     object
Race                

To understand the size and structure of our dataset, we check the number of rows and columns. This information provides an overview of the dataset's dimensions and helps in assessing the volume of data we are working with. Knowing the number of rows and columns is also useful for later steps in data analysis, such as memory management and performance optimization.

In [7]:
#Display the number of rows and columns
number_of_rows, number_of_columns = traffic_violations_df.shape
print("Number of rows: ", number_of_rows)
print("Number of columns: ", number_of_columns)

Number of rows:  1292399
Number of columns:  35


To ensure the quality and completeness of our dataset, we need to identify columns with missing values. By using the count function, we can determine the number of non-null entries in each column. Comparing these counts with the total number of rows in the dataset helps us pinpoint columns that contain missing data. Identifying these columns is the first step in handling missing values, which is crucial for accurate data analysis and modeling.

In [8]:
#Use a count function to identify which row has a missing value
traffic_violations_df.count()

Date Of Stop               1292399
Time Of Stop               1292399
Agency                     1292399
SubAgency                  1292389
Description                1292390
Location                   1292397
Latitude                   1197045
Longitude                  1197045
Accident                   1292399
Belts                      1292399
Personal Injury            1292399
Property Damage            1292399
Fatal                      1292399
Commercial License         1292399
HAZMAT                     1292399
Commercial Vehicle         1292399
Alcohol                    1292399
Work Zone                  1292399
State                      1292340
VehicleType                1292399
Year                       1284325
Make                       1292342
Model                      1292212
Color                      1276272
Violation Type             1292399
Charge                     1292399
Article                    1227230
Contributed To Accident    1292399
Race                

To ensure the integrity and uniqueness of our dataset, we need to check for duplicate entries. Duplicate rows can distort analysis results and lead to incorrect conclusions. By using the `duplicated()` function, we can identify rows that are exact duplicates of others. This step helps us maintain a clean dataset by identifying and removing redundant data points.

In [9]:
duplicate_rows = traffic_violations_df.duplicated()
print("Duplicate entries: ", traffic_violations_df[duplicate_rows].shape[0])

Duplicate entries:  1588


In this step, we identify which columns have missing values and how many missing values each column contains. This information is essential for understanding the completeness of our dataset. By using the `isnull().sum()` function, we can quickly see the number of missing entries in each column. This helps us decide how to handle these missing values, whether by removing rows, filling in missing values, or using other imputation methods.

In [10]:
# Check which columns have a missing vlaue
missing_values_count = traffic_violations_df.isnull().sum()
missing_values_count

Date Of Stop                   0
Time Of Stop                   0
Agency                         0
SubAgency                     10
Description                    9
Location                       2
Latitude                   95354
Longitude                  95354
Accident                       0
Belts                          0
Personal Injury                0
Property Damage                0
Fatal                          0
Commercial License             0
HAZMAT                         0
Commercial Vehicle             0
Alcohol                        0
Work Zone                      0
State                         59
VehicleType                    0
Year                        8074
Make                          57
Model                        187
Color                      16127
Violation Type                 0
Charge                         0
Article                    65169
Contributed To Accident        0
Race                           0
Gender                         0
Driver Cit

### Data cleaning