**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Anthony Mitine
- Nyla Janmohamed
- Jocelyn Huang
- Mei Man Teng Lam

# Research Questions

- How did environmental conditions (weather, lighting, and roadway surface) impact the rate of car crashes in Chicago from 2016 to 2024?
- How did the time of day, day of the week, and holiday periods influence the rate of car crashes in Chicago from 2016 to 2024?
- Did the geographic location within Chicago impact the likelihood of car crashes from 2016 to 2024?


## Background and Prior Work


This study investigates the impact of environmental conditions, temporal patterns, and geographic distribution on the frequency of car crashes in Chicago during 2018. By analyzing these factors, our research aims to uncover patterns that could inform traffic management strategies and contribute to road safety improvements.

**Temporal Patterns**: Research indicates that the timing of car accidents can vary significantly depending on the day of the week and time of day. A comprehensive study from Dropplr suggests that the highest frequency of accidents occurred during the afternoon rush hours of 2:00 PM to 5:00 PM, with Fridays being particularly prone to collisions<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). This pattern raises questions about the role of traffic volume and driver behavior at these times, which our study will explore further.

**Environmental Conditions**: Previous studies have shown that weather conditions play a significant role in road safety. For instance, Kaggle Chicago Car Crashes Data Analysis found that overcast and partially cloudy weather during spring and early summer resulted in the highest rate of car crashes with fixed objects, pedestrians, and angle collisions being the categories with more severe accidents<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3).

**Geographic Distribution**: Our research will also examine how the geographic location within Chicago impacts the likelihood of car crashes. We hypothesize that different areas within the city may exhibit varied crash rates due to factors such as road design, traffic volume, and local enforcement of traffic laws. Understanding the spatial distribution of accidents can help identify high-risk areas, guiding targeted interventions and safety improvements. By mapping accident data across Chicago, we aim to pinpoint where the highest concentrations of accidents occur and analyze potential underlying causes related to the geographic characteristics of these areas.

**Chicago’s Traffic Safety Efforts**: Kaggle Chicago Car Crashes Data Analysis found out that the key factors of causing road accidents are: issues of yielding right of way, tailgating, overtaking, not adjusting speed, and improper backing maneuvers<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). The city’s ongoing traffic safety initiatives, including the Vision Zero plan, aim to reduce traffic fatalities by addressing factors like vehicle speed and reckless driving<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). Our study will examine how these efforts have possibly influenced accident rates and the persistence of high-risk zones throughout the city.

**Footnotes:**

1. <a name="cite_note-1"></a> [^](#cite_ref-1) City of Chicago, 'Traffic Safety.' *Chicago.gov*. [https://www.chicago.gov/city/en/sites/complete-streets-chicago/home/traffic-safety.html](https://www.chicago.gov/city/en/sites/complete-streets-chicago/home/traffic-safety.html).

2. <a name="cite_note-2"></a> [^](#cite_ref-2) Dropplr, 'When Do Most Car Accidents Occur in Chicago (2016-2020)?' [https://www.dopplr.com/when-do-most-car-accidents-occur-in-chicago-2016-2020/](https://www.dopplr.com/when-do-most-car-accidents-occur-in-chicago-2016-2020/).

3. <a name="cite_note-3"></a> [^](#cite_ref-3) Kaggle, 'Chicago Car Crashes Data Analysis.' [https://www.kaggle.com/code/lashametreveli/chicago-car-crashes-data-analysis](https://www.kaggle.com/code/lashametreveli/chicago-car-crashes-data-analysis).

# Hypotheses


- We hypothesize that in Chicago from 2016 to 2024, the rate of car crashes was significantly higher during periods of snowfall and in areas with inadequate lighting. Due to the compounded adverse effects of reduced visibility and traction, we predict that more crashes would occur during these specific conditions.

- We hypothesize that in Chicago from 2016 to 2024, the rate of car crashes was significantly higher during the evening hours, on weekends and holidays. This increase can be attributed to higher road congestion as people are more likely to be going out, alongside an increase in impaired driving due to alcohol consumption during these times.

- We hypothesize that in Chicago from 2016 to 2024, the likelihood of car crashes was significantly higher near the on and off ramps of highways and freeways, where frequent merging and diverging traffic increases the potential for accidents. Additionally, areas with higher traffic volumes and intersections with complex traffic patterns may also exhibit a higher incidence of crashes.


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Traffic Crashes - Crashes
  - Link to the website: https://catalog.data.gov/dataset/traffic-crashes-crashes
  - Link to the dataset: https://data.cityofchicago.org/api/views/85ca-t3if/rows.csv?accessType=DOWNLOAD
  - Number of observations in total dataset: 920,016
  - Number of variables in total dataset: 48
  - Number of observations in cleaned dataset: 896,842
  - Number of variables in cleaned dataset: 13

This data set provides information about car crashes in Chicago that were reported to the police department between 2016 and 2024. Some important variables are the crash day, time and location, weather and road condition details, and information about the parties/individuals involved.  These can help us determine what factors have the biggest impact on the type of crash and its severity. The data cleaning process will mostly include understanding or getting rid of NaN values and converting dates and times to numerics or all the same type, resulting in a more focused dataset with 13 key variables. It is important to note that this dataset only reflects reported crashes, potentially omitting minor or unreported incidents. 



## Traffic Crashes - Crashes

In [None]:
# Import pandas
import pandas as pd


In [None]:
# Load the data
chicago_crash = pd.read_csv('https://data.cityofchicago.org/api/views/85ca-t3if/rows.csv?accessType=DOWNLOAD')
chicago_crash.head()

In [None]:
# Clean the data: Convert CRASH_DATE to datetime format
chicago_crash['CRASH_DATE'] = pd.to_datetime(chicago_crash['CRASH_DATE'], format='%m/%d/%Y %I:%M:%S %p')

In [None]:
# Extract the year from CRASH_DATE for easier aggregation
chicago_crash['YEAR'] = chicago_crash['CRASH_DATE'].dt.year

In [None]:
# Filter the data to include only years between 2016 and 2024
chicago_crash = chicago_crash[(chicago_crash['YEAR'] >= 2016) & (chicago_crash['YEAR'] <= 2024)]

In [None]:
# Drop the not needed columns for our analysis
columns_to_drop = [
    'CRASH_RECORD_ID', 'CRASH_DATE_EST_I', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION', 'FIRST_CRASH_TYPE',
    'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'REPORT_TYPE', 'CRASH_TYPE', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
    'STREET_NO', 'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE', 'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I',
    'DOORING_I', 'WORK_ZONE_I', 'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS', 'MOST_SEVERE_INJURY',
    'INJURIES_TOTAL', 'INJURIES_FATAL', 'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING', 
    'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN', 'CRASH_HOUR', 
    'CRASH_DAY_OF_WEEK', 'CRASH_MONTH', 'LOCATION', 'YEAR'
]
# Drop the columns from chicago_crash dataframe
chicago_crash = chicago_crash.drop(columns=columns_to_drop)

In [None]:
# Sort the data by CRASH_DATE in ascending order
chicago_crash = chicago_crash.sort_values(by='CRASH_DATE', ascending=True)

In [None]:
# Confirm the cleaned and sorted data
chicago_crash.head()

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 3 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

### Section 4 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

- To ensure privacy, the dataset we found didn’t include the police report number, license plate information, or any car details, and it removed victims' names and any other identifiable information. This ensures the privacy of individuals' data, as no one can easily identify the person in each crash. The location and time details could somewhat reveal information, but because the data is from a large city, the chances of identification are minimal.

- There could be a slight reporting bias since more minor incidents may not have been reported. This could skew the results as a higher number of severe incidents would be reported. It’s also unclear if a large section of Chicago was covered or only small parts, which could affect the data due to differences in police response rates and population density across neighborhoods.

- To address these issues, we will analyze the data to check for strong skews or outliers. Based on this analysis, we will write an honest assessment that acknowledges the limitations of the data and potential confounding factors outside of the specific independent variables we are focusing on.

- Additionally, we will be careful not to draw conclusions that might reinforce biases or misrepresent trends. Our goal is to present a fair and accurate analysis while being mindful of the ethical implications of our findings.


# Team Expectations 


- Team expectation 1: Attend discussion sections regularly so we can work on the group project and split up tasks each week.
- Team expectation 2: Respond to group messages in a timely manner. If you're busy, send a text letting the group know when you'll be able to respond or finish your work.
- Team expectation 3: Everyone should make an effort to participate and contribute equally.


# Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting               | Discuss at Meeting              |
|---------------|--------------|---------------------------------------|--------------------------------|
| 2/7           | 4 PM         | Review dataset, think about team expectations | Draft Proposal                 |
| 2/12          | 1 PM         | Revise raw dataset                    | Wrangle, clean, and tidy data. Assign group member roles for each task component |
| 2/19          | 1 PM         | Read any necessary material related to Checkpoint 1 | Discuss Checkpoint 1           |
| 2/21          | 4 PM         | Brainstorm sections within Checkpoint 1 | Write Checkpoint 1              |
| 2/26          | 1 PM         | Review/Edit wrangling work            | Discuss project updates         |
| 3/5           | 1 PM         | Read any necessary material related to Checkpoint 2 | Discuss Checkpoint 2           |
| 3/7           | 4 PM         | Brainstorm sections within Checkpoint 2 | Write Checkpoint 2              |
| 3/12          | 1 PM         | Read any necessary material related to Final Report + Video | Discuss Final Report + Video  |
| 3/17          | 3 PM         | Work on finalizing the report and wrapping up the video | Final Report + Video           |
