# Collisions in NY City
 Carlos Arbonés & Benet Ramió

## Data extraction

We acquired the dataset from the [Motor Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) source. Prior to obtaining it, we specifically filtered and downloaded the records corresponding to the periods of June to September in both 2018 and 2020.

## General Info

The dataset contains 115740 rows and 30 columns. 

## Preprocessing

We will review every colum and check datatypes, null values, clustering...

### Crash Date

We mantain the datatype text. There are not null values. The data is consistent. We sort rows by Crash Date. 

### Crash Time

No null values. There are hours like 00:00 and 13:00 that there are a lot of accidents. Exact hours like 15:20 or 6:45 are more often than minuts like 19, 23... that may be commodity typo by officers.

### Borough (municipio)

Only 6 different values: BRONX, BROOKLYN, MANHATTAN, QUEENS, STATEN ISLAND, and blank. Blank has 40671 values. 

### Zip Code

We change the datatype to number. There are 208 different zip codes. The majority of rows that do not have borough neither they have zip code, lot of blanks. From the zip code we can derive the 
borough: 
- Manhattan:10001-10282.
- Staten Island :10301-10314.
- Bronx: 10451-10475.
- Queens:11004-11109, 11351-11697.
- Brooklyn:11201-11256.

### Latitude, Longitude and Location

We changed the type of latitude and longitude to number. We remove column location since we have all the information in the 2 previous columns. There are 7667 blanks.  

There were atypical values like longitudes of -201 (impossible values). All this values were assigned to the same sreet: "QUEENSBORO BRIDGE UPPER BROADWAY". We checked the real longitude and it is -73.954224, we changed the values to this langitude. 

There is also and strange value of longitude related to NASSAU EXPRESSWAY, we changed and correct the previous value (-32) to -73.7813672. 

Again, antother wrong longitude value... We changed the value of the "WEST SHORE EXPRESSWAY" from -74.7 to -74.1864671.

The rows that have both latitude and longitude equal to 0, we change the values to blank text in order to uniform the data. 

### ON STREET NAME, CROSS STREET NAME and OFF STREET NAME

We remove the colums since they have a lot of blank values (On street name has 28000, Cross Street name has more than 57000 and OFF STREET NAME more or less 80000) and the information is redundant and there are too many streets to make interesting viwes of groups. If needed we have more information in the latitude-longitude columns. 

### NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDASTRIANS INJURED,NUMBER OF PEDASTRIANS KILLED, NUMBER OF CYCLISTS INJURED, NUMBER OF CYCLISTS KILLED, NUMBER OF MOTORISTS INJURED, NUMBER OF MOTORISTS KILLED

We change the datatype of all this columns to number and changed the column names to shorter and more informative ones. The new names are: 

- TOTAL_INJURED
- TOTAL_KILLED
- PEDASTRIANS_INJURED
- PEDASTRIANS_KILLED
- CYCLISTS_INJURED
- CYCLISTS_KILLED
- MOTORISTS_INJURED
- MOTORISTS_KILLED



### COLLISION_ID

We remove the columns since we do not need it for our analysis

### VEHICLE TYPE CODE 1 & 2

For both vehicle type code 1 and 2 we have done a cluster with Nearest Neighbor Method and then with key collision to uniform the final names. Since a lot of human errors were not clustered properly with this automatic method, we also checked all the names to see if we could cluster manually more, which we did. For example, a vehicle was entered as GLP050VXEV and searching in internet we saw that is a model of forklift so we changed its name. Like this example, we did it with a lot of vehicles. We also generalized some vehicles which were too especific; for example we changed fedex, ups, mail and others to delivery to have less classes of vehicles. Furthermore, all the unknown and NA values were putted in the blank format and all the vehicle types that had less than ten collisions were clustered into the others type to have less different names. Even with all the transformations mentioned, there were some types of vehicles that were not inerpretable and we were not able to correct and put into a cluster. 

It is important to remark that before doing any tranformation to the vheicle type names, form code 1 there were 361 different names and from code 2 there were 373.

The list of names and number of examples of "VEHICLE TYPE CODE 1" after we applying all the transformations mentioned are the following: 

- Sedan - 54025
- Station Wagon/Sport Utility Vehicle - 40658
- Taxi - 4806
- Pickup - 3479
- Box truck - 2325
- Bike - 1938
- Bus - 1420
- Truck - 1087
- Motorcycle - 1051
- Van - 836
- Ambulance - 467
- Convertible - 411
- Dump - 325
- E-Scooter - 240
- Flat bed - 232
- Garbage - 193
- Carry All - 146
- E-Bike - 146
- Moped - 135
- Tow truck - 106
- Scooter - 90
- Chassis Cab - 83
- Fire truck - 78
- Tanker - 70
- Motorbike - 64
- Concrete Mixer - 63
- Trailer - 63
- Flat rack - 42
- Delivery - 39
- Armored Truck - 33
- Beverage Truck - 30
- 3-Door - 28
- Lift boom - 21
- Multi-Wheeled Vehicle - 19
- Forklift - 16
- Commercial - 13
- Stake or Rackv - 12
- Utility - 12
- School bus - 11
- Tractor - 11
- Others - 156
- (blank) - 750

The list of names and number of examples of "VEHICLE TYPE CODE 1" after we applied all the transformations mentioned are the following:

- Sedan - 38643
- Station Wagon/Sport Utility Vehicle - 29520
- Taxi - 3622
- Bike - 3605
- Pickup - 3051
- Box truck - 2532
- includeeditBus - 1279
- Truck - 1034
- Motorcycle - 839
- Van - 800
- Dump - 369
- E-Scooter - 357
- Ambulance - 265
- E-Bike - 256
- Convertible - 252
- Flat bed - 246
- Garbage - 212
- Moped - 152
- Tow truck - 142
- Carry All - 116
- Fire truck - 95
- Motorbike - 83
- Motorscooter - 73
- Chassis Cab - 68
- Concrete Mixer - 50
- Tanker - 46
- Delivery - 38
- Scooter - 37
- Trailer - 34
- Flat rack - 27
- Beverage Truck - 25
- Lift boom - 24
- 3-Door - 22
- Armored Truck - 22
- Stake or Rack - 21
- Commercial - 19
- Forklift - 19
- Electric - 18
- Minibike - 15
- Pedicab - 13
- Multi-Wheeled Vehicle - 12
- Open Body - 11
- Street cleaning - 10
- Others - 186
- (blank) - 27480

### VEHICLE TYPE CODE 3, 4 & 5

Since most of the accidens occure between two vehicles, this columns were almost all blank values (code 3 107095, code 4 113658 and code 4 115154) so we decided to remove this columns