# Clean Data

![clean-merge-data](img/scrub-process-diagram.png)

## Import Libraries

### External Libraries

In [4]:
import geopandas as gpd

### External Libraries

## Define Variables

In [5]:
nyc_street_flooding_input = 'data/street-flooding/street-flood-complaints_rows-all.geojson'
nyc_street_flooding_output = 'data/street-flooding/clean_street-flood-complaints_rows-all.geojson'

## Get Original Data

In [6]:
street_flooding_gdf = gpd.read_file(nyc_street_flooding_input)

## Before Count

In [7]:
street_flooding_complaints_before_count = len(street_flooding_gdf)
print(f'There were {street_flooding_complaints_before_count:,} street flooding complaints from 2010 to the present.')

There were 35,051 street flooding complaints from 2010 to the present.


## Set `unique_key` as Index

In [8]:
street_flooding_gdf.set_index('unique_key', inplace=True)

## Remove Rows With Missing `geometry`

In [9]:
street_flooding_gdf.dropna(subset = ['geometry'], inplace = True)

## After Count

In [10]:
street_flooding_complaints_after_count = len(street_flooding_gdf)
print(f'There were {street_flooding_complaints_after_count:,} street flooding complaints after rows with missing geometry have been removed.')

There were 34,044 street flooding complaints after rows with missing geometry have been removed.


## Preview Street Flooding Data

In [11]:
street_flooding_gdf[['created_date', 'borough', 'bbl', 'geometry']].head(10)

Unnamed: 0_level_0,created_date,borough,bbl,geometry
unique_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15639934,2010-01-02 08:26:00,BROOKLYN,3089000064.0,POINT (-73.92178 40.58778)
15640572,2010-01-02 12:00:00,STATEN ISLAND,,POINT (-74.14329 40.63866)
15640664,2010-01-02 17:45:00,QUEENS,4120050012.0,POINT (-73.79530 40.68140)
15655327,2010-01-04 16:47:00,QUEENS,4106210008.0,POINT (-73.73843 40.72006)
15668560,2010-01-05 10:37:00,BROOKLYN,3086550021.0,POINT (-73.90969 40.61250)
15674300,2010-01-06 19:26:00,BROOKLYN,3029270015.0,POINT (-73.93297 40.71584)
15674896,2010-01-06 08:24:00,QUEENS,4119960122.0,POINT (-73.80255 40.67925)
15674924,2010-01-06 09:17:00,STATEN ISLAND,5040740044.0,POINT (-74.10646 40.55866)
15675505,2010-01-06 06:00:00,QUEENS,4030030044.0,POINT (-73.87694 40.71804)
15683503,2010-01-07 10:16:00,STATEN ISLAND,5014850078.0,POINT (-74.14943 40.61979)


In [12]:
street_flooding_gdf[['created_date', 'borough', 'bbl', 'geometry']].tail(10)

Unnamed: 0_level_0,created_date,borough,bbl,geometry
unique_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
56879833,2023-02-23 15:23:00,QUEENS,4137320024.0,POINT (-73.74717 40.65495)
56881026,2023-02-23 10:56:00,QUEENS,4142340587.0,POINT (-73.83003 40.65729)
56883191,2023-02-23 13:23:00,BROOKLYN,3080520059.0,POINT (-73.90364 40.63474)
56883535,2023-02-24 12:08:00,QUEENS,4137350024.0,POINT (-73.74752 40.65428)
56889664,2023-02-24 10:56:00,STATEN ISLAND,,POINT (-74.13515 40.61709)
56894127,2023-02-25 21:17:00,QUEENS,4066360043.0,POINT (-73.82293 40.71523)
56895026,2023-02-25 12:47:00,QUEENS,4067470075.0,POINT (-73.81219 40.73705)
56899909,2023-02-25 20:08:00,BROOKLYN,3056230001.0,POINT (-73.99062 40.63595)
56900879,2023-02-26 09:08:00,QUEENS,4015360120.0,POINT (-73.88446 40.73925)
56904542,2023-02-26 18:05:00,STATEN ISLAND,5061080026.0,POINT (-74.20391 40.54321)


## Save Clean Dataset

In [13]:
street_flooding_gdf.to_file(nyc_street_flooding_output, driver='GeoJSON')