# Cleaning up data

Based on our review of the average rental prices, we saw that the earliest reported spike happened in 2019. So, we'll keep only buildings built 2019 onward. We also want to drop buildings that are outside our area of interest.

In [1]:
# importing libraries
import pandas as pd

In [2]:
df = pd.read_csv("mott-haven-streeteasy-buildings.csv")
df.dtypes

building_name     object
link              object
address           object
coordinates       object
total_units      float64
total_stories    float64
year_built       float64
dtype: object

In [3]:
len(df)

488

## Keeping only buildings built in 2019 onward

In [4]:
filtered_df = df[df["year_built"] >= 2019]
len(filtered_df)

23

## Coordinates for Port Morris area

Here, we will filter the `df` further, by dropping items that are outside our target area. The polygon was plotted via [geojson.io](https://geojson.io/).

In [5]:
# splitting lat and long from `coordinates`
filtered_df[['latitude', 'longitude']] = filtered_df['coordinates'].str.split(",", expand=True)

# convert to float
filtered_df['latitude'] = pd.to_numeric(filtered_df['latitude'])
filtered_df['longitude'] = pd.to_numeric(filtered_df['longitude'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df[['latitude', 'longitude']] = filtered_df['coordinates'].str.split(",", expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df[['latitude', 'longitude']] = filtered_df['coordinates'].str.split(",", expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['

In [6]:
# declaring coordinates for our target area

polygon_coords = [
    (-73.9190382254446, 40.79911470202015),
    (-73.91660044458266, 40.80239535458509),
    (-73.91714217366301, 40.80369390143031),
    (-73.92711901756049, 40.807828578714776),
    (-73.9312271297539, 40.81097212789376),
    (-73.93149799429405, 40.811928830692864),
    (-73.93280717290484, 40.811416313051694),
    (-73.9327168847248, 40.80895617326601),
    (-73.92860877253183, 40.80396727647616),
    (-73.92766074664124, 40.80256621746361),
    (-73.92662243257016, 40.80239535458509),
    (-73.92359777853783, 40.80246369978926),
    (-73.92174687084673, 40.80157520664537),
    (-73.92048283632549, 40.79959314061813),
    (-73.9190382254446, 40.79911470202015)
]

In [7]:
# extract min/max latitude and longitude 
lngs, lats = zip(*polygon_coords)
min_lng, max_lng = min(lngs), max(lngs)
min_lat, max_lat = min(lats), max(lats)

# filtering coordinates
final_df = filtered_df[(filtered_df["latitude"] >= min_lat) & (filtered_df["latitude"] <= max_lat) & 
                 (filtered_df["longitude"] >= min_lng) & (filtered_df["longitude"] <= max_lng)]

# checking our data
final_df[["building_name", "latitude", "longitude", "building_name"]]

Unnamed: 0,building_name,latitude,longitude,building_name.1
0,The Arches +NYC,40.810096,-73.930533,The Arches +NYC
1,The Arches,40.809963,-73.931007,The Arches
2,One38,40.803534,-73.920762,One38
3,Bruckner House,40.806294,-73.927228,Bruckner House
5,Maven Mott Haven,40.808753,-73.931452,Maven Mott Haven
8,101 Bruckner Boulevard,40.805482,-73.925894,101 Bruckner Boulevard
9,Third at Bankside,40.808674,-73.931915,Third at Bankside
10,The Motto,40.809346,-73.929697,The Motto
11,Lincoln at Bankside,40.807106,-73.930036,Lincoln at Bankside
14,Brown Place Pearl,40.805283,-73.92219,Brown Place Pearl


In [8]:
len(final_df)

16

## Saving as `csv`

In [9]:
final_df.to_csv("filtered_buildings.csv", encoding="UTF-8", index=False)