# Capstone Project - The Battle of Neighborhoods

### Introduction

The goal of this study is to give insights to a real state investor from Manhattan. Which areas are the best to invest accroding to your needs? Which is the main characteristics of the real state market in every neighbourhood?

### Business Problem

Let's say you are a real state investor that wants to focus their next steps in the New York area, but first you want to know which areas fit best your appetite. Maybe you are more focused in suburban areas with low price per square meter, or maybe you are into the luxury segment and want to know which places the high end clients prefer.

With this study, we'll categorize neighbourhoods into different clusters that will describe which kind of investment fits them better.

### Data

We use the data provided in the Kaggle repository _NYC Property Sales_:

[NYC Property Sales](https://www.kaggle.com/new-york-city/nyc-property-sales)

The data will be used in this fashion:

+ Clean the data, with focus in the 'ADDRESS' and 'SALE PRICE' columns
+ Assign latitude and longitude coordinates to each row, so we can apply _Four Square_ API
+ Build the dataframe that correlates main characteristics of real state sales to each neighbourhood.

#### Example

Import of all necessary packages for data treatment

In [19]:
import pandas as pd
import numpy as np
import requests # library to handle requests

!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values




I read the _csv_ file downloaded for Kaggle

In [71]:
ny_sales_df = pd.read_csv('nyc-rolling-sales.csv')
ny_sales_df.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


We see that some rows in the 'SALE PRICE' column have non-numeric values, so we'll drop those.

In [82]:
ny_sales_df['SALE PRICE'] == ' -  '

0        False
1         True
2         True
3        False
4        False
         ...  
84543    False
84544    False
84545    False
84546    False
84547    False
Name: SALE PRICE, Length: 84548, dtype: bool

We will build the address variable by adding the ' New York, NY' string to the 'ADDRESS' column

In [63]:
address = ny_sales_df['ADDRESS'].head() + ', New York, NY'
address[0]

'153 AVENUE B, New York, NY'

Finally, with the address variable we can use _geolocator_ to get the latitude and longitude of each address

In [83]:
# address = '153 AVENUE B, New York, NY'
geolocator = Nominatim(user_agent="foursquare_agent")
for loc in address:
    location = geolocator.geocode(loc)
    latitude = location.latitude
    longitude = location.longitude
    print(loc, '->', latitude, longitude)

153 AVENUE B, New York, NY -> 40.726572950000005 -73.97987037365662
234 EAST 4TH   STREET, New York, NY -> 40.723315 -73.98313696338553
197 EAST 3RD   STREET, New York, NY -> 40.6471074 -73.9779171
154 EAST 7TH STREET, New York, NY -> 40.72541255 -73.9824410244452
301 EAST 10TH   STREET, New York, NY -> 40.72778205 -73.98166031239643
