# The Battle Of Neighbourhoods

## 1 Business Problem Section

### 1.1 Background

New York city is world's major financial and cultural centre. As a result of this, real estate properties see lot of activity in terms of buying and selling. 
Due to financial cricis of 2008, NYC has seen a bit of downturn in real estate prices but recently it has seen some recovery. Investors looking for real estate investment are very curious to know the best way to invest in the current scenario. 

### 1.2 Business Problem :

Homebuyer client want optimal recommendation based on his requirements to buy a house in NYC. The house should meet his requirements, that is to be able to take a decision 
of buying in NYC.

## 2. Data Description and Methodology

### 2.1 Data Description:

Using Data Science techniques learnt in this course and using FourSquare location data we will provide recommendations to client.
We are going to cluster New York neighborhoods in order to recommend venues and the current average price of real estate where homebuyers can make a real estate investment.
Also we will recommend profitable venues venues i.e. pharmacy, restaurants, hospitals & grocery stores.

The Department of Finance (DOF) maintains records for all property sales in New York City, including sales of family homes in each borough(https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD).
This list includes all sales of 1-, 2-, and 3-Family Homes' from January 1st, 2009 to December 31, 2009, whose sale price is equal to or more than $150,000. The Building Class Category for Sales is based on the Building Class at the time of the sale.
To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on New York properties and the relative price paid data from the HM Land Registry and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we will be able to recommend profitable real estate investments.

### 2.2 Methodology

1. Collect Inspection Data
2. Explore and Understand Data
3. Data preparation and preprocessing 
4. Modeling

In [1]:
# Importing Libraries
#Beautifulsoup library helps in web scraping data from webpage
from bs4 import BeautifulSoup
#lxml library is the parser used to parse the content from diffrent HTML Tags
import lxml
# Requests library helps in getting the content of the webpage
import requests as req
# library to handle data in a vectorized manner
import numpy as np
#library for Data Analysis
import pandas as pd
# library to handle JSON files
import json 
# convert an address into latitude and longitude values
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 
# library to handle requests
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib as plt
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# map rendering library
!conda install -c conda-forge folium=0.5.0 --yes
import folium 
# library to find median of List
from numpy import median
print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [14]:
# Download the Neighbourhood of NewYork with price dataset
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [16]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    newyork_data

In [23]:
neighborhoods_data = newyork_data['features']

In [24]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [25]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [27]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

## Data Exploration

In [29]:
   neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [30]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 612 neighborhoods.


In [32]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [33]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [34]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


In [35]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [36]:
CLIENT_ID = 'N4XVQHVUYVXOWT3YJWZFDH3JLDX2BWFVOZLX0N521KOTMXI3' # Foursquare ID
CLIENT_SECRET = 'JKHTU3I0CKQDHUOZZRNZ2Y0EZQNUSJIHLDSAYF3A3GXQXC4Z' # Foursquare Secret
VERSION = '20190226' # Foursquare API version
LIMIT=100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: N4XVQHVUYVXOWT3YJWZFDH3JLDX2BWFVOZLX0N521KOTMXI3
CLIENT_SECRET:JKHTU3I0CKQDHUOZZRNZ2Y0EZQNUSJIHLDSAYF3A3GXQXC4Z


In [37]:
# First Neighbourhood 
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

In [38]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


In [39]:
# type your answer here

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=N4XVQHVUYVXOWT3YJWZFDH3JLDX2BWFVOZLX0N521KOTMXI3&client_secret=JKHTU3I0CKQDHUOZZRNZ2Y0EZQNUSJIHLDSAYF3A3GXQXC4Z&v=20190226&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'

In [40]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d46b477787dba003836e332'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Marble Hill',
  'headerFullLocation': 'Marble Hill, New York',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 26,
  'suggestedBounds': {'ne': {'lat': 40.88105078329964,
    'lng': -73.90471933917806},
   'sw': {'lat': 40.87205077429964, 'lng': -73.91659997808156}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b4429abf964a52037f225e3',
       'name': "Arturo's",
       'location': {'address': '5198 Broadway',
        'crossStreet': 'at 225th St.',
        'lat': 40.87441177110231,
        'lng': -73.91027100981574,
        'labeledLatLngs': [{'label'

In [41]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [42]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Starbucks,Coffee Shop,40.877531,-73.905582
4,Dunkin',Donut Shop,40.877136,-73.906666


In [43]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

26 venues were returned by Foursquare.


In [44]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [45]:
# type your answer here

manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyve

In [46]:
print(manhattan_venues.shape)
manhattan_venues.head()

(6654, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


In [47]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,200,200,200,200,200,200
Carnegie Hill,200,200,200,200,200,200
Central Harlem,84,84,84,84,84,84
Chelsea,200,200,200,200,200,200
Chinatown,200,200,200,200,200,200
Civic Center,200,200,200,200,200,200
Clinton,200,200,200,200,200,200
East Harlem,86,86,86,86,86,86
East Village,200,200,200,200,200,200
Financial District,200,200,200,200,200,200


In [48]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 340 uniques categories.


In [49]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Volleyball Court,Watch Shop,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
manhattan_onehot.shape

(6654, 341)

In [51]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Volleyball Court,Watch Shop,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.02,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.0,0.01,0.03
2,Central Harlem,0.0,0.0,0.0,0.071429,0.047619,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0
4,Chinatown,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Civic Center,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.03
6,Clinton,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.02,0.01,...,0.0,0.0,0.0,0.0,0.0,0.05,0.02,0.0,0.0,0.0
9,Financial District,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.01,0.0


In [53]:
manhattan_grouped.shape

(40, 341)

In [54]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
                venue  freq
0                Park  0.08
1         Coffee Shop  0.06
2               Hotel  0.05
3       Memorial Site  0.04
4  Italian Restaurant  0.03


----Carnegie Hill----
         venue  freq
0  Coffee Shop  0.06
1  Pizza Place  0.06
2          Bar  0.04
3         Café  0.04
4          Gym  0.03


----Central Harlem----
                 venue  freq
0   African Restaurant  0.07
1   Seafood Restaurant  0.05
2    French Restaurant  0.05
3   Chinese Restaurant  0.05
4  American Restaurant  0.05


----Chelsea----
                venue  freq
0         Coffee Shop  0.06
1  Italian Restaurant  0.05
2      Ice Cream Shop  0.05
3           Nightclub  0.04
4              Bakery  0.04


----Chinatown----
                   venue  freq
0     Chinese Restaurant  0.09
1           Cocktail Bar  0.05
2  Vietnamese Restaurant  0.04
3    American Restaurant  0.04
4     Salon / Barbershop  0.04


----Civic Center----
                  venue  freq
0  Gym / Fit

In [55]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [56]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Hotel,Memorial Site,Gym,Wine Shop,Italian Restaurant,Playground,BBQ Joint,Women's Store
1,Carnegie Hill,Coffee Shop,Pizza Place,Bar,Café,Bakery,Bookstore,Spa,Cosmetics Shop,Yoga Studio,Japanese Restaurant
2,Central Harlem,African Restaurant,Fried Chicken Joint,French Restaurant,Seafood Restaurant,Chinese Restaurant,Cosmetics Shop,Public Art,Bar,American Restaurant,Southern / Soul Food Restaurant
3,Chelsea,Coffee Shop,Ice Cream Shop,Italian Restaurant,Nightclub,Bakery,Theater,Hotel,American Restaurant,Seafood Restaurant,Art Gallery
4,Chinatown,Chinese Restaurant,Cocktail Bar,Salon / Barbershop,American Restaurant,Vietnamese Restaurant,Bakery,Bubble Tea Shop,Ice Cream Shop,Spa,Dumpling Restaurant


In [57]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 1, 2, 1, 2, 2, 0, 1, 1], dtype=int32)

In [59]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,4,Sandwich Place,Coffee Shop,Discount Store,Yoga Studio,Diner,Steakhouse,Supplement Shop,Shopping Mall,Seafood Restaurant,Tennis Stadium
1,Manhattan,Chinatown,40.715618,-73.994279,1,Chinese Restaurant,Cocktail Bar,Salon / Barbershop,American Restaurant,Vietnamese Restaurant,Bakery,Bubble Tea Shop,Ice Cream Shop,Spa,Dumpling Restaurant
2,Manhattan,Washington Heights,40.851903,-73.9369,0,Café,Grocery Store,Deli / Bodega,Bakery,Mobile Phone Shop,Clothing Store,Spanish Restaurant,Supermarket,Tapas Restaurant,Mexican Restaurant
3,Manhattan,Inwood,40.867684,-73.92121,0,Mexican Restaurant,Café,Lounge,Deli / Bodega,Bakery,Pizza Place,Restaurant,Frozen Yogurt Shop,Spanish Restaurant,Park
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0,Pizza Place,Café,Mexican Restaurant,Coffee Shop,Park,Caribbean Restaurant,Chinese Restaurant,School,Sandwich Place,Bank


In [60]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [61]:
# Cluster 1
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Washington Heights,Café,Grocery Store,Deli / Bodega,Bakery,Mobile Phone Shop,Clothing Store,Spanish Restaurant,Supermarket,Tapas Restaurant,Mexican Restaurant
3,Inwood,Mexican Restaurant,Café,Lounge,Deli / Bodega,Bakery,Pizza Place,Restaurant,Frozen Yogurt Shop,Spanish Restaurant,Park
4,Hamilton Heights,Pizza Place,Café,Mexican Restaurant,Coffee Shop,Park,Caribbean Restaurant,Chinese Restaurant,School,Sandwich Place,Bank
5,Manhattanville,Coffee Shop,Liquor Store,Mexican Restaurant,Italian Restaurant,Seafood Restaurant,Park,Lounge,Bike Trail,Other Nightlife,Sushi Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Deli / Bodega,Thai Restaurant,Latin American Restaurant,Spa,Pizza Place,French Restaurant,Steakhouse,Spanish Restaurant
25,Manhattan Valley,Pizza Place,Indian Restaurant,Coffee Shop,Yoga Studio,Mexican Restaurant,Playground,Café,Deli / Bodega,Thai Restaurant,Bar
36,Tudor City,Mexican Restaurant,Greek Restaurant,Park,Café,Pizza Place,Hotel,Deli / Bodega,Dog Run,Sushi Restaurant,Burger Joint
42,Washington Heights,Café,Grocery Store,Deli / Bodega,Bakery,Mobile Phone Shop,Clothing Store,Spanish Restaurant,Supermarket,Tapas Restaurant,Mexican Restaurant
43,Inwood,Mexican Restaurant,Café,Lounge,Deli / Bodega,Bakery,Pizza Place,Restaurant,Frozen Yogurt Shop,Spanish Restaurant,Park
44,Hamilton Heights,Pizza Place,Café,Mexican Restaurant,Coffee Shop,Park,Caribbean Restaurant,Chinese Restaurant,School,Sandwich Place,Bank


In [62]:
# Cluster 2
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Chinatown,Chinese Restaurant,Cocktail Bar,Salon / Barbershop,American Restaurant,Vietnamese Restaurant,Bakery,Bubble Tea Shop,Ice Cream Shop,Spa,Dumpling Restaurant
6,Central Harlem,African Restaurant,Fried Chicken Joint,French Restaurant,Seafood Restaurant,Chinese Restaurant,Cosmetics Shop,Public Art,Bar,American Restaurant,Southern / Soul Food Restaurant
9,Yorkville,Coffee Shop,Gym,Italian Restaurant,Bar,Pizza Place,Sushi Restaurant,Deli / Bodega,Wine Shop,Japanese Restaurant,Mexican Restaurant
10,Lenox Hill,Coffee Shop,Sushi Restaurant,Italian Restaurant,Pizza Place,Sporting Goods Shop,Burger Joint,Gym,Cosmetics Shop,Gym / Fitness Center,Gift Shop
11,Roosevelt Island,Sandwich Place,Coffee Shop,Market,Gym / Fitness Center,Gym,Park,Greek Restaurant,School,Liquor Store,Hotel
12,Upper West Side,Italian Restaurant,Wine Bar,Bar,Indian Restaurant,Mediterranean Restaurant,Bakery,Cosmetics Shop,Coffee Shop,Vegetarian / Vegan Restaurant,Pet Store
16,Murray Hill,Coffee Shop,Japanese Restaurant,Sandwich Place,Hotel,Italian Restaurant,Gym / Fitness Center,Bar,Gym,French Restaurant,Salon / Barbershop
19,East Village,Bar,Wine Bar,Chinese Restaurant,Mexican Restaurant,Pizza Place,Ice Cream Shop,Ramen Restaurant,Cocktail Bar,Vegetarian / Vegan Restaurant,Coffee Shop
20,Lower East Side,Coffee Shop,Chinese Restaurant,Pizza Place,Ramen Restaurant,Café,Japanese Restaurant,Cocktail Bar,Sandwich Place,Art Gallery,Bakery
27,Gramercy,Bar,American Restaurant,Pizza Place,Bagel Shop,Italian Restaurant,Grocery Store,Ice Cream Shop,Thai Restaurant,Wine Shop,Thrift / Vintage Store


In [63]:
# Cluster 3
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Upper East Side,Italian Restaurant,Exhibit,Art Gallery,Bakery,Gym / Fitness Center,Coffee Shop,French Restaurant,Juice Bar,Spa,Hotel
13,Lincoln Square,Theater,Gym / Fitness Center,Café,Concert Hall,Plaza,Italian Restaurant,Opera House,Indie Movie Theater,Performing Arts Venue,French Restaurant
14,Clinton,Theater,Gym / Fitness Center,Hotel,American Restaurant,Italian Restaurant,Coffee Shop,Spa,Sandwich Place,Wine Shop,Food Court
15,Midtown,Hotel,Clothing Store,Theater,Coffee Shop,Cocktail Bar,Japanese Restaurant,Spa,Bakery,Bookstore,Steakhouse
17,Chelsea,Coffee Shop,Ice Cream Shop,Italian Restaurant,Nightclub,Bakery,Theater,Hotel,American Restaurant,Seafood Restaurant,Art Gallery
18,Greenwich Village,Italian Restaurant,Clothing Store,Sushi Restaurant,French Restaurant,Café,Seafood Restaurant,Cosmetics Shop,Chinese Restaurant,Indian Restaurant,Coffee Shop
21,Tribeca,Spa,Park,Italian Restaurant,Café,American Restaurant,Boutique,Wine Shop,Wine Bar,Coffee Shop,Gym
22,Little Italy,Bakery,Café,Salon / Barbershop,Sandwich Place,Clothing Store,Bubble Tea Shop,Cocktail Bar,Mediterranean Restaurant,Italian Restaurant,Women's Store
23,Soho,Clothing Store,Boutique,Art Gallery,Women's Store,Shoe Store,Men's Store,Italian Restaurant,Hotel,Bakery,Sporting Goods Shop
24,West Village,Italian Restaurant,Cosmetics Shop,New American Restaurant,Wine Bar,Park,American Restaurant,Cocktail Bar,Jazz Club,Bakery,Coffee Shop


# Observations & Result 

Lets examine them as per the neighbourhood areas and clusters-

1. The average and Median price of Cluster one Neighborhoods are 403649.400000 and 406267.950000 respectively.
2. The cluster contains following places - CAMBRIA HEIGHTS
JAMAICA
LAURELTON
ROSEDALE
SOUTH JAMAICA
SPRINGFIELD GARDENS
ST. ALBANS 
3. The most common venues nearby are Food Corner , Restaurants, Bank , Park. The no of Sales is less with respect to available properties.
4. The properties are best to buy as it has very reasonable average and median rates and in addition to that it has elementary stuffs for daily needs .
5. The place is best for food and restaurants but frequency of other amenities like hospital, schools is less.

Cluster 1:
1. The average and Median price of Cluster one Neighborhoods are 610196.027397 and 607131.506849 respectively.
2. The cluster contains following places - ASTORIA
BAYSIDE
BRIARWOOD
CORONA
DOUGLASTON
EAST ELMHURST
ELMHURST
FLUSHING-NORTH
FLUSHING-SOUTH
FOREST HILLS
FRESH MEADOWS
GLENDALE
HILLCREST
JACKSON HEIGHTS
KEW GARDENS
LITTLE NECK
LONG ISLAND CITY
MASPETH
MIDDLE VILLAGE
OAKLAND GARDENS
REGO PARK
RICHMOND HILL
RIDGEWOOD
SUNNYSIDE
WOODSIDE
3. The average and median price is more compare to all other clusters .The most common venues nearby are Supermarkets , Restaurants, Bar , Park and Bagel Shop.

Cluster 2:
1. The average and Median price of Cluster one Neighborhoods are 474991.333333 and 458104.6 respectively.
2. The cluster contains following places- ARVERNE
BELLE HARBOR
FAR ROCKAWAY
HAMMELS
NEPONSIT
ROCKAWAY PARK 
3. The most common venues nearby are Beach, Pizza place,Bank,Bus stop and all kinds of Food Corners.
4. This should be second most preferred properties after Cluster 0 properties due to its average and median rates.

Cluster 3:

1.The average and Median price of Cluster one Neighborhoods are 511496.795918 and 458104.600000 respectively.
2.The cluster contains following places- AIRPORT LA GUARDIA
BEECHHURST
BELLEROSE
BROAD CHANNEL
COLLEGE POINT
FLORAL PARK
GLEN OAKS
HOLLIS
HOLLIS HILLS
HOLLISWOOD
HOWARD BEACH
JAMAICA BAY
JAMAICA ESTATES
JAMAICA HILLS
OZONE PARK
QUEENS VILLAGE
SO. JAMAICA-BAISLEY PARK SOUTH OZONE PARK
WHITESTONE
WOODHAVEN 
3.The most common venues nearby are Airport Lounge,Burger Joint,Pharmacy,Coffee Shop ,Parks etc. .
4.The real estate properties are more expensive after cluster 1 properties.

## Conclusion

The problem scenario is to suggest the home buyers clients to purchase a suitable real estate in New York using Machine Learning Algorithms.
As a result, the business problem we are currently posing is:

How could we provide suggestions to home buyers clients to purchase a suitable real estate in New York street in this depreciating economy?
To solve this business problem, we are going to cluster New York neighborhoods in order to recommend venues and the current average price of real estate where home buyers can make a real estate investment.Also we will recommend profitable venues venues i.e. pharmacy , restaurants, hospitals & grocery stores.
First, we gathered data from The Department of Finance (DOF) maintains records for all property sales in New York City, including sales of family homes in each borough(https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD).
This list includes all sales of 1-, 2-, and 3-Family Homes' from January 1st, 2009 to December 31, 2009, whose sale price is equal to or more than $150,000. The Building Class Category for Sales is based on the Building Class at the time of the sale.

To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on New York properties and the relative price paid data from the HM Land Registry and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we will be able to recommend profitable real estate investments.

At last , We may analyze our results according to the five clusters we have produced. Even though, all clusters could praise an optimal range of facilities and amenities.
Cluster 3 - It have properties with almost average and median nearly close to each other and also the common venues also matching to each other but properties has more expensive than Cluster 1.

Cluster 0 and 2 - The average and median price is less compare to other clusters.

Cluster 1 - The average and median price is more compare to other clusters.