# Capstone Project - The Battle of Neighborhoods to get optimal Real-Estate properties

## Business Problem section

### Background

New York City’s housing market has largely recovered from the financial crisis of 2008, but that doesn’t necessarily mean that buying a home here is, in the long run, a good investment. That’s the conclusion from a new report by StreetEasy, which looks at how home values in the city have changed in the 10 years since the Great Recession.
Additionally, home values have overall gone up since the post-crisis low of November 2011. StreetEasy found that those have risen by a whopping 30 percent in the past seven years, at an average of nearly four percent per year.

### Business Problem

The problem scenario is to suggest the homebuyers clientele to purchase a suitable real estate in New York using Machine Learning Algorithms.

As a result, the business problem we are currently posing is:

**How could we provide  suggestions to homebuyers clients to purchase a suitable real estate in New York street in this depreciating economy?**

To solve this business problem, we are going to cluster New York neighborhoods in order to recommend venues and the current average price of real estate where homebuyers can make a real estate investment.Also we will recommend profitable venues  venues i.e. pharmacy, restaurants, hospitals & grocery stores.

## Data Section

The Department of Finance (DOF) maintains records for all property sales in New York City, including sales of family homes in each borough(https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD). 

This list includes all sales of 1-, 2-, and 3-Family Homes' from January 1st, 2009 to December 31, 2009, whose sale price is equal to or more than $150,000. The Building Class Category for Sales is based on the Building Class at the time of the sale. 

To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on New York properties and the relative price paid data from the HM Land Registry and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we will be able to recommend profitable real estate investments.

## Methodology

1. Collect Inspection Data
2. Explore and Understand Data
3. Data preparation and preprocessing 
4. Modeling

# Implementation

In [1]:
#Beautifulsoup library helps in web scraping data from webpage
from bs4 import BeautifulSoup
#lxml library is the parser used to parse the content from diffrent HTML Tags
import lxml
# Requests library helps in getting the content of the webpage
import requests as req
# library to handle data in a vectorized manner
import numpy as np
#library for Data Analysis
import pandas as pd
# library to handle JSON files
import json 
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 
# library to handle requests
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib as plt
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# map rendering library
import folium 
# library to find median of List
from numpy import median
print('Libraries imported.')

Libraries imported.


## 1. Collect Inspection data

In [2]:
# Download the Neighbourhood of NewYork with price dataset
!wget -O ny_neighbourhood.csv  https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD

--2019-03-22 22:08:00--  https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD
Resolving data.cityofnewyork.us (data.cityofnewyork.us)... 52.206.140.205, 52.206.140.199, 52.206.68.26
Connecting to data.cityofnewyork.us (data.cityofnewyork.us)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘ny_neighbourhood.csv’

ny_neighbourhood.cs     [ <=>                  ]  18.18K  --.-KB/s   in 0s     

2019-03-22 22:08:01 (164 MB/s) - ‘ny_neighbourhood.csv’ saved [18612]



## 2. Explore Data

In [3]:
#Reading the dataset value to DataFrame
original_data=pd.read_csv('ny_neighbourhood.csv')
original_data.head()

Unnamed: 0,NEIGHBORHOOD,TYPE OF HOME,TOTAL NO. OF PROPERTIES,NUMBER OF SALES,LOWEST SALE PRICE,AVERAGE SALE PRICE,MEDIAN SALE PRICE,HIGHEST SALE PRICE
0,AIRPORT LA GUARDIA,01 ONE FAMILY HOMES,84,1,485000.0,485000.0,485000.0,485000.0
1,AIRPORT LA GUARDIA,02 TWO FAMILY HOMES,14,1,480000.0,480000.0,480000.0,480000.0
2,ARVERNE,01 ONE FAMILY HOMES,696,32,161000.0,297194.0,310276.0,390291.0
3,ARVERNE,02 TWO FAMILY HOMES,1528,112,160000.0,505043.0,427868.0,1170987.0
4,ARVERNE,03 THREE FAMILY HOMES,137,6,165000.0,414658.0,506796.0,582320.0


##  3. Preprocessing 

In [None]:
#Label Encoding for Type of Homes
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
original_data['TYPE OF HOME'] = labelencoder.fit_transform(original_data['TYPE OF HOME'])
original_data.head()

Unnamed: 0,NEIGHBORHOOD,TYPE OF HOME,TOTAL NO. OF PROPERTIES,NUMBER OF SALES,LOWEST SALE PRICE,AVERAGE SALE PRICE,MEDIAN SALE PRICE,HIGHEST SALE PRICE
0,AIRPORT LA GUARDIA,0,84,1,485000.0,485000.0,485000.0,485000.0
1,AIRPORT LA GUARDIA,1,14,1,480000.0,480000.0,480000.0,480000.0
2,ARVERNE,0,696,32,161000.0,297194.0,310276.0,390291.0
3,ARVERNE,1,1528,112,160000.0,505043.0,427868.0,1170987.0
4,ARVERNE,2,137,6,165000.0,414658.0,506796.0,582320.0


The following Label Encoded value can be mapped as follows:

**0  = 01 ONE FAMILY HOMES**  
**1  = 02 TWO FAMILY HOMES**  
**2  = 03 THREE FAMILY HOMES**

In [None]:
count=0
lat=[]
lon=[]
from geopy.geocoders import Nominatim
for i in original_data['NEIGHBORHOOD']:
    address = i+' , New York, USA'

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    if location==None:
        latitude = 40.7136
        longitude = -73.7965
        #print('{},{},{}'.format(address,latitude,longitude))
        lat.append(latitude)
        lon.append(longitude)
        continue
    latitude = location.latitude
    longitude = location.longitude
    #print('{},{},{}'.format(address,latitude,longitude))
    lat.append(latitude)
    lon.append(longitude)
    
    
#Outlier Reduction and Treatment
lat_med=median(lat)
lon_med=median(lon)
for i in range(0,len(lat)):
    #print(lat[i])
    if lat[i]>41:
        lat[i]=lat_med
        #print(lat[i])
for i in range(0,len(lon)):
    #print(lat[i])
    if lon[i]<-74:
        lon[i]=lon_med
        #print(lat[i])

#print(lat)
original_data['LONGITUDE']=lon
original_data['LATITUDE']=lat



In [None]:
#Getting the Data 
original_data.head()


In [None]:
#Getting the Geographical Location both Longitude and Latitude of New York , USA
from geopy.geocoders import Nominatim
address = 'NEW YORK'
geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

## Plotting the Neighborhoods of NewYork present in Dataset

In [None]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, price, street in zip(original_data['LATITUDE'], original_data['LONGITUDE'], original_data['AVERAGE SALE PRICE'], original_data['NEIGHBORHOOD']):
    label = '{}, {}'.format(street, price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [None]:
CLIENT_ID = 'TZ2P2CCEDAWFAQXREEPK30XUHXJD20G4JMEKDH0PZ4KESADF' # Foursquare ID
CLIENT_SECRET = '4FEVNVQT5FQ2XKR2X1YPSDM4NE3B4SC2BDDBSJB43VWDONCB' # Foursquare Secret
VERSION = '20190226' # Foursquare API version
LIMIT=100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=2500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Street', 
                  'Street Latitude', 
                  'Street Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

## Getting Near By Venues of New York

In [None]:
newyork_venues = getNearbyVenues(names=original_data['NEIGHBORHOOD'],
                                   latitudes=original_data['LATITUDE'],
                                   longitudes=original_data['LONGITUDE']
                                  )
newyork_venues

In [None]:
# Grouping the venues according to Street
newyork_venues.groupby('Street').count()


In [None]:
ax = newyork_venues['Street'].value_counts().plot(kind='bar', figsize=(20, 10),color=['#CD5C5C'],fontsize=14, width=0.4)

ax.set_title('Frequency of nearby venues distribution of NewYork ',fontsize=14)
#ax.spines['right'].set_visible(False)
#ax.spines['top'].set_visible(False)
#ax.spines['left'].set_visible(False)

ax.set_ylabel('Count',fontsize=14)

In [None]:
# get the List of Unique Categories
print('There are {} uniques categories.'.format(len(newyork_venues['Venue Category'].unique())))

In [None]:
newyork_venues.shape

In [None]:
# one hot encoding
venues_onehot = pd.get_dummies(newyork_venues[['Venue Category']], prefix="", prefix_sep="")

# add street column back to dataframe
venues_onehot['Street'] = newyork_venues['Street'] 

# move street column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])

#fixed_columns
venues_onehot = venues_onehot[fixed_columns]

venues_onehot.head()

In [None]:
newyork_grouped = venues_onehot.groupby('Street').mean().reset_index()
newyork_grouped.head()

In [None]:
newyork_grouped.shape

## Top Five Venues

In [None]:
num_top_venues = 5

for hood in newyork_grouped['Street']:
    print("----"+hood+"----")
    temp = newyork_grouped[newyork_grouped['Street'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
# Define a function to return the most common venues/facilities nearby real estate investments#

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Street']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [None]:
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Street'] = newyork_grouped['Street']

for ind in np.arange(newyork_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(newyork_grouped.iloc[ind, :], num_top_venues)

In [None]:
venues_sorted.head()


## 4. Modelling

## Determining the Optimal K Value for KMeans algorithm

In [None]:
newyork_grouped_clustering = newyork_grouped.drop('Street', 1)
Sum_of_squared_distances = []
K = range(1,15)
for kclusters in K:
    kmeans = KMeans(n_clusters=kclusters, random_state=2).fit(newyork_grouped_clustering)
    Sum_of_squared_distances.append(kmeans.inertia_)

In [None]:
import matplotlib.pyplot as plt
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

### From above graph the Elbow point is in **k=4** which is the optimal value of K-Means clustering

In [None]:
# set number of clusters
kclusters = 4

newyork_grouped_clustering = newyork_grouped.drop('Street', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(newyork_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
#newyork_merged.drop('Cluster Labels',1)
#venues_sorted.drop('Cluster Labels',1,inplace=True)


In [None]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

newyork_merged = original_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
newyork_merged = newyork_merged.join(venues_sorted.set_index('Street'), on='NEIGHBORHOOD')

newyork_merged.head() # check the last columns!


In [None]:
#newyork_merged.to_csv('A_data.csv')

In [None]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(newyork_merged['LATITUDE'], newyork_merged['LONGITUDE'], newyork_merged['NEIGHBORHOOD'], newyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Data in Cluster 1 after KMeans

In [None]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 0, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

## Data in Cluster 2 after KMeans


In [None]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 1, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

## Data in Cluster 3 after KMeans


In [None]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 2, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

## Data in Cluster 4 after KMeans


In [None]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 3, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]]

In [None]:
mean_data=newyork_merged.groupby('Cluster Labels')['AVERAGE SALE PRICE'].mean()
mean_data=pd.DataFrame(mean_data)
mean_data

In [None]:
ax = mean_data['AVERAGE SALE PRICE'].plot(kind='bar', figsize=(10, 5),color=['#CD5C5C'],fontsize=14, width=0.2)

ax.set_title('Mean Sale Price of Data for all clusters ',fontsize=14)
#ax.spines['right'].set_visible(False)
#ax.spines['top'].set_visible(False)
#ax.spines['left'].set_visible(False)

ax.set_ylabel('AVERAGE SALE PRICE',fontsize=14)

In [None]:

median_data=newyork_merged.groupby('Cluster Labels')['MEDIAN SALE PRICE'].mean()
median_data=pd.DataFrame(median_data)
median_data

In [None]:
ax = median_data['MEDIAN SALE PRICE'].plot(kind='bar', figsize=(10, 5),color=['#CD5C5C'],fontsize=14, width=0.2)

ax.set_title('Median Sale Price of Data for all clusters ',fontsize=14)
#ax.spines['right'].set_visible(False)
#ax.spines['top'].set_visible(False)
#ax.spines['left'].set_visible(False)

ax.set_ylabel('MEDIAN SALE PRICE',fontsize=14)

## Result

First of all, even though the London Housing Market may be in a rut, it is still an "ever-green" for business affairs.

Key Observations under the Results:

First, we may examine them according to neighborhoods of New York Areas.


Cluster O:

1. The average and Median price of Cluster one Neighborhoods are 403649.400000 and 406267.950000 respectively.
2. The cluster contains following places -
    CAMBRIA HEIGHTS          
    JAMAICA                  
    LAURELTON                
    ROSEDALE                 
    SOUTH JAMAICA            
    SPRINGFIELD GARDENS      
    ST. ALBANS               
3. The most common venues nearby are Food Corner , Restaurants, Bank , Park. The no of Sales is less with respect to available properties.
4. The properties are best to buy as it has very reasonable average and median rates and in addition to that it has elementary stuffs for daily needs .
5. The place is best for food and restaurants but frequency of other amenities like hospital, schools is less.

Cluster 1:

1. The average and Median price of Cluster one Neighborhoods are 610196.027397 and 607131.506849 respectively.
2. The cluster contains following places -
    ASTORIA                  
    BAYSIDE                  
    BRIARWOOD                
    CORONA                   
    DOUGLASTON               
    EAST ELMHURST            
    ELMHURST                 
    FLUSHING-NORTH           
    FLUSHING-SOUTH           
    FOREST HILLS             
    FRESH MEADOWS            
    GLENDALE                 
    HILLCREST                
    JACKSON HEIGHTS          
    KEW GARDENS              
    LITTLE NECK              
    LONG ISLAND CITY         
    MASPETH                  
    MIDDLE VILLAGE           
    OAKLAND GARDENS          
    REGO PARK                
    RICHMOND HILL            
    RIDGEWOOD                
    SUNNYSIDE                
    WOODSIDE                 

3. The average and median price is more compare to all other clusters .The most common venues nearby are  Supermarkets , Restaurants, Bar , Park and Bagel Shop.

Cluster 2:

1. The average and Median price of Cluster one Neighborhoods are 474991.333333 and 458104.6 respectively.
2. The cluster contains following places-
    ARVERNE                  
    BELLE HARBOR             
    FAR ROCKAWAY             
    HAMMELS                  
    NEPONSIT                 
    ROCKAWAY PARK            
3. The most common venues nearby are Beach, Pizza place,Bank,Bus stop and all kinds of Food Corners.
4. This should be second most preferred properties after Cluster 0 properties due to its average and median rates.

Cluster 3:

1. The average and Median price of Cluster one Neighborhoods are 511496.795918 and 458104.600000 respectively.
2. The cluster contains following places-
    AIRPORT LA GUARDIA       
    BEECHHURST               
    BELLEROSE                
    BROAD CHANNEL            
    COLLEGE POINT            
    FLORAL PARK              
    GLEN OAKS                
    HOLLIS                   
    HOLLIS HILLS             
    HOLLISWOOD               
    HOWARD BEACH             
    JAMAICA BAY              
    JAMAICA ESTATES          
    JAMAICA HILLS            
    OZONE PARK               
    QUEENS VILLAGE           
    SO. JAMAICA-BAISLEY PARK 
    SOUTH OZONE PARK         
    WHITESTONE               
    WOODHAVEN                
3. The most common venues nearby are Airport Lounge,Burger Joint,Pharmacy,Coffee Shop ,Parks etc.
.
4. The real estate properties are more expensive after cluster 1 properties.

## Conclusion

At Last we state the problem scenario.

The problem scenario is to suggest the home buyers clients to purchase a suitable real estate in New York using Machine Learning Algorithms.

As a result, the business problem we are currently posing is:

How could we provide suggestions to home buyers clients to purchase a suitable real estate in New York street in this depreciating economy?

To solve this business problem, we are going to cluster New York neighborhoods in order to recommend venues and the current average price of real estate where home buyers can make a real estate investment.Also we will recommend profitable venues venues i.e. pharmacy , restaurants, hospitals & grocery stores.

First, we gathered data from The Department of Finance (DOF) maintains records for all property sales in New York City, including sales of family homes in each borough(https://data.cityofnewyork.us/api/views/948r-3ads/rows.csv?accessType=DOWNLOAD).

This list includes all sales of 1-, 2-, and 3-Family Homes' from January 1st, 2009 to December 31, 2009, whose sale price is equal to or more than $150,000. The Building Class Category for Sales is based on the Building Class at the time of the sale.

To explore and target recommended locations across different venues according to the presence of amenities and essential facilities, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. By merging data on New York properties and the relative price paid data from the HM Land Registry and data on amenities and essential facilities surrounding such properties from FourSquare API interface, we will be able to recommend profitable real estate investments.


At last ,  We may analyze our results according to the five clusters we have produced. Even though, all clusters could praise an optimal range of facilities and amenities.

Cluster 3 - It have properties with almost average and median nearly close to each other and also the common venues also matching to each other but properties has more                 expensive than Cluster 1.

Cluster 0 and 2 - The average and median price is less compare to other clusters.

Cluster 1 - The average and median price is more compare to other clusters.