<h1 align=center><font size = 6>Best Airbnb Rentals for NYU Students</font></h1>

<h2 align=center><font size = 5>By Charles Hu</font></h2>

## Introduction

In this project, we will be exploring the best Airbnb options for students attending New York University. This will be done by analyzing the 'New York City Airbnb Open Data' Dataset on Kaggle, combined with using the Foursquare API to find the venues near each Airbnb listing.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Choosing Desirable Airbnb Features</a>

3. <a href="#item3">Foursquare Data Import and Analysis</a>

4. <a href="#item4">Conclusion</a> 
</font>
</div>

Some imports first.

In [1]:
import pandas as pd
import numpy as np

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

We will import the full dataset as 'airbnb', which was downloaded from: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv

### First Glance

In [2]:
airbnb = pd.read_csv('/Users/Wayne/Downloads/AB_NYC_2019.csv')

Taking a preliminary look at the dataset:

In [3]:
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
airbnb.shape

(48895, 16)

In [5]:
airbnb.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

### Dropping / Filling Null

We can already see that some features won't be useful in our analysis. 'id', 'host_id', 'last_review', 'reviews_per_month', and 'calculated_host_listings_count' don't contain any insightful information; however, 'id' and 'host_id' may be useful for students searching the Airbnb. We therefore only drop 'last_review', 'reviews_per_month', and 'calculated_host_listings_count'. 

In [6]:
airbnb.drop(['last_review', 'reviews_per_month', 'calculated_host_listings_count'], axis=1, inplace=True)
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0


### Plotting a Map

Let's plot a map around NYU to get a better understanding of what region it is located in.

In [7]:
# Getting coordinates of NYU
address = 'New York University, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
nyu_latitude = location.latitude
nyu_longitude = location.longitude
print('The geograpical coordinate of New York University are {}, {}.'.format(nyu_latitude, nyu_longitude))

The geograpical coordinate of New York University are 40.72925325, -73.99625393609625.


In [8]:
# create map around NYU using latitude and longitude values
map_nyu = folium.Map(location=[nyu_latitude, nyu_longitude], zoom_start=12)

# add markers to map

label = folium.Popup('NYU', parse_html=True)
folium.CircleMarker(
    [nyu_latitude, nyu_longitude],
    radius=7,
    popup=label,
    color='black',
    fill=True,
    fill_color='purple',
    fill_opacity=0.7,
    parse_html=False).add_to(map_nyu)  
    
map_nyu

As we can see from the map, NYU is situated in Downtown Manhattan. We will be looking for Airbnb's close to this region, which will mainly be in Manhattan or Brooklyn.

In [9]:
# Checking the number of airbnb's in each neighbourhood group
airbnb.neighbourhood_group.value_counts().to_frame(name='Number of Airbnbs')

Unnamed: 0,Number of Airbnbs
Manhattan,21661
Brooklyn,20104
Queens,5666
Bronx,1091
Staten Island,373


## 2. Choosing Desirable Airbnb Features (Data Cleaning)

### Limiting distance to NYU

An important factor for students choosing where to live is how close the housing is to the university. A housing situated near the university helps save travel time to and from classes, as well as make it more convenient to visit friends or participate in clubs. We therefore limit our Airbnb listings to only those within 4 km (2.5 mi) of the university. 

In [10]:
# NYU coords
nyulat=40.72925
nyulon=-73.99625

# Import trig stuff
from math import sin, cos, sqrt, atan2,radians

# Distance function between two coordinates
def getDist(lat1,lon1,lat2,lon2):
  R = 6373.0

  lat1 = radians(lat1)
  lon1 = radians(lon1)
  lat2 = radians(lat2)
  lon2 = radians(lon2)

  dlon = lon2 - lon1
  dlat = lat2 - lat1

  a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
  c = 2 * atan2(sqrt(a), sqrt(1 - a))

  return R * c

# Applying distance function to dataframe
airbnb['dist']=list(map(lambda k: getDist(airbnb.loc[k]['latitude'],airbnb.loc[k]['longitude'],nyulat,nyulon), airbnb.index))

# This will give us Airbnbs within 4 km, or 2.5 mi, of NYU
airbnb = airbnb[airbnb['dist']<4].reset_index(drop=True)
print(airbnb.shape)
airbnb.head()

(14322, 14)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,355,2.907561
1,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,129,2.721247
2,5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150,1,160,188,1.827068
3,5441,Central Manhattan/near Broadway,7989,Kate,Manhattan,Hell's Kitchen,40.76076,-73.98867,Private room,85,2,188,39,3.562585
4,6090,West Village Nest - Superhost,11975,Alina,Manhattan,West Village,40.7353,-74.00525,Entire home/apt,120,90,27,0,1.014045


In [11]:
airbnb.neighbourhood_group.value_counts()

Manhattan    11560
Brooklyn      2723
Queens          39
Name: neighbourhood_group, dtype: int64

### Limiting Airbnb price

Next, we must look for cheap enough accomodation. 

In [12]:
# Accomodation at NYU for one academic year costs around $12,000. 
# There is approx. 240 days in the year where students must live close to NYU.
# Average NYU housing price per day = $50

Airbnb's usually have discounts for long term stays of about 50-60%. To look for Airbnbs significantly cheaper than NYU housing, we will restrict the dataset to listings under $50.

In [13]:
airbnb = airbnb[airbnb['price'] < 50].reset_index(drop=True)
airbnb.shape

(181, 14)

### Limiting other factors

We now consider other desirable factors in listings. Reviews are fairly important to see how other people felt about the listing. But as there is no review rating in this dataset, looking at each review will have to be done manually by students. What we can do however is search for airbnb's with at least 1 review.

In [14]:
airbnb = airbnb[airbnb.number_of_reviews > 0].reset_index(drop=True)
airbnb.shape

(127, 14)

The minimum nights for a listing determines its availability to NYU students, who generally will be renting them for a semester or less (around 100 days). We therefore look for airbnb's with minimum nights under 100 days.

In [15]:
airbnb = airbnb[airbnb.minimum_nights < 100].reset_index(drop=True)
airbnb.shape

(124, 14)

Finally, some listings have been set to unavailable by setting 'availability_365', the number of days in the year that the listing can be rented, to 0. We eliminate these here:

In [16]:
airbnb = airbnb[airbnb.availability_365 > 0].reset_index(drop=True)
airbnb.shape

(61, 14)

We now have a small subset of around 61 Airbnb's, perfect for looking into further!

In [17]:
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist
0,12048,LowerEastSide apt share shortterm 1,7549,Ben,Manhattan,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,188,1.797145
1,392948,Williamsburg near soho .support artist living,312722,Kristian & Di,Brooklyn,Williamsburg,40.70994,-73.96573,Private room,45,15,15,242,3.351582
2,1399273,NYC - furnished room - Greenpoint,7130382,Walter,Brooklyn,Greenpoint,40.72991,-73.95208,Private room,45,4,7,88,3.7238
3,1578721,"Huge, Sunny Greenpoint Flat",999689,Eric,Brooklyn,Greenpoint,40.72887,-73.95163,Private room,35,30,5,97,3.761273
4,1801036,Private room with sleeping loft,8200820,Valerie,Brooklyn,Williamsburg,40.7187,-73.95133,Private room,45,2,14,36,3.964275


### Plotting map of desirable Airbnbs

Now that we have created a small subset of airbnb's with characteristics desirable to NYU students, let's plot a map to see where they all are.

In [18]:
# create map of airbnbs around nyu using latitude and longitude values
finalmap_airbnb = folium.Map(location=[nyu_latitude, nyu_longitude], zoom_start=13)

# add markers to map
for lat, lng, label in zip(airbnb['latitude'], airbnb['longitude'], airbnb['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(finalmap_airbnb)
    
# add NYU marker to map
label = folium.Popup('NYU', parse_html=True)
folium.CircleMarker(
    [nyu_latitude, nyu_longitude],
    radius=7,
    popup=label,
    color='black',
    fill=True,
    fill_color='purple',
    fill_opacity=1,
    parse_html=False).add_to(finalmap_airbnb)  
    
finalmap_airbnb

## 3. Foursquare 

We have chosen desirable listings based on their Airbnb information so far, but what about the types of venues nearby them? This is where the Foursquare API comes into play in extracting these venues.

### Getting Nearby Venues 

In [19]:
CLIENT_ID = 'K4142FRKXZ2GXOPNTKWBJE4FBZBVGKUKZGBZ5SNZJVWZX0CC' # your Foursquare ID
CLIENT_SECRET = 'QD0IBQS3GRSABCL4SFV24CMKPROKZICLC5I23TU3VKJPRQGG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: K4142FRKXZ2GXOPNTKWBJE4FBZBVGKUKZGBZ5SNZJVWZX0CC
CLIENT_SECRET:QD0IBQS3GRSABCL4SFV24CMKPROKZICLC5I23TU3VKJPRQGG


After we enter our Foursquare Client ID and Client Secret, we can define a get request for nearby venues that iterate over all of the listings. We limit the radius to 500m around the listing, and the number of venues to 20 in order to reduce computation time.

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            20)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['name', 
                  'latitude', 
                  'longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We execute the function:

In [21]:
airbnb_venues = getNearbyVenues(names=airbnb['name'],
                                   latitudes=airbnb['latitude'],
                                   longitudes=airbnb['longitude']
                                  )

LowerEastSide apt share shortterm 1
Williamsburg near soho .support artist living
NYC - furnished room - Greenpoint 
Huge, Sunny Greenpoint Flat
Private room with sleeping loft
Coffee,Tea&Milk Astor Place Lodging
LowerEastSide apt share shortterm 3
Williamsburg private clean cozy RM
Lovely Large Room with Private Bathroom
Shared male Room on Manhattan. Amazing view! II
Comfy room in a loft in  Greenpoint/Williamsburg!
Shared male room on Manhattan with crazy view! I
Private Room in Greenpoint, Brooklyn
The best deal, really close to Times Square
Williamsburg Couch on a Budget
RG - Budget Friendly room in the Greenpoint area!
Ladies Only: Spacious Shared Apt
Amazing Loft in the heart of Williamsburg
Cozy East Village Room in huge apartment
Lovely & bright Brooklyn room close to train
Location Location - Cozy Room In Time Square
Private Room- Greenpoint, Willamsburg local
Manhattan Phenomenal Deal! BEST Chelsea Room
Single Bed A in Sharing Room near Grand Central
Single Bed B in Sharing 

And we save it to pickle in case we need to use it again.

In [22]:
airbnb_venues.to_pickle('airbnb_venues')

In [23]:
print(airbnb_venues.shape)
airbnb_venues.head()

(1183, 7)


Unnamed: 0,name,latitude,longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,LowerEastSide apt share shortterm 1,40.71401,-73.98917,Metrograph,40.714999,-73.991035,Indie Movie Theater
1,LowerEastSide apt share shortterm 1,40.71401,-73.98917,Hawa Smoothies,40.7142,-73.98939,Juice Bar
2,LowerEastSide apt share shortterm 1,40.71401,-73.98917,Little Canal,40.714317,-73.990361,Coffee Shop
3,LowerEastSide apt share shortterm 1,40.71401,-73.98917,Ice & Vice,40.714375,-73.986956,Ice Cream Shop
4,LowerEastSide apt share shortterm 1,40.71401,-73.98917,Forgtmenot,40.714459,-73.991546,New American Restaurant


### Manipulating Venue Data

In order to derive insights from the data on nearby venues, we will rank the top 5 most common venues for each listing. We start by doing one hot encoding on the venues:

In [24]:
# one hot encoding
airbnb_onehot = pd.get_dummies(airbnb_venues[['Venue Category']], prefix="", prefix_sep="")

# add name column back to dataframe
airbnb_onehot['name'] = airbnb_venues['name'] 

# move name column to the first column
fixed_columns = [airbnb_onehot.columns[-1]] + list(airbnb_onehot.columns[:-1])
airbnb_onehot = airbnb_onehot[fixed_columns]

print(airbnb_onehot.shape)
airbnb_onehot.head()

(1183, 173)


Unnamed: 0,name,Accessories Store,American Restaurant,Arepa Restaurant,Art Gallery,Art Museum,Asian Restaurant,Australian Restaurant,BBQ Joint,Bagel Shop,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Yoga Studio
0,LowerEastSide apt share shortterm 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,LowerEastSide apt share shortterm 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,LowerEastSide apt share shortterm 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,LowerEastSide apt share shortterm 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,LowerEastSide apt share shortterm 1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, we group by the listing name:

In [25]:
airbnb_grouped = airbnb_onehot.groupby('name').mean().reset_index()
print(airbnb_grouped.shape)
airbnb_grouped.head()

(61, 173)


Unnamed: 0,name,Accessories Store,American Restaurant,Arepa Restaurant,Art Gallery,Art Museum,Asian Restaurant,Australian Restaurant,BBQ Joint,Bagel Shop,...,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Yoga Studio
0,Amazing Loft in the heart of Williamsburg,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05
1,Amazing cozy and warm male room on Manhattan IV,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Angela's Sweet Suite (Shared-Females Only),0.0,0.05,0.0,0.0,0.0,0.0,0.05,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0
3,BEST DEAL in NYC - Cute and Cozy Room Times Sq...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,BEST LOCATION - Lovely room by Times Square!!!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0


Let's define a function to get us the 5 most common venues for each listing, and finally append them to the dataset airbnb_venues_sorted.

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
airbnb_venues_sorted = pd.DataFrame(columns=columns)
airbnb_venues_sorted['name'] = airbnb_grouped['name']

for ind in np.arange(airbnb_grouped.shape[0]):
    airbnb_venues_sorted.iloc[ind, 1:] = return_most_common_venues(airbnb_grouped.iloc[ind, :], num_top_venues)

airbnb_venues_sorted.head()

Unnamed: 0,name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Amazing Loft in the heart of Williamsburg,Café,Yoga Studio,Bike Shop,Cycle Studio,Diner
1,Amazing cozy and warm male room on Manhattan IV,Chinese Restaurant,Coffee Shop,Mexican Restaurant,Cocktail Bar,Tea Room
2,Angela's Sweet Suite (Shared-Females Only),Bar,Indian Restaurant,Ice Cream Shop,Grocery Store,Coffee Shop
3,BEST DEAL in NYC - Cute and Cozy Room Times Sq...,Theater,Burger Joint,Japanese Restaurant,Sandwich Place,Sushi Restaurant
4,BEST LOCATION - Lovely room by Times Square!!!,Sandwich Place,Theater,Burger Joint,Indie Theater,Japanese Restaurant


### Clustering similar Airbnbs

To gain further insight into the venues nearby each listing, we can group listings based on similar nearby venues. We use k means clustering with 3 clusters to do this.

In [28]:
# set number of clusters
kclusters = 3

airbnb_grouped_clustering = airbnb_grouped.drop('name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(airbnb_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 2, 2, 1, 1, 1, 1, 1], dtype=int32)

In [29]:
# add clustering labels
airbnb_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

airbnb_merged = airbnb

# merge airbnb_merged and airbnb_venues_sorted to add latitude/longitude for each neighborhood
airbnb_merged = airbnb_merged.join(airbnb_venues_sorted.set_index('name'), on='name')

airbnb_merged.head() 

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,12048,LowerEastSide apt share shortterm 1,7549,Ben,Manhattan,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,188,1.797145,0,Coffee Shop,Mediterranean Restaurant,Dumpling Restaurant,Spanish Restaurant,Mexican Restaurant
1,392948,Williamsburg near soho .support artist living,312722,Kristian & Di,Brooklyn,Williamsburg,40.70994,-73.96573,Private room,45,15,15,242,3.351582,1,Pizza Place,Cocktail Bar,Café,Yoga Studio,Bar
2,1399273,NYC - furnished room - Greenpoint,7130382,Walter,Brooklyn,Greenpoint,40.72991,-73.95208,Private room,45,4,7,88,3.7238,1,Coffee Shop,Mexican Restaurant,Café,Yoga Studio,Szechuan Restaurant
3,1578721,"Huge, Sunny Greenpoint Flat",999689,Eric,Brooklyn,Greenpoint,40.72887,-73.95163,Private room,35,30,5,97,3.761273,1,Coffee Shop,Mexican Restaurant,Polish Restaurant,Bagel Shop,Laundry Service
4,1801036,Private room with sleeping loft,8200820,Valerie,Brooklyn,Williamsburg,40.7187,-73.95133,Private room,45,2,14,36,3.964275,1,Bar,Yoga Studio,Pet Store,Peruvian Restaurant,Dessert Shop


Let's plot a map to see where each cluster is located.

In [30]:
# create map
map_clusters = folium.Map(location=[nyu_latitude, nyu_longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(airbnb_merged['latitude'], airbnb_merged['longitude'], airbnb_merged['name'], airbnb_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
# add NYU marker to map
label = folium.Popup('NYU', parse_html=True)
folium.CircleMarker(
    [nyu_latitude, nyu_longitude],
    radius=7,
    popup=label,
    color='black',
    fill=True,
    fill_color='purple',
    fill_opacity=1,
    parse_html=False).add_to(map_clusters)  
       
map_clusters

### Examining Clusters

Let's try to find some common features within each cluster.

#### Cluster 1 (red markers)

Quick look of cluster:

In [31]:
Cluster_1 = airbnb_merged.loc[airbnb_merged['Cluster Labels'] == 0, airbnb_merged.columns[[1] + list(range(5, airbnb_merged.shape[1]))]].reset_index(drop=True)
print(Cluster_1.shape)
Cluster_1.head()

(11, 16)


Unnamed: 0,name,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,LowerEastSide apt share shortterm 1,Lower East Side,40.71401,-73.98917,Shared room,40,1,214,188,1.797145,0,Coffee Shop,Mediterranean Restaurant,Dumpling Restaurant,Spanish Restaurant,Mexican Restaurant
1,Shared male Room on Manhattan. Amazing view! II,Lower East Side,40.70993,-73.987,Shared room,35,14,6,201,2.286069,0,Ice Cream Shop,Vegetarian / Vegan Restaurant,French Restaurant,Mediterranean Restaurant,Café
2,Shared male room on Manhattan with crazy view! I,Lower East Side,40.70985,-73.98724,Shared room,35,14,2,320,2.287639,0,Cocktail Bar,Vegetarian / Vegan Restaurant,French Restaurant,Mediterranean Restaurant,Café
3,Fresh and cozy male room on Manhattan III,Lower East Side,40.7111,-73.98865,Shared room,32,14,4,341,2.118048,0,Chinese Restaurant,Mexican Restaurant,Malay Restaurant,Tea Room,Cocktail Bar
4,Amazing cozy and warm male room on Manhattan IV,Lower East Side,40.71172,-73.98665,Shared room,35,14,3,322,2.111138,0,Chinese Restaurant,Coffee Shop,Mexican Restaurant,Cocktail Bar,Tea Room


In [32]:
Cluster_1['1st Most Common Venue'].value_counts()

Chinese Restaurant    6
Cocktail Bar          3
Coffee Shop           1
Ice Cream Shop        1
Name: 1st Most Common Venue, dtype: int64

We can see that the most common venues are chinese restaurants. As shown on the map, this cluster is indeed located at New York City's Chinatown.

In [33]:
Cluster_1['price'].mean()

34.63636363636363

Cluster 1's prices look fairly cheaper than the other clusters.

Let's rename Cluster 1 as 'Cheaper_Chinatown_Region' to reflect these two insights.

In [34]:
Cheaper_Chinatown_Region = Cluster_1

#### Cluster 2 (Purple markers)

In [35]:
Cluster_2 = airbnb_merged.loc[airbnb_merged['Cluster Labels'] == 1, airbnb_merged.columns[[1] + list(range(5, airbnb_merged.shape[1]))]].reset_index(drop=True)
print(Cluster_2.shape)
Cluster_2.head()

(46, 16)


Unnamed: 0,name,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Williamsburg near soho .support artist living,Williamsburg,40.70994,-73.96573,Private room,45,15,15,242,3.351582,1,Pizza Place,Cocktail Bar,Café,Yoga Studio,Bar
1,NYC - furnished room - Greenpoint,Greenpoint,40.72991,-73.95208,Private room,45,4,7,88,3.7238,1,Coffee Shop,Mexican Restaurant,Café,Yoga Studio,Szechuan Restaurant
2,"Huge, Sunny Greenpoint Flat",Greenpoint,40.72887,-73.95163,Private room,35,30,5,97,3.761273,1,Coffee Shop,Mexican Restaurant,Polish Restaurant,Bagel Shop,Laundry Service
3,Private room with sleeping loft,Williamsburg,40.7187,-73.95133,Private room,45,2,14,36,3.964275,1,Bar,Yoga Studio,Pet Store,Peruvian Restaurant,Dessert Shop
4,"Coffee,Tea&Milk Astor Place Lodging",East Village,40.72672,-73.98851,Shared room,30,30,125,304,0.710522,1,Coffee Shop,Pizza Place,Vegetarian / Vegan Restaurant,Japanese Restaurant,New American Restaurant


Cluster 2 is the largest cluster by far, with 45 listings.

In [36]:
Cluster_2['1st Most Common Venue'].value_counts()

Coffee Shop                 10
Bakery                       4
Indian Restaurant            4
Grocery Store                3
Pizza Place                  3
Bar                          3
Polish Restaurant            2
Cocktail Bar                 2
Gym / Fitness Center         2
Café                         2
Wine Bar                     1
Yoga Studio                  1
American Restaurant          1
Italian Restaurant           1
Thrift / Vintage Store       1
Clothing Store               1
Pub                          1
Mediterranean Restaurant     1
Thai Restaurant              1
Japanese Restaurant          1
Music Venue                  1
Name: 1st Most Common Venue, dtype: int64

In [37]:
Cluster_2['price'].mean()

39.80434782608695

As there are not many defining factors for Cluster 2 - prices are about average, and locations are spread far apart - we rename it 'General_Region'.

In [38]:
General_Region = Cluster_2

#### Cluster 3 (Turquoise Clusters)

In [39]:
Cluster_3 = airbnb_merged.loc[airbnb_merged['Cluster Labels'] == 2, airbnb_merged.columns[[1] + list(range(5, airbnb_merged.shape[1]))]].reset_index(drop=True)
print(Cluster_3.shape)
Cluster_3.head()

(4, 16)


Unnamed: 0,name,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,availability_365,dist,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"The best deal, really close to Times Square",Theater District,40.75863,-73.98812,Private room,40,1,163,102,3.338979,2,Theater,Burger Joint,Office,Taco Place,Resort
1,BEST DEAL in NYC - Cute and Cozy Room Times Sq...,Hell's Kitchen,40.75854,-73.98988,Private room,35,3,101,68,3.301851,2,Theater,Burger Joint,Japanese Restaurant,Sandwich Place,Sushi Restaurant
2,BEST LOCATION - Lovely room by Times Square!!!,Hell's Kitchen,40.75859,-73.99195,Private room,39,3,84,17,3.28354,2,Sandwich Place,Theater,Burger Joint,Indie Theater,Japanese Restaurant
3,KING Room w Private Entrance - Times Square,Hell's Kitchen,40.75759,-73.99006,Private room,45,3,72,86,3.195124,2,Theater,Hotel,Jazz Club,Cocktail Bar,Exhibit


In [40]:
Cluster_3['1st Most Common Venue'].value_counts()

Theater           3
Sandwich Place    1
Name: 1st Most Common Venue, dtype: int64

In [41]:
Cluster_3['price'].mean()

39.75

We see Cluster 3 is the smallest cluster, and the 1st most common venues mostly consist of theaters. We hence rename it 'Theatre_Region'.

In [42]:
Theatre_Region = Cluster_3

## 4. Conclusion

In this project, we have successfully picked out 61 listings based on features desirable to students attending NYU, such as price, distance to the university, and availability. We have further grouped these 61 listings into 3 clusters depending on the types of nearby venues as starting points of exploration for stakeholders. Future research could improve on this through using more recent data as well as using the elbow method to choose the optimum amount of clusters.