# Manhattan Rental Apartments Clustering

### Xian Lai
Apr.2017

### Abstract:
A city functions like a gigantic sophisticate network. Within it each buildings and blocks are connected by visible transportation systems and invisible functional dependencies. 

But on the other hand, the difference of locations and functionality also divides the city into many sub-areas. For different purposes, the boundaries of these sub-areas are different. Like for political administration, we have boroughs, community districts and neighbourhoods, and for postal service, we have zip codes. 

In this projet, I would like to make use of rental apartment online listing data set and new york building footprint data set to explore the possible geographic boundaries or patterns of apartment rental market.

And we know that equivalent to finding boundaries, employing unsupervised clustering technique to find the best grouping of buildings with respect to their location and rental market popularity will help us understand the existing rental market data and get insights of its geographical form.

<img src="images/title_image.jpg" width="1200">

In [1]:
import pandas as pd
import numpy as np
import random
import copy

In [2]:
import clustering as c
pd.options.mode.chained_assignment = None

In [3]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.layouts import row
from bokeh.palettes import Spectral11
output_notebook()

# 0. Data sets & Preprocessing

In this project, I have 2 data sets. 

The first one is market data set consisting of about 46,000 rental apartment online listings. Each listing containing information about rental apartments’ geographic location, popularity (defined by how many visits of listing webpage) and some other description features like facilities, number of bedrooms, bathrooms, rental price, etc.

In [4]:
df_market = pd.read_pickle('market_set_cleaned.csv')
df_market.head()

Unnamed: 0,bath,bed,y,x,center_pt,price_bin,broker,elevator,fitness_center,no_fee,popularity
10,1.5,3,356555.874048,-9608865.0,"(-9608865.029217346, 356555.87404817063)",1,0,0,0,0,2
10000,1.0,2,365557.19018,-9616935.0,"(-9616934.736082882, 365557.1901797713)",3,0,1,1,0,1
100004,1.0,1,356477.16348,-9616369.0,"(-9616368.679438747, 356477.1634797137)",1,0,0,0,0,3
100007,1.0,1,360258.281928,-9614124.0,"(-9614124.415577795, 360258.28192826593)",2,0,0,0,1,1
100013,1.0,4,370278.122028,-9617337.0,"(-9617336.91450809, 370278.1220278065)",2,1,0,0,0,1


But because the market data set only covers part of buildings in Manhattan, we can’t perform clustering on these data points directly. So I brought in the second data set which is building data set simply consists of every building in Manhattan and their locations. 

Assuming the popularity of rental apartment is geographically continuous, namely the popularity of one building is similar to surrounding buildings, I can interpolate the popularities of every building in rental market using the information from rental market data set. And then I performed clustering on building data set instead.

In [5]:
df_building = pd.read_pickle('building_set_cleaned.csv')
df_building.head()

Unnamed: 0,DOITT_ID,footprint,center_pt,x,y
31950,313119,"[[-73.9834, -73.9835, -73.9835, -73.9836, -73....","(-9613469.10811105, 355447.84763150103)",-9613469.0,355447.847632
31986,554919,"[[-73.9882, -73.9883, -73.9883, -73.9883, -73....","(-9617184.379490266, 361048.91133206384)",-9617184.0,361048.911332
31987,665185,"[[-73.9797, -73.9798, -73.9799, -73.9799, -73....","(-9613400.620183097, 356174.110258732)",-9613401.0,356174.110259
31988,79482,"[[-73.9937, -73.9937, -73.9938, -73.9938, -73....","(-9614161.458572928, 354308.31446485035)",-9614161.0,354308.314465
31989,301212,"[[-73.9874, -73.9874, -73.9875, -73.9875, -73....","(-9614705.761096902, 356760.94259855757)",-9614706.0,356760.942599


# 1. Building "Popularity Level" Interpolation

Based on assumption that popularity of buildings are related to popularities of their surrounding buildings, I choose inverse distance weighting (IDW) as my interpolation method.

To interpolate the popularity value for one building $y_k$ in building data set, I filter out the rental apartments in market data set within 0.7 miles distance range from $y_k$ denoted as ${x_1, x_2, ..., x_I}$. Take the inverse distance weighted average of popularities of ${x_1, x_2, ..., x_i}$ and assign as the popularity of $y_k$:

Assuming $u(x_i)$ is the popularity of $x$. Then we have the calculation formula:

If $d(y_k,x_i) \neq 0$ for all $x_i$:
$$u(y_k)=\frac{\sum_{i=1}^{I} w(x_i)*u(x_i)}{\sum_{i=1}^{I} w(x_i)}$$
where 
$$w(x_i)=\frac{1}{d(y_k,x_i)^p}$$
If $d(y_k,x_i) = 0$ for some $x_i$:
$$u(y_k)=u(x_i)$$

Here there is a parameter p controls the power of distance affecting the weighting. 

In [6]:
def IDW(dists, popularities, power):
    """ input: distances series between interpolated point and reference points,
    a popularity level series of ref pts and the coefficient p controlling curvature.
    output: sum of normalized inverse weighted popularities. If any of the dist 
    is 0, then assign the popularity of the first 0-dist reference point.
    """
    zero = dists[dists == 0]
    if not zero.empty:
        return popularities.loc[zero.index].mean()
    
    else:
        weights = 1 / (dists**power)
        weights = weights / weights.sum()
        return sum(popularities * weights)


def interpolation(mapped_point, mapping_points, closestK, power):
    """ input: building center point, close pts in df and whether cut top 5.
    output: sum of lvl weighted by 1/distance
    """
    dist = lambda x: abs(x[0] - mapped_point[0]) + abs(x[1] - mapped_point[1])
    mapping_points['dist'] = mapping_points['center_pt'].apply(dist)
        
    mapping_points = mapping_points.sort_values('dist')
    
    if closestK:
        mapping_points = mapping_points.iloc[:5]
    
    popularities = mapping_points['popularity']
        
    result = IDW(mapping_points['dist'], popularities, power=power)
    
    return result

## How does the value of p affecting interpolation?

Here I test different values of p using a toy example:

Assuming the mapped data point y located at original point. And we have 6 mapping points nearby: 2 close points and 4 further points.

If we assign 3 differnt pupolarity values {1, 2, 3} to these 2 close points, and vary the value of p, we will have the following plotting. We can see that when p is small, the closest points take small weight in calculating final popularity value for y, and when p increases, the interpolated popularity value will be closer and closer to the popularity of 2 close points.

In [7]:
def dists(pop,mapped_point):
    mapping_points = {'center_pt':[[0.1,0.1],[-0.1,0.1],[0.3,0.4],[-0.4,0.2],[-0.5,0.7],[0.6,-0.8]],'popularity':[pop,pop,1,2,1,3]}
    mapping_points = pd.DataFrame(mapping_points)

    popularity = []
    for p in np.arange(0.1, 5, 0.2):
        popularity.append(interpolation(x, mapping_points, closestK=False, power=p))
        
    return popularity

In [8]:
x = [0,0] # the interpolation point
popularities = []
for pop in range(1,4):
    popularities.append(dists(pop, x))
    
xs = list(np.arange(0.1, 5, 0.2))

In [9]:
p1 = figure(plot_width=500, plot_height=300, title='Affecting of different values of p',\
           x_axis_label='values of p',y_axis_label='popularity of y')
p1.line(xs, popularities[0],color="firebrick", legend='close pt = 1', line_width=4)
p1.line(xs, popularities[1],color="green", legend='close pt = 2', line_width=4)
p1.line(xs, popularities[2],color="navy", legend='close pt = 3', line_width=4)


p2 = figure(plot_width=300, plot_height=300, title='Example')
p2.circle(x[0], x[1], size=10, color='navy')
p2.x([0.1,-0.1], [0.1,0.1], size=15, color='navy',legend='close pts')
p2.x([0.3,-0.4,-0.5,0.6], [0.4,0.2,0.7,-0.8], size=10, color='grey',legend='further pts')
p2.legend.location = "bottom_left"
show(row(p2,p1))

## Interpolation
Larger value of p allows closer apartments weight more in the interpolation. 
To see which one works best for new york rental market, I did different interpolations using different p values: 0.1, 0.5, 1 and 3.

In [10]:
def findMappingPoints(mapped_point, df):
    """ Input: building center point represented as a list with lng and lat
    Output: data points in df with 0.01 degree distance range from input point.
    """
    mask_x = (df['x'] >= mapped_point[0]-600) & (df['x'] <= mapped_point[0]+600)
    mask_y = (df['y'] >= mapped_point[1]-600) & (df['y'] <= mapped_point[1]+600)
    
    return df[mask_x & mask_y]

def buildingsInterpolation(mapped_point, power, df_market=df_market):
    """ input: the building center point represented as a list with lng and lat
    output: the popularity of this building
    """
    mapping_points = findMappingPoints(mapped_point, df_market)
    
    if len(mapping_points) == 0:  # if no close points exist, set popularity to 0
        result = 0
        
    elif len(mapping_points) <= 5:  # if there are less than 5, consider them all
        result = interpolation(mapped_point, mapping_points, False, power)
        
    else:  # if more than 5, consider closest 5
        result = interpolation(mapped_point, mapping_points, True, power)
        
    return result

In [11]:
df_building['IDW_0.1'] = df_building['center_pt'].apply(buildingsInterpolation, power=0.1)
df_building['IDW_0.5'] = df_building['center_pt'].apply(buildingsInterpolation, power=0.5)
df_building['IDW_1'] = df_building['center_pt'].apply(buildingsInterpolation, power=1)
df_building['IDW_3'] = df_building['center_pt'].apply(buildingsInterpolation, power=3)

In [12]:
df_building.head()

Unnamed: 0,DOITT_ID,footprint,center_pt,x,y,IDW_0.1,IDW_0.5,IDW_1,IDW_3
31950,313119,"[[-73.9834, -73.9835, -73.9835, -73.9836, -73....","(-9613469.10811105, 355447.84763150103)",-9613469.0,355447.847632,1.2,1.2,1.2,1.2
31986,554919,"[[-73.9882, -73.9883, -73.9883, -73.9883, -73....","(-9617184.379490266, 361048.91133206384)",-9617184.0,361048.911332,1.0,1.0,1.0,1.0
31987,665185,"[[-73.9797, -73.9798, -73.9799, -73.9799, -73....","(-9613400.620183097, 356174.110258732)",-9613401.0,356174.110259,1.387856,1.336792,1.271134,1.079
31988,79482,"[[-73.9937, -73.9937, -73.9938, -73.9938, -73....","(-9614161.458572928, 354308.31446485035)",-9614161.0,354308.314465,1.812472,1.875882,1.950081,1.999751
31989,301212,"[[-73.9874, -73.9874, -73.9875, -73.9875, -73....","(-9614705.761096902, 356760.94259855757)",-9614706.0,356760.942599,1.19521,1.17564,1.15089,1.066971


# 2. Clustering Models Selection

With every building assigned popularity values, I performed clustering using their longitude, latitude and the popularity. 

Clustering on large number of data points are computational expensive, for model selection, I use 10,000 data points sampled out of original building data set.

In [13]:
df_building = df_building.sample(frac=1)
df_clustering = df_building.head(n=10000)

### Clustering Method:
In this project I choose the hierarchical clustering method. One reason is that I don’t know how many clusters I need before experiments. And more importantly, different numbers of clusters give us different point of view, they all make sense. If we choose higher cut on dendrogram with less clusters, we get a more general picture about relation between different areas. And if we choose lower cut with more clusters, we get insights of relation in neighbourhood in detail. 

### Distance Metric:
For distance metric between data points, I have 2 choice: Manhattan distance and Euclidean distance. Without knowing which one is better, I tried them both.

### Linkage
For distance metric between clusters, I experiment with complete linkage, average linkage, weighted linkage, centroid linkage and ward linkage. (Single linkage will merge most of the buildings into one cluster and the rest are singleton clusters, so I don’t use it in this project.)

## Evalution:
Until now, we have 3 parameters of our rental market clustering model: the power parameter p of IDW, the distance metric choice and linkage choice. The combination of them gives us in total 32 clustering models. 

To evaluate these 32 different models, I set up 6 statistical criteria which act as 6 scoring systems:
1. The size of 15% largest clusters.
2. The number of singleton clusters.

3. The size of 85% largest clusters.
4. The area of 85% largest clusters.

5. The within cluster popularity variance.
6. The between cluster popularity variance*.

The first 4 criteria evaluate whether a model yields balanced clustering with respect to both size and area. For this application, we shouldn’t have too many small clusters or too many large clusters.

The last 2 criteria evaluate whether a model put nearby buildings with similar popularity in the same cluster and the ones with different popularity into different clusters.

* This variance is actually calculated by first calculating the variances of popularity of each cluster and its nearest 5 clusters, and then take the mean of these variances. So it indicates the between cluster popularity variance for nearby clusters.

In [14]:
cnt_small = [] # The size of 15% largest clusters. The higher the better.
cnt_singlton = [] # The number of singleton clusters. The smaller the better.

cnt_large = [] # The size of 85% largest clusters. The smaller the better.
areas_large = [] # The area of 85% largest clusters. The smaller the better.

interVars = [] # The within cluster popularity variance. The smaller the better.
intraVars = [] # The between cluster popularity variance. The bigger the better.

models = []

In [15]:
df_norms = [df_clustering[['x','y','IDW_0.1']], df_clustering[['x','y','IDW_0.5']], \
            df_clustering[['x','y','IDW_1']], df_clustering[['x','y','IDW_3']]]

IDWs = ['IDW_0.1', 'IDW_0.5', 'IDW_1', 'IDW_3']

In [16]:
for df_norm, idw in zip(df_norms, IDWs):

    #### get linkages for different methods and metrics
    Z_average_N1 = c.Clustering(df_norm, method='average', metric='cityblock')
    Z_average_N2 = c.Clustering(df_norm, method='average', metric='euclidean')

    Z_weighted_N1 = c.Clustering(df_norm, method='weighted', metric='cityblock')
    Z_weighted_N2 = c.Clustering(df_norm, method='weighted', metric='euclidean')

    Z_centroid_N2 = c.Clustering(df_norm, method='centroid', metric='euclidean')

    Z_complete_N1 = c.Clustering(df_norm, method='complete', metric='cityblock')
    Z_complete_N2 = c.Clustering(df_norm, method='complete', metric='euclidean')

    Z_ward_N2 = c.Clustering(df_norm, method='ward', metric='euclidean')

    Zs = [Z_average_N1,Z_average_N2,Z_weighted_N1,Z_weighted_N2,\
          Z_centroid_N2,Z_complete_N1,Z_complete_N2,Z_ward_N2]
    Ls = ['average_N1','average_N2','weighted_N1','weighted_N2',\
          'centroid_N2','complete_N1','complete_N2','ward_N2']

    #### evaluating clusters
    for Z, L in zip(Zs, Ls):

        Z.getClusters(k=800)
        result = Z.clusterStats()

        cnt_small.append(result[0])
        cnt_large.append(result[1])
        cnt_singlton.append(result[2])
        areas_large.append(result[3])
        interVars.append(result[4])
        intraVars.append(result[5])

        models.append(idw+';'+L)

In [17]:
df_perf = pd.DataFrame({'cnt_small':cnt_small,'cnt_large':cnt_large,'cnt_singlton':cnt_singlton,\
                        'areas_large':areas_large,'interVars':interVars,'intraVars':intraVars})

## Combine Multiple Selection Criteria

Gaining the evaluation decision given by 6 criteria for each clustering model, in order to properly compare them, I first normalized and sorted the decisions and produce the score and ranks of each criterion. 

Let $S_m(n)$ and $R_m(n)$ be the score and rank given by $m^{th}$ criterion on $n^{th}$ model respectively. We will have $S_m(n) \in [0,1]$ with highest scoring = 1 and $R_m(n) \in [1,N]$ with highest ranking = 1.

In [18]:
def transform_rank(sr):
    """ transform the dist to it's rank in range [0,n]
    """
    sr = sr.rank(method='min', ascending=False) # high(1), low(n)
    return sr


def transform_score(sr):
    """ transform the dist to it's score in range [0,1]
    """
    min_sr = sr.min()
    sr = sr - min_sr
    
    max_sr = sr.max()
    sr = sr / max_sr

    return sr


def transform_SRC(sr):
    """ transform the dist to it's SRC in range [0,n]
    """
    score = transform_score(sr)
    rank = transform_rank(sr)
    sr = score/rank
    
    return sr


df_rank = df_perf.apply(transform_rank, axis=0)
df_score = df_perf.apply(transform_score, axis=0)
df_src = df_perf.apply(transform_SRC, axis=0)

In [19]:
scores = copy.deepcopy(df_score)
scores.index = models
scores

Unnamed: 0,areas_large,cnt_large,cnt_singlton,cnt_small,interVars,intraVars
IDW_0.1;average_N1,0.681814,0.680412,0.083916,0.0,1.0,0.795785
IDW_0.1;average_N2,0.755028,0.793814,0.111888,0.0,0.996728,0.893752
IDW_0.1;weighted_N1,0.71661,0.731959,0.195804,0.2,0.993853,0.837344
IDW_0.1;weighted_N2,0.788239,0.701031,0.202797,0.2,0.994344,1.0
IDW_0.1;centroid_N2,0.744268,0.762887,0.0,0.0,0.998957,0.899686
IDW_0.1;complete_N1,0.88707,0.71134,0.713287,0.4,0.953347,0.934088
IDW_0.1;complete_N2,1.0,0.762887,0.699301,0.4,0.952931,0.908809
IDW_0.1;ward_N2,0.93163,1.0,0.951049,0.8,0.970035,0.456518
IDW_0.5;average_N1,0.344818,0.402062,0.300699,0.2,0.50235,0.294611
IDW_0.5;average_N2,0.347013,0.402062,0.27972,0.2,0.566942,0.541699


There exist several different ways of combining the output of the scoring systems, including score combination, rank combination, voting, average combination and weighted combination. Based on Hsu and Taksa's research*, we can investigate the scoring behavior of different criterions defined by Rank-Score Characteristic(RSC):
$$RSC_m(n) = \frac{S_m(n)}{R_m(n)}$$

The RSC curves of each criterion will form rank-score graph that tells us how different each criterion deciding their scoring. The following picture is an illustration of 3 scoring systems. The scoring system who assigns scores in a linearly decreasing fashion will have a linear rank-score curve like $f_2$ does. The system who habitually assigns high scores to a large subset of its top ranked candidates will have a graph that is not a straight line, but has a low slope for the top ranked candidates and a higher slope for the remainder similar to $f_3$. A third class of scoring behavior is exemplified by $f_1$. In this case, the expert habitually gives higher scores to a small subset of its top ranked candidates and much lower scores to the rest. 
![Rank_score](images/RSC_graph.png)

Hsu and Taksa indicate that a diversity measure based on the rank-score graph can be used to determine whether a score or rank fusion will produce a better result. When the rank-score graphs of two systems are very SIMILAR, then a Score Combination will produce the best fusion. When the rank-score graphs are very DIFFERENT, then a Rank Combination produces the better result.
 * Hsu, D.F. and Taksa, I., Comparing rank and score combination methods for data fusion in information retrieval.

In [20]:
SRC_cnt_small = df_src['cnt_small'].sort_values(ascending=False).reset_index(drop=True)
SRC_cnt_large = df_src['cnt_large'].sort_values(ascending=False).reset_index(drop=True)
SRC_cnt_singlton = df_src['cnt_singlton'].sort_values(ascending=False).reset_index(drop=True)
SRC_areas_large = df_src['areas_large'].sort_values(ascending=False).reset_index(drop=True)
SRC_interVars = df_src['interVars'].sort_values(ascending=False).reset_index(drop=True)
SRC_intraVars = df_src['intraVars'].sort_values(ascending=False).reset_index(drop=True)

In [21]:
p = figure(plot_width=500, plot_height=400)
p.line(SRC_cnt_small.index, SRC_cnt_small, line_width=2, color=Spectral11[0], alpha=0.8, legend='size_smaller_clusters')
p.line(SRC_cnt_large.index, SRC_cnt_large, line_width=2, color=Spectral11[2], alpha=0.8, legend='size_large_clusters')
p.line(SRC_cnt_singlton.index, SRC_cnt_singlton, line_width=2, color=Spectral11[4], alpha=0.8, legend='count_singleton_clusters')
p.line(SRC_areas_large.index, SRC_areas_large, line_width=2, color=Spectral11[6], alpha=0.8, legend='area_large_clusters')
p.line(SRC_intraVars.index, SRC_intraVars, line_width=2, color=Spectral11[8], alpha=0.8, legend='within_cluster_var')
p.line(SRC_interVars.index, SRC_interVars, line_width=2, color=Spectral11[10], alpha=0.8, legend='btw_cluster_var')
show(p)

As we observed that these SRC curves for 6 criteria all have concave shape which means that they gives higher scores to a small subset of its top ranked candidates and much lower scores to the rest. And they all have similar scoring behaviors.

So I should combine the scores of 6 criteria using Mahalanobis combination and find the best clustering model.
$$SC(n)=\sum_{m=1}^{M} w_m*S_m(n)$$

where $$w_m=\frac{\frac{1}{\sigma_m^2}}{\sum_{1}^{M} \frac{1}{\sigma_m^2}}$$


In [22]:
def combineScore(df):
    
    precs = [1/df[attr].var() for attr in df.columns]
    scores = [(precs[i] / sum(precs)) * df.loc[:, df.columns[i]] for i in range(6)]
    combined_score = scores[0]+scores[1]+scores[2]+scores[3]
    combined_score = transform_score(combined_score)
    
    return combined_score

combined_score = combineScore(df_score)

index = combined_score.sort_values().index[-1]
print("The model with best combined score is {}.".format(models[index]))

The model with best combined score is IDW_0.1;ward_N2.


# 3. Clustering

Knowing the best parameters for interpoaltion and clustering, I went ahead and performed clustering using the full building data set(actually because of computer power limitation, I used 30,000 data points out of 45,000). And chose the cut on dendrogram at 600 clusters to visualize. 

In [23]:
def clt(df, popularity='IDW_0.1'):
    df_norm = df[['x','y',popularity]]
    df_norm = df_norm.head(n=30000)
    return df_norm

In [24]:
df_norm = clt(df_building)

In [25]:
cls = c.Clustering(df_norm, method='ward', metric='euclidean')
cls.getClusters(k=600)

In [26]:
df_cluster = cls.getClusterPopularity()

In [27]:
def basePlot(plot_width, plot_height,tools='pan,wheel_zoom,reset,save'):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,\
               outline_line_color=None, title='Manhattan Rental Market Clustering')
    
    p.axis.visible = True
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p

In [28]:
def clusterPlot(df_cluster, p_alpha=3.3,plot_width=600, plot_height=2160):
    #cm = [random.choice(Spectral11) for i in range(1,601)]
    #colors = [cm[int(i)-1] for i in df_cluster['cluster'].values]
    cm = df_cluster['cluster_popularity']*10
    cm = np.floor(cm)
    colors = [Spectral11[int(i)] for i in cm.values]
    
    alphas = [alpha**p_alpha+0.01 for alpha in df_cluster['cluster_popularity'].values]
    
    lngs, lats = df_cluster['x'].values, df_cluster['y'].values

    p = basePlot(plot_width, plot_height)
    p.background_fill_color = "black"
    p.scatter(lngs, lats, size=1, color=colors, alpha=alphas)
    
    return p

In [52]:
p = clusterPlot(df_cluster)
#show(p)

The plotting visualized the clustering using bright red indicating popular and dark green indicating not so popular. we see there exist some geographical pattern for Manhattan rental apartments. In general, midtown and downtown are more popular, especially west village and Chinatown areas. And areas around lincoln center, port authority bus terminal and west 125th st are the worst.

<img src="images/all.jpg" width="350">

# 4. Other Variables Afftecting the clustering

Besides finding the best clustering for rental market simply using the location and popularity, I also investigate how other variables, like price, number of bedrooms, doorman, etc, affecting the clustering.

## Price

In [30]:
df_market.head()

Unnamed: 0,bath,bed,y,x,center_pt,price_bin,broker,elevator,fitness_center,no_fee,popularity
10,1.5,3,356555.874048,-9608865.0,"(-9608865.029217346, 356555.87404817063)",1,0,0,0,0,2
10000,1.0,2,365557.19018,-9616935.0,"(-9616934.736082882, 365557.1901797713)",3,0,1,1,0,1
100004,1.0,1,356477.16348,-9616369.0,"(-9616368.679438747, 356477.1634797137)",1,0,0,0,0,3
100007,1.0,1,360258.281928,-9614124.0,"(-9614124.415577795, 360258.28192826593)",2,0,0,0,1,1
100013,1.0,4,370278.122028,-9617337.0,"(-9617336.91450809, 370278.1220278065)",2,1,0,0,0,1


The quartiles of price variable in rental market data set are around \$2500, \$3200 and \$4100. To investigate how price of rental apartments affecting people’s interest and thus forming a different clustering, I filtered out data points in market data set with price under \$2500 and above \$4100 as two new market data sets and performed interpolation and clustering on each of the new market data set separately. 

In [31]:
price_markets = [df_market[df_market['price_bin']==0], df_market[df_market['price_bin']==3]]

In [32]:
low_price_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
low_price_group['IDW_0.1'] = low_price_group['center_pt'].apply(buildingsInterpolation, \
                                                              power=0.1, df_market=price_markets[0])

high_price_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
high_price_group['IDW_0.1'] = high_price_group['center_pt'].apply(buildingsInterpolation, \
                                                                power=0.1, df_market=price_markets[1])

In [33]:
def clusterDifferentiate(group):
    samples = clt(group)
    samples = samples.head(n=30000)
    clustering = c.Clustering(samples, method='ward', metric='euclidean')
    clustering.getClusters(k=600)
    result = clustering.getClusterPopularity()
    p = clusterPlot(result)
    return p

In [34]:
p_1 = clusterDifferentiate(low_price_group)
p_4 = clusterDifferentiate(high_price_group)

From the clustering plottings, we can see the geographic pattern on low price apartment and high price apartments are quite different. 
* Left plotting shows people’s interest on low price rentals are much more evenly distributed. 
* The right plotting for high-price rentals is darker and more greenish. It indicates when people are looking for high price rentals, they focus more on midtown and downtown areas. (It is not caused by less people looking for high-price rentals because I normalized the variables before clustering.)

In [51]:
#show(row(p_1, p_4))

<img src="images/price.jpg" width="650">

## Bedroom

In [36]:
bedroom_markets = [df_market[df_market['bed'] >= 2], df_market[df_market['bed'] < 2]]

In [37]:
less_bedroom_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
less_bedroom_group['IDW_0.1'] = less_bedroom_group['center_pt'].apply(buildingsInterpolation, \
                                                              power=0.1, df_market=bedroom_markets[0])

more_bedroom_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
more_bedroom_group['IDW_0.1'] = more_bedroom_group['center_pt'].apply(buildingsInterpolation, \
                                                                power=0.1, df_market=bedroom_markets[1])

In [38]:
p_1 = clusterDifferentiate(less_bedroom_group)
p_2 = clusterDifferentiate(more_bedroom_group)

In [53]:
#show(row(p_1, p_2))

<img src="images/bedroom.jpg" width="650">

## Broker Fee

In [40]:
bf_markets = [df_market[df_market['no_fee'] == 1], df_market[df_market['no_fee'] == 0]]

no_bf_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
no_bf_group['IDW_0.1'] = no_bf_group['center_pt'].apply(buildingsInterpolation, \
                                                              power=0.1, df_market=bf_markets[0])

has_bf_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
has_bf_group['IDW_0.1'] = has_bf_group['center_pt'].apply(buildingsInterpolation, \
                                                                power=0.1, df_market=bf_markets[1])

In [41]:
p_1 = clusterDifferentiate(no_bf_group)
p_2 = clusterDifferentiate(has_bf_group)

In [54]:
#show(row(p_1, p_2))

<img src="images/fees.jpg" width="650">

## Fitness Center

And what is most interesting and surpricing is the impact of fitness center. The lefting plotting is clustering of buildings without fitness center, which is more or less the same as the other clustering. And the right plotting is the one with fitness center building in which we can clearly see some outstanding clusters which means people want a fitness center in building look at apartments in these areas more.

In [43]:
fc_markets = [df_market[df_market['fitness_center'] == 0], df_market[df_market['fitness_center'] == 1]]

no_fc_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
no_fc_group['IDW_0.1'] = no_fc_group['center_pt'].apply(buildingsInterpolation, \
                                                              power=0.1, df_market=fc_markets[0])

has_fc_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
has_fc_group['IDW_0.1'] = has_fc_group['center_pt'].apply(buildingsInterpolation, \
                                                                power=0.1, df_market=fc_markets[1])

In [44]:
p_1 = clusterDifferentiate(no_fc_group)
p_2 = clusterDifferentiate(has_fc_group)

In [49]:
#show(row(p_1, p_2))

<img src="images/fitness.jpg" width="650">

## Elevator

When we look into the clustering of building with and without elevator, we can find that the grouping is almost identical to the fitness center grouping. This may due to high correlation between these 2 variables. Which makes sense because a building has fitness center normally has elevator as well.

In [46]:
elevator_markets = [df_market[df_market['elevator'] == 0], df_market[df_market['elevator'] == 1]]

no_elevator_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
no_elevator_group['IDW_0.1'] = no_elevator_group['center_pt'].apply(buildingsInterpolation, \
                                                              power=0.1, df_market=elevator_markets[0])

has_elevator_group = df_building[['DOITT_ID', 'center_pt', 'x', 'y']]
has_elevator_group['IDW_0.1'] = has_elevator_group['center_pt'].apply(buildingsInterpolation, \
                                                                power=0.1, df_market=elevator_markets[1])

In [47]:
p_1 = clusterDifferentiate(no_elevator_group)
p_2 = clusterDifferentiate(has_elevator_group)

In [50]:
#show(row(p_1, p_2))

<img src="images/elevator.jpg" width="650">

## Conclusion: 
By performing interpolation and clustering, we show that the geographic pattern do exist for Manhattan rental market. And it is affected by many variables. 

Some variable like price and fitness center have heavier impact on clustering. Some other variables like elevator rarely affects the building grouping. They may be important, but they are independent of geographic location. 