# Manhattan Rental Apartments Clustering
## - Clustering

### Xian Lai
xlai4@fordham.edu   
Apr.2017

=======================================================

In the last jupyter notebook, we select the final clustering model using combined score with hyperparameters:  
**{'IDWpower':0.1, 'n_ngb':13, 'method':'complete', 'metric':'euclidean'}**

Next we will use this model on the full building dataset and query information we are interested in.

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import sys; sys.path.append('modules')
import HierarchicalClustering as hc
import Visual_static as vis
from bokeh.io import output_notebook, push_notebook, show
from bokeh.layouts import row
output_notebook()

In [3]:
df_listing = pd\
    .read_pickle('data/market_set_cleaned.pickle')\
    .reset_index(drop=True)
df_building = pd\
    .read_pickle('data/building_set_cleaned.pickle')\
    .drop(labels=['DOITT_ID', 'footprint'], axis=1)\
    .reset_index(drop=True)
    
HC = hc.HierarchicalClustering(df_listing, df_building)
HC.prepareData(IDWpower=0.1, n_ngb=13)
HC.clustering(
    n_clusters=600, method='complete', metric='euclidean'
)

Interpolation done.
Clustering done.


## Querying clustering using cluster statistics
In the process of clustering, we calculate some statistics for each cluster:
- Popularity mean
- Popularity variance
- cluster size
- cluster area

What we can do with these statistics are:
1. We can use them to filter clusters. For example, we can filter out 100 clusters with highest popularities.
2. We can visualize these clusters using statistics as color coding. For example, we can plot the clusters colored by their popularity mean.   
(green->red : unpopular->popular).

In [4]:
HC.clusters.head() # The dataset of each data point with their corresponding cluster's statistics

Unnamed: 0,x,y,popularity,cluster,area,mean,size,variance
0,-1.653851,1.595556,-1.234482,31.0,0.371805,0.029672,0.046512,0.319083
1,-0.95724,-0.942289,1.18544,102.0,0.280429,0.344522,0.116279,0.025513
2,-0.431375,1.157017,0.96891,164.0,0.727016,0.307627,0.296512,0.01689
3,0.794382,-0.943578,-0.027957,194.0,0.413933,0.192525,0.209302,0.012023
4,0.138062,1.03502,0.600564,316.0,0.358621,0.282154,0.668605,0.155873


In [25]:
v = vis.Visual(figWidth=400)
# v.plotClustering(HC.clusters, color='mean')

<img src="images/all.png" width="400">

### Examples:
In the 1st example, let's filter out the top 200 largest clusters in terms of area and plot them colored by mean, variance and size respectively. 

To do this, we will first sort the stats by the area column and take the top 200 clusters. And then we filter the dataset with the cluster indices corresponding to the top 200 clusters. At last we plot the clustering colored by different statistics.

In [6]:
largeClusters_200 = HC.clusterStats\
    .copy()\
    .sort_values('area', ascending=False)\
    .head(n=200)
data = HC.clusters.copy()
mask = data['cluster'].isin(largeClusters_200.index)
data = data[mask]

In [26]:
v1 = vis.Visual(figWidth=300, title='colored by mean')
v1.plotClustering(data, color='mean', show=False)
v2 = vis.Visual(figWidth=300, title='colored by variance', 
     x_range=v1.p.x_range, y_range=v1.p.y_range)
v2.plotClustering(data, color='variance', show=False)
v3 = vis.Visual(figWidth=300, title='colored by size', 
     x_range=v1.p.x_range, y_range=v1.p.y_range)
v3.plotClustering(data, color='size', show=False)
# show(row(v1.p, v2.p, v3.p))

<img src="images/top_area_clusters.png" width="1300">

In the 2nd example, we will investigate the clusters with large popularity variance and look at their popularity mean.

In [8]:
largeVarClusters = HC.clusterStats.copy()\
    .sort_values('variance', ascending=False)\
    .head(n=200)
data = HC.clusters.copy()
data = data[data['cluster'].isin(largeVarClusters.index)]

In [27]:
v1 = vis.Visual(figWidth=300, title="Full clustering")
v1.plotClustering(HC.clusters, color='mean', show=False)
v2 = vis.Visual(figWidth=300, title="Large variance clusters colored by mean")
v2.plotClustering(data, color='mean', bins=v1.bins, show=False)
v3 = vis.Visual(figWidth=300, title="Large variance clusters colored by variance")
v3.plotClustering(data, color='variance', show=False)
# show(row(v1.p, v2.p, v3.p))

<img src="images/top_variance_clusters.png" width="1300">

## Compare different datasets

Besides querying one clustering with its stats, we can also comparing different clusterings. Recall that we have different features in the listing dataset like price level, whether there is broker fee, whether there is fitness center in the building etc. We can use these features to create different listing datasets and then compare the clustering results coming from different datasets interpolated from different listing datasets.

As examples, we will compare the following clusters:
    - low price rentals vs high price rentals
    - 1-bedroom rentals, 2-bedroom rentals vs 4-bedroom rentals
    - rentals without elevators vs rentals with elevators
    - rentals without gym vs rentals with gym

In [10]:
df_listing.head()

Unnamed: 0,bath,bed,y,x,center_pt,price_bin,broker,elevator,fitness_center,popularity
0,1.0,3,352061.960979,-9614608.0,"(-9614608.448523302, 352061.96097883163)",3,0,1,1,2
1,1.0,2,367089.294251,-9615811.0,"(-9615810.566116467, 367089.2942514741)",0,1,0,0,2
2,1.5,3,365770.867282,-9613973.0,"(-9613972.983376194, 365770.86728249304)",1,1,0,0,1
3,2.0,3,365840.105803,-9616608.0,"(-9616607.920002967, 365840.10580306733)",2,0,0,0,1
4,2.0,2,359408.88919,-9617688.0,"(-9617688.486000193, 359408.8891899623)",3,0,1,1,2


### Price levels:
The listing apartments are in 4 price bins. We can compare the clusterings interpolated from low, median and high price apartments.

In [11]:
df_listing['price_bin'].describe()

count    10000.000000
mean         1.515300
std          1.100221
min          0.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          3.000000
Name: price_bin, dtype: float64

In [12]:
listing_datasets = [
    df_listing[df_listing['price_bin']==0],
    df_listing[df_listing['price_bin']==1],
    df_listing[df_listing['price_bin']==3]
]
print("The size of low price apartment listing dataset: %d" %len(listing_datasets[0]))
print("The size of median price apartment listing dataset: %d" %len(listing_datasets[1]))
print("The size of high price apartment listing dataset: %d" %len(listing_datasets[2]))

The size of low price apartment listing dataset: 2311
The size of median price apartment listing dataset: 2717
The size of high price apartment listing dataset: 2492


In [13]:
HCs_price = []
for listing_dataset in listing_datasets:
    HC = hc.HierarchicalClustering(listing_dataset, df_building)
    HC.prepareData(IDWpower=0.1, n_ngb=13)
    HC.clustering(
        n_clusters=600, method='complete', metric='euclidean'
    )
    HCs_price.append(HC)

Interpolation done.
Clustering done.
Interpolation done.
Clustering done.
Interpolation done.
Clustering done.


In [28]:
datas = [HC.clusters for HC in HCs_price]
titles=[
    'low price rentals clustering', 
    'median price rentals clustering', 
    'high price rentals clustering'
]
v1 = vis.Visual(figWidth=300, title=titles[0])
v1.plotClustering(datas[0], show=False)
v2 = vis.Visual(figWidth=300, title=titles[1])
v2.plotClustering(datas[1], show=False)
v3 = vis.Visual(figWidth=300, title=titles[2])
v3.plotClustering(datas[2], show=False)
# show(row(v1.p, v2.p, v3.p))

<img src="images/prices.png" width="1300">

### Number of bedrooms:

In [15]:
df_listing['bed'].describe()

count    10000.000000
mean         1.509300
std          1.063403
min          0.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          4.000000
Name: bed, dtype: float64

In [16]:
listing_datasets = [
    df_listing[df_listing['bed']==1],
    df_listing[df_listing['bed']==2],
    df_listing[df_listing['bed']==4]
]
print("The size of 1-bedroom apartment listing dataset: %d" %len(listing_datasets[0]))
print("The size of 2-bedroom apartment listing dataset: %d" %len(listing_datasets[1]))
print("The size of 4-bedroom apartment listing dataset: %d" %len(listing_datasets[2]))

The size of 1-bedroom apartment listing dataset: 3315
The size of 2-bedroom apartment listing dataset: 2993
The size of 4-bedroom apartment listing dataset: 356


In [17]:
HCs_br = []
for listing_dataset in listing_datasets:
    HC = hc.HierarchicalClustering(listing_dataset, df_building)
    HC.prepareData(IDWpower=0.1, n_ngb=13)
    HC.clustering(
        n_clusters=600, method='complete', metric='euclidean'
    )
    HCs_br.append(HC)

Interpolation done.
Clustering done.
Interpolation done.
Clustering done.
Interpolation done.
Clustering done.


In [29]:
datas = [HC.clusters for HC in HCs_br]
titles=[
    '1-bedroom rentals clustering', 
    '2-bedroom rentals clustering', 
    '4-bedroom rentals clustering'
]
v1 = vis.Visual(figWidth=300, title=titles[0])
v1.plotClustering(datas[0], show=False)
v2 = vis.Visual(figWidth=300, title=titles[1])
v2.plotClustering(datas[1], show=False)
v3 = vis.Visual(figWidth=300, title=titles[2])
v3.plotClustering(datas[2], show=False)
# show(row(v1.p, v2.p, v3.p))

<img src="images/bedrooms.png" width="1300">

### Elevators:

In [19]:
listing_datasets = [
    df_listing[df_listing['elevator']==0],
    df_listing[df_listing['elevator']==1]
]
print("The size of non-elevator apartment listing dataset: %d" %len(listing_datasets[0]))
print("The size of elevator apartment listing dataset: %d" %len(listing_datasets[1]))

The size of non-elevator apartment listing dataset: 7321
The size of elevator apartment listing dataset: 2679


In [20]:
HCs_elev = []
for listing_dataset in listing_datasets:
    HC = hc.HierarchicalClustering(listing_dataset, df_building)
    HC.prepareData(IDWpower=0.1, n_ngb=13)
    HC.clustering(
        n_clusters=600, method='complete', metric='euclidean'
    )
    HCs_elev.append(HC)

Interpolation done.
Clustering done.
Interpolation done.
Clustering done.


In [30]:
datas = [HC.clusters for HC in HCs_elev]
titles=['non-elevator rentals clustering', 'elevator rentals clustering']
v1 = vis.Visual(figWidth=450, title=titles[0])
v1.plotClustering(datas[0], show=False)
v2 = vis.Visual(figWidth=450, title=titles[1])
v2.plotClustering(datas[1], show=False)
# show(row(v1.p, v2.p))

<img src="images/elevators.png" width="600">

### Gyms:

In [22]:
listing_datasets = [
    df_listing[df_listing['fitness_center']==0],
    df_listing[df_listing['fitness_center']==1]
]
print("The size of non-fitness_center apartment listing dataset: %d" 
      %len(listing_datasets[0]))
print("The size of fitness_center apartment listing dataset: %d" 
      %len(listing_datasets[1]))

The size of non-fitness_center apartment listing dataset: 7321
The size of fitness_center apartment listing dataset: 2679


In [23]:
HCs_fc = []
for listing_dataset in listing_datasets:
    HC = hc.HierarchicalClustering(listing_dataset, df_building)
    HC.prepareData(IDWpower=0.1, n_ngb=13)
    HC.clustering(
        n_clusters=600, method='complete', metric='euclidean'
    )
    HCs_fc.append(HC)

Interpolation done.
Clustering done.
Interpolation done.
Clustering done.


In [31]:
datas_fc = [HC.clusters for HC in HCs_fc]
titles=[
    'non-fitness_center rentals clustering', 
    'fitness_center rentals clustering'
]
v_fc_0 = vis.Visual(figWidth=450, title=titles[0])
v_fc_0.plotClustering(datas_fc[0], show=False)
v_fc_1 = vis.Visual(figWidth=450, title=titles[1])
v_fc_1.plotClustering(datas_fc[1], show=False)
# show(row(v_fc_0.p, v_fc_1.p))

<img src="images/gyms.png" width="600">


## Conclusion: 
By performing interpolation and clustering, we show that the geographic pattern do exist for Manhattan rental apartments. And it is affected differently by different apartment properties. 

Some apartment properties like price and number of bedrooms have heavy impacts on clustering. Some other properties like elevator or gym affect the building grouping in the same way but less compared to price or bedrooms.  They may be important, but they are independent of geographic location. 