### Setup:

In [1]:
from os import path

import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import scipy.stats as st

In [2]:
# Set matplotlib default styles
mpl_update = {'font.size':16,
              'xtick.labelsize':14,
              'ytick.labelsize':14,
              'figure.figsize':[14.0,7.0], 
              'axes.labelsize':16,
              'axes.labelcolor':'#677385',
              'axes.titlesize':20,
              'lines.color':'#0055A7',
              'lines.linewidth':3,
              'text.color':'#677385'}
mpl.rcParams.update(mpl_update)

# Set seaborn style
sns.set_style('whitegrid')

### Import NYC RE Data:

In [3]:
directory = 'data'
file_name = 'NYC_RealEstate_Data.json'

data = pd.read_json(path.join(directory,file_name))

In [4]:
data.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address
0,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,1466754864000,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue
1,1.0,2,c5c8a357cba207596b04d1afd1e4f130,1465733967000,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue
10,1.0,0,0,1460596230000,New to the market! Spacious studio located in ...,York Avenue,[],low,40.7769,6869199,-73.9467,e32475a6134d6d18279946b7b20a0f12,[https://photos.renthop.com/2/6869199_06b2601f...,1950,1661 York Avenue
100,1.0,2,e3ea799fc85b5ed5a65cb662e6eebafa,1460523347000,Beautiful 2 Bed apartment in bustling ...,8518 3rd Avenue,[],medium,40.624,6866364,-74.0312,6f63020874d0bac3287ec5cdf202e270,[https://photos.renthop.com/2/6866364_50f3ac50...,2000,8518 3rd Avenue
1000,1.0,1,db572bebbed10ea38c6c47ab41619059,1460433932000,Amazing building in a Prime location! just ste...,W 57 St.,"[Swimming Pool, Roof Deck, Doorman, Elevator, ...",medium,40.767,6859853,-73.9841,2b14eec3be2c4d669ce5949cf863de6f,[https://photos.renthop.com/2/6859853_db2bbf20...,3275,322 W 57 St.


### High Level Analysis of Interest Level:

In [5]:
p_ilvl = data.groupby('interest_level')['listing_id'].count().reset_index().set_index('interest_level')
p_ilvl.rename(columns={'listing_id':'count'}, inplace=True)
p_ilvl.sort_values('count', ascending=False, inplace=True)

# Calculate fraction of total
p_ilvl[['p']] = round(p_ilvl.divide(p_ilvl.sum()),3)

p_ilvl

Unnamed: 0_level_0,count,p
interest_level,Unnamed: 1_level_1,Unnamed: 2_level_1
low,34228,0.694
medium,11225,0.228
high,3835,0.078


In [6]:
ax = data['interest_level'].value_counts().plot.bar(figsize=(10,5),rot=0)

ax.set_xlabel('Interest Level')
ax.set_ylabel('Count')
ax.set_title('Number of Listings per Interest Level')
ax.set_xticklabels(['Low', 'Medium', 'High'])

directory = 'figures'
file_name = 'MVP_BarChart_NumberOfListingsPerInterestLevel.png'
plt.savefig(path.join(directory, file_name), bbox_inches='tight')

#plt.show()
plt.close()

![](figures/MVP_BarChart_NumberOfListingsPerInterestLevel.png)

From the table and bar chart above is evident that high interest properties are a rare occurance - only 7.8% of listings are high interest.

### High Level Analysis of Price

Price is one more intuitive factors one would expect to have an impact on the interest level in a listing, so we will start with some basic analysis of the price data available and its basic statistics across each interest level.

##### Distribution of Price Data

In [7]:
fig, axs = plt.subplots(1,2, figsize=(15,5))

data['price'].plot.kde(ax=axs[0])
axs[0].set_xlabel('Price')
price_max = data['price'].max()
axs[0].set_xlim(0, price_max)
axs[0].set_title('Full Range (0-{})'.format(price_max))

data['price'].plot.kde(ax=axs[1])
axs[1].set_xlabel('Price')
axs[1].set_xlim(0,10000)
axs[1].set_title('< 10000')

fig.suptitle('Distirubtion of Listing Prices', size=24)

fig.tight_layout()
fig.subplots_adjust(top=0.8)

directory = 'figures'
file_name = 'MVP_KDE_Price.png'

fig.savefig(path.join(directory, file_name))

#plt.show()
plt.close()

![](figures/MVP_KDE_Price.png)

In [8]:
print('Mean Price = {}'.format(data['price'].mean()))

Mean Price = 3664.0422212303197


In [9]:
print('Skew = {}, Kurtosis = {}'.format(st.skew(data['price']),st.kurtosis(data['price'])))

Skew = 8.703749824887856, Kurtosis = 194.16351576157203


In [10]:
price_mask = data['price'] <= 10000
print('Skew = {}, Kurtosis = {}'.format(st.skew(data[price_mask]['price']),st.kurtosis(data[price_mask]['price'])))

Skew = 1.4523326600473045, Kurtosis = 2.6457663279667623


Price data is extemely skewed to the right. Reducing our samples to prices below 10000 minimizes this somewhat, but we still have a highly skewed distribution with a skew > 1. Consequently, any application of normal distribution statistics is likely to lead to biased results.

##### Converting Interest Level to Integer

For the purpose of performing numerical analysis of interest level, the Low/Medium/High values will be converted to an integer equivalent.

In [11]:
def get_ilevel_int(ilevel):
    '''
    Converts a string value interest level (high/medium/low) into an integer value of 1, 0 or -1.
    '''
    try:
        if ilevel == 'high':
            ilevel_int = 1
        elif ilevel == 'medium':
            ilevel_int = 0
        elif ilevel == 'low':
            ilevel_int = -1
        else:
            ilevel_int = 0
    except Exception as e:
        print(e, ': ', ilevel)
        ilevel_int = 0
        return ilevel_int
    
    return ilevel_int

High, Medium and Low interest levels are assigned an integer value of 1, 0 and -1, respectively. The reasoning for these weights is to allow for a focus on concentrations of high interest properties, whilst enabling low interest properties to detract from results.

If medium interest properties were also being considered, these could potentially be given something like half the weight of high interest (i.e. 0.5). However, we will simply proceed with calculations using the values above.

In [12]:
# Creating new column with integer values for interest level

data_price_ilvl = data[['price', 'interest_level']].copy()
data_price_ilvl['interest_level_int'] = data_price_ilvl['interest_level'].apply(lambda x: get_ilevel_int(x))
data_price_ilvl.head()

Unnamed: 0,price,interest_level,interest_level_int
0,3000,medium,0
1,5465,low,-1
10,1950,low,-1
100,2000,medium,0
1000,3275,medium,0


##### Boxplot of Price by Interest Level

First step is to visualize the distribution of prices for each interest level. Boxplots are a convenient means of displaying & comparing the spread of price data for all interest levels:

In [14]:
ax = data_price_ilvl[['interest_level_int', 'price']].boxplot(by='interest_level_int', figsize=(15,7))

ax.set_xticklabels(['low','medium','high'])
ax.set_title('Boxplot: Price grouped by Interest Level')
ax.set_xlabel('Interest Level')
ax.set_ylabel('Price')

plt.suptitle('')

directory = 'figures'
file_name = 'MVP_Boxplot_PriceByInterestLevel.png'
plt.savefig(path.join(directory, file_name))

#plt.show()
plt.close()

![](figures/MVP_Boxplot_PriceByInterestLevel.png)

In the figure above it is difficult to distinguish some of the finer differences in median, IQR, etc. due to the large number of outliers above the IQR for low interest properties. The boxplot is repeated below, this time ignoring the outliers.

In [15]:
# Plotting boxplot without outliers for a closer view.

ax = data_price_ilvl[['interest_level_int', 'price']].boxplot(by='interest_level_int', figsize=(15,7), showfliers=False)

ax.set_xticklabels(['low','medium','high'])
ax.set_title('Boxplot: Price grouped by Interest Level')
ax.set_xlabel('Interest Level')
ax.set_ylabel('Price')

plt.suptitle('')

directory = 'figures'
file_name = 'MVP_Boxplot_PriceByInterestLevelNoFliers.png'
plt.savefig(path.join(directory, file_name))

plt.close()

![](figures/MVP_Boxplot_PriceByInterestLevelNoFliers.png)

Whilst there does appear to be an inverse relationship between the median price and the interest level (lower price = higher interest level), at first glance the differences between each level appear to not be quite as profound as one might have anticipated. There is a lot of overlap between the three IQRs, and overall price ranges. As for the medians:

In [16]:
price_median = data['price'].median()
print("Overall Price Median = ${:.2f}".format(price_median))

Overall Price Median = $3150.00


In [17]:
ilvl_price_medians = data.groupby('interest_level')['price'].median()[['low', 'medium', 'high']]

print("Median Prices by Interest Level:")
for i, p in ilvl_price_medians.items():
    print("{} = ${:.2f}".format(i, p))

Median Prices by Interest Level:
low = $3300.00
medium = $2895.00
high = $2400.00


In [18]:
ilvl_price_median_delta = ilvl_price_medians.max() - ilvl_price_medians.min()

print('Range in Median Prices across Interest Level = ${:.2f}'.format(ilvl_price_median_delta))

Range in Median Prices across Interest Level = $900.00


In [19]:
print('Difference in Price Medians and Overall Median per Interest Level:')
for i, p in ilvl_price_medians.items():
    delta = p - price_median
    p_delta = 100 * (p/price_median - 1)
    sign = '+' if delta >= 0 else '-'
    print("{} = {}${:.2f} ({}{:.0f}%)".format(i, sign, abs(delta), sign, abs(p_delta)))

Difference in Price Medians and Overall Median per Interest Level:
low = +$150.00 (+5%)
medium = -$255.00 (-8%)
high = -$750.00 (-24%)



There is a \$900 difference between the high interest and low interest median prices. Comparing each median price to the overall median, it is evident that the high interest properties show the largest (negative) difference of \$750, compared to low interest properties with a median price \$150 more than the overall median. These differences may not appear to be as significant as one might expect, however it is improtant to consider these price differences within the context of monthly rental payments, in which even a difference of \$500 per month adds up to an additonal \$6000 per year.

##### Evaluating price by Standard Deviation bins

Another possible view of prices across interest leve is with a bar chart splitting the properties into standard deviation bins. This should hopefully give some insight into how many properties are distributed in the standard deviations below and above the mean.

First, each listing is placed in the appropriate bin, using intervals of 0.5 the standard deviation:

In [20]:
price_mean = data['price'].mean()
price_std = data['price'].std()

print('Mean Price = ${:.2f}'.format(price_mean))
print('Price Std. = ${:.2f}'.format(price_std))

Mean Price = $3664.04
Price Std. = $2382.97


In [21]:
def get_price_sbin(price, mean, std):
    '''
    Returns the standard deviation bin for a given price according to the provided mean & standard deviation.
    
    Bins are rounded to the nearest 0.5 standard deviation.
    '''
    dev = (price-mean)/std
    dev = round(2*dev)/2 # Round to nearest 0.5
    
    # Due to skew of price data, will simplify the higher prices as belonging to the 3+ std. bin
    if dev > 3:
        dev = 3
    
    return dev

In [22]:
# Set new price standard dev. column
data['price_std_bin'] = data['price'].apply(lambda x: get_price_sbin(x, price_mean, price_std))

In [23]:
data[data['price'] == 100000]

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address,price_std_bin
45674,6.0,6,d20bce0bc08b2731f726067a1e501162,1460773335000,MANSION IN THE SKY!!! This six-bedroom full-fl...,230 West 56th Street,"[Doorman, Elevator, Furnished, Laundry in Unit...",low,40.7654,6881666,-73.9822,37ffeac28297e956deecd7b31940c6e7,[https://photos.renthop.com/2/6881666_bedcd181...,100000,230 West 56th Street,3.0


In [24]:
# Plot the bar charts for each interest level

mask_high_ilvl = data['interest_level'] == 'high'
mask_med_ilvl = data['interest_level'] == 'medium'
mask_low_ilvl = data['interest_level'] == 'low'

fig, axs = plt.subplots(1,3, figsize=(14,5))

hbins = pd.np.arange(-3.25,3.5,0.5)

data[mask_high_ilvl]['price_std_bin'].plot.hist(bins=hbins, ax=axs[0], xlim=(-3.25,3.25), title='High Interest',edgecolor='k')
data[mask_med_ilvl]['price_std_bin'].plot.hist(bins=hbins, ax=axs[1], xlim=(-3.25,3.35), title='Medium Interest', edgecolor='k')
data[mask_low_ilvl]['price_std_bin'].plot.hist(bins=hbins, ax=axs[2], xlim=(-3.25,3.25), title='Low Interest', edgecolor='k')


for ax in axs:
    ax.set_xlabel('# Std. Deviations from Mean Price')
    ax.set_xticks(range(-3,4))
    ax.grid(False)

plt.tight_layout()

directory = 'figures'
file_name = 'MVP_Hist_StDevPriceperInterestLevel.png'
plt.savefig(path.join(directory, file_name))

plt.close()

![](figures/MVP_Hist_StDevPriceperInterestLevel.png)

In [25]:
# Perform crosstab & heatmap for alternative visualization

ilvl_x_pricebin = pd.crosstab(data['interest_level'], data['price_std_bin']).reindex(['low','medium','high'])

# Convert to percentage/proportion to allow for useful comparison between interest levels
p_ilvl_x_pricebin = ilvl_x_pricebin.divide(ilvl_x_pricebin.sum(axis=1), axis=0)

plt.figure(figsize=(10,5))

ax = sns.heatmap(p_ilvl_x_pricebin, cmap='GnBu')

plt.xlabel('Price Standard Deviation')
plt.ylabel('Interest Level')
plt.suptitle('Heatmap: Price Standad Dev. from Mean, per Interest Level')

directory = 'figures'
file_name = 'MVP_Heatmap_StDevPricePerInterestLevel.png'
plt.savefig(path.join(directory, file_name))

plt.close()

![](figures/MVP_Heatmap_StDevPricePerInterestLevel.png)

From the two figures above, one can observe an increased bias towards the lower standard deviations for the the medium and high interest listings. However, we also observe the same trend, albeit with less intensity, for the low interest properties. With the earlier findings on the skew of price data, using standard deviation as a means of splitting data may not provide the clearest picture / distinctions between each interest level.

##### Evaluating Price by Quartile bins

As an alternative to standard deviation, we will repeat the above analysis using quartiles as bins.

In [26]:
def get_price_qbin(price, quartiles):
    '''
    Return which of the provided quartiles the price falls in.
    '''
    for i, q in enumerate(quartiles):
        if price <= q:
            return i+1

In [27]:
# Set new quartile bin column using 4 quartiles (0-25, 25-50, 50-75, 75-100)

quartiles = data['price'].quantile([0.25,0.5,0.75, 1])

data['price_q_bin'] = data['price'].apply(lambda x: get_price_qbin(x, quartiles))

data[['price','price_q_bin']].head()

Unnamed: 0,price,price_q_bin
0,3000,2
1,5465,4
10,1950,1
100,2000,1
1000,3275,3


In [29]:
# Heatmap of the prices across the four quartiles

ilvl_x_pricebin = pd.crosstab(data['interest_level'], data['price_q_bin']).reindex(['low','medium','high'])

# Convert to proportion of total in a given interest level to allow comparison across interest levels
p_ilvl_x_pricebin = ilvl_x_pricebin.divide(ilvl_x_pricebin.sum(axis=1), axis=0)

sns.heatmap(p_ilvl_x_pricebin, cmap='GnBu')

plt.xlabel('Price Quartile')
plt.ylabel('Interest Level')
plt.suptitle('Heatmap: Price Quartile per Interest Level')

directory = 'figures'
file_name = 'MVP_Heatmap_PriceQuartileperInterestLevel.png'
plt.savefig(path.join(directory, file_name))

plt.close()

![](figures/MVP_Heatmap_PriceQuartileperInterestLevel.png)

The figure above provides a much more distinct picture of how price varies across interest level. The high and medium interest proeprties clearly favor prices in the 1st quartile, whereas the low interest properties have the majority of listing in the 4th quartile.

### Next Steps:

Following this initial analysis of price and interest level, there is a lot left to explore within the data for possible influencers on interest level. Some of the immediate possibilities are as follows:

More in-depth analysis of price & interest level with respect to other factors:
* Price per bedrooms
* Price per bathroom
* Price per total number of rooms

Analysis of Features and other non-numerical attributes:
* Frequency of various features across interest levels
* Presence of photos, number of photos
* Presence of description, length of description
* Presence of features, number of features

Analysis of Location:
* Distribution of low, medium, high interest throughout the city. Areas of concentration.
* Analysis of interest level by Burough
* Analysis of interest level vs. surrounding listings (e.g. price, interest level, features of nearby listings)