# STATISTICAL INFERENCE

This is a critical process of diving deeper into the relationships observed in the data exploratory analysis step.

# QUESTIONS TO ANSWER

<ul>
    <li>Is price normally distributed? If not, see if it is normally distributed on a log10 scale.</li>
	<li>How is price correlated to the number of bedrooms?</li>
    <li>How is price correlated to the number of bathrooms?</li>
    <li>How is price correlated to the number of beds?</li>
	<li>How is price correlated to the number of accommodates?</li>
    <li>How is price correlated to reviews_scores_rating?</li>
    <li>How is price correlated to bedroom_bath_ratio?</li>
    <li>How is price correlated to number_of_bookings?</li>
    <li>How is price correlated to number_of_reviews?</li>
    <li>How does each room_type impact price?</li>
    <li>How does each neighbourhood impact price?</li>
    <li>What is the price correlation of bedrooms, bathrooms, beds and accommodates for the Westlake Hills neighbourhood?</li>
    <li>How does bedrooms impact price?</li>
    <li>How does bathrooms impact price?</li>
    <li>How does beds impact price?</li>
    <li>How does accommodates impact price?</li>
</ul>

I will use statistical inference to explore and answers the questions above.

In [38]:
# Import required libraries
import pandas as pd
from scipy import stats
import statsmodels
import numpy as np
import plotly
import cufflinks as cf
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
cf.go_offline()

# Read in the csv file
df = pd.read_csv('Data/airbnb_clean.csv')
df.head()

Unnamed: 0,listing_id,zip_code,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,number_of_reviews,review_scores_rating,neighbourhood,number_of_bookings,bedroom_bath_ratio
0,2265,78702,30.2775,-97.71398,Entire home/apt,4,2.0,2.0,2.0,225.0,30,24,93.0,East Downtown,365.0,100.0
1,5245,78702,30.27577,-97.71379,Private room,2,1.0,1.0,2.0,100.0,30,9,91.0,East Downtown,354.0,100.0
2,5456,78702,30.26112,-97.73448,Entire home/apt,3,1.0,1.0,2.0,95.0,2,499,96.0,East Downtown,74.0,100.0
3,75174,78702,30.24773,-97.72584,Entire home/apt,3,1.0,1.0,1.0,130.0,2,249,98.0,East Downtown,131.0,100.0
4,76911,78702,30.26775,-97.72695,Entire home/apt,10,3.0,5.0,12.0,821.0,2,126,99.0,East Downtown,56.0,60.0


<b>Is the price normally distributed? If not, check if it is normally distributed on a log10 scale.</b>

To ease the prediction and inference of price, the distribution should be normally distributed. From prior exploratory data analysis, I observed that the data was heavily skewed to the right which means it is not normally distributed. I will transform price to log10 and then check the distribution.

In [39]:
# Plot reminder of positively skewed price data
df['price'].iplot(
    kind='hist',
    bins=20,
    xTitle='log price',
    linecolor='black',
    yTitle='count',
    title='Price Distribution')

In [40]:
# Create a new column for log 10 of price
df['log_price'] = np.log10(df.price)

# Inspect the dataframe
df.head()

Unnamed: 0,listing_id,zip_code,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,number_of_reviews,review_scores_rating,neighbourhood,number_of_bookings,bedroom_bath_ratio,log_price
0,2265,78702,30.2775,-97.71398,Entire home/apt,4,2.0,2.0,2.0,225.0,30,24,93.0,East Downtown,365.0,100.0,2.352183
1,5245,78702,30.27577,-97.71379,Private room,2,1.0,1.0,2.0,100.0,30,9,91.0,East Downtown,354.0,100.0,2.0
2,5456,78702,30.26112,-97.73448,Entire home/apt,3,1.0,1.0,2.0,95.0,2,499,96.0,East Downtown,74.0,100.0,1.977724
3,75174,78702,30.24773,-97.72584,Entire home/apt,3,1.0,1.0,1.0,130.0,2,249,98.0,East Downtown,131.0,100.0,2.113943
4,76911,78702,30.26775,-97.72695,Entire home/apt,10,3.0,5.0,12.0,821.0,2,126,99.0,East Downtown,56.0,60.0,2.914343


In [41]:
df['log_price'].iplot(
    kind='hist',
    bins=20,
    xTitle='log price',
    linecolor='black',
    yTitle='count',
    title='Price Distribution')

From the histogram alone, log_price appears much more normally distributed compared to the original price variable. I will use the log price for my statistical analysis going forward.

<b>How is price correlated to the number of bedrooms?</b>

In [42]:
# Calculate correlation for bedrooms and price
bedrooms_coef, bedrooms_p_value = stats.pearsonr(df.log_price, df.bedrooms)
print("Correlation Coefficient:", bedrooms_coef)
print("P-value:", bedrooms_p_value)

Correlation Coefficient: 0.5247087791441427
P-value: 0.0


The correlation coefficient of 0.52 means there was a positive correlation between between price and bedrooms. A P-value of 0 means that this is statistically significant.

<b>How is price correlated to the number of bathrooms?</b>

In [43]:
# Calculate correlation for bathrooms and price
bathrooms_coef, bathrooms_p_value = stats.pearsonr(df.log_price, df.bathrooms)
print("Correlation Coefficient:", bathrooms_coef)
print("P-value:", bathrooms_p_value)

Correlation Coefficient: 0.5057768087586119
P-value: 0.0


The correlation coefficient of 0.51 means there was a positive correlation between between price and bathrooms. A P-value of 0 means that this is statistically significant.

<b>How is price correlated to the number of beds?</b>

In [44]:
# Calculate correlation for beds and price
bed_coef, bed_p_value = stats.pearsonr(df.log_price, df.beds)
print("Correlation Coefficient:", bed_coef)
print("P-value:", bed_p_value)

Correlation Coefficient: 0.4115329390429835
P-value: 0.0


The correlation coefficient of 0.41 means there was a strong correlation between between price and beds. A P-value of 0 means that this is statistically significant.

<b>How is price correlated to the number of accommodates?</b>

In [45]:
# Calculate correlation for accommodates and price
accommodates_coef, accommodates_p_value = stats.pearsonr(df.log_price, df.accommodates)
print("Correlation Coefficient:", accommodates_coef)
print("P-value:", accommodates_p_value)

Correlation Coefficient: 0.5505050002800288
P-value: 0.0


The correlation coefficient of 0.55 means there was a strong correlation between between accommodates and beds. A P-value of 0 means that this is statistically significant.

<b>How is price correlated to review_scores_rating?</b>

In [46]:
# Calculate correlation for reviews_scores_rating and price
reviewratings_coef, reviewratings_p_value = stats.pearsonr(df.log_price, df.review_scores_rating)
print("Correlation Coefficient:", reviewratings_coef)
print("P-value:", reviewratings_p_value)

Correlation Coefficient: -0.18459286554830479
P-value: 2.021060328068111e-87


The correlation coefficient of -0.18 is very close to zero which means that review_scores_rating and price may be independent of each other. The P-value is also close to zero which means that this is statistically significant.

<b>How is price correlated to bedroom_bath_ratio?</b>

In [47]:
# Calculate correlation for bedroom_bath_ratio and price
bedbathratio_coef, bedbathratio_p_value = stats.pearsonr(df.log_price, df.bedroom_bath_ratio)
print("Correlation Coefficient:", bedbathratio_coef)
print("P-value:", bedbathratio_p_value)

Correlation Coefficient: -0.11659327796210625
P-value: 1.3427722639743032e-35


The correlation coefficient of -0.11 is very close to zero which means that bedroom_bath_raio and price may be independent of each other. The P-value is also close to zero which means that this is statistically significant.

<b>How is price correlated to number_of_bookings?</b>

In [48]:
# Calculate correlation for number_of_bookings and price
numberbookings_coef, numberbookings_p_value = stats.pearsonr(df.log_price, df.number_of_bookings)
print("Correlation Coefficient:", numberbookings_coef)
print("P-value:", numberbookings_p_value)

Correlation Coefficient: -0.17575857961611144
P-value: 2.6460881788881057e-79


The correlation coefficient of -0.18 is close to zero which means that number_of_bookings and price may be independent of each other. The P-value is also close to zero which means that this is statistically significant.

<b>How is price correlated to number_of_reviews?</b>

In [49]:
# Calculate correlation for number_of_reviews and price
numberreviews_coef, numberreviews_p_value = stats.pearsonr(df.log_price, df.number_of_reviews)
print("Correlation Coefficient:", numberreviews_coef)
print("P-value:", numberreviews_p_value)

Correlation Coefficient: -0.13628297915262633
P-value: 4.071317217464367e-48


The correlation coefficient of -0.14 is close to zero which means that number_of_reviews and price may be independent of each other. The P-value is also close to zero which means that this is statistically significant.

<b>How does each room_type impact price?</b>

In [50]:
df['room_type'].value_counts()

Entire home/apt    8408
Private room       2628
Shared room         181
Hotel room          116
Name: room_type, dtype: int64

In [51]:
room_type_stats = pd.DataFrame()
for index, value in df['room_type'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['room_type'] == index]['price'], df[df['room_type'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['room_type'] == index]['price'].mean()
    mean_False = df[df['room_type'] != index]['price'].mean()
    count_has = df[df['room_type'] == index]['room_type'].count()
    data = pd.DataFrame({'room_type' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    room_type_stats = room_type_stats.append(data, ignore_index=True)

In [52]:
# Set index as room_type as index
room_type_stats = room_type_stats.set_index('room_type')

# Sort by mean difference
room_type_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Entire home/apt,15.174123,0.0,484.886299,120.764444,8408,364.121854
Hotel room,1.112208,0.26607239,506.905172,389.708389,116,117.196783
Shared room,-3.832423,0.00012757,72.044199,396.083214,181,-324.039015
Private room,-14.845579,0.0,107.075723,476.595635,2628,-369.519912


<b>How does each neighbourhood impact price?</b>

In [53]:
neighbourhood_stats = pd.DataFrame()
for index, value in df['neighbourhood'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['neighbourhood'] == index]['price'], df[df['neighbourhood'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['neighbourhood'] == index]['price'].mean()
    mean_False = df[df['neighbourhood'] != index]['price'].mean()
    count_has = df[df['neighbourhood'] == index]['neighbourhood'].count()
    data = pd.DataFrame({'neighbourhood' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    neighbourhood_stats = neighbourhood_stats.append(data, ignore_index=True)

In [54]:
# Set index as neighbourhood as index
neighbourhood_stats = neighbourhood_stats.set_index('neighbourhood')

# Sort by mean difference
neighbourhood_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Westlake Hills,11.416735,0.0,1102.990476,370.549828,315,732.440649
Barton Creek,7.614269,0.0,1017.961957,380.559243,184,637.402714
Highland,4.472501,7.81e-06,871.862385,386.237259,109,485.625126
Gracywoods,4.530266,5.95e-06,749.085,384.473457,200,364.611543
Long Canyon,1.8497,0.06438283,685.58,389.602145,50,295.977855
Old West Austin,6.176242,0.0,668.734454,375.51341,595,293.221043
Brentwood,1.680852,0.09281924,541.794872,388.802004,156,152.992868
Downtown,1.715742,0.0862367,464.197568,386.390445,658,77.807123
University of Texas,0.672649,0.50118402,421.162479,389.225596,597,31.936883
Steiner Ranch,0.312205,0.75489069,412.144981,390.391631,269,21.753351


<b>How does bedrooms impact price?</b>

In [60]:
bedrooms_stats = pd.DataFrame()
for index, value in df['bedrooms'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['bedrooms'] == index]['price'], df[df['bedrooms'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['bedrooms'] == index]['price'].mean()
    mean_False = df[df['bedrooms'] != index]['price'].mean()
    count_has = df[df['bedrooms'] == index]['bedrooms'].count()
    data = pd.DataFrame({'bedrooms' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    bedrooms_stats = bedrooms_stats.append(data, ignore_index=True)

In [61]:
# Set index as bedrooms as index
bedrooms_stats = bedrooms_stats.set_index('bedrooms')

# Sort by mean difference
bedrooms_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
bedrooms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
14.0,,,4500.0,390.545358,1,4109.454642
23.0,,,4485.0,390.546682,1,4094.453318
15.0,,,2893.0,390.687169,1,2502.312831
13.0,,,2100.0,390.757148,1,1709.242852
12.0,,,2100.0,390.757148,1,1709.242852
8.0,4.66841,3.07e-06,1910.333333,389.297412,12,1521.035921
7.0,6.575017,0.0,1789.535714,387.443874,28,1402.09184
6.0,11.390379,0.0,1756.689655,380.342166,87,1376.347489
9.0,2.114073,0.03453082,1458.0,390.43697,5,1067.56303
5.0,13.849955,0.0,1383.393305,369.526681,239,1013.866624


<b>How does bathrooms impact price?</b>

In [62]:
bathrooms_stats = pd.DataFrame()
for index, value in df['bathrooms'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['bathrooms'] == index]['price'], df[df['bathrooms'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['bathrooms'] == index]['price'].mean()
    mean_False = df[df['bathrooms'] != index]['price'].mean()
    count_has = df[df['bathrooms'] == index]['bathrooms'].count()
    data = pd.DataFrame({'bathrooms' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    bathrooms_stats = bathrooms_stats.append(data, ignore_index=True)

In [63]:
# Set index as bathrooms as index
bathrooms_stats = bathrooms_stats.set_index('bathrooms')

# Sort by mean difference
bathrooms_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
bathrooms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5.75,,,6437.0,390.374426,1,6046.625574
7.0,10.877239,0.0,5007.571429,388.054653,7,4619.516776
3.75,8.143851,0.0,4490.4,389.098517,5,4101.301483
17.0,,,4485.0,390.546682,1,4094.453318
6.5,12.693233,0.0,4063.066667,386.041173,15,3677.025493
3.25,,,3470.0,390.636251,1,3079.363749
11.0,,,2893.0,390.687169,1,2502.312831
7.5,3.542936,0.00039729,2699.0,390.296823,3,2308.703177
5.5,8.190443,0.0,2271.125,386.917765,24,1884.207235
6.0,8.472579,0.0,2259.038462,386.612276,26,1872.426186


<b>How does beds impact price?</b>

In [64]:
beds_stats = pd.DataFrame()
for index, value in df['beds'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['beds'] == index]['price'], df[df['beds'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['beds'] == index]['price'].mean()
    mean_False = df[df['beds'] != index]['price'].mean()
    count_has = df[df['beds'] == index]['beds'].count()
    data = pd.DataFrame({'beds' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    beds_stats = beds_stats.append(data, ignore_index=True)

In [66]:
# Set index as beds as index
beds_stats = beds_stats.set_index('beds')

# Sort by mean difference
beds_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
beds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
132.0,,,4500.0,390.545358,1,4109.454642
61.0,,,4485.0,390.546682,1,4094.453318
36.0,,,2899.0,390.68664,1,2508.31336
39.0,,,2893.0,390.687169,1,2502.312831
17.0,3.778916,0.00015832,2002.142857,389.912149,7,1612.230708
19.0,2.849425,0.00438773,1998.75,390.340277,4,1608.409723
15.0,3.456187,0.00054988,1623.8,389.819129,10,1233.980871
14.0,3.444599,0.00057398,1468.461538,389.670495,13,1078.791044
26.0,1.771535,0.07649859,1390.75,390.554947,4,1000.195053
13.0,4.179832,2.939e-05,1373.26087,388.910256,23,984.350613


<b>How does accommodates impact price?</b>

In [68]:
accommodates_stats = pd.DataFrame()
for index, value in df['accommodates'].value_counts().iteritems():
    statistic, pvalue = stats.ttest_ind(df[df['accommodates'] == index]['price'], df[df['accommodates'] != index]['price'])
    pvalue = "{:.8f}".format(float(pvalue))
    mean_True = df[df['accommodates'] == index]['price'].mean()
    mean_False = df[df['accommodates'] != index]['price'].mean()
    count_has = df[df['accommodates'] == index]['accommodates'].count()
    data = pd.DataFrame({'accommodates' : [index], 'stat' : [statistic], 'pvalue' : [pvalue],
                        'mean_True' : [mean_True], 'mean_False' : [mean_False],
                        'count_has' : [count_has], 'mean_diff' : [mean_True - mean_False]})

    accommodates_stats = accommodates_stats.append(data, ignore_index=True)

In [69]:
# Set index as accommodates as index
accommodates_stats = accommodates_stats.set_index('accommodates')

# Sort by mean difference
accommodates_stats.sort_values(by=['mean_diff'], ascending=False)

Unnamed: 0_level_0,stat,pvalue,mean_True,mean_False,count_has,mean_diff
accommodates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18,13.119278,0.0,7740.75,388.312914,4,7352.437086
21,6.700187,0.0,4165.5,389.575249,4,3775.924751
20,10.363698,0.0,3900.909091,387.497792,11,3513.411299
30,3.903266,9.545e-05,3505.0,390.358309,2,3114.641691
19,,,1900.0,390.774797,1,1509.225203
32,1.868624,0.06170092,1882.5,390.644692,2,1491.855308
24,1.542828,0.12290049,1622.5,390.690583,2,1231.809417
28,1.327325,0.18442779,1450.5,390.720943,2,1059.779057
15,4.169389,3.077e-05,1312.423077,388.78898,26,923.634097
14,6.02229,0.0,1217.880597,385.989881,67,831.890716
