## Airbnb Data
#### This public dataset is part of Airbnb, and the original source can be found on this link  http://insideairbnb.com/

#### This notebook seeks to explore the dataset in order to derive conclusions based on some statistical analysis

#### We begin with the usual imports

In [19]:
import pandas as pd
import glob, os
from IPython.display import IFrame
import plotly_express as px
import plotly.graph_objects as go

In [20]:
df = pd.read_csv(r'AB_NYC_2019.csv')

#### Scenario2: - 
#### Exploratory Data Analysis 

##### We know the NYC boroughs are ..

In [21]:
for x in df['neighbourhood_group'].unique():
    print(x)

Brooklyn
Manhattan
Queens
Staten Island
Bronx


##### Lets exclude those properties with no reviews

In [22]:
# Total unique hosts in this dataset
len(df['host_id'].unique())

37457

In [23]:
# Hosts per Borough
df.groupby('neighbourhood_group')['host_id'].count()

# Brooklyn, Manhattan and Queens have high concentrations of properties

neighbourhood_group
Bronx             1091
Brooklyn         20104
Manhattan        21661
Queens            5666
Staten Island      373
Name: host_id, dtype: int64

In [24]:
# Removing those hosts with no review
df.dropna(subset=['reviews_per_month', 'last_review'], inplace=True)

In [25]:
# Separating short term and long term stay properties
shortStays = df[df['minimum_nights']<=7].copy()
longStays = df[df['minimum_nights']>7].copy()

In [26]:
# Count number of Shortstay properties per Borough
x1 = shortStays.groupby('neighbourhood_group')['host_id'].count().reset_index()
x1.columns=['Borough','host_id_Shortstay']

In [27]:
# Count number of Longstay properties per Borough
x2 = longStays.groupby('neighbourhood_group')['host_id'].count().reset_index()
x2.columns = ['Borough','host_id_Longstay']

In [28]:
X = pd.concat([x1,x2], axis=1)
X

Unnamed: 0,Borough,host_id_Shortstay,Borough.1,host_id_Longstay
0,Bronx,833,Bronx,43
1,Brooklyn,14743,Brooklyn,1704
2,Manhattan,14181,Manhattan,2451
3,Queens,4220,Queens,354
4,Staten Island,300,Staten Island,14


In [29]:
# Distribution of price per type of stay
s1 = px.histogram(shortStays, x="price", color="neighbourhood_group")
s1.update_layout(title='Price distribution in short stay properties')
s1.show()

In [30]:
# Top5 popular Shortstay properties - I make the assumption that the higher the 'reviews_per_month', the more popular the property is 

shortStays['MaxBookingPerMonth'] = 30/shortStays['minimum_nights']
shortStays['Popularity'] = shortStays['number_of_reviews']*shortStays['MaxBookingPerMonth']
shortStays.sort_values('Popularity', ascending=False, inplace=True)

In [39]:
# Lets map these ...
popSst = px.scatter_mapbox(shortStays[shortStays['Popularity']>10000], lat="latitude", lon="longitude", color="Popularity", hover_name = "name",
                        hover_data=["price","minimum_nights","Popularity"], 
                        zoom=10, color_continuous_scale='Inferno')
popSst.update_layout(mapbox_style="carto-positron", title="Shortstay properties with Popularity>10000!")
popSst.show()

In [40]:
# Applying the same analysis to Longstay properties

longStays['MaxBookingPerMonth'] = 30/longStays['minimum_nights']
longStays['Popularity'] = longStays['number_of_reviews']*longStays['MaxBookingPerMonth']
longStays.sort_values('Popularity', ascending=False, inplace=True)
popLst = px.scatter_mapbox(longStays[longStays['Popularity']>300], lat="latitude", lon="longitude", color="Popularity", hover_name = "name",
                        hover_data=["price","minimum_nights","Popularity"], 
                        zoom=10, color_continuous_scale='Inferno')
popLst.update_layout(mapbox_style="carto-positron", title="Longstay properties with Popularity>300!")
popLst.show()

In [41]:
# Finally correlating popularity with price

fig1 = px.scatter(shortStays, x="Popularity", y="price", color="price",
                 hover_data=['price', 'Popularity'])
fig1.show()

In [42]:
# Checking this in Longstay too

fig2 = px.scatter(longStays, x="Popularity", y="price", color="price",
                 hover_data=['price', 'Popularity'])
fig2.show()

#### From our definition of 'Popularity', price and Poularity are inversely related in both Short and Longstays.

In [47]:
shortStays[shortStays['Popularity']>15000].groupby(['room_type'])['room_type'].count()

room_type
Private room    7
Name: room_type, dtype: int64

In [48]:
longStays[longStays['Popularity']>300].groupby(['room_type'])['room_type'].count()

room_type
Entire home/apt    4
Private room       5
Name: room_type, dtype: int64

#### The above 2 lines show that tourists prefering short stay are happy with a Private room whereas those with longer duration stays prefer either the entire home / apartment or Private room.