#AIRBNB NYC OPEN DATA EXPLORATORY ANALYSIS

The goal of this notebook is to explore the data found in "New York City Airbnb Open Data" dataset found at https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data. The dataset has approximately 49,000 rows and consists of individual Airbnb listings, their hosts, geographical location (coordinates, borough, and neighborhood), a "room type" classification, as well as information about reviews. While the dataset includes number of reviews and the most recent review as of the time the data was collected, actual ratings are absent. The dataset is from 2019.

The goal is to answer a few basic questions about the data:
1: Which boroughs and neighborhoods commanded the highest prices?
2: Which host was the most reviewed?
3: Which accomodation type was the most popular: "Entire home/apt", "Private room", or "Shared room?"

After this exploratory analysis is done, the data will also be visualized in Tableau.

In [2]:
import pandas as pd

In [3]:
data=pd.read_csv('AB_NYC_2019.csv') #importing the data with pandas

##Overview of the data
The next cell shows the first five rows of data, better illustrating what information is included.

In [5]:
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [6]:
data.price.describe()

count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

##Outliers
As one can see above, there is quite some variance in the data. Rows with prices of 0 or above 9999 dollars will be removed so it does not impact the mean.

In [7]:
data=data[data['price']>0]
data=data[data['price']<9999]
data.price.describe() #It turns out the outliers did not impact the mean very much after all.

count    48878.000000
mean       151.546319
std        213.974393
min         10.000000
25%         69.000000
50%        106.000000
75%        175.000000
max       8500.000000
Name: price, dtype: float64

##Question 1: Boroughs, Neighborhoods, and Prices
The goal is to find out which boroughs and neighborhoods command the highest prices on Airbnb. 

In [8]:
data.groupby('neighbourhood_group').price.mean()

neighbourhood_group
Bronx             87.577064
Brooklyn         123.947447
Manhattan        195.074344
Queens            97.769991
Staten Island    114.812332
Name: price, dtype: float64

Perhaps as expected, the borough of Manhattan has the highest average price for Airbnb listings. Now, let's drill down to the neighborhood level. The following cell allows the mean prices of each individual neighborhood to be seen.

In [9]:
pd.set_option('display.max_rows', None)
neighborhood_means=data.groupby(['neighbourhood_group','neighbourhood']).price.mean()
print(neighborhood_means)

neighbourhood_group  neighbourhood             
Bronx                Allerton                       87.595238
                     Baychester                     75.428571
                     Belmont                        77.125000
                     Bronxdale                      57.105263
                     Castle Hill                    63.000000
                     City Island                   173.000000
                     Claremont Village              87.464286
                     Clason Point                  112.761905
                     Co-op City                     77.500000
                     Concourse                      86.180000
                     Concourse Village              73.781250
                     East Morrisania                94.444444
                     Eastchester                   141.692308
                     Edenwald                       82.000000
                     Fieldston                      75.083333
                     F

The next cell will select the highest mean price in each borough. 

In [94]:
means_dataframe=pd.DataFrame(neighborhood_means) #new dataframe made to hold the above
means_reset_index=means_dataframe.reset_index() #resetting the index
max_indices=means_reset_index.groupby('neighbourhood_group').price.idxmax() #selecting the indices of the max prices
means_reset_index.iloc[max_indices]

Unnamed: 0,neighbourhood_group,neighbourhood,price
34,Bronx,Riverdale,442.090909
88,Brooklyn,Sea Gate,487.857143
121,Manhattan,Tribeca,490.638418
163,Queens,Neponsit,274.666667
188,Staten Island,Fort Wadsworth,800.0


This information can give a person a good idea as to what area would fit their budget if they were planning a stay in the city. It might also give clues as to which areas have higher or lower costs of living if they were planning a move. Of course, this information may be out of date due to the events of the pandemic, but the same should apply if newer data is analyzed.

##Question 2: Which host was the most reviewed?
This question is less involved than the previous one. We can group by "host_id" and sum up the number of reviews. The below cell shows the top 25 hosts by number of reviews on their properties. Although this dataset doesn't contain ratings for each property, this information can be useful for a potential customer deciding which property to stay at if they can navigate to it using the website.

In [95]:
popular_hosts=data.groupby('host_id').number_of_reviews.sum().sort_values(ascending=False)
popular_hosts[:25] 

host_id
37312959     2273
344035       2205
26432133     2017
35524316     1971
40176101     1818
4734398      1798
16677326     1355
6885157      1346
219517861    1281
23591164     1269
59529529     1229
47621202     1205
22959695     1157
58391491     1154
21641206     1062
137814       1059
156948703    1052
156684502    1046
3441272      1013
7831209       970
2680820       959
50600973      949
417504        935
303939        915
22384027      902
Name: number_of_reviews, dtype: int64

##Question 3: Which accomodation type was the most popular?
The accomodation types include "Entire home/apt", "Private room", and "Shared room." One way to guess at this question is to see which type had the highest average number of reviews. The "reviews per month" data may also be useful for this purpose.

In [103]:
print(data.groupby('room_type').number_of_reviews.mean().sort_values(ascending=False))
print('\n')
print(data.groupby('room_type').reviews_per_month.mean().sort_values(ascending=False))
print('\n')
print(data.groupby('room_type').number_of_reviews.apply(sum).sort_values(ascending=False))

room_type
Private room       24.105883
Entire home/apt    22.847459
Shared room        16.622625
Name: number_of_reviews, dtype: float64


room_type
Shared room        1.474775
Private room       1.444981
Entire home/apt    1.306754
Name: reviews_per_month, dtype: float64


room_type
Entire home/apt    580394
Private room       537971
Shared room         19249
Name: number_of_reviews, dtype: int64


It appears that the private room had the highest average number of reviews, while the shared room had the highest number of reviews per month. However, if we look at the sum of the number of reviewes for each category, we can see that the shared room was reviewed very seldom by comparison. This implies that the entire home/apt and the private room were most popular, however without actual booking transaction data we can only make an educated guess.