# Section 1: Overview
>## Background
This notebook is to explore the datasets of airbnb markets. The resource includes the listing file, review file and location file called neighbourhood. This analysis will focus on the listing file, which include information such as, location, listing keyword, host id & name, room type, price, review and etc. 
>## Use cases 
This notebook aims to explore the data and analyse them from aspects of prices, users and listings. Through the analysis, this notebook is expected to understand several features of airbnb hence satisfy the needs of stakeholders(users and hosts).
>>### 1. Listing prices and its location distribution
>>### 2. User and host profile analysis
>>### 3. Listing keyword analysis





# Section 2: Preprocession

In [1]:
## Import library to support the analysis
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [1]:
import geopandas

ModuleNotFoundError: No module named 'geopandas'

In [2]:
## Read in file
airBnb_listing = pd.read_csv('DataSource_AirBnb/listings.csv')
airBnb_listing.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,9835,Beautiful Room & House,33057,Manju,,Manningham,-37.77268,145.09213,Private room,60,1,4,2015-09-12,0.03,1,365,0,
1,12936,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,50121,The A2C Team,,Port Phillip,-37.85999,144.97662,Entire home/apt,95,3,42,2020-03-15,0.3,10,0,0,
2,33111,Million Dollar Views Over Melbourne,143550,Paul,,Melbourne,-37.81997,144.96834,Private room,1000,1,2,2012-01-27,0.02,1,265,0,
3,38271,Melbourne - Old Trafford Apartment,164193,Daryl & Dee,,Casey,-38.05725,145.33936,Entire home/apt,110,1,171,2021-12-16,1.26,1,313,18,
4,41836,CLOSE TO CITY & MELBOURNE AIRPORT,182833,Diana,,Darebin,-37.69729,145.00082,Private room,40,7,159,2018-08-22,1.17,2,0,0,


In [3]:
airBnb_listing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17409 entries, 0 to 17408
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              17409 non-null  int64  
 1   name                            17407 non-null  object 
 2   host_id                         17409 non-null  int64  
 3   host_name                       17405 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   17409 non-null  object 
 6   latitude                        17409 non-null  float64
 7   longitude                       17409 non-null  float64
 8   room_type                       17409 non-null  object 
 9   price                           17409 non-null  int64  
 10  minimum_nights                  17409 non-null  int64  
 11  number_of_reviews               17409 non-null  int64  
 12  last_review                     

In [4]:
airBnb_listing.isnull().sum()

id                                    0
name                                  2
host_id                               0
host_name                             4
neighbourhood_group               17409
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                        3925
reviews_per_month                  3925
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
license                           17409
dtype: int64

## 2. Data cleaning

In [5]:
## check the rows with missing listing name
airBnb_listing.loc[airBnb_listing['name'].isna()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1021,5406148,,27981790,Celine,,Melbourne,-37.82165,144.95687,Entire home/apt,125,1,0,,,1,0,0,
3804,15822412,,39805494,Bernadette,,Bayside,-37.89076,144.99128,Private room,120,1,17,2019-05-18,0.29,1,88,0,


**There seems to be no way of filling the information from inference of other cells. Hence fill them with 'unknown'**

In [16]:
airBnb_listing['name'].fillna('unknown', inplace=True)
airBnb_listing['name'].isna().sum()

0

In [6]:
## check the rows with missing host name
airBnb_listing.loc[airBnb_listing['host_name'].isna()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1203,6147642,Large Room Just Across Crown Hotel,31891911,,,Melbourne,-37.82445,144.96246,Private room,75,1,3,2015-07-12,0.04,1,0,0,
2750,11999648,"Private Bedroom, central as it gets",64163227,,,Melbourne,-37.81085,144.96711,Private room,68,1,1,2016-04-12,0.01,2,0,0,
7616,25766554,Lucky Home,193648165,,,Bayside,-37.9826,145.05115,Shared room,80,1,1,2018-08-10,0.02,1,88,0,
9658,32327308,Rowville Beauty,64163227,,,Knox,-37.93473,145.22225,Entire home/apt,225,2,3,2019-04-22,0.09,2,0,0,


**There seems to be no other listings with the same listing ids as the missing host name ones. Hence 'unknown' will be used.**

In [17]:
airBnb_listing['host_name'].fillna('unknown', inplace=True)
airBnb_listing['host_name'].isna().sum()

0

In [18]:
## Read in review file
airBnb_review = pd.read_csv('DataSource_AirBnb/reviews.csv')
airBnb_review.head()

Unnamed: 0,listing_id,date
0,9835,2011-05-24
1,9835,2013-02-26
2,9835,2014-12-08
3,9835,2015-09-12
4,12936,2010-08-04


In [10]:
airBnb_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469148 entries, 0 to 469147
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  469148 non-null  int64 
 1   date        469148 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.2+ MB


In [35]:
## check the price range
airBnb_listing['price'].describe()

count    17409.000000
mean       190.626458
std        411.494624
min          0.000000
25%         75.000000
50%        122.000000
75%        200.000000
max      15000.000000
Name: price, dtype: float64

In [36]:
px.box(airBnb_listing, x='price')

In [25]:
## check the price 0s
price0 = airBnb_listing.loc[airBnb_listing['price']==0]
price0

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
12813,41278626,Free bushfire accommodation for family in need,175946909,Shaun,,Yarra Ranges,-37.78599,145.38469,Private room,0,1,0,,,1,0,0,


**No other inference about the price hence this listing will be dropped in this notebook**

In [38]:
## check the extremely above uper bound 
priceOut = airBnb_listing.loc[airBnb_listing['price']>387]
priceEx.describe()

Unnamed: 0,id,host_id,neighbourhood_group,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
count,189.0,189.0,0.0,189.0,189.0,189.0,189.0,189.0,105.0,189.0,189.0,189.0,0.0
mean,33164830.0,122113200.0,,-37.828902,145.060108,2708.153439,16.751323,14.89418,0.724952,10.693122,184.402116,1.47619,
std,15732170.0,122051500.0,,0.088367,0.171424,2707.067604,57.665087,49.918971,1.036859,16.314431,145.46436,4.697862,
min,997517.0,1695158.0,,-38.18977,144.85772,1001.0,1.0,0.0,0.01,1.0,0.0,0.0,
25%,19753290.0,12336760.0,,-37.85269,144.96138,1280.0,1.0,0.0,0.09,1.0,27.0,0.0,
50%,36834890.0,64874110.0,,-37.82428,144.98765,1680.0,2.0,1.0,0.29,2.0,180.0,0.0,
75%,47454290.0,192661300.0,,-37.81026,145.06623,2712.0,7.0,6.0,0.9,11.0,349.0,1.0,
max,54168970.0,436051700.0,,-37.59966,145.73929,15000.0,400.0,401.0,5.09,53.0,365.0,38.0,


In [39]:
## clean outliers and 0
airBnb_listing_priceCleaned = airBnb_listing.loc[(airBnb_listing['price']>0)
                                                 & (airBnb_listing['price']<=387)]

In [40]:
airBnb_listing_priceCleaned.describe()

Unnamed: 0,id,host_id,neighbourhood_group,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
count,16024.0,16024.0,0.0,16024.0,16024.0,16024.0,16024.0,16024.0,12512.0,16024.0,16024.0,16024.0,0.0
mean,29171610.0,113513500.0,,-37.826888,145.014131,132.043372,7.039191,27.763542,0.96945,8.240327,127.461495,4.920432,
std,14970030.0,110435500.0,,0.071504,0.142913,79.789695,35.517971,54.442116,1.346857,21.549172,141.346706,11.6794,
min,9835.0,9082.0,,-38.22411,144.54161,13.0,1.0,0.0,0.01,1.0,0.0,0.0,
25%,17497960.0,23829000.0,,-37.85462,144.957527,70.0,1.0,1.0,0.12,1.0,0.0,0.0,
50%,29467260.0,71109030.0,,-37.818405,144.97941,115.0,2.0,5.0,0.5,1.0,72.0,0.0,
75%,41387220.0,174864600.0,,-37.80059,145.026832,175.0,3.0,29.0,1.32,4.0,282.0,4.0,
max,54189530.0,438118500.0,,-37.4823,145.83784,387.0,1125.0,666.0,30.85,152.0,365.0,306.0,


In [59]:
airBnb_listing_priceCleaned['availability_365'].unique()

array([365,   0, 313, 308, 145, 299, 357, 364,   2, 306, 337, 356, 129,
       336, 260, 124,   3, 329,  55,   8, 275, 295, 252, 327,  90, 256,
       343, 271, 361, 291, 359, 358,  70, 338,  80, 154, 328, 178, 315,
       353, 363, 286, 310, 342, 172,  60, 305,  37,  58, 181, 254, 189,
       196, 314,  99, 179,  26, 175,  98, 228, 287, 335, 183, 173,  25,
       194, 244, 298, 158,  47, 307,  88, 106, 208, 227, 159, 167, 135,
       354, 341,  40, 294, 283, 273, 213, 230, 248,  67, 279,  63, 282,
       150,   7,   9, 362, 350, 192, 317, 331, 191, 235, 347,  66, 334,
       161, 200, 166, 360, 133, 188, 257, 262, 261,  59, 348, 195, 155,
       351, 222, 100, 233,  89,   1,  81, 333,  85, 311, 344, 164, 325,
       177, 111, 119, 160, 204, 349, 319,  77,  53,  57, 210, 339, 346,
       162,  48, 221, 115,  18, 243, 125,  15, 253, 267, 301, 355,  97,
        11, 126, 289, 312, 138, 242, 345, 247, 352, 156, 137,  32, 277,
       330, 292,  21, 131,  73, 212,  64, 207, 239, 293, 265, 17

In [62]:
airBnb_listing_priceCleaned['availability_365'].describe()

count    16024.000000
mean       127.461495
std        141.346706
min          0.000000
25%          0.000000
50%         72.000000
75%        282.000000
max        365.000000
Name: availability_365, dtype: float64

In [61]:
px.scatter(airBnb_listing_priceCleaned, x='availability_365', y='price')

In [63]:
## check the 0 availability listing
airBnb_listing_priceCleaned.loc[airBnb_listing_priceCleaned['availability_365']==0]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1,12936,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,50121,The A2C Team,,Port Phillip,-37.85999,144.97662,Entire home/apt,95,3,42,2020-03-15,0.30,10,0,0,
4,41836,CLOSE TO CITY & MELBOURNE AIRPORT,182833,Diana,,Darebin,-37.69729,145.00082,Private room,40,7,159,2018-08-22,1.17,2,0,0,
8,66754,Richmond CITY EDGE 60s COOL 1BR+WIFI+AC,50121,The A2C Team,,Yarra,-37.82127,144.99408,Entire home/apt,94,3,70,2020-03-14,0.53,10,0,0,
10,68411,Large Bayside suburban house,334095,Alec,,Bayside,-37.93378,145.01600,Entire home/apt,300,1,0,,,1,0,0,
16,80986,Richmond CENTRAL PARK EDGE 1BR+WIFI,50121,The A2C Team,,Yarra,-37.81541,145.00157,Entire home/apt,84,3,80,2020-03-24,0.64,10,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16977,53690637,Queesplace High floor 2BR2BA+WIFI,68749128,City,,Melbourne,-37.81155,144.95862,Entire home/apt,170,2,3,2022-01-01,3.00,24,0,3,
16985,53696858,Stylish 1BRBA apt in Melbourne CBD+WIFI,68749128,City,,Melbourne,-37.80990,144.95995,Entire home/apt,135,1,0,,,24,0,0,
17037,53757759,Art & Luxury.,56782577,Rhonda,,Port Phillip,-37.86255,144.97740,Entire home/apt,364,2,0,,,1,0,0,
17161,53857948,"Trendy, Modern Apartment in Heart of St Kilda",279001183,MadeComfy,,Port Phillip,-37.85770,144.98496,Entire home/apt,155,1,0,,,123,0,0,


**0 availability has no inference within dataset to explain. Hence, this analysis will focus on the major part of dataset without it.**

In [64]:
airbnbListing = airBnb_listing_priceCleaned.loc[airBnb_listing_priceCleaned['availability_365']>0]

In [67]:
airbnbListing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9776 entries, 0 to 17407
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              9776 non-null   int64  
 1   name                            9776 non-null   object 
 2   host_id                         9776 non-null   int64  
 3   host_name                       9776 non-null   object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   9776 non-null   object 
 6   latitude                        9776 non-null   float64
 7   longitude                       9776 non-null   float64
 8   room_type                       9776 non-null   object 
 9   price                           9776 non-null   int64  
 10  minimum_nights                  9776 non-null   int64  
 11  number_of_reviews               9776 non-null   int64  
 12  last_review                     7

# Section 3: Data analysis
## 1. Price analysis

In [65]:
px.histogram(airbnbListing, x='price')

In [66]:
airbnbListing['price'].describe()

count    9776.000000
mean      148.389218
std        82.298546
min        16.000000
25%        85.000000
50%       133.000000
75%       196.000000
max       387.000000
Name: price, dtype: float64

In [68]:
## check the relationship inner features
corr=airbnbListing.corr()
px.imshow(corr, color_continuous_scale='Blues')

**Feature with most obvious relations to price is longitude.**

In [69]:
## check the price spread on neighbourhood & longitude
px.box(airbnbListing, x = 'neighbourhood', y = 'price')

In [70]:
px.scatter(airbnbListing, x='longitude', y='price')

**Price varies obvious with neighbourhood. No obvious pattern with longitude.**

In [71]:
## The listing spread with location
airBnb_pie = airbnbListing.groupby('neighbourhood').size().reset_index(name='count')
airBnb_pie

Unnamed: 0,neighbourhood,count
0,Banyule,133
1,Bayside,144
2,Boroondara,257
3,Brimbank,84
4,Cardinia,95
5,Casey,112
6,Darebin,241
7,Frankston,112
8,Glen Eira,238
9,Greater Dandenong,84


In [72]:
## visualize the location spread 
fig = px.pie(airBnb_pie, values='count', names='neighbourhood')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

## 2. User analysis

In [73]:
airbnbListing['host_name'].describe()

count        9776
unique       2870
top       Valeria
freq          152
Name: host_name, dtype: object

In [90]:
airbListing_Hname = airbnbListing.groupby('host_name')['name'].count().reset_index(name='count')
airbListing_Hname['count'].describe()

count    2870.000000
mean        3.406272
std         7.208741
min         1.000000
25%         1.000000
50%         1.000000
75%         3.000000
max       152.000000
Name: count, dtype: float64

**Most host has 1-3 listing while 25% of them has more than 3 listing.**

In [91]:
px.box(airbListing_Hname, x='count')

In [93]:
## check the amount after upperfence
hostName_extreme = airbListing_Hname.loc[airbListing_Hname['count']>6]

In [94]:
## visualize it
px.treemap(hostName_extreme, path=['host_name'], values='count')

In [86]:
airbnbListing.groupby(['host_id', 'host_name'])['name'].count().reset_index(name='count')


Unnamed: 0,host_id,host_name,count
0,9082,Dennis,1
1,26687,Rachel,1
2,33057,Manju,1
3,112497,Fleur,1
4,117431,Lorraine,2
...,...,...,...
5599,437807374,Irineu,1
5600,438093729,Alex,1
5601,438112924,Nicholas,1
5602,438117736,Martin,1


**There is almost twice host id of host name.**

In [None]:
## check the df with host name and more listing 

## 3. Listing analysis