# Section 1: Overview
>## Background
This notebook is to explore the datasets of airbnb markets. The resource includes the listing file, review file and location file called neighbourhood. This analysis will focus on the listing file, which include information such as, location, listing keyword, host id & name, room type, price, review and etc. 
>## Use cases 
This notebook aims to explore the data and analyse them from aspects of prices, users and listings. Through the analysis, this notebook is expected to understand several features of airbnb hence satisfy the needs of stakeholders(users and hosts).
>>### 1. Listing prices and its location distribution
>>### 2. User and host profile analysis
>>### 3. Listing keyword analysis





# Section 2: Preprocession
## 1. Data exploration

In [1]:
## Import library to support the analysis
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [1]:
import geopandas

ModuleNotFoundError: No module named 'geopandas'

In [2]:
## Read in file
airBnb_listing = pd.read_csv('DataSource_AirBnb/listings.csv')
airBnb_listing.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,9835,Beautiful Room & House,33057,Manju,,Manningham,-37.77268,145.09213,Private room,60,1,4,2015-09-12,0.03,1,365,0,
1,12936,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,50121,The A2C Team,,Port Phillip,-37.85999,144.97662,Entire home/apt,95,3,42,2020-03-15,0.3,10,0,0,
2,33111,Million Dollar Views Over Melbourne,143550,Paul,,Melbourne,-37.81997,144.96834,Private room,1000,1,2,2012-01-27,0.02,1,265,0,
3,38271,Melbourne - Old Trafford Apartment,164193,Daryl & Dee,,Casey,-38.05725,145.33936,Entire home/apt,110,1,171,2021-12-16,1.26,1,313,18,
4,41836,CLOSE TO CITY & MELBOURNE AIRPORT,182833,Diana,,Darebin,-37.69729,145.00082,Private room,40,7,159,2018-08-22,1.17,2,0,0,


In [3]:
airBnb_listing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17409 entries, 0 to 17408
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              17409 non-null  int64  
 1   name                            17407 non-null  object 
 2   host_id                         17409 non-null  int64  
 3   host_name                       17405 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   17409 non-null  object 
 6   latitude                        17409 non-null  float64
 7   longitude                       17409 non-null  float64
 8   room_type                       17409 non-null  object 
 9   price                           17409 non-null  int64  
 10  minimum_nights                  17409 non-null  int64  
 11  number_of_reviews               17409 non-null  int64  
 12  last_review                     

In [4]:
airBnb_listing.isnull().sum()

id                                    0
name                                  2
host_id                               0
host_name                             4
neighbourhood_group               17409
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                        3925
reviews_per_month                  3925
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
license                           17409
dtype: int64

In [5]:
airBnb_listing.loc[airBnb_listing['name'].isna()]


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1021,5406148,,27981790,Celine,,Melbourne,-37.82165,144.95687,Entire home/apt,125,1,0,,,1,0,0,
3804,15822412,,39805494,Bernadette,,Bayside,-37.89076,144.99128,Private room,120,1,17,2019-05-18,0.29,1,88,0,


In [6]:
airBnb_listing.loc[airBnb_listing['host_name'].isna()]


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1203,6147642,Large Room Just Across Crown Hotel,31891911,,,Melbourne,-37.82445,144.96246,Private room,75,1,3,2015-07-12,0.04,1,0,0,
2750,11999648,"Private Bedroom, central as it gets",64163227,,,Melbourne,-37.81085,144.96711,Private room,68,1,1,2016-04-12,0.01,2,0,0,
7616,25766554,Lucky Home,193648165,,,Bayside,-37.9826,145.05115,Shared room,80,1,1,2018-08-10,0.02,1,88,0,
9658,32327308,Rowville Beauty,64163227,,,Knox,-37.93473,145.22225,Entire home/apt,225,2,3,2019-04-22,0.09,2,0,0,


In [10]:
airBnb_listing.loc[airBnb_listing['host_id']==31891911]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1203,6147642,Large Room Just Across Crown Hotel,31891911,,,Melbourne,-37.82445,144.96246,Private room,75,1,3,2015-07-12,0.04,1,0,0,


In [11]:
airBnb_listing.loc[airBnb_listing['host_id']==64163227]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
2750,11999648,"Private Bedroom, central as it gets",64163227,,,Melbourne,-37.81085,144.96711,Private room,68,1,1,2016-04-12,0.01,2,0,0,
9658,32327308,Rowville Beauty,64163227,,,Knox,-37.93473,145.22225,Entire home/apt,225,2,3,2019-04-22,0.09,2,0,0,


In [12]:
airBnb_listing.loc[airBnb_listing['host_id']==193648165]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
7616,25766554,Lucky Home,193648165,,,Bayside,-37.9826,145.05115,Shared room,80,1,1,2018-08-10,0.02,1,88,0,


In [41]:
airBnb_listing['host_name'].duplicated().value_counts()

True     12618
False     4791
Name: host_name, dtype: int64

In [42]:
airBnb_listing['host_id'].duplicated().value_counts()

False    11352
True      6057
Name: host_id, dtype: int64

In [51]:
hostName_ref = airBnb_listing.groupby(['host_id', 'host_name'])['name'].count().reset_index(name='count')    


In [59]:
hostName_ref[hostName_ref['count']==0]

InvalidIndexError: (0        False
1        False
2        False
3        False
4        False
         ...  
11344    False
11345    False
11346    False
11347    False
11348    False
Name: count, Length: 11349, dtype: bool,          host_id host_name
0           9082    Dennis
1          18785      Kate
2          26687    Rachel
3          33057     Manju
4          40864      Jane
...          ...       ...
11344  437807374    Irineu
11345  438093729      Alex
11346  438112924  Nicholas
11347  438117736    Martin
11348  438118547      Linh

[11349 rows x 2 columns])

In [46]:

dict_hostName = dict([(i, [x]) for i, x in zip(hostName_ref['host_id'], hostName_ref['host_name'])])


In [47]:
def isNaN(string):
    return string != string

In [48]:
for r, row in enumerate(airBnb_listing['host_name'].values):
    if isNaN(row) and airBnb_listing['host_id'][r] in dict_hostName:
        airBnb_listing['host_name'][r] = dict_hostName[airBnb_listing['host_id'][r]]

In [50]:
airBnb_listing['host_name'].isna().sum()

4

In [8]:
## Read in file
airBnb_review = pd.read_csv('DataSource_AirBnb/reviews.csv')
airBnb_review.head()

Unnamed: 0,listing_id,date
0,9835,2011-05-24
1,9835,2013-02-26
2,9835,2014-12-08
3,9835,2015-09-12
4,12936,2010-08-04


In [10]:
airBnb_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 469148 entries, 0 to 469147
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  469148 non-null  int64 
 1   date        469148 non-null  object
dtypes: int64(1), object(1)
memory usage: 7.2+ MB


## 2. Data cleaning

In [None]:
airbnb = pd.merge(airBnb_listing, airBnb_review, on = '')

# Section 3: Data analysis
## 1. Price analysis

## 2. User analysis

## 3. Listing analysis