# Airbnb - Business Understanding/Data Understanding (Vancouver)

Initial exploratory analysis of the dataset in order to generate some understanding. Raw data is read in and examined using basis statistics and visualizations. Some basic processing will be done to allow this.

## Imports

In [4]:
import pandas as pd
import numpy as np

## Functions

In [5]:
#No functions used in this initial analysis

# Basic Setup

Read data, create global variables

In [32]:
#Read in .csv files
Vancouver_Cal = pd.read_csv('Data/calendar.csv.gz')
Vancouver_List = pd.read_csv('Data/listings.csv.gz')
Vancouver_Rev = pd.read_csv('Data/reviews.csv.gz')

## Business Understanding

Airbnb is a business which allows users to rent out properties to other one on one. Revenue is generated for AirBnb through users hosting and booking on their platform.

Initial questions are as follows:

 - Does the price vary seasonably?
 - Which Airbnb listing features have the largest impact on price?
 - Is it possible to predict prices base on features and season?
 
## Data Understanding

Ensure that data is of quality and quantity to answer the above questions

### Vancouver

__Calendar.csv Exploratory Analysis__

In [8]:
#First 5 rows
Vancouver_Cal.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,257163,2019-09-17,f,$65.00,$65.00,3,14
1,257163,2019-09-18,f,$65.00,$65.00,3,14
2,5731,2019-09-17,f,$40.00,$40.00,2,30
3,5731,2019-09-18,f,$40.00,$40.00,2,30
4,5731,2019-09-19,f,$40.00,$40.00,2,30


In [11]:
#Proportion of NaN's
pd.isnull(Vancouver_Cal).sum()/len(Vancouver_Cal)*100

listing_id        0.0
date              0.0
available         0.0
price             0.0
adjusted_price    0.0
minimum_nights    0.0
maximum_nights    0.0
dtype: float64

In [12]:
#Count # of ID's
len(Vancouver_Cal.listing_id.unique())

6176

In [15]:
#Data range
print(pd.to_datetime(Vancouver_Cal.date.min()))
print(pd.to_datetime(Vancouver_Cal.date.max()))

2019-09-17 00:00:00
2020-09-15 00:00:00


In [16]:
#Number of days
len(Vancouver_Cal.date.unique())

365

In [20]:
#Check if all days are counted for all listings
(Vancouver_Cal.groupby(['listing_id']).count() == 365)['date'].value_counts()

True    6176
Name: date, dtype: int64

__Listings.csv Exploratory Analysis__

In [21]:
Vancouver_List.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,5731,https://www.airbnb.com/rooms/5731,20190917034805,2019-09-17,Mai Lodging - Room Single bed 5,The rental area has been remodeled as of April...,"Located right next to a beautiful open park, K...",The rental area has been remodeled as of April...,none,,...,t,f,strict_14_with_grace_period,f,f,6,2,4,0,0.81
1,10080,https://www.airbnb.com/rooms/10080,20190917034805,2019-09-17,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, option to add up to 2 tw...","Stunning two bedroom, two bathroom apartment. ...",none,,...,f,f,strict_14_with_grace_period,f,f,38,38,0,0,0.17
2,13188,https://www.airbnb.com/rooms/13188,20190917034805,2019-09-17,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,Very Close (3min walk) to Nat Bailey baseball ...,Garden level studio suite with garden patio - ...,none,The uber hip Main street area is a short walk ...,...,t,f,moderate,f,f,2,2,0,0,1.87
3,13357,https://www.airbnb.com/rooms/13357,20190917034805,2019-09-17,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...,none,Amanzing bibrant professional neighbourhood. C...,...,f,f,strict_14_with_grace_period,t,t,3,1,2,0,0.49
4,13490,https://www.airbnb.com/rooms/13490,20190917034805,2019-09-17,Vancouver's best kept secret,This apartment rents for one month blocks of t...,"Vancouver city central, 700 sq.ft., main floor...",This apartment rents for one month blocks of t...,none,"In the heart of Vancouver, this apartment has ...",...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.82


In [26]:
#All features
Vancouver_List.columns.to_list()

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'thumbnail_url',
 'medium_url',
 'picture_url',
 'xl_picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'latitude',
 'longitude',
 'is_location_exact',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities',


In [27]:
#See if number of ID's is identical to calendar.csv
len(Vancouver_List.id.unique())

6176

In [29]:
#Is each ID listed only once?
Vancouver_List.shape

(6176, 106)

In [30]:
#Check for NaN's and show highest frequency columns
data = pd.isnull(Vancouver_List).sum()/len(Vancouver_List)*100
data.sort_values(ascending=False).head(30)

host_acceptance_rate            100.000000
neighbourhood_group_cleansed    100.000000
xl_picture_url                  100.000000
medium_url                      100.000000
thumbnail_url                   100.000000
square_feet                      98.737047
weekly_price                     91.207902
monthly_price                    90.560233
notes                            49.789508
host_about                       37.435233
access                           35.994171
interaction                      31.395725
transit                          26.748705
house_rules                      26.489637
neighborhood_overview            26.408679
space                            20.709197
license                          18.636658
security_deposit                 14.054404
review_scores_checkin            13.244819
review_scores_location           13.244819
review_scores_value              13.244819
review_scores_communication      13.212435
review_scores_cleanliness        13.196244
review_scor

__Reviews.csv__

In [33]:
Vancouver_Rev.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,5731,1609,2009-04-19,11052,Carmen,This was just perfect for a professional visit...
1,5731,1919,2009-04-30,11973,Meseret,Mai Lodging is very clean and fully furnished....
2,5731,3963,2009-06-13,18827,Rob,"Very nice place, and good location for getting..."
3,5731,3967,2009-06-13,18888,Armin,Awesome place! Best price! It's just really ni...
4,5731,5186,2009-07-07,17965,Michel,Great food right in the neighbourhood (vietnam...


In [34]:
#Is number of ID's identical to other files
len(Vancouver_Rev.listing_id.unique())

5385

In [35]:
#Data range
print(pd.to_datetime(Vancouver_Rev.date.min()))
print(pd.to_datetime(Vancouver_Rev.date.max()))

2009-04-19 00:00:00
2019-09-16 00:00:00


In [36]:
#Check for NaN's
pd.isnull(Vancouver_Rev).sum()/len(Vancouver_Rev)*100

listing_id       0.000000
id               0.000000
date             0.000000
reviewer_id      0.000000
reviewer_name    0.000000
comments         0.041078
dtype: float64

__Key Findings__

 - All files share a Listing ID
 - Reviews.csv is not a complete dataset for listing_ID (some Listing ID's must not have any reviews yet)
 - Calendar & Reviews files have multiple entries
 - Calendar.csv has no missing data
 - Some features in the Listing.csv file have a substaintial amount of missing data (sq ft, monthly cost, etc)
 - Reviews.csv go back much further and has little missing data