# Airbnb - Boston and Seattle Data First look

In this notebook, we will examine the data set and understand its features.

In [1]:
# Import all necessary packages
import numpy as np
import pandas as pd

In [2]:
#Increasing the display rows to see more records for better understanding of data
pd.set_option('display.max_rows', 500)

In [3]:
# Load Seattle Airbnb data

seattle_calendar = pd.read_csv('Seattle\Calendar.csv')
seattle_listings = pd.read_csv('Seattle\listings.csv')

In [4]:
# Load Boston Airbnb data

boston_calendar = pd.read_csv('Boston\Calendar.csv')
boston_listings = pd.read_csv('Boston\listings.csv')

In [5]:
# Analyse the data
boston_calendar.head()

Unnamed: 0,listing_id,date,available,price
0,12147973,2017-09-05,f,
1,12147973,2017-09-04,f,
2,12147973,2017-09-03,f,
3,12147973,2017-09-02,f,
4,12147973,2017-09-01,f,


In [6]:
boston_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [7]:
boston_calendar.shape

(1308890, 4)

In [8]:
boston_listings.shape

(3585, 95)

In [9]:
# Compare the data types of Seattle and Boston data 

print(seattle_calendar.dtypes == boston_calendar.dtypes)

listing_id    True
date          True
available     True
price         True
dtype: bool


NameError: name 'seattle_reviews' is not defined

In [10]:
# This resulted in error because there are some column differences
print(seattle_listings.dtypes == boston_listings.dtypes)

ValueError: Can only compare identically-labeled Series objects

In [11]:
# boston listings data set has three extra columns than Seattle so we can drop them
boston_listings.drop(columns = ['access', 'interaction', 'house_rules'], axis = 0, inplace = True)

In [12]:
print(seattle_listings.dtypes == boston_listings.dtypes)

id                                   True
listing_url                          True
scrape_id                            True
last_scraped                         True
name                                 True
summary                              True
space                                True
description                          True
experiences_offered                  True
neighborhood_overview                True
notes                                True
transit                              True
thumbnail_url                        True
medium_url                           True
picture_url                          True
xl_picture_url                       True
host_id                              True
host_url                             True
host_name                            True
host_since                           True
host_location                        True
host_about                           True
host_response_time                   True
host_response_rate                

In [18]:
# In above results we could see format of 5 fields are different, lets dig in
boston_listings['host_listings_count'].value_counts()

1      1616
2       498
3       220
4       157
749     136
5        85
7        83
558      79
6        67
313      61
363      58
11       54
52       50
24       48
22       45
18       33
8        28
307      25
15       25
30       24
12       24
13       22
16       21
37       20
21       17
14       15
20       14
10       14
122      13
17       11
9        10
28        8
0         2
71        1
45        1
Name: host_listings_count, dtype: int64

In [19]:
seattle_listings['host_listings_count'].value_counts()

1.0      2179
2.0       620
3.0       261
4.0       151
5.0        98
34.0       67
6.0        48
48.0       46
169.0      39
37.0       37
36.0       36
7.0        30
9.0        28
10.0       26
8.0        23
11.0       22
21.0       21
18.0       19
17.0       16
13.0       12
12.0       12
354.0      10
19.0        4
163.0       4
15.0        3
84.0        2
502.0       2
Name: host_listings_count, dtype: int64

In [16]:
boston_listings['license'].value_counts()

Series([], Name: license, dtype: int64)

In [17]:
seattle_listings['license'].value_counts()

Series([], Name: license, dtype: int64)

In [None]:
seattle_calendar.head()

In [None]:
seattle_listings.head()

So we had a quick look at the data and understood few things:
1. Data set is almost same.
2. Boston listings data set has three extra columns when compared to Seattle i.e. 'access', 'interaction', 'house_rules'
3. There are 5 columns with different format - 'host_listings_count','host_total_listings_count', 'neighbourhood_group_cleansed', 'has_availability', 'jurisdiction_names'
4. Field 'license' has only NAN in both the data set.
5. Field 'market' can be used to identify if a particular record is of Boston's or Seattle's.