lines from internet: 10

Own lines of code : 65

We used plotly to plot the Data where as on the internet the data is plotted using MatplotLib

## Description of Data
The data is sourced from the **Inside Airbnb** website `http://insideairbnb.com/get-the-data.html` which hosts publicly available data from the Airbnb site.
The dataset comprises of three main tables:
* `listings` - Detailed listings data showing 106 atttributes for each of the listings. Some of the attributes used in the analysis are `price` (continuous), `longitude` (continuous), `latitude` (continuous), `listing_type` (categorical), `is_superhost` (categorical), `neighbourhood` (categorical), `ratings` (continuous) among others.
* `reviews` - Detailed reviews given by the guests with 6 attributes. Key attributes include `date` (datetime), `listing_id` (discrete), `reviewer_id` (discrete) and `comment` (textual).
* `calendar` - Provides details about booking for the next year by listing. Four attributes in total including `listing_id` (discrete), `date` (datetime), `available` (categorical) and `price` (continuous).


In [2]:
import pandas as pd
import numpy as np
#import plotly
import plotly.plotly as py
import plotly.offline as off
import plotly.graph_objs as go
from collections import Counter
off.init_notebook_mode(connected = True)

In [3]:
listings = pd.read_csv('../data/listings.csv', parse_dates = True)
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3781,https://www.airbnb.com/rooms/3781,20190209175027,2019-02-09,HARBORSIDE-Walk to subway,Fully separate apartment in a two apartment bu...,This is a totally separate apartment located o...,Fully separate apartment in a two apartment bu...,none,"Mostly quiet ( no loud music, no crowed sidewa...",...,f,f,super_strict_30,f,f,1,1,0,0,0.32
1,5506,https://www.airbnb.com/rooms/5506,20190209175027,2019-02-09,**$79 Special ** Private! Minutes to center!,This is a private guest room with private bath...,**THE BEST Value in BOSTON!!*** PRIVATE GUEST ...,This is a private guest room with private bath...,none,"Peacful, Architecturally interesting, historic...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.66
2,6695,https://www.airbnb.com/rooms/6695,20190209175027,2019-02-09,$99 Special!! Home Away! Condo,,** WELCOME *** FULL PRIVATE APARTMENT In a His...,** WELCOME *** FULL PRIVATE APARTMENT In a His...,none,"Peaceful, Architecturally interesting, histori...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.73
3,6976,https://www.airbnb.com/rooms/6976,20190209175027,2019-02-09,Mexican Folk Art Haven in Boston Residential Area,Come stay with me in Boston's Roslindale neigh...,"This is a well-maintained, two-family house bu...",Come stay with me in Boston's Roslindale neigh...,none,The LOCATION: Roslindale is a safe and diverse...,...,f,f,moderate,t,f,1,0,1,0,0.64
4,8789,https://www.airbnb.com/rooms/8789,20190209175027,2019-02-09,Curved Glass Studio/1bd facing Park,"Bright, 1 bed with curved glass windows facing...",Fully Furnished studio with enclosed bedroom. ...,"Bright, 1 bed with curved glass windows facing...",none,Beacon Hill is a historic neighborhood filled ...,...,f,f,strict_14_with_grace_period,f,f,10,10,0,0,0.4


In [25]:
listings['host_since'] = pd.to_datetime(listings["host_since"], format="%Y/%m/%d")
listings.groupby(listings.host_since.dt.year)['host_id'].count()

host_since
2008.0       1
2009.0      66
2010.0     112
2011.0     322
2012.0     289
2013.0     687
2014.0    1293
2015.0    1101
2016.0    1061
2017.0     613
2018.0     572
2019.0      36
Name: host_id, dtype: int64

In [5]:
data_g1 = [
    go.Bar(
        x=listings.host_since.dt.year.unique(), # assign x as the dataframe column 'x'
        y=listings.groupby(listings.host_since.dt.year)['host_id'].count()
    )
]
layout_g1 = go.Layout(
    #barmode='group',
    title = 'Number of new hosts added every year'
)
fig1 = go.Figure(data=data_g1, layout=layout_g1)
off.iplot(fig1,filename='Sample')

 From the above graph we can infer that **AirBNB** has started it's operations in **Boston** from `2008`.And also, we can see that there is a drastic rise in the number of new hosts from 2012 to 2014

In [26]:
def prices_to_numbers(price_string):
    price_numeric = float(str(price_string).replace(',', '').split('$')[-1])
    return price_numeric

In [27]:
listings['price'] = listings.price.apply(prices_to_numbers)

In [28]:
listings.groupby(listings.neighbourhood_cleansed)['accommodates','beds'].sum()

Unnamed: 0_level_0,accommodates,beds
neighbourhood_cleansed,Unnamed: 1_level_1,Unnamed: 2_level_1
Allston,1000,533.0
Back Bay,1675,836.0
Bay Village,145,82.0
Beacon Hill,799,411.0
Brighton,1007,564.0
Charlestown,590,325.0
Chinatown,517,269.0
Dorchester,1755,1015.0
Downtown,1622,780.0
East Boston,1088,568.0


In [9]:
nhab= pd.DataFrame(listings.groupby(listings.neighbourhood_cleansed)['accommodates','beds'].sum()/listings.groupby(listings.neighbourhood_cleansed)['accommodates','beds'].count()).reset_index()
nhab.head()

Unnamed: 0,neighbourhood_cleansed,accommodates,beds
0,Allston,3.039514,1.620061
1,Back Bay,3.390688,1.69574
2,Bay Village,3.717949,2.102564
3,Beacon Hill,3.133333,1.611765
4,Brighton,2.812849,1.575419


In [10]:
nhap = pd.DataFrame(listings.groupby(listings.property_type)['accommodates','price'].sum()/listings.groupby(listings.property_type)['accommodates','price'].count()).reset_index()
nhap

Unnamed: 0,property_type,accommodates,price
0,Aparthotel,4.0,250.0
1,Apartment,3.424399,216.447717
2,Barn,4.0,275.0
3,Bed and breakfast,2.446809,148.893617
4,Boat,4.5,272.357143
5,Boutique hotel,2.684211,186.631579
6,Bungalow,2.5,121.25
7,Camper/RV,5.0,112.5
8,Chalet,2.0,54.0
9,Condominium,3.486819,251.337434


In [11]:
trace1 = go.Bar(
    x=nhab.neighbourhood_cleansed,
    y=nhab.accommodates.round(),
    name='Accomodates '
)
trace2 = go.Bar(
    x=nhab.neighbourhood_cleansed,
    y=nhab.beds.round(),
    name='beds '
)

data_g2 = [trace1, trace2]
layout_g2 = go.Layout(
    barmode='group',
    title = 'Accomodates Vs Beds'
)

fig = go.Figure(data=data_g2, layout=layout_g2)
off.iplot(fig, filename='grouped-bar')

It can be inferred that there are more number of listings in Back Bay area

In [12]:
trace3 = go.Bar(
    x=nhap.property_type,
    y=nhap.price,
    name='Price '
)
trace4 = go.Bar(
    x=nhap.property_type,
    y=nhap.accommodates.round(),
    name='Accomodates '
)

data = [trace3, trace4]
layout = go.Layout(
    barmode='group',
    title = 'Property_type Vs Price'
)

fig = go.Figure(data=data, layout=layout)
off.iplot(fig, filename='grouped-bar')

From the graph we can say that Serviced Apartment has the highest number of Average Price when compared to other property_type 

In [13]:
nh = listings['neighbourhood_cleansed'].value_counts()
neighbourhood_table = pd.DataFrame({"Count":nh})
neighbourhood_table.head()

Unnamed: 0,Count
Dorchester,537
Jamaica Plain,514
Back Bay,494
Downtown,453
South End,445


In [14]:
data_g3 = [
    go.Bar(
        x=neighbourhood_table.index, # assign x as the dataframe column 'x'
        y=neighbourhood_table.Count
    )
]
layout_g3 = go.Layout(
    #barmode='group',
    title = 'Listings per neighbourhood'
)
fig1 = go.Figure(data=data_g3, layout=layout_g3)
off.iplot(fig1,filename='Sample')

In [18]:
r = pd.read_csv('../data/reviews.csv')
r['date'] = pd.to_datetime(r["date"], format="%m/%d/%Y")
#r.groupby([r.date.dt.year,r.listing_id])['date'].count()

In [36]:
#r.groupby(r.date.dt.year)['date'].count()
yearwise_reviews = pd.DataFrame({"Count":r.groupby([r.date,r.listing_id])['date'].count()})
yearwise_reviews = yearwise_reviews.reset_index()

In [30]:
r = pd.read_csv('../data/reviews.csv')
r['date'] = pd.to_datetime(r["date"], format="%m/%d/%Y")

In [31]:
cL = pd.read_csv('../sentiment_data.csv')
cL.head()

Unnamed: 0.1,Unnamed: 0,listing_id,pos,neu,neg,compound,name,latitude,longitude,review_scores_rating,neighbourhood_cleansed,distance_airport,distance_tophub,distance_prudential,distance_royale,distance_harvard
0,3195,22162098,1.0,0.0,0.0,0.6588,Beautiful 2 Bedroom in The heart of Boston!,42.332205,-71.112811,80.0,Mission Hill,5.765146,1.781861,1.862614,2.711111,3.581266
1,4001,26362139,1.0,0.0,0.0,0.4404,Two Bedroom in Boston's Back Bay #201,42.346761,-71.079739,100.0,Back Bay,3.818834,0.211154,0.144703,0.758202,3.815722
2,4374,28492660,1.0,0.0,0.0,0.6588,Classic 2BR in South End by Sonder,42.342961,-71.065314,100.0,South End,3.252235,0.980294,0.92697,0.480489,4.568061
3,4811,30505480,0.902,0.098,0.0,0.44815,Sophisticated 2BR in Lower Allston by Sonder,42.364704,-71.131268,100.0,Allston,6.227948,2.732877,2.774433,3.515281,1.148113
4,4874,31261651,0.83,0.17,0.0,0.8676,(116-9) Back Bay - Perfect Loft Studio,42.353872,-71.077603,80.0,Back Bay,3.573977,0.600067,0.52824,0.674416,3.633181


In [32]:
random_listings = cL[100:115]

In [33]:
trace_1 = go.Bar(
    x=random_listings.name,
    y=random_listings.pos.values,
    name='Positive '
)
trace_2 = go.Bar(
    x=random_listings.name,
    y=random_listings.neg.values,
    name='Negative '
)
trace_3 = go.Bar(
    x= random_listings.name,
    y= random_listings['compound'],
    name='Compound '
)

data_1 = [trace_1, trace_2,trace_3]
layout_1 = go.Layout(
    barmode='stack',
    title = 'Review Sentiment Distribution on Listings '
    
)

fig_1 = go.Figure(data=data_1, layout=layout_1)
off.iplot(fig_1, filename='grouped-bar')

## How popular is AirBNB across years?

In [35]:
#r = pd.read_csv('../data/reviews.csv',parse_dates = True)
#r['date'] = pd.to_datetime(r["date"], format="%m/%d/%Y")
grouped = r.groupby(r.date.dt.year)['date'].count()
trace5 =go.Scatter(
        x=yearwise_reviews.date, # assign x as the dataframe column 'x'
        y= yearwise_reviews['Count'],
        mode = 'markers') 
    

data_g4 = [trace5]
layout_g4 = go.Layout(
    #barmode='group',
    title = 'Increase in number of reveiws over years for listings',
    xaxis = dict(
        range = ['2009-01-01','2019-4-25'])
)
fig1 = go.Figure(data=data_g4, layout=layout_g4)
off.plot(fig1,filename='Sample')

'file://C:\\Users\\chowd\\INFO6105\\Sample.html'

 There is an drastic increase of reviews per listings over the years 