![](https://cdn.shopify.com/s/files/1/0053/6513/7472/products/newyorkcitysunset911.jpg?v=1544036357)
([Image Source](https://gettyphotography.com/products/new-york-city-empire-state-building-sunset-911))

# Table of Contents

[1. Acknowledgement](#1)<br>
&nbsp;&nbsp;&nbsp;[Introduction](#1.1)<br>
[2. Import Library](#2)<br>
[3. Data Cleaning](#3)<br>
[4. Exploratory Data Analysis (EDA)](#4)<br>
&nbsp;&nbsp;&nbsp;[Extend the Summary Data](#4.1)<br>
[5. Hypothesis](#5)<br>
[6. Conclusion](#6)<br>
[7. Summary](#7)<br>

# Ackowlegement<a id='1'></a>
This Notebook would not have been possible without the fantastic dataset provided by [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Also, shoutout to TKH volunteers: [Michael](https://www.linkedin.com/in/michael-lieberman-65786ba/), [Saurabh](https://www.linkedin.com/in/sauragar/) and our Instructors: Anil and [Malcolm](https://www.linkedin.com/in/malcolm-holliday/) who helped us with this project.

# Introduction<a id='1.1'></a>


[Airbnb](https://www.airbnb.com) is the biggest player in the short-term rental market, with more than 7 million listings in over 220 countries. Over the years, its rampant growth and lack of transparency have made it a target for everything from charges of fueling overtourism and turning formerly residential neighborhoods into tourist zones to enabling raucous parties despite complaints and virus-related restrictions on gatherings. 

### The COVID effect
The coronavirus has taken a massive toll on the travel industry, which in turn has created a major challenge for Airbnb. 

After laying off a quarter of its work force in the spring, Airbnb jettisoned some new ventures, including forays into transportation and entertainment, and hunkered down to focus on its core strength, lodging, even as its valuation fell from a high of \\$31 billion to, recently, \\$18 billion, according to The Wall Street Journal.

During the fourth quarter of last year, Airbnb raked in \\$1.11 billion in revenue. But by the second quarter of this year, which covered the height of the pandemic, Airbnb’s revenue had dwindled to \\$334.78 million, down 72% percent year-over-year. The rising number of cancellations and slowing number of bookings ballooned Airbnb’s losses to \\$575.6 million in the second quarter, compared to a loss of \\$297.4 million during the same period one year prior. 

Last year, prior to suffering any effects from the pandemic, the company lost \\$674.33 million, far greater than its \\$16.9 million loss in 2018. Meanwhile, in the first nine months of 2020, the company has already lost \\$696.9 million, more than during all of 2019. [source](https://fortune.com/2020/11/16/airbnb-ipo-initial-public-offering-coronavirus-impact/)

### Our Study

**The [current](https://www.kaggle.com/ivanovskia1/nyc-airbnb-rental-data-october-2017) NYC AirBnb Rental data October 2017 contains information about airbnb listings. It has it's location by latitute and longitude as well as the neighborhood,borough. It also has its price per night, amount of bedrooms, bathrooms ect.**

The aim of our study is to explore data produced by airbnb listings and look for factors that might have contributed to Airbnb sucess. Futhermore, find out if we can find any patterns and predict the location and price of a listing. 

## Technology Stack
In this analysis, we used python as the primary programming language because of its rich palette of tools that make data analysis a cinch. Some of the packages we used are:
1. [Matplotlib](https://matplotlib.org/) is an extremely versatile library of tools for generating interactive plots that are easy to interpret and customise.
2. [Numpy](https://github.com/numpy/numpy) is a popular library used for array manipulation and vector operations. It is used extensively across python projects that require scientific computing.
3. [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) is another library for data science that is just as popular as numpy. It provides easy to use data structures and functions to manipulate structured data.
4. [Seaborn](https://seaborn.pydata.org/) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
5. [Sqlite3](https://www.sqlite.org/index.html) is a C library that provides a lightweight disk-based database that allows accessing the database using a nonstandard variant of the SQL query language.
6. [Folium](https://python-visualization.github.io/folium/) makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map.

These tools are well documented and come with several examples that make it easy to start using them. You can check out the linked documentation pages for more information.

# Import Library<a id='2'></a>

In [1]:
import pandas as pd
import numpy as np 
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3 as sql
# import folium
# from folium.plugins import HeatMap

In [5]:
db = 'project_2.db'
conn = sql.connect(db)

In [6]:
df = pd.read_sql("SELECT * from airbnb2017", conn, index_col='index')
df

Unnamed: 0_level_0,id,host_response_time,host_response_rate,host_is_superhost,host_has_profile_pic,neighbourhood_cleansed,latitude,longitude,is_location_exact,property_type,...,maximum_nights,calendar_updated,availability_30,number_of_reviews,review_scores_rating,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,reviews_per_month
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,18461891,,,f,t,Ditmars Steinway,40.774142,-73.916246,t,Apartment,...,6,5 months ago,0,0,,f,f,strict,f,
1,20702398,within an hour,100%,f,t,City Island,40.849191,-73.786509,f,House,...,21,2 weeks ago,19,2,100.0,f,f,moderate,f,2.00
2,6627449,within an hour,100%,f,t,City Island,40.849775,-73.786609,t,Apartment,...,21,2 weeks ago,28,21,95.0,f,f,strict,f,0.77
3,19949243,within a few hours,100%,f,t,City Island,40.848838,-73.782276,f,Boat,...,1125,6 days ago,30,0,,t,f,strict,f,
4,1886820,,,f,t,City Island,40.841144,-73.783052,t,House,...,90,16 months ago,30,0,,f,f,strict,f,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44312,20530309,within an hour,90%,f,t,Flatlands,40.618675,-73.932736,f,Apartment,...,1125,2 weeks ago,30,1,100.0,t,f,flexible,f,0.81
44313,20459907,within a few hours,100%,f,t,Bushwick,40.684681,-73.905174,t,Apartment,...,30,2 weeks ago,4,0,,t,f,strict,f,
44314,4287386,within an hour,100%,f,t,Rockaway Beach,40.583865,-73.819245,f,Apartment,...,60,2 weeks ago,1,6,87.0,f,f,moderate,f,3.91
44315,20939747,within an hour,100%,f,t,Rosedale,40.679998,-73.720787,f,Apartment,...,1125,a week ago,7,0,,f,f,strict,f,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44317 entries, 0 to 44316
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             44317 non-null  int64  
 1   host_is_superhost              44317 non-null  object 
 2   host_has_profile_pic           44317 non-null  object 
 3   neighbourhood_cleansed         44317 non-null  object 
 4   latitude                       44317 non-null  float64
 5   longitude                      44317 non-null  float64
 6   is_location_exact              44317 non-null  object 
 7   property_type                  44317 non-null  object 
 8   room_type                      44317 non-null  object 
 9   accommodates                   44317 non-null  int64  
 10  bathrooms                      44317 non-null  float64
 11  bedrooms                       44317 non-null  float64
 12  beds                           44317 non-null 

In [58]:
df2 = pd.read_sql("SELECT * from opennyc", conn, index_col='index')

In [61]:
info = df2.groupby(["neighbourhood","neighbourhood_group"])

In [62]:
pd.DataFrame(info)[0]

0                   (Allerton, Bronx)
1      (Arden Heights, Staten Island)
2           (Arrochar, Staten Island)
3                   (Arverne, Queens)
4                   (Astoria, Queens)
                    ...              
216       (Windsor Terrace, Brooklyn)
217               (Woodhaven, Queens)
218                 (Woodlawn, Bronx)
219          (Woodrow, Staten Island)
220                (Woodside, Queens)
Name: 0, Length: 221, dtype: object

In [63]:
data= {key:value for key,value in pd.DataFrame(info)[0]}

In [64]:
data

{'Allerton': 'Bronx',
 'Arden Heights': 'Staten Island',
 'Arrochar': 'Staten Island',
 'Arverne': 'Queens',
 'Astoria': 'Queens',
 'Bath Beach': 'Brooklyn',
 'Battery Park City': 'Manhattan',
 'Bay Ridge': 'Brooklyn',
 'Bay Terrace': 'Queens',
 'Bay Terrace, Staten Island': 'Staten Island',
 'Baychester': 'Bronx',
 'Bayside': 'Queens',
 'Bayswater': 'Queens',
 'Bedford-Stuyvesant': 'Brooklyn',
 'Belle Harbor': 'Queens',
 'Bellerose': 'Queens',
 'Belmont': 'Bronx',
 'Bensonhurst': 'Brooklyn',
 'Bergen Beach': 'Brooklyn',
 'Boerum Hill': 'Brooklyn',
 'Borough Park': 'Brooklyn',
 'Breezy Point': 'Queens',
 'Briarwood': 'Queens',
 'Brighton Beach': 'Brooklyn',
 'Bronxdale': 'Bronx',
 'Brooklyn Heights': 'Brooklyn',
 'Brownsville': 'Brooklyn',
 "Bull's Head": 'Staten Island',
 'Bushwick': 'Brooklyn',
 'Cambria Heights': 'Queens',
 'Canarsie': 'Brooklyn',
 'Carroll Gardens': 'Brooklyn',
 'Castle Hill': 'Bronx',
 'Castleton Corners': 'Staten Island',
 'Chelsea': 'Manhattan',
 'Chinatown': 'M

In [65]:
data['Gerritsen Beach']= 'Brooklyn'

In [66]:
data['Glen Oaks']= 'Queens'

In [67]:
data['Hollis Hills']= 'Queens'

In [68]:
df["borought"] = df["neighbourhood_cleansed"].apply(lambda x:data[x])

In [71]:
df

Unnamed: 0_level_0,id,host_is_superhost,host_has_profile_pic,neighbourhood_cleansed,latitude,longitude,is_location_exact,property_type,room_type,accommodates,...,calendar_updated,availability_30,number_of_reviews,review_scores_rating,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,reviews_per_month,borought
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,18461891,f,t,Ditmars Steinway,40.774142,-73.916246,t,Apartment,Entire home/apt,2,...,5 months ago,0,0,0.0,f,f,strict,f,0.00,Queens
1,20702398,f,t,City Island,40.849191,-73.786509,f,House,Private room,2,...,2 weeks ago,19,2,100.0,f,f,moderate,f,2.00,Bronx
2,6627449,f,t,City Island,40.849775,-73.786609,t,Apartment,Entire home/apt,3,...,2 weeks ago,28,21,95.0,f,f,strict,f,0.77,Bronx
3,19949243,f,t,City Island,40.848838,-73.782276,f,Boat,Entire home/apt,4,...,6 days ago,30,0,0.0,t,f,strict,f,0.00,Bronx
4,1886820,f,t,City Island,40.841144,-73.783052,t,House,Entire home/apt,4,...,16 months ago,30,0,0.0,f,f,strict,f,0.00,Bronx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44312,20530309,f,t,Flatlands,40.618675,-73.932736,f,Apartment,Private room,1,...,2 weeks ago,30,1,100.0,t,f,flexible,f,0.81,Brooklyn
44313,20459907,f,t,Bushwick,40.684681,-73.905174,t,Apartment,Entire home/apt,6,...,2 weeks ago,4,0,0.0,t,f,strict,f,0.00,Brooklyn
44314,4287386,f,t,Rockaway Beach,40.583865,-73.819245,f,Apartment,Entire home/apt,4,...,2 weeks ago,1,6,87.0,f,f,moderate,f,3.91,Queens
44315,20939747,f,t,Rosedale,40.679998,-73.720787,f,Apartment,Entire home/apt,2,...,a week ago,7,0,0.0,f,f,strict,f,0.00,Queens


In [78]:
def wifi_locator(column):
    for row in df[column]:
        if "Wireless Internet" in row:
            return 1
        else:
            return 0
wifi_locator("amenities")

1

In [80]:
df["Wifi Access"] = df["amenities"].apply(lambda x: 1 if "Wireless Internet" in x else 0)

In [84]:
df["amenities"].iloc[44314]

'{"Wheelchair accessible",Kitchen,"Free parking on premises",Breakfast,Heating,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Lake access",Beachfront}'

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44317 entries, 0 to 44316
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             44317 non-null  int64  
 1   host_is_superhost              44317 non-null  object 
 2   host_has_profile_pic           44317 non-null  object 
 3   neighbourhood_cleansed         44317 non-null  object 
 4   latitude                       44317 non-null  float64
 5   longitude                      44317 non-null  float64
 6   is_location_exact              44317 non-null  object 
 7   property_type                  44317 non-null  object 
 8   room_type                      44317 non-null  object 
 9   accommodates                   44317 non-null  int64  
 10  bathrooms                      44317 non-null  float64
 11  bedrooms                       44317 non-null  float64
 12  beds                           44317 non-null 

In [None]:
# df.to_sql("updated_airbnb2017", conn, if_exists="replace")

# Data Cleaning<a id='3'></a>

The most important step to take before we get started geenrating any kind of information from all these data sources, it is first important to clean our data and make sure that the datasets are compatible with each other. Since most of the data is divided on a host id basis, we must make sure that all the rows have values and cantain the same formatting. 

# Exploratory Data Analysis (EDA)<a id='4'></a>

## Standardize Borough Names

### Steps followed:
- Identify a dictionary with neighborhood:borough relationship 
- Plan on how to use this dictionary to parse the neighborhood column to produce a new column with each borough 
- Implement in code 

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe().T

In [None]:
df.head()

In [None]:
print('number of unique host: ', len(df['host_id'].unique()))
df['host_id'].value_counts()

In [None]:
# number of neighbourhood groups
df['neighbourhood_group'].value_counts()

In [None]:
fig, axes = plt.subplots(figsize=(10,10))
axes.set(title="Amount of BTC held by everyone other than Davin")
axes.pie(df['neighbourhood_group'].value_counts(), labels=df["neighbourhood_group"].unique(),explode=[0,0.05,0,0,0],autopct='%1.1f%%',shadow=True)
plt.legend()
plt.show()

In [None]:
# number of different neighbourhood
len(df['neighbourhood'].unique())

In [None]:
from matplotlib import style

plt.style.use('Solarize_Light2')
fig,ax=plt.subplots(figsize=(20,10))
color = ("blue", "red", "purple",'cadetblue','darksalmon')
df.neighbourhood.value_counts().sort_values(ascending=False)[:10].sort_values().plot(kind='barh',color=color)
ax.set_title("Top 10 neighbourhood by the number of listings available to book",size=24)
ax.set_xlabel('number of listings',size=15)
ax.set_ylabel('neighborhoods', size=15)
plt.show()

These are the 10 neighborhoods that have most listings, most rooms avaliable, apparently they are very popular.

In [None]:
# number of room types
sns.countplot('room_type',data=df,order=df['room_type'].value_counts().index).set_title('Room Type')

This is a breakdown of the type of room/apt offered. As we can see these are mostly Entire Apartments or Private Rooms. There are very few Shared Rooms offered as an option for renting on AirBnb.

In [None]:
plt.style.use('Solarize_Light2')
fig,ax=plt.subplots(figsize=(8,8))
df[['price']].boxplot()
plt.title('Price per night distribution')
#plt.xlabel('Days avaiable')
plt.ylabel('Cost per night') 
plt.show()

It seems that there are a few price outliers of $10000, let's see how many and what they are.

In [None]:
df.sort_values(by=['price'], ascending=False).head(20)

#schoud we just drop row=9151?

In [None]:
x = df['price'].values
y = df['neighbourhood_group'].values
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(15,5))
ax.scatter(x,y, s=50, c='b', alpha=0.3)
ax.set_xlabel('price')
ax.set_ylabel('boroughs')
ax.set_title('Price distribution per neighborhood_group')
plt.show()

Also, let's drop last_review and reviews_per_month since they have lots of missing values (or better, lets fill the up with 0s since these are probably new properties.)

In [None]:
#df.drop(['last_review', 'reviews_per_month'], axis = 1, inplace = True)
#or
# df.fillna(value = {'last_review':0}, inplace = True)
# df.fillna(value={'reviews_per_month':0}, inplace = True)

In [None]:
plt.style.use('tableau-colorblind10')
fig, ax = plt.subplots(figsize=(15,5))
plt.hist(df['availability_365'], bins = 30)
plt.title('Avaliability')
plt.xlabel('Days avaiable')
plt.ylabel('Density') 
plt.show()

#most listings are avaliable short-term (more than 2 weeks)
fig, ax = plt.subplots(figsize=(15,5))
plt.hist(df[(df['minimum_nights'] <= 30) & (df['minimum_nights'] > 0)]['minimum_nights'], bins = 30)
plt.title('Minimum nights booking requiremnt')
plt.xlabel('Minimum of nights required to book')
plt.ylabel('Density')
plt.show()

# most books are for 1-2-3 days minimum
from matplotlib import colors
plt.style.use('tableau-colorblind10')
fig, ax = plt.subplots(figsize=(15,5))
plt.hist2d(df['number_of_reviews'],df['price'], bins=10, norm = colors.LogNorm(), cmap ="RdBu")
plt.title('Relationship between Number of Reviews and Price')
plt.xlabel('Number of reviews')
plt.ylabel('Price')
plt.show()

In [None]:
#I will probably remove the last histogram - i wanted to do a 2d histogram, ut can't really find a good relationship for it. Ideas?

In [None]:
sns.pairplot(df)

Aibnb Listing Distribution on a Heatmap

In [None]:
data = folium.Map([40.7128,-74.0060],zoom_start=11)
HeatMap(df[['latitude','longitude']],radius=8,gradient={0.2:'blue',0.4:'purple',0.6:'orange',1.0:'red'}).add_to(data)
display(data)

The highest Density areas are marked in red and lowest density areas are marked in blue color

In [None]:
plt.figure(figsize=(17,8))
plt.scatter(df.longitude, df.latitude, c=df.availability_365, cmap='plasma', edgecolor='black', linewidth=1, alpha=0.75)
cbar = plt.colorbar()
cbar.set_label('availability_365')
plt.title('Avaliability')
plt.show()

The Yellow color on the map shows the places which have more availability throughout the year, meaning that they are being rented throughout the whole year as opposed to on seasonal or weekly basis.

In [None]:
#word cloud
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
text = " ".join(str(each) for each in df.name)
# Create and generate a word cloud image:
wordcloud = WordCloud(width=800, height=400, margin=0,colormap='Blues').generate(text)
# Display the generated image:
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

In [None]:
#this cloud be our banner at the end

# Hypothesis <a id='5'></a>
#### What do we understand from this? 

## Linear Regression
#### Hypothesis: 

The following features are the best predictors for ***price*** of airbnb listings:

- Boroughs
- Accommodates
- Bathrooms
- Bedrooms 
- Beds 
- Number of guest included (in price)

## Multinomial Logistic Regression
#### Hyppothesis:  

The following features are the best predictors for ***location*** of airbnb listings:

- Accommodates
- Bathrooms
- Bedrooms 
- Beds 
- Number of guest included (in price)
- Minimun nights

In [1]:
#Machine Learning Models

# Conclusion<a id='6'></a>

In our first linear regression model, we tested the effect of the location (NYC 5 boroughs) on price (cost of airbnb listing) and found that the coefficient of determination was the lowest at around **0.04**. This might imply that price **doesn't** depend on which borough the listing is located in.

The second linear regression model, which include all feautures mentioned in the hypothesis except borough, had a higher coefficient of determination of around **0.5**. From this, we can conclude that these features might not necessarily the strongest indicators but **do have** an effect on the price of an airbnb listing. 

Futhermore, our first multinomial regression model had a score of **0.481** in predicting the location of a listing based on all the features mentioned in the hypothesis except the minimun nights. We observed that this model couldn't 
tell the difference the boroughs which could be a sign of underfitting. 

In hopes to increase the accuracy of our prediction, we did another multinomial regression model. In the second model, we added an extra feature, minimun nights, and we increase the training size. Even though we increased the training size, the coefficient of determination decreased to **0.4774**. This could be due to having bias data to begin with. In other words, our data could contain all luxorious apartments in the boroghs which could mean that the apartment features of the listings are too similar and there is no way for our model to tell them apart. 

The next logical solution would be to analyze if we can predict 