> ### Please Upvote if you like it and provide any sugesstions in the comments section.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Airbnb_Logo_Bélo.svg/1920px-Airbnb_Logo_Bélo.svg.png)

**Airbnb,Inc.** is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate listings, nor does it host events; it acts as a broker, receiving commissions from each booking.

Airbnb provides a platform for hosts to accommodate guests with short-term lodging and tourism-related activities. Guests can search for lodging using filters such as lodging type, dates, location, and price. Hosts provide prices and other details for their rental or event listings, such as the allowed number of guests, home type, rules, and amenities.Pricing is determined by the host, with recommendations from Airbnb.Hosts and guests have the ability to leave reviews about the experience.

![](https://www.macobserver.com/wp-content/uploads/2018/05/airbnb-summer-vacation-768x449.png)

The dataset is about the listings and metrics in NYC from 2008 to 2019.This dataset includes information about hosts, geographical availability, necessary metrics. First we will see what are all the details provided then analyse them and make predictions.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime 
import seaborn as sb
import matplotlib.pyplot as plt
import folium
from folium.plugins import MarkerCluster
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

> # About the data

When coming to data analysis **PANDAS** is the common library used to load or read the data. The info function of pandas provides all the basic information of the dataset like total no of rows (or) entries, total no of columns (or) features, Data type of each feature.
 
 
**What Extra information we can get from the dataset?**
1. Count of non-null data in each feature.
2. Count of different data types.
3. Memoy usage of the dataset.

### **Information on the Dataset**

In [None]:
ny_data = pd.read_csv('../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv',index_col = 'host_id')
ny_data.info()

We can see it contains 15 features (or) columns with 48.8k entries (or) rows in total. This makes a total of 733.4k total datapoints. Each feature name is specifed which we will cover in detail in next section. There are 3 different types of data Float,Int,Object (String) and on a total it occupies around 6MB of memory. Features name, host name, last review, reviews per month have missing values.

> # Columns in the Data

Columns are called Features. Each feature is like a dimension if we consider two features and plot a relation it would be a line (might not be a straight). Instead if we consider 3 features their relationship would be a plane. So if the total no of features is N then their relationship would be a N-1 dimension space.

Features **Name**,**HostId**, **Host Name** might not be helpful in analysis and prediction. **No of reviews** and **Reviews per month** are related so one is redundant, but we only have the info of last review the first review data is not provided so both **No of reviews** and **Reviews per month** are required. We will use all the data for analysis except **Name,HostId, Host Name**.

**Some other important Things to notice**
1. The dataset has missing values so we need to handle it first.
2. Last review is a date type but it shows object.
3. Also should check for duplicate entries.

These are must steps in any data analysis project before getting into the analysis part. So first clean the dataset and then analyse.

> # Clean the Dataset

Here we make sure that dataset is completely ready for analysis without errors. The errors we check for are 
1. Missing values
2. Duplicate entries 
3. Other minor corrections

**Missing values**

From the previous section name, host name, last review, reviews per month are having missing values. Name and host name might not be useful in analysis or prediction. So we will handle the missing values in the last review, reviews per month columns.

**How do we handle missing values?**

We first have to check whether there is a link between columns with missing values and the other columns so we can replace those missing.
If we dont find any link and unable to replace missing values the option is to remove those rows from dataset.

### **Missing values**

In [None]:

print(ny_data[ny_data['number_of_reviews']==0][['number_of_reviews','last_review','reviews_per_month']]);

The interesting part is if you observe no of entries with missing values in both last review, reviews per month is 100052 and the no of entries with number of reviews = 0 is 100052. **The above table tells us that whenever we have number of reviews = 0 last review, reviews per month are not defiend.i.e, NaN.**


**What is the solution?**
1. If the total no of reviews is 0 we can repalce reviews per month with 0.
2. For missing values in last review what we can do is the last review compared in all the data is on **8th jul 2019**. So we can create a new feature **No of days since last review** which wolud be the differece between 8th jul 2019 and last review. So if last review is NaN we make the value -1 in the new feature which tells us that there was never a review on that listing.

In [None]:
ny_data['reviews_per_month'].fillna(0,inplace = True)
today = datetime.datetime(2019,7,8)
tommorow = str(today + datetime.timedelta(days=1))
ny_data['last_review'].fillna(tommorow,inplace = True)

ny_data['days_since_last_review'] = (pd.to_datetime(today) - pd.to_datetime(ny_data['last_review']))/np.timedelta64(1,'D')

After handling the missin values you can check the info and the head in the dataset.

 ### **Information on data after cleaning the data**

In [None]:
print(ny_data.info())

> # Analysis

Fisrt lets remove all the unnecessary cells from the dataset which are not helpful in analysis **id, name, host name, last review**.

### **After Removing unnecessary features**

In [None]:
ny_data = ny_data.drop(['name','id','host_name','last_review'],axis = 1)
ny_data.head()

### **Initial Understanding**

In [None]:
plt.figure(figsize = (18,18));
plt.subplot(221);
counts = ny_data['room_type'].value_counts();
plt.pie(counts,labels = counts.index,colors=['darkcyan','skyblue','powderblue'],autopct='%1.2f%%');
plt.title('Distribution of data with respect to room type');

plt.subplot(222);
sb.countplot(data = ny_data ,x= 'neighbourhood_group',color = 'cadetblue');
plt.title('Distribution of data with respect to neighbourhood group');
sb.despine(offset = 10,left = True,bottom = True);

plt.subplot(223);
plt.hist(ny_data['price'],bins =60,color = 'cadetblue');
plt.title('Distribution of data with respect to price');
plt.xlim(0,2500);
plt.xlabel('price $');
plt.ylabel('count');

plt.subplot(224);
plt.hist(ny_data['availability_365'],bins =20,color='cadetblue');
plt.title('Distribution of data with respect to availabilty');
plt.xlabel('avaliability');
plt.ylabel('count');

sb.despine(left=True,bottom=True);

***Summary:***

1. From all the hosts in New York around 98% of them offer either entire home or private room.
2. Brooklyn and Manhattan has highest listings so high percent of the hosts are from those areas.
3. Average price of listing can be expected below 500$.
4. Almost 50% of rooms are avaliable only around 20 days in whole year rest 50% avaliability is varing i.e, we only have a 50% chance of getting room whenever we want.


### **Complete Analysis neighbourhood-groupwise**

Lets look at the data neighbourhood wise and see what it tells us about that area. **But we will be looking at only two neighbourhoods Brooklyn and Queens then we proceed for prediction algorithm.**

<h2><center>Brooklyn</center></h2>

### **Distribution of all listings in brooklyn**
(select those numbered points to zoom in)

In [None]:
brooklyn_data = ny_data.loc[ny_data['neighbourhood_group']=='Brooklyn']

map_brooklyn=folium.Map(location=[40.638177,-73.964160],tiles="stamenterrain",zoom_start=11)
marker_cluster = MarkerCluster().add_to(map_brooklyn)

locations = brooklyn_data[['latitude', 'longitude']]
locationlist = locations.values.tolist()

for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point]).add_to(marker_cluster)
    
map_brooklyn

The above map gives overview of all the listings present in the brooklyn location. Just select the bubble you want and it will zoom in to inner locations.Its just a graphical location of all the listings for user convenience. 

### **Listings view with respect to room type**

In [None]:
fig = plt.figure(figsize=(18,12))
fig.add_subplot(1,2,1)
g=sb.scatterplot(brooklyn_data.longitude,brooklyn_data.latitude,hue=brooklyn_data.room_type,palette='deep',alpha =0.6)
g.set_title('Distribution of listings based on room type according to lat-long location points');
sb.despine(offset = 10,trim=True,ax=g);
fig.add_subplot(2,2,2)
x=sb.violinplot(x=brooklyn_data.room_type,y=brooklyn_data.price,palette='deep',inner="quartile");
plt.ylim(-100,500);
plt.title('Price distribution according to room type');
sb.despine(offset = 10,bottom = True,left=True,ax=x);
fig.add_subplot(2,2,4);
y=sb.violinplot(x=brooklyn_data.room_type,y=brooklyn_data.availability_365,palette='deep',inner="quartile");
sb.despine(offset = 10,bottom = True,left=True,ax=y);
plt.title('avaliability distribution according to room type');

Brooklyn has about 20.1k listings in which few of them are high priced so i had to limit the price range to 500$ to get a better visual on the data. The left graph (lat-long graph) contains all the listings unlike the right one where price axis is adjusted. 

### **Parameters variance with price according to room type**

In [None]:
new_data=brooklyn_data[brooklyn_data.columns[4:]] 
new_data=new_data[new_data['price']<2100]
y=new_data.columns
count=0
plt.figure(figsize=(18,18));
for i in range(2,8):
    plt.subplot(3,3,count+1);
    sb.scatterplot(data=new_data,x='price',y=y[i],hue='room_type',palette='deep');
    count=count+1
sb.despine(offset = 10,bottom = True,left=True);

The above graphs show the relation between prrice and other parameters in the data. The relation doesn't look linear. Also all the plots are differentiated based on the room type and each type has been given with a color in all the grpahs even in the previous graphs.

In [None]:
brooklyn_data = ny_data.loc[ny_data['neighbourhood_group']=='Brooklyn']
plt.figure(figsize = (12,14));

plt.subplot(221);
new_data = brooklyn_data.groupby('neighbourhood')['price'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'price':new_data.values}).sort_values(by=['price']).head(10);
barlist=plt.barh(new_data['neighbourhood'],new_data['price'],color = 'cadetblue');
for i, v in enumerate(new_data['price']):
    plt.text(v + 1,i-0.3, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.xticks([]);
barlist[7].set_color('lightcoral');
plt.title('Best Places in Brooklyn by average price $');


plt.subplot(222);
new_data = brooklyn_data.groupby('neighbourhood')['availability_365'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'availability_365':new_data.values}).sort_values(by=['availability_365']).tail(10);
barlist=plt.barh(new_data['neighbourhood'],new_data['availability_365'],color = 'cadetblue');
for i, v in enumerate(new_data['availability_365']):
    plt.text(v + 1,i-0.3, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.xticks([]);
barlist[4].set_color('lightcoral');
plt.title('Best Places in Brooklyn by average availability in days');

plt.subplot(212);
new_data = brooklyn_data.groupby('neighbourhood')['number_of_reviews'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'number_of_reviews':new_data.values}).sort_values(by=['number_of_reviews']).tail(10);
barlist = plt.bar(new_data['neighbourhood'],new_data['number_of_reviews'],color = 'cadetblue');
for i, v in enumerate(new_data['number_of_reviews']):
    plt.text(i-0.3,v + 1, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.yticks([]);
barlist[1].set_color('lightcoral');
plt.title('Best Places in Brooklyn by average no of reviews');
plt.tight_layout()

***Summary:***
1. The median price of the entire home/apt is around 150\$ which is higher than the private room and shared room in Brooklyn. Whereas according to avaliability the shared room has high availability rate of median around 200 days.
2. Shared rooms are less priced and are having high availability rate in brooklyn but the no of listings offering shared rooms are also less in quantity.
3. In all the relationships initially at low prices private/shared rooms show high y axis value but the entire home/apt have good features at higher prices.
4. If we consider no.of reviews and reviews per month they are less at higher prices and it is obvious because only few people can afford for those room types at that price. 
5. Also for high priced rooms the avaliability is either very less like for 2-5 days or very high 300-365 days(most are entire home/apt).
6. Intresting thing is at low prices a set of houses are having higher minimum nights around 200 days. There are houses with less minimum nights at low prices but at price greater than 250$ you may not worry about minimum nights to stay.
7. Days since last review combined with no of reviews and reviews per month tells us how frequently that room is being rated. Also not everybody rates as it is not a compulsion.
8. East New york is one region which stood in top 10 in all the categories avg price,avg avaliability and avg no of reviews.


<h2><center>Queens</center></h2>

### **Distribution of all listings in Queens**
(select those numbered points to zoom in)

In [None]:
queens_data = ny_data.loc[ny_data['neighbourhood_group']=='Queens']

map_queens=folium.Map(location=[40.6582,-73.7949],tiles="stamenterrain",zoom_start=11)
marker_cluster = MarkerCluster().add_to(map_queens)

locations = queens_data[['latitude', 'longitude']]
locationlist = locations.values.tolist()

for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point]).add_to(marker_cluster)
    
map_queens

The above map gives overview of all the listings present in the Queens location. Just select the bubble you want and it will zoom in to inner locations.Its just a graphical location of all the listings for user convenience. 

### **Listings view with respect to room type**

In [None]:
fig = plt.figure(figsize=(18,12))
fig.add_subplot(1,2,1)
g=sb.scatterplot(queens_data.longitude,queens_data.latitude,hue=queens_data.room_type,palette='deep',alpha =0.6)
g.set_title('Distribution of listings based on room type according to lat-long location points');
sb.despine(offset = 10,trim=True,ax=g);
fig.add_subplot(2,2,2)
x=sb.violinplot(x=queens_data.room_type,y=queens_data.price,palette='deep',inner="quartile");
plt.ylim(-100,500);
plt.title('Price distribution according to room type');
sb.despine(offset = 10,bottom = True,left=True,ax=x);
fig.add_subplot(2,2,4);
y=sb.violinplot(x=queens_data.room_type,y=queens_data.availability_365,palette='deep',inner="quartile");
sb.despine(offset = 10,bottom = True,left=True,ax=y);
plt.title('avaliability distribution according to room type');

Queens has about 5.6k listings and similar to brooklyn data above few of the listings are high priced so i had to limit the price range to 500$ to get a better visual on the data. The left graph (lat-long graph) contains all the listings unlike the right one where price axis is adjusted. 

### **Parameters variance with price according to room type**

In [None]:
new_data=queens_data[queens_data.columns[4:]] 
new_data=new_data[new_data['price']<2100]
y=new_data.columns
count=0
plt.figure(figsize=(18,18));
for i in range(2,8):
    plt.subplot(3,3,count+1);
    sb.scatterplot(data=new_data,x='price',y=y[i],hue='room_type',palette='deep');
    count=count+1
sb.despine(offset = 10,bottom = True,left=True);

We have similar representation of relationships like brooklyn data but the quantity of data is less comapred to brooklyn. Also the price range is limited till 2000$ and none of the relation is linear. Color reperesntation of the data is same to broolkyn. 

In [None]:
brooklyn_data = ny_data.loc[ny_data['neighbourhood_group']=='Queens']
plt.figure(figsize = (12,18));

plt.subplot(221);
new_data = brooklyn_data.groupby('neighbourhood')['price'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'price':new_data.values}).sort_values(by=['price']).head(10);
barlist=plt.barh(new_data['neighbourhood'],new_data['price'],color = 'cadetblue');
for i, v in enumerate(new_data['price']):
    plt.text(v + 1,i-0.3, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.xticks([]);
barlist[9].set_color('lightcoral');
plt.title('Best Places in Queens by average price $');


plt.subplot(222);
new_data = brooklyn_data.groupby('neighbourhood')['availability_365'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'availability_365':new_data.values}).sort_values(by=['availability_365']).tail(10);
barlist=plt.barh(new_data['neighbourhood'],new_data['availability_365'],color = 'cadetblue');
for i, v in enumerate(new_data['availability_365']):
    plt.text(v + 1,i-0.3, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.xticks([]);
barlist[5].set_color('lightcoral');
plt.title('Best Places in Queens by average availability in days');

plt.subplot(212);
new_data = brooklyn_data.groupby('neighbourhood')['number_of_reviews'].mean();
new_data = pd.DataFrame({'neighbourhood':new_data.index,'number_of_reviews':new_data.values}).sort_values(by=['number_of_reviews']).tail(10);
barlist = plt.bar(new_data['neighbourhood'],new_data['number_of_reviews'],color = 'cadetblue');
for i, v in enumerate(new_data['number_of_reviews']):
    plt.text(i-0.3,v + 1, str(round(v,2)), color='grey', fontweight='bold');
sb.despine(left=True,bottom=True);
plt.yticks([]);
barlist[7].set_color('lightcoral');
plt.xticks(rotation=15);
plt.title('Best Places in Queens by average no of reviews');
plt.tight_layout()

***Summary:***
1. The median price of entire home/apt is around 130\$ which is almost near to the private room and shared room with median prices around 80 and 50\$ respectively. The average availability of the shared room is around 200 days but unlike brooklyn here the average avaliability of the private room and entire home/apt is around 100 days.
2. No of listings of shared room are limited but thier average price is placed less and average avalaiability high around 200 days.
3. Minimum no of nights to stay reduces with increase in price. From price around 250\$ minimum nights will not be an issue.
4. No of reviews and reviews per month are more at less price than higher price as there is less chance of people going for a high priced rooms.
5. similar to brooklyn spread of availability is high at low prices and at higher price the avaliability is either very less or very high.
6. At low prices some listings are having high value for days since last review which means every person who rents might not rate the residence. But when combined with no of reviews and reviews per month it tells us how frequently that room is being rated.
7. South Ozone Park is the location which is in top 10 of all the categories avg price,avg avaliability and avg no of reviews.

### Prediction
As we can see The missing values are already handled. Now the remaining steps to perform before prediction are
1. Removing outilers in the data
2. Feature Selection
3. Feature Encoding
4. Feature Scaling
5. Train Test Split


In [None]:
ny_data.info()


### 1. Removing Outliers 

Outliers are more dangerous to the training data as the high values effect the model training heavily. So inorder to avoid these outliers we need to remove them from training data. so to identify outliers we can use box plots(single variable).

In [None]:
plt.figure(figsize=(16,10))
plt.subplot(231)
ny_data.boxplot(column=['price']);
plt.subplot(232)
ny_data.boxplot(column=['calculated_host_listings_count']);
plt.subplot(233)
ny_data.boxplot(column=['days_since_last_review']);
plt.subplot(234)
ny_data.boxplot(column=['reviews_per_month']);
plt.subplot(235)
ny_data.boxplot(column=['number_of_reviews']);
plt.subplot(236)
ny_data.boxplot(column=['minimum_nights']);

In some cases if we remove some outliers from one of the column it should show some reduction in outliers in all other columns but here it is not lik that. Check below example where i confine the price to 200 then plot the box plots see what it does to the other column outliers.


In [None]:
newdata = ny_data[ny_data['price']<200]
plt.figure(figsize=(16,10))
plt.subplot(231)
newdata.boxplot(column=['price']);
plt.subplot(232)
newdata.boxplot(column=['calculated_host_listings_count']);
plt.subplot(233)
newdata.boxplot(column=['days_since_last_review']);
plt.subplot(234)
newdata.boxplot(column=['reviews_per_month']);
plt.subplot(235)
newdata.boxplot(column=['number_of_reviews']);
plt.subplot(236)
newdata.boxplot(column=['minimum_nights']);

If you can observe there is very little change in the other column outliers. Also outliers should be rare, but by the data we can see there are a lot of them which means they have impact on the rest of the data. Lets see multivariate relation to see posssible outliers.

In [None]:
plt.figure(figsize=(16,10))
plt.subplot(231)
plt.scatter(data=ny_data,x='calculated_host_listings_count',y='price',alpha=0.2);
plt.xlabel('calculated_host_listings_count');
plt.ylabel('Price');
plt.subplot(232)
plt.scatter(data=ny_data,x='days_since_last_review',y='price',alpha=0.2);
plt.xlabel('days_since_last_review');
plt.ylabel('Price');
plt.subplot(233)
plt.scatter(data=ny_data,x='reviews_per_month',y='price',alpha=0.2);
plt.xlabel('reviews_per_month');
plt.ylabel('Price');
plt.subplot(234)
plt.scatter(data=ny_data,x='number_of_reviews',y='price',alpha=0.2);
plt.xlabel('number_of_reviews');
plt.ylabel('Price');
plt.subplot(235)
plt.scatter(data=ny_data,x='minimum_nights',y='price',alpha=0.2);
plt.xlabel('minimum_nights');
plt.ylabel('Price');
plt.subplot(236)
plt.scatter(data=ny_data,x='availability_365',y='price',alpha=0.2);
plt.xlabel('availability_365');
plt.ylabel('Price');

From the five graphs with their relation to price we can avoid values after certain point to avoid the training to consider them. These outer values will considerably affect the training. Instead of deleting them lets add them together as a seperate dataset.

1. From Price vs calculated host listing count -> The limits which i choose for outliers are calculated host listing >60
2. From Price vs days since last review -> The limits which i choose for outliers are days since last revivew >= 2000 
3. From Price vs reviews per month -> The limits which i choose for outliers are reviews per month >= 15
4. From Price vs number of reviews -> The limits which i consider for outliers are number of reviews > 500.
5. From the Price vs minumum nights -> The limits which i choose are miminimum nights >= 300.

Also from all the graphs if we observe points with price > 4500 look they are very less in number but thier effect on price are very high than the rest of the data so we consider them into outlier data.

### Info on Outlier data After seperating from main data

In [None]:
outlier_data = ny_data[ (ny_data['price']>4500) | (ny_data['calculated_host_listings_count']>60) | (ny_data['days_since_last_review']>=2000) 
                       | (ny_data['reviews_per_month']>=15) | (ny_data['number_of_reviews']>500) | (ny_data['minimum_nights']>=300) ]

ny_data = pd.concat([ny_data, outlier_data, outlier_data]).drop_duplicates(keep=False)

outlier_data.head()

In [None]:
outlier_data.info()

### 2. Feature Selection

Three techniques that we can use for feature selection.
1. Univariate Selection -> Filter based model
2. Feature Importance -> embedded model
3. Correlation Matrix with HeatMap

### 1. Univariate Selection
Using the combination of Statistical test with the slectkbest method of sklearn to select best K features  which perform best in the selected test.  There are different statistical filters pearson's,spearman's,ANOVA,Kendall,Chi-squared etc., Each of them is based on type of the i/p feature and the target feature. **Kendall and Spearman** are from **scipy** library the rest of the filters are from **Sklearn**. 


The target variable is numerical and in the data we have 8 numerical(other than target variable) and 3 categorical types. Also none of them look to be having linear relation so for i/p variables of numeric we apply spearman's filter and for i/p variables of categorical we apply Kendall filter to find out their score. Based on the scores we decide top 8 columns to use for prediction.

### 3. Feature Encoding

This is the part where we convert our categorical data into numeric form for the training to create a relation between the target column and the categorical column. 

There are three columns which are categorical neighbourhood_group,neighbourhood and room_type. All of them should be converted into numerics.

### main data

In [None]:
ny_data=pd.get_dummies(ny_data,prefix=['neighbourhood_group','neighbourhood','room_type'], drop_first=True)
ny_data.head()

In [None]:
ny_data.columns

### outlier data

In [None]:
outlier_data=pd.get_dummies(outlier_data,prefix=['neighbourhood_group','neighbourhood','room_type'])
outlier_data.drop(columns=['room_type_Entire home/apt','neighbourhood_group_Bronx'],axis=1,inplace=True)
new_cols = list(ny_data.columns.difference(outlier_data.columns))
for i in new_cols:
    outlier_data[i]=0
outlier_data.head()


In [None]:
outlier_data.columns