## **AirBnb Booking Analysis**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

[link text](https://)# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**EDA**
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.



*   Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world
*   Data analysis on millions of listings provided through AirBnb is a crucial for company.
*   These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.










#### **Define Your Business Objective?**



*   Finding the total number of listings available in New York.
*   Find the average price .
*   Find the owner/hosts who has the most number of listings throughout NewYork.
*   Find out the reason how price is affecting the maximum reviewed place.
*   To see the price variation for different listings throughout New York.






Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime

**Now we mount the drive and import the dataset (AirBnb NYC 2019)**

In [None]:
#Drive Mounting
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# import dataset from drive
file_path= '/content/drive/MyDrive/Colab Notebooks/data/Airbnb NYC 2019.csv'
airbnb_df=pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
# Top 5 Row of data
airbnb_df.head()

In [None]:
#Top 5 columns of data
airbnb_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airbnb_df.duplicated().sum()

There are 0 duplicate values present.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
airbnb_df.isna().sum()


There are 16 missing values in "name" column and 21 missing values in host_name which we can ignore because it is very less in comparision to the data frame.
We can ignore the "name" and "host_name" columns missing values s we have unique ids for name and host_name.

But columns "last_review" and "review_per_month" have noticeable records which are missing. 

we can replace the NaN in "review_per_month" with 0 and drop the last_review column.

In [None]:
# Visualizing the missing values
import missingno as msno #to visualize the mising values


In [None]:
msno.bar(airbnb_df)

In [None]:
msno.matrix(airbnb_df)

black area shows the non null value and white area shows null values. 
last_reveiw & reviews_per_month have a lot of missing values.
 we will replace the missing value of "reviews_per_month" with 0 and drop the column "last_review"

In [None]:
airbnb_df.shape

# ***Treating Null Values***

In [None]:
#Replacing null value of review_permonth with 0
airbnb_df['reviews_per_month'].fillna(0,inplace=True)

In [None]:
#dropping columns that are not significant or could be unethical to use for our future data exploration and predictions
airbnb_df.drop(['last_review', 'id', 'host_name'], axis=1, inplace= True)

In [None]:
airbnb_df.info()

In [None]:
airbnb_df.head(2)

In [None]:
#checking the null value again
airbnb_df.isna().sum()

### What did you know about your dataset?

In [None]:
#Know the columns names
airbnb_df.columns

Lets know more about the price column as this needs more attention then other columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Describe
airbnb_df.price.describe()

# **Treating Outliers**

In [None]:
plt.figure(figsize=(10,5))
plt.ylim(0,1500)
sns.boxplot(x='neighbourhood_group', y='price', data=airbnb_df)
plt.show()

In [None]:
len(airbnb_df[airbnb_df['price']>500])


We can see that price varies from 0 till 10,000 but there are a wide range of outliers present in here. We will try to minimize the outliers instead of removing them completely as we don't want to lose any data in here.We observed there are only 1044 records which are having more than 500 as the price. So just taking the price of property which are more than 500 as 600, which will just minimize the number of outliers.

In [None]:
airbnb_df['price'][airbnb_df['price']>600] = 600

In [None]:
plt.figure(figsize=(10,5))
plt.ylim(0,1000)
sns.boxplot(x='neighbourhood_group', y='price',data=airbnb_df)
plt.show()


Now we even treated outliers(just minimized) and treated NaN, we will start with analysing what observations can be drawn from the dataframe in here.

### **Exploratory Data Analysis (EDA)**

###1) **Rentals/properties present in Neighbourhood group , Neighbourhood, Room type**
### **a) No. of rentals/properties which are grouped by room_type in each neighbourhood**

In [None]:
#Lets find the neighbourhood group
airbnb_df.neighbourhood_group.unique()

In [None]:
airbnb_df.groupby(['neighbourhood_group'],dropna=True)['room_type'].value_counts()


In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='neighbourhood_group', hue='room_type', data=airbnb_df)
plt.show()

Throughout NY, we have our properties located in 5 neighbourhood group. Clearly we can see that the cities Manhattan and Brooklyn are the cities with most number of properties. We can even obseve the room_type in here.

There are 3 categories available in room_type. And we can even observe the number of properties in each room_type in those 5 cities. We can get the conclusion,

(1)more number of properties of home are avialbale in Manhattan when compared to all cities

(2)the private room properties are varying from cities to cities. Count of private roomtypes are all high/almost equal to the number of home properties except Manhattan

(3)shared room properties are very few in all cities



**b)Total number of rentals/properties in each neighbourhood group**


In [None]:
list(airbnb_df.neighbourhood_group.unique())

In [None]:
nbd_grp_counts=pd.value_counts(airbnb_df['neighbourhood_group'])
print(nbd_grp_counts)
plt.plot(nbd_grp_counts)
plt.xlabel("Neighbourhood Group")
plt.ylabel("Counts")
plt.title("Number of Properties in Neighbourhood Group")

This gives information about the count of properties present in those 5 cities. We can observe that Brooklyn and Manhattan has more properties when compared to all 5 cities. In particular, Manhattan is the city which is busy city with most properties to offer

## **Total count of room types available in NYC**

In [None]:
from pandas._libs.hashtable import value_count
pd.value_counts(airbnb_df['room_type'])

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='room_type', data= airbnb_df)
plt.show()

Overall just trying to fetch more information about the room_types in general, we have more properties in home/apt. We already know that home/apt is most demanded room_type throughout NY

## **Overall contributions of each neighbourhood in the count of listings throughout NYC**

In [None]:
plt.figure(figsize=(20,8))
plt.title("Neighbourhood Group")
plt.pie(airbnb_df.neighbourhood_group.value_counts(), labels=airbnb_df.neighbourhood_group.value_counts().index,autopct='%1.1f%%', startangle=180)

Highest Number of Bookings are in Manhatan with 44.3% of Overall

## **Availability_365 and the neighbourhood_group**

In [None]:
sns.boxplot(data=airbnb_df, x='neighbourhood_group',y='availability_365')

We can observe that the mean of listings count in Brooklyn, Manhattan and Queens are available only for 0 to 100 days a year.






### **Average price for each neighbourhood group**

We got to know that most properties are available in Manhattan. Now we even want to know the average price of all the properties for the cities.

In [None]:
avg_price_nbd= airbnb_df.groupby(['neighbourhood_group','room_type'],dropna=True)['price'].mean().reset_index()
avg_price_nbd

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='neighbourhood_group',y='price', hue='room_type', data=avg_price_nbd)
plt.show()

We can observe that, Manhattan is the city which is in lead in all aspects. Its price for all room_type properties are pretty high when compared to the rest of the cities. Indeed its busy and famous city.

In [None]:
airbnb_df.groupby(['neighbourhood_group'])['price'].mean().reset_index()

In [None]:
def nbd_avg_price(df,x_axis,y_axis):
  group_price = df.groupby([x_axis],as_index=False)[y_axis].mean().reset_index(drop=True)
  plt.figure(figsize=(10,5))
  sns.barplot(x=group_price[x_axis],y=group_price[y_axis])
  plt.ylabel('mean ' + y_axis)
  plt.show()

In [None]:
nbd_avg_price(airbnb_df,'neighbourhood_group','price')

We already saw that Manhattan is the city which has high price for all the room_type properties. Now in here,we just tried to fetch a visualization of price with neighbourhood_group alone.

## **Average Price for each neighbourhood areas in respective neighbourhood_groups**

Trying to observe the trend of price for each areas/neighbourhood in those cities.

In [None]:
n = airbnb_df.neighbourhood_group.unique()
nbd_grp = list(n)

for i in nbd_grp:
  nbd_price= airbnb_df[airbnb_df['neighbourhood_group']==i].groupby(['neighbourhood'])['price'].mean().reset_index(drop=False)
  nbd_price = pd.DataFrame(nbd_price)
  plt.figure(figsize=(10,5))
  plt.xticks(rotation=90)
  plt.title(i)
  plt.plot(nbd_price['neighbourhood'],nbd_price['price'])

We can observe that many cities are having very fluctuating price in their neighbourhood.Especially Manhattan is the city which has even very less price and even very high price in its neighbourhood.

### **Price distribution data in every neighbourhood_group**

We already saw the price variation for the neighbourhood of each neighbourhood group.Now we trying to see some price distribution for each neighbour_group alone or i can say for the 5 cities.

In [None]:
for i in nbd_grp:
  df_price=pd.DataFrame(airbnb_df['price'][airbnb_df['neighbourhood_group']==i])
  print(i)
  print(df_price.describe(),"\n")

## **Overall Price distribution throught NY**

Till now we discussed the price range only for neighbouhood / neighbourhood_group / room_type. Now lets focus on how well the price is varying throughout NY.

In [None]:
plt.figure(figsize=(10,5))
plt.xlim(0,2000)
plt.xlabel('Price')
plt.ylabel('Records')
sns.distplot(airbnb_df['price'],bins=10,kde=True)
plt.show()
#A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analagous to a histogram.

We can observe that most of the price are in between the range 0 to 250.( We can say those properties are not that high until you want to opt for a fancy property with high price.)

### **In search of a famous Host**


Here I tried searching the famous host, i felt that whoever has the most number of properties(doesn't matter if its a private room/home/shared room) is the famous host. He has been offering those properties for the customers. And if the property count is more that means he has more to offer. So obviously he is the one famous host.

In [None]:
famous_host = airbnb_df['host_id'].value_counts().head(20)
famous_host = pd.DataFrame(famous_host)
famous_host.reset_index(inplace=True)
famous_host.rename(columns={'index':'Host_id','host_id':'Count'},inplace=True)
famous_host.head(2)

In [None]:
#Top 10 Hosts
plt1 = sns.barplot(x='Host_id',y='Count',data=famous_host)
plt1.set_xticklabels(plt1.get_xticklabels(), rotation=90)

Above is the visualization which describes top 10 hosts whoever has the most number of properties to offer.

In [None]:
top_host_id = famous_host['Host_id'][famous_host['Count']==famous_host['Count'].max()]
top_host_name = airbnb_df[['neighbourhood_group']][airbnb_df['host_id']==top_host_id[0]].head()
top_host_name

In [None]:
airbnb_df[['neighbourhood_group','neighbourhood','room_type','price','availability_365']][airbnb_df['host_id']==top_host_id[0]]

The person with the Host Id- 219517861,  who we concluded that as famous host, has all the properties in Manhattan.

(1)He is offering many properties with almost all as 'Entire_home'.

(2)He has many properties in Financial Distict which may be more famous or i can say affordable to all. Because we can see the price as well, its reasonable one(not high/not low).

(3)We can even check that all the properties are available for almost 300+days.

So we can conclude that he is the famous one because he possesses the best properties(home properties) in best place(Manhattan).

**Number_of_reviews**


We tried analysing the data with respect to neighbourhood_group, neighbourhood, room_type, host_id and price. Now we will try to analyse Number of reviews and we will see what intrepretation we can get from here.

In [None]:
top_reviewed_place=airbnb_df.nlargest(10,'number_of_reviews')
top_reviewed_place.head(3)

### **Average price for the most_reviewed place:**

In [None]:
price_avrg=top_reviewed_place.price.mean()
print('Average price per night: {}'.format(price_avrg))

### **with respect to host ID**

In [None]:
nor_host = airbnb_df.groupby(['host_id'])['number_of_reviews'].max().reset_index()
nor_host = nor_host.sort_values(['number_of_reviews'],ascending=False).head(10)
nor_host

In [None]:
plt2 = sns.barplot(nor_host['host_id'],nor_host['number_of_reviews'])
plt2.set_xticklabels(plt2.get_xticklabels(), rotation=90)

We found the top 10 hosts whose place got the most number of reviews.

### **Number of reviews for each neighbourhood_group**

In [None]:
nor_nbd_grp = airbnb_df.groupby(['neighbourhood_group','room_type'])['number_of_reviews'].max().reset_index()
nor_nbd_grp

In [None]:
sns.barplot(x=nor_nbd_grp['neighbourhood_group'],hue=nor_nbd_grp['room_type'],y=nor_nbd_grp['number_of_reviews'])


We can observe that Queens has the most_reviewed place in NY. Through this visualization we will get good comparision of the other neighbourhood as well.

Overall,

(1)So from the above result, we have the average price per night as 65.4.

(2)And we can observe that host_id 47621202 has most number of reviews for his property which is in Queens. We can even consider him as the famous host for some reason as he has most reviewed place in NY. Again its all upto our conception.

(3)the most reviewd place has very less price. That means we can conclude that the price which ever is very low, people tend to go to that home/private room. Hence it got most number of reviews.

(4)From observed data (above 4 result), we have the total number of properties in Queens is 5666 and out of which Now we can see that 629 reviews were on single property. Its pretty good. So we can come to the conclusion that Queens is the place/city which got most reviewd place. It may be because its comparibly cheap when it compared to other cities like Manhattan/Brooklyn.

(5) The top reviewed place was available for almost 333days out of 365 days of a year. Which gives us good picture that people tend to choose the property with less price and which is mostly available.

### **Average price for room_type throughout NY**

Now we are focusing on the column room_type alone. We trying to find the average price for all room types throughout NY.

In [None]:
#avg_price_room_type = airbnb_df.groupby(['room_type']).agg({'price': ['mean', 'min', 'max']})
avg_price_room_type = airbnb_df.groupby(['room_type'])['price'].mean().reset_index()
avg_price_room_type

In [None]:
sns.barplot(avg_price_room_type['room_type'],avg_price_room_type['price'])

We can see that Entire home/Apt is in demand and it has the price high when compared to private_room/shared_room.

### **Average price of the place which is most available**

In [None]:
most_available_place=airbnb_df.nlargest(10,'availability_365')
most_available_place.head(2)

In [None]:
price_avrg=most_available_place.price.mean()
print('Average price of most avialble place: {}'.format(price_avrg))

We got to know that the average price for most available place is 150.3.We can see that its quite reasonable when compared to price range in total.

## **Average price in descending order based on minimum_nights of stay**

In [None]:
min_nights_stay = airbnb_df.groupby(['minimum_nights','neighbourhood_group','neighbourhood'],dropna=True)['price'].mean().reset_index()


In [None]:
min_nights_stay.sort_values('minimum_nights',ascending=False,inplace=True)
min_nights_stay

In [None]:
plt.scatter(min_nights_stay['minimum_nights'],min_nights_stay['price'])
plt.xlabel('Minimum_nights')
plt.ylabel('Price')

We can clearly observe that we can't get much intrepetion from minimum_number_of_nights and price visualization. Even there are places with less price which has the minimum_nights as 1000 and vice versa. Because it mainly depends on the neighbourhood_group/city where you are staying.

### **Average price per night**

In [None]:
x = ['neighbourhood_group','room_type','minimum_nights','price']
prvsmin = airbnb_df[x]
prvsmin.head(5)

In [None]:
# creating a new column which gives price per a single night
prvsmin['price_per_night'] = prvsmin['price']/prvsmin['minimum_nights']
prvsmin.head(5)

## ** Average price per night for each neighbourhood along with room_type**

In [None]:
# finding the average price for each neighbourhood
avg_price_per_night_nbd = prvsmin.groupby(['neighbourhood_group','room_type'])['price'].mean().reset_index()
avg_price_per_night_nbd

In [None]:
sns.barplot(x=avg_price_per_night_nbd['neighbourhood_group'],hue=avg_price_per_night_nbd['room_type'],y=avg_price_per_night_nbd['price'])


Over all, we can see that in every neighbourhood group, Entire home/apartment has the price high. And Manhattan is the neighbourhood which average price is high in all room types.

### **Average price per night for all 5 neighbourhood**

In [None]:
avg_price_per_night = prvsmin.groupby(['neighbourhood_group'])['price'].mean().reset_index()
avg_price_per_night

In [None]:
sns.barplot(avg_price_per_night['neighbourhood_group'],avg_price_per_night['price'])


We can see that average price per night is high in Manhattan followed by Brooklyn.

### **Average price per night for each room type throughout New York**

In [None]:
avg_price_per_night_nbd_room_type = prvsmin.groupby(['room_type'])['price'].mean().reset_index()
avg_price_per_night_nbd_room_type

In [None]:
sns.barplot(avg_price_per_night_nbd_room_type['room_type'],avg_price_per_night_nbd_room_type['price'])


We can observe that Entire home/Apartment price per night is high when compared to other room types

### **Longitude and Latitude**

We considered all possible columns for visualization. Why not Longitude and Latitude ? Here we are just trying to do the visualization even more attractive with the fancy maps of geological graph.

**Map based on properties of Neighbourhood_group/cities**

In [None]:
sns.scatterplot(airbnb_df.longitude,airbnb_df.latitude,hue=airbnb_df.neighbourhood_group)

**Maps with respect to room_types available throughout NY**

In [None]:
sns.scatterplot(airbnb_df.longitude,airbnb_df.latitude,hue=airbnb_df.room_type,palette='Accent')


**Maps which tells us about the price variation of all available properties**

In [None]:
sns.scatterplot(airbnb_df.longitude,airbnb_df.latitude,hue=airbnb_df.price)

**Conclusion:**

(1) Price depends on Neighbourhood_group. Its high in Manhattan.

(2) Within neighbourhood_groups, price fluctuates between the range. But Manhattan is the place where we see alot ups and downs in price, giving the colclusion that Manhattan is the city which contains highest price and aswell as lowest price.

(3) We can get the famous host (hosy_id 219517861 from Manhattan) from the number of properties he is offering (OR) We can find the famous host(JordanHost_Id-47621202	 from Queens) with respect to most reviewed place as well.

(4) People reviewed at most in properties of Queens which has the lowest prices for their properties as well.

(5) Throughout NY, Entire home/apt is the room_type which is mostly in demand.

(6) Manhattan is the place which is famous and can be a good option for the companies to invest on properties of Entire home/Apt.