<a href="https://colab.research.google.com/github/arun-saraswat/Data_Analysis/blob/main/Airbnb_Booking_Analysis_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Airbnb Booking Analysis EDA
##### **Contribution**    - Individual


# **Project Summary -**



Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.
Explore and analyze the data to discover key understandings (not limited to these) such as :

What can we learn about different hosts and areas?

What can we learn from predictions? (ex: locations, prices, reviews, etc)

Which hosts are the busiest and why?

Is there any noticeable difference of traffic among different areas and what could be the reason for it?

# **GitHub Link -**

https://github.com/arun-saraswat/Data_Analysis

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
#using pandas library and 'read_csv' to read Airbnb csv file
airbnb=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Airbnb NYC 2019.csv")


### Dataset First View

In [None]:
# Dataset First Look
airbnb.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airbnb.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
airbnb.isnull().sum()

In [None]:
# Visualizing the missing values
# visualising using seaborn heatmap
plt.figure(figsize=(10,6))
sns.heatmap(airbnb.isnull().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})
plt.show()

### What did you know about your dataset?

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb.columns

In [None]:
# Dataset Describe
airbnb[['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']].describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_airbnb = pd.unique(airbnb[['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']].values.ravel())
unique_airbnb

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#replace the null values in reviews per month and last_review with zero
airbnb.reviews_per_month.fillna(0, inplace=True)
airbnb.last_review.fillna(0, inplace=True)

#after cleaning
airbnb.isnull().sum()


In [None]:
#last five rows
airbnb.tail()

In [None]:
#type of rooms in this data
number_of_room_type = airbnb['room_type'].unique()
number_of_room_type

In [None]:
#number of locations
number_of_place=airbnb["neighbourhood_group"].unique()
number_of_place

In [None]:
#total rooms by locations
total_room_by_location= airbnb.groupby(['neighbourhood_group', 'room_type']).size().reset_index(name="total_room")
print (total_room_by_location)

In [None]:
#location vise total availability
average_availability= airbnb.groupby(['neighbourhood_group'])['availability_365'].sum().reset_index()
average_availability

In [None]:
#locations with total reviews
location_reviews =airbnb.groupby(['neighbourhood_group'])['number_of_reviews'].sum().reset_index()
location_reviews

In [None]:
#price preditions with the help of reviews
price_area = airbnb.groupby(['price'])['number_of_reviews'].max().reset_index()
price_reviews=price_area.head(5)
price_reviews

In [None]:
#top five busiest hosts
busiest_hosts = airbnb.groupby(['host_name','host_id','room_type'])['number_of_reviews'].max().reset_index()
busiest_hosts = busiest_hosts.sort_values(by='number_of_reviews', ascending=False).head()
busiest_hosts

In [None]:
#maximum bookings in hotelst type with locations
traffic_areas = airbnb.groupby(['neighbourhood_group','room_type'])['minimum_nights'].count().reset_index()
traffic_areas = traffic_areas.sort_values(by='minimum_nights', ascending=False)
traffic_areas

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#total hotels chart
plt.figure(figsize=(10,6))
plt.title("total hotels on location",fontsize = 16)
sns.countplot(data=airbnb,x='neighbourhood_group')
plt.xlabel("neighbourhood_group",fontsize = 16)
plt.ylabel("count",fontsize = 16)
plt.show()

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#hotel share per location
plt.figure(figsize=(10,6))
plt.title("Neighbourhood Group",fontsize = 16)
plt.pie(airbnb.neighbourhood_group.value_counts(), labels=airbnb.neighbourhood_group.value_counts().index,autopct='%1.1f%%')
plt.show()

*The pie and bar chart above shows that Airbnb Listings in Manhattan, and Brooklyn has the highest share of hotels.*



#### Chart - 3

In [None]:
# Chart - 3 visualization code
#types of room
plt.figure(figsize=(10,6))
plt.title("Type of Room",fontsize = 16)
sns.countplot(data=airbnb,x='room_type')
plt.xlabel("room_type",fontsize=16)
plt.ylabel("count",fontsize=16)
plt.show()

*We can see that the Entire Home/Apartment has the highest share, followed by the Private Room, and the least preferred is Shared Room.*



#### Chart - 4

In [None]:
# Chart - 4 visualization code
#rooms type per location
plt.figure(figsize=(10,6))
plt.title("Room Type on Neighbourhood Group",fontsize=16)
sns.countplot(data=airbnb,x='neighbourhood_group',hue=airbnb.room_type)
plt.xlabel("neighbourhood_group",fontsize=16)
plt.ylabel("count",fontsize=16)
plt.show()

*The graph shows that the Entire Home/Apartment is listed most near Manhattan, while Private Rooms and Apartments Near Brooklyn are Nearly equal.*

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#availability per year
plt.figure(figsize=(10,6))
plt.title("Neighbourhood Group vs. Availability Room",fontsize=16)
sns.boxplot(data=airbnb, x='neighbourhood_group',y='availability_365')
plt.xlabel('neighbourhood_group',fontsize=16)
plt.ylabel("availability_365",fontsize=16)
plt.show()

*The above box plot shows the relationship between the availability room and neighborhood group.*



#### Chart - 6

In [None]:
# Chart - 6 visualization code
#locations
plt.figure(figsize=(10,6))
sns.scatterplot(x=airbnb.longitude,y=airbnb.latitude,hue=airbnb.neighbourhood_group)
plt.xlabel("longitude",fontsize=16)
plt.ylabel("latitude",fontsize=16)
plt.title("location of neighbourhood_group ",fontsize=16)
plt.show()

*above scatterplot shows that the hotels listed by location*



#### Chart - 7

In [None]:
# Chart - 7 visualization code
#price hike
airbnb[airbnb.price<100].plot(kind='scatter', x='longitude',y='latitude',label='Map of Price Distribution',c='price',cmap=plt.get_cmap('jet'),colorbar=True,alpha=0.4)
plt.show()

*The information we got from the graph above is red color dots are the rooms with a higher price. Also, we can see that the Manhattan region has a more expensive room price.*

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#prices per location
plt.figure(figsize=(10,6))
plt.title("Neighbourhood Group Price Distribution<500",fontsize=16)
sns.boxplot(y="price",x ='neighbourhood_group',data = airbnb[airbnb.price<500])
plt.xlabel("neighbourhood_group",fontsize=16)
plt.ylabel("price",fontsize=16)
plt.show()

*1-We can say that the Manhattan has the highest price range for the listings,*followed by Brooklyn

2-*Queens and Staten Island seem to have a very similar distribution*,

3-*The Bronx is the cheapest*.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#number of reviews per area
area = location_reviews['neighbourhood_group']
review = location_reviews['number_of_reviews']

fig = plt.figure(figsize = (10, 6))
plt.bar(area,review, width = 0.4)
plt.title("Area vs Number of reviews",fontsize=16)
plt.xlabel("Area",fontsize=16)
plt.ylabel("Reviews",fontsize=16)
plt.show()

*above bar plot shows that the brooklyn has a most number of reviews followed by manhattan , least reviews on staten island*



#### Chart - 10

In [None]:
# Chart - 10 visualization code
#prices and review relation
price_reviews.plot()
plt.title('price vs number of review',fontsize = 16)
plt.xlabel("Price",fontsize = 16)
plt.ylabel("Number_of_review",fontsize = 16)
plt.show()

*From the above Analysis we can say that most people prefer to stay in place where price is less.*



#### Chart - 11

In [None]:
# Chart - 11 visualization code
#top most hosts were maximum booking
name = busiest_hosts['host_name']
reviews = busiest_hosts['number_of_reviews']

fig = plt.figure(figsize = (10,6))
plt.bar(name, reviews ,width = 0.4)
plt.xlabel("Name of the Host",fontsize = 16)
plt.ylabel("Number of Reviews",fontsize = 16)
plt.title("Top_5_Busiest_Hosts",fontsize = 16)
plt.show()

*from the above bar chart we have a top five busiest hosts..*

1-*Dona*

2-*Jj*

3-*Maya*

4-*Carol*

5-*Danielle*

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#traffic areas in hotel type
room_type = traffic_areas['room_type']
stayed = traffic_areas['minimum_nights']

fig = plt.figure(figsize = (8,5))
plt.bar(room_type, stayed,width = 0.4)
plt.xlabel("Room Type",fontsize = 16)
plt.ylabel("Minimum number of nights stayed",fontsize = 16)
plt.title("Traffic Areas",fontsize = 16)
plt.show()

*From the Above Analysis We can Say that People are preferring Entire home/apt or Private room which are present in Manhattan, Brooklyn, Queens and people are preferring listings which are less in price.*

# **Conclusion**

1- The people who prefer to stay in an Entire home or Apartment are going to stay a bit longer in that particular Neighborhood only.

2- The people who prefer to stay in a Private room won't stay longer as compared to a Home or Apartment.

3- Most people prefer to pay less price.

4- If there are more number of reviews for a particular neighborhood group that means that a place is a tourist place.

5- If people are not staying more than one night means they are travelers.

6- For the given data set I found that there are a total of 221 different areas out of which “Williamsburg” has a maximum number of listings.

7- There are a total of 37457 hosts and the host with host id- 219517861 “Sonder” is the top host with 327 listings.

8- No strong correlation was observed between price, reviews, and location.

9- Out of 5 different locations in the dataset, Manhattan is the most crowded location with 44.3% of listings.*

10- Top five busiest host are Dona,Jj,Maya,Carol,Danielle.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***