<a href="https://colab.research.google.com/github/blackxhrt2102/AirBnB-Booking-Analysis/blob/main/AIRBNB_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Airbnb EDA (Exploratory Data Analysis) Capstone Project**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.



---------------------------------------------------------------------------------

**Explore and analyze the data to discover key understandings (not limited to these) such as :**
* **What can we learn about different hosts and areas?**
* **What can we learn from predictions? (ex: locations, prices, reviews, etc)**
* **Which hosts are the busiest and why?**
* **Is there any noticeable difference of traffic among different areas and what could be the reason for it?**
--------------------------------------------------------------------------------

# **Business Model of Airbnb:**
 **Airbnb is a community-based, two-sided online platform that facilitates the process of booking private living spaces for travelers. On the one side it enables owners to list their space and earn rental money. On the other side it provides travelers easy access to renting private homes. With over 1,500,000 listings in 34,000 cities and 190 countries, its wide coverage enables travelers to rent private homes all over the world. Personal profiles as well as a rating and reviewing system provide information about the host and what is on offer. Vice versa, hosts can choose on their own who to rent out their space to.**

 **Airbnb receives commissions from two sources upon every booking, namely from the hosts and guests. For every booking Airbnb charges the guest 6-12% of the booking fee. Moreover Airbnb charges the host 3% for every successful transaction.**

# **CODING AND ANALYSIS PART:**

In [None]:
# Importing modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


sns.set_style('darkgrid')
%matplotlib inline

In [None]:
# Mounting google drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading the dataset

df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/AlmaBetter/EDA Capstone Project/Airbnb NYC 2019.csv')

# Analysing the Data

In [None]:
# Printing the first 5 rows

df.head()

In [None]:
# Printing the last 5 rows

df.tail()

In [None]:
# Checking the shape of the dataset

df.shape

In [None]:
# Checking the basic information of the dataset

df.info()

In [None]:
# Checking the unique values of the host id column

df['host_id'].nunique()

In [None]:
# Checking the unique values of the neighbourhood group

df['neighbourhood_group'].unique()

**Information provided by each column:**


*   **id:** It gives a unique number for each observation.
*   **name:** Basic description of the provided Airbnb.

*   **host_id:** This gives us the id of the host who owns the Airbnb.
*   **host_name:** This gives us the name of the host who owns the Airbnb.
*   **neighbourhood_group:** The 5 boroughs(a town or district which is an administrative unit) of the New York City.
*   **neighbourhood:** Towns/Cities present in the 5 boroughs.
*   **latitude:** Latitude of the Airbnb.
*   **longitude:** Longitude of the Airbnb.
*   **room_type:** Different room types available for the Airbnb booking.
 1.   Entire Home/Apartment
 2.   Private Room
 3.   Shared Room
*   **price:** Price of the Airbnb for one night.
*   **minimum_nights:** Number of minimun nights spent by a person in the Airbnb.
*   **number_of_reviews:** Number of reviews received by the Airbnb.
*   **last_review:** Date of the last review given by the user.
*   **reviews_per_month:** Mean number of reviews received by the Airbnb per month.
*   **calculated_host_listings_count:** Count of the list of hosts.
*   **availability_365:** Availability of the Airbnb out of 365 days.


### **Data Preprocessing**

In [None]:
# Dropping the columns that are not required

df.drop(columns=['name','host_name','latitude','longitude','last_review','calculated_host_listings_count'], inplace = True)

If there are Nan values in the reviews per month column that means those Airbnbs have not received the reviews and thus those Nan values can be replaced with 0. 

In [None]:
# Replacing the null values of reviews per month column with 0

df[['reviews_per_month']] = df[['reviews_per_month']].fillna(0)

The price of any Airbnb can not be equal to 0, so replacing the 0 priced Airbnbs with the median of the price column

In [None]:
# Replacing the 0 priced Airbnbs with the median of the price column

df['price'] = df['price'].replace(0,np.median(df['price']))

In [None]:
# Statistical Description of the Dataset

df.describe()

## Classifying the categorical and numerical data:

*   **Categorical:**

 1. host_id 
 2. neighbourhood_group
 3. neighbourhood
 4. room_type


*   **Numerical:**

 1. price
 2. minimum_nights
 3. number_of_reviews
 4. reviews_per_month
 5. availability_365


# **Univariate Analysis:**
**It involves the analysis of a single variable**

In [None]:
# Creating a dataframe representing the top 10 host IDs who own maximum number of Airbnbs
top_10_host_id = pd.DataFrame({'Host ID':df['host_id'].value_counts(ascending=False).index,
                               'No of Airbnbs':df['host_id'].value_counts(ascending=False).values})[:10]

# Defining the size of the plot
fig,ax=plt.subplots(figsize=(15,8))

# Plotting a bar graph
figure = sns.barplot(x='Host ID', y='No of Airbnbs', data = top_10_host_id)

# Defining the title of the graph
figure.set(title='Host IDs that owns maximum number of Airbnbs')

# Displaying the graph
plt.show(figure)

## **Distribution of the Airbnbs in the neighbourhood groups**

In [None]:
# Creating a dataframe representing the distribution of Airbnbs in the neighbourhood groups
airbnb_count = pd.DataFrame({'Neighbourhood Groups':df['neighbourhood_group'].value_counts(ascending=False).index,
                             'No of Airbnbs':df['neighbourhood_group'].value_counts(ascending=False).values})

# Defining the size of the plot
plt.figure(figsize=(10,5))

# Plotting a bar graph
figure = sns.barplot(x='Neighbourhood Groups', y='No of Airbnbs', data = airbnb_count)

# Defining the title of the graph
figure.set(title='Distribution of the Airbnbs in the neighbourhood groups')

# Displaying the graph
plt.show(figure)

# Percentage distribution of the Airbnbs in the neighbourhood groups  
print('\n\nPercentage distribution of Airbnbs in the neighbourhood groups:')
df['neighbourhood_group'].value_counts(normalize=True)*100

## **Distribution of Airbnbs on the basis of room types**

In [None]:
# Creating a dataframe representing the distribution of Airbnbs on the basis of room types
room_type_count = pd.DataFrame({'Room Type':df['room_type'].value_counts(ascending=False).index,
                                'No of Airbnbs':df['room_type'].value_counts(ascending=False).values})

# Defining the size of the plot
plt.figure(figsize=(10,5))

# Plotting a bar graph
figure = sns.barplot(x='Room Type', y='No of Airbnbs', data=room_type_count)

# Defining the title of the graph
figure.set(title='Distribution of Airbnbs on the basis of room types')

# Displaying the graph
plt.show(figure)

# Percentage distribution of the Airbnbs in the neighbourhood groups  
print('\n\nPercentage distribution of Airbnbs on the basis of room types:')
df['room_type'].value_counts(normalize=True)*100

## **Price distribution of the Airbnbs across the New York City**

In [None]:
# Defining the size of the graph
plt.figure(figsize=(15,5))

# Defining the labels of the graph
plt.xlabel("Price")
plt.ylabel("Number of Airbnbs")

# Plotting histogram of the price distribution
figure = sns.histplot(x='price', data=df)

# Defining the title of the graph
figure.set(title='Price distribution of Airbnbs')

# Displaying the graph
plt.show(figure)

In [None]:
# Statistical description of price column
df['price'].describe()

In [None]:
# Creating a dataset where Airbnb price is equal to and below 200
price_below_200 = df[df['price']<=200]

# Defining the size of the graph
plt.figure(figsize=(15,5))

# Defining the labels of the graph
plt.xlabel("Price")
plt.ylabel("Number of Airbnbs")

# Plotting histogram of the price distribution
figure_1 = sns.histplot(x='price', data=price_below_200, kde=True, color='red')

# Defining the title of the graph
figure_1.set(title='Price distribution of Airbnbs having price below 200')

# Displaying the graph
plt.show(figure_1)

## **Top 20 neighbourhoods having maximum number of Airbnbs**

In [None]:
# Creating a dataframe representing the top 20 neighbourhoods having maximum number of Airbnbs
cities = pd.DataFrame({'Neighbourhood':df['neighbourhood'].value_counts().index,
                       'No of Airbnb':df['neighbourhood'].value_counts().values})[:20]

# Defining the size of the plot
fig,ax=plt.subplots(figsize=(15,10))

# Defining the label and title of the graph
ax.set_xticklabels(labels=cities['Neighbourhood'],rotation=45)
ax.set_title('Top 20 neighbourhoods having maximum number of Airbnbs.')

# Plotting a bar graph
figure = sns.barplot(x='Neighbourhood',y='No of Airbnb',data=cities)
plt.show(figure)