<a href="https://colab.research.google.com/github/Vaikunth97/vaik/blob/main/Airbnb_EDA_Project(Vaikunth_PA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Airbnb Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalised way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analysed and used for security, business decisions, understanding of customers' and providers' (hosts) behaviour and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Explore and analyse the data to discover key understandings.

#### **Define Your Business Objective?**

***Explore and analyse data to discover key understandings***

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
file_path="/content/drive/My Drive/Alma better Projects/Module 2/Capstone project/Airbnb_NYC_2019.csv"# filepath for dataset
airbnb_dataframe= pd.read_csv(file_path)#defining the dataframe
print(airbnb_dataframe) #printing the dataframe

### Dataset First View

In [None]:
# Dataset First 5 rows
airbnb_dataframe.head()

In [None]:
#dataset last 5 rows
airbnb_dataframe.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
airbnb_dataframe.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb_dataframe.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(airbnb_dataframe[airbnb_dataframe.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(airbnb_dataframe.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(airbnb_dataframe.isnull(), cbar=False)

### What did you know about your dataset?

The dataset given is a dataset from Airbnb hotel booking data, and we have to analyse the data's and the insights behind it.

The above dataset has 48895 rows and 16 columns. There are many mising values and no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_dataframe.columns

In [None]:
# Dataset Describe
airbnb_dataframe.describe(include='all')

### Variables Description

* **id                :**Unique id

* **name       :**Name of the listing

* **host_id           :**Unique host id

* **host_name            :**Name of the host

* **neighbourhood_group           :**  location

* **neighbourhood        :**Area

* **latitude             :**Latitude range

* **longitude**         :lobgitude range

* **room_type**         :Type of listing

* **price**          :Price of the listing

* **minimum_nights**	:Minimum nights to be paid for

* **Number_of_reviews**	:Number of reviews

* **last_review** :	Content of the last review

* **reeviews_per_month** :	Number of checks per month

* **calculated_host_listing_count** :	Total count

* **availability_365** :	Availability around the year


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airbnb_dataframe.columns.tolist():
  print("No. of unique values in ",i,"is",airbnb_dataframe[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

Dropping columns with no data

In [None]:
#dropping columns for handling null values
airbnb_dataframe.drop(['name', 'host_name', 'last_review', 'reviews_per_month'], axis=1, inplace=True)

In [None]:
# Checking Null Value drop activity by plotting Heatmap
sns.heatmap(airbnb_dataframe.isnull(), cbar=False)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Univarient charts


In [None]:
# Chart - 1 Bar chart for room type distribution
plt.figure(figsize=(10,5))
airbnb_dataframe['room_type'].value_counts().plot(kind='bar', color='green' )
plt.title("Room Type Distribution")
plt.xlabel("Room Type")
plt.ylabel("Count")
plt.show()

**Key Insights from Room Type Distribution Chart**:

* Entire homes/apartments are the most common listings, followed by private rooms.

* Shared rooms make up a very small percentage, indicating lower demand or availability.

* The high number of entire homes suggests Airbnb is used for full-property rentals rather than shared stays.

In [None]:
#chart2 barchart for neighbourhood group distribution
plt.figure(figsize=(10,5))
airbnb_dataframe['neighbourhood_group'].value_counts().plot(kind='bar', color='green' )
plt.title("Neighbourhood Group Distribution")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Count")
plt.show()


**Key Insights from Neighbourhood Group Distribution Chart**:
* Manhattan and Brooklyn dominate the Airbnb market, accounting for the majority of listings.
* Queens has a moderate number of listings, while Bronx and Staten Island have significantly fewer.
* The low number of listings in Bronx & Staten Island may indicate lower tourist demand or fewer short-term rental properties.

In [None]:
#chart 3 bar chart for neighbourhood distribution
plt.figure(figsize=(30,5))
airbnb_dataframe['neighbourhood'].value_counts().plot(kind='bar', color='green' )
plt.title("Neighbourhood Distribution")
plt.xlabel("Neighbourhood")
plt.ylabel("Count")
plt.show()

Key Insights from Neighbourhood Distribution Chart:

* Williamsburg, Bedford-Stuyvesant, and Harlem have the highest number of Airbnb listings.

* Most listings are concentrated in a few key neighborhoods, with a long tail of areas having fewer listings.

* Many neighborhoods have minimal listings, indicating lower demand or fewer rental properties.

In [None]:
#chart 4 histplot for availability across stays
plt.figure(figsize=(20,5))
sns.histplot(airbnb_dataframe['availability_365'],bins=50, kde=True, color='green' )
plt.title("Availability Distribution")
plt.xlabel("Availability")
plt.ylabel("Count")
plt.show()

**Key Insights from Availability Distribution Chart**:
* Most listings have very low availability (0–10 days/year), suggesting many inactive or rarely rented properties.
* A smaller peak near 365 days indicates a subset of listings available year-round, likely from full-time hosts.
* Gradual decline between 50-300 days suggests varied listing strategies, with some properties available seasonally.

In [None]:
#chart 5 histplot for price distribution across stays
plt.figure(figsize=(10, 5))
sns.histplot(airbnb_dataframe['price'], bins=50, color= 'green',kde=True)
plt.title("Price Distribution")
plt.show()


* Highly Right-Skewed Distribution: Most listings are priced under $500.

* Majority of listings are concentrated under $200, indicating budget to mid-range accommodations dominate the market.

* Few luxury listings exist at very high prices ($2000+), which distort the average price.

###   Bi-varient charts

In [None]:
# chart1 scatterplot for listings by location
plt.figure(figsize=(8, 5))
sns.scatterplot(x=airbnb_dataframe['longitude'], y=airbnb_dataframe['latitude'], hue=airbnb_dataframe['neighbourhood_group'])
plt.title("Airbnb Listings by Location")
plt.show()
#chart2 scatterplot for listings by room type
plt.figure(figsize=(8, 5))
sns.scatterplot(x=airbnb_dataframe['longitude'], y=airbnb_dataframe['latitude'], hue=airbnb_dataframe['room_type'])
plt.title("Airbnb Listings room type across locations")
plt.show()



**Key Insights from Airbnb Listings Location Analysis**:
* Manhattan and Brooklyn have the highest listing density, while Bronx and Staten Island have significantly fewer rentals.
* Entire homes/apartments dominate most areas, especially in Brooklyn and Manhattan, whereas private rooms are scattered throughout the city.
* Shared rooms are rare, with minimal presence across all boroughs.

In [None]:
#chart3 real time map locations of the listings
import folium

m = folium.Map(location=[airbnb_dataframe['latitude'].mean(), airbnb_dataframe['longitude'].mean()], zoom_start=12)
for index, row in airbnb_dataframe.iterrows():
    folium.CircleMarker([row['latitude'], row['longitude']], radius=2, color='blue').add_to(m)

m  # Displays map in Jupyter Notebook


In [None]:
#chart4 boxplot for price distribution
plt.figure(figsize=(10, 10))
sns.boxplot(x=airbnb_dataframe['room_type'], y=airbnb_dataframe['price'], color='green')
plt.xlabel("Room Type")
plt.ylabel("Price")
plt.ylim(0, 400)
q1 = airbnb_dataframe['price'].quantile(0.25)
q3 = airbnb_dataframe['price'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
airbnb_dataframe = airbnb_dataframe[(airbnb_dataframe['price'] >= lower_bound) & (airbnb_dataframe['price'] <= upper_bound)]
plt.title("Price Variation Across Room Types")
plt.show()



**Key Insights from Price Variation Across Room Types**:
* Entire homes/apartments have the highest median price.
Private rooms are mid-priced, but have many high-price outliers.
* Shared rooms are the cheapest, but some high-price outliers exist.

In [None]:
# chart5 violinplot for price distribution across neighbourhood groups
plt.figure(figsize=(10, 5))
sns.violinplot(x=airbnb_dataframe['neighbourhood_group'], y=airbnb_dataframe['price'], color='green')
plt.xlabel("Location")
plt.ylabel("Price")
plt.ylim(0, 400)
q1 = airbnb_dataframe['price'].quantile(0.25)
q3 = airbnb_dataframe['price'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
airbnb_dataframe = airbnb_dataframe[(airbnb_dataframe['price'] >= lower_bound) & (airbnb_dataframe['price'] <= upper_bound)]
plt.title("Price Variation Across neighbourhood group")
plt.show()


**Key Insights from Price Variation Across Neighborhood Groups**:
* Manhattan has the highest price range, brooklyn follows next with a slightly lower median price than Manhattan.
* Queens, Bronx, and Staten Island have lower overall prices.
* All boroughs show a long tail of high-priced outliers, indicating the presence of luxury listings.

In [None]:
#chart6 boxplot for price distribution across room types
plt.figure(figsize=(10, 5))
sns.boxplot(x=airbnb_dataframe['room_type'], y=airbnb_dataframe['price'],hue= airbnb_dataframe['neighbourhood_group'])
plt.xlabel("Room type")
plt.ylabel("Price")
plt.ylim(0, 400)
plt.title("Price Variation Across Room Types")
plt.show()


**Key Insights from Price Variation Across Room Types & Neighborhoods**:

* Manhattan has the highest prices across all room types, especially for entire homes/apartments.
* Brooklyn follows with slightly lower prices, but still has a wide range of listings.
* Queens, Bronx, and Staten Island have lower median prices, with Staten Island showing more variability.
* Outliers exist in all boroughs, especially for private and shared rooms in Manhattan and Brooklyn.

In [None]:
#chart7 scatterplot for price variation vs reviews
plt.figure(figsize=(10, 5))
sns.scatterplot(x=airbnb_dataframe['number_of_reviews'], y=airbnb_dataframe['price'],hue= airbnb_dataframe['room_type'])
plt.xlabel("Reviews")
plt.ylabel("Price")
plt.ylim(0, 400)
plt.title("Price Variation vs reviews")
plt.show()


**Key Insights from Price vs. Reviews Across Room Types**:
* Lower-priced listing tend to have the highest number of reviews & High-priced listings generally have fewer reviews, suggesting occupancy changes against pricing.

* Private rooms and shared rooms dominate the lower price range, while entire homes/apartments are spread across all price levels.

* Some budget listings with high reviews indicate strong customer preference for affordable stays.

In [None]:
#chart8 boxplot room availability vs room type
plt.figure(figsize=(10, 5))
sns.boxplot(x=airbnb_dataframe['room_type'], y=airbnb_dataframe['availability_365'],hue= airbnb_dataframe['neighbourhood_group'])
plt.xlabel("Room type")
plt.ylabel("Room availability")
plt.ylim(0, 400)
plt.title("Room availability across Room Types")
plt.show()


**Key Insights from Room Availability Chart**:

* Shared rooms have the highest availability, especially in Queens and Brooklyn.
* Private and entire homes show higher variation, with Manhattan & Brooklyn having lower median availability.
* Outliers in shared rooms suggest some listings are available year-round.

In [None]:
#chart9 scatterplot for Price Variation vs availabilty across Room Types
plt.figure(figsize=(10, 5))
sns.scatterplot(x=airbnb_dataframe['price'], y=airbnb_dataframe['availability_365'],hue= airbnb_dataframe['room_type']) # Corrected the column name to 'availability_365'
plt.xlabel("price")
plt.ylabel("availability")
plt.ylim(0, 400)
plt.title("Price Variation vs availabilty across Room Types")
plt.show()

**Key Insights from Price vs. Availability Across Room Types Chart**:

* Lower-priced listings have the highest availability, mostly private and shared rooms.

* Entire homes/apartments dominate higher price ranges, with varied availability.

* No strong pattern between price and availability.

* High-priced listings tend to have limited availability.

In [None]:
airbnb_dataframe.groupby(['neighbourhood_group','room_type'])['price'].mean() #mean pricing for room type across neighbourhood

**Insights**:

* Manhattan is the most expensive among the neighbourhood groups.

* Brooklyn & Staten Island have mid-range entire home prices, while private rooms are cheaper.

* Queens & Bronx offer budget-friendly stays for entire homes and private rooms.

* Shared rooms are the cheapest in all boroughs, with Manhattan still being the priciest of them.

###Co-rrelation

In [None]:
airbnb_dataframe.copy()#copy of main dataframe
airbnb_dataframe.drop(['host_id', 'neighbourhood','latitude', 'longitude'], axis=1, inplace=True)#dropping for co-relation heatmap

In [None]:
# Select only numerical features for correlation analysis
numerical_features = airbnb_dataframe.select_dtypes(include=np.number)

# Calculate the correlation matrix for numerical features
correlation_matrix = numerical_features.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


**Key Insights from the Correlation Heatmap**:
* Price has weak correlations with all other features (highest being 0.16 with calculated_host_listings_count).
*Negative Correlation Between number_of_reviews and id (-0.32)
*A moderate negative correlation implies that newer listings (higher id values) tend to have fewer reviews.
*There is a weak positive correlation between availability_365 and calculated_host_listings_count.
*minimum_nights Has Almost No Correlation with Other Features:


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Optimize Pricing**
* Implement dynamic pricing based on demand and seasonality.

* Reduce prices of listings with few reviews and offer long-stay discounts.

**Luxury Properties**
* Enhance visibility through premium branding and marketing.

* Adjust pricing based on demand and seasonal trends.

**New Hosts and Underrepresented Areas**

* Encourage new hosts in low-competition areas with attractive pricing.

* Offer incentives for hosts in underrepresented neighborhoods.

* Promote budget-friendly options in Queens & Bronx.

**Host Engagement**
* Improve host engagement for better guest experiences.

* Promote high-review properties and increase availability.

# **Conclusion**

**Market Dominance**:

* Manhattan and Brooklyn have the highest number of listings.

* Queens follows, with fewer in Staten Island and the Bronx.

**Pricing Trends**:

* Manhattan listings are the most expensive, especially for entire homes.

* Queens, Bronx, and Staten Island are more affordable.

* Luxury listings ($200+) have fewer reviews.

**Availability Patterns**:

* Many listings have low availability, suggesting they are rarely rented.

* Shared rooms have higher availability.

* High-review listings are often budget-friendly.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***