# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

### **Project Summary: Airbnb Booking Analysis – Exploratory Data Analysis**

The rapid growth of Airbnb as a global platform for short-term lodging has revolutionized the travel and hospitality industry. Understanding customer behavior, host preferences, seasonal trends, and geographical patterns is crucial for both Airbnb and its users—guests and hosts alike. This project, titled **"Airbnb Booking Analysis – Exploratory Data Analysis (EDA)"**, aims to uncover valuable insights from Airbnb’s booking data to help stakeholders make informed decisions.

#### **Objective**

The primary goal of this project is to perform an in-depth exploratory data analysis on Airbnb datasets to identify meaningful patterns, trends, and correlations in booking behavior. This involves understanding the dynamics of pricing, availability, popular destinations, guest preferences, and factors influencing listing popularity and occupancy rates. The insights derived can benefit multiple stakeholders: Airbnb can enhance its service offerings, hosts can optimize pricing and listings, and travelers can plan better stays.

#### **Methodology**

The analysis was conducted using Python, leveraging libraries such as **Pandas** for data manipulation, **Matplotlib** and **Seaborn** for data visualization, and **NumPy** for numerical operations. The approach followed these key steps:

1. **Data Cleaning and Preprocessing:**
   The initial phase focused on handling missing values, removing duplicates, parsing date columns, and converting data types. Columns with excessive missing or irrelevant data were dropped or imputed based on context.

2. **Univariate and Bivariate Analysis:**
   Distributions of individual features (like price, number of reviews, and availability) were analyzed. Relationships between features, such as price vs. number of reviews or availability vs. room type, were visualized using scatter plots, heatmaps, and boxplots.

3. **Geospatial Analysis:**
   Latitude and longitude data enabled mapping listings on a city-level basis using geospatial libraries like Folium or Plotly. This helped identify hotspots, tourism-friendly areas, and price zones across cities.

4. **Temporal Patterns:**
   Time-based features were analyzed to detect seasonal trends in bookings and pricing. This included availability trends, booking patterns across months, and variation in demand during holidays or peak seasons.

5. **Sentiment and Review Analysis (Optional Extension):**
   If review text data was included, Natural Language Processing (NLP) techniques were used to analyze guest sentiments and common keywords, offering insights into guest satisfaction.

#### **Key Insights**

* **Price Distribution:** The price per night follows a right-skewed distribution, with most listings priced below a certain threshold, but a few high-end properties significantly inflating the average.
* **Room Type Popularity:** Entire homes/apartments are the most listed and booked, followed by private rooms. Shared rooms are the least preferred.
* **Location-Based Trends:** Certain neighborhoods consistently show higher average prices and higher review counts, indicating popularity and higher demand.
* **Review Influence:** Listings with more positive reviews and higher ratings are often more frequently booked, showing a direct link between customer feedback and listing success.
* **Seasonal Demand:** Booking frequency and prices spike during holiday seasons and popular travel months, suggesting seasonality is a strong factor in listing dynamics.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The aim of this project is to perform data wrangling, cleaning, and exploratory data analysis (EDA) on a dataset containing hotel bookings information. The dataset consists of various attributes such as booking details, guest information, and stay characteristics. The goal is to gain valuable insights into booking patterns, customer behavior, and factors influencing booking cancellations. This project will involve the following tasks:

1. **Data Wrangling and Cleaning:**
 * Handle missing values, outliers, and inconsistencies in the dataset.
 * Standardize data formats and correct any data entry errors.
 * Perform feature engineering to derive new features if necessary.

2. **Exploratory Data Analysis (EDA):**
 * Explore the distribution of key variables such as booking lead time, stay duration, and average daily rate.
 * Analyze the relationship between booking cancellations and various factors such as lead time, market segment, and previous cancellations.
 * Investigate seasonal patterns in hotel bookings and cancellations.
 * Identify trends and patterns in customer preferences, such as meal type, room type, and special requests.

3. **Visualization:**
 * Create visualizations (e.g., histograms, box plots, heatmaps) to illustrate the distribution and relationships of different variables.
 * Generate insights from visualizations to better understand the dataset and highlight key findings.
 * Visualize trends over time (e.g., monthly booking trends, changes in cancellation rates) using line plots or time series analysis.

4. **Insights and Recommendations:**
 * Summarize key insights obtained from the data analysis.
 * Provide recommendations for hotel management based on the findings, such as optimizing booking strategies, improving customer satisfaction, and reducing cancellation rates.

By completing these tasks, this project aims to provide actionable insights for hotel management to optimize their operations and enhance the overall guest experience.

#### **Define Your Business Objective?**

Business Objective:

The primary objective of this project is to leverage data analysis techniques to optimize the operations and improve the overall performance of the hotel business. By analyzing the hotel bookings dataset, the following specific goals are targeted:

**Enhance Booking Efficiency:** Identify factors influencing booking patterns and optimize booking strategies to maximize occupancy rates and revenue generation.

**Reduce Booking Cancellations:** Understand the drivers behind booking cancellations and implement measures to minimize cancellations, thereby increasing revenue stability and guest satisfaction.

**Improve Customer Experience:** Gain insights into customer preferences and behavior to tailor services, such as meal offerings, room assignments, and special requests, to enhance the overall guest experience.

**Optimize Resource Allocation:** Utilize data-driven insights to optimize resource allocation, such as staffing levels, room inventory management, and parking space availability, to meet guest demands efficiently.

**Inform Strategic Decision-Making:** Provide actionable insights to inform strategic decision-making processes, including marketing strategies, pricing policies, and operational improvements, to drive sustainable business growth and competitiveness in the hospitality industry.

By achieving these objectives, the project aims to contribute to the long-term success and profitability of the hotel business while ensuring exceptional guest satisfaction and loyalty.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/My Drive/Airbnb booking analysis EDA/Airbnb Dataset/Airbnb NYC 2019.csv')
# Before we the EDA and clean our data analysis ready let's create a copy of the Data Frame.
airbnb_booking = data.copy()

### Dataset First View

In [None]:
# Dataset First
airbnb_booking.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
airbnb_booking.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb_booking.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(airbnb_booking[airbnb_booking.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(airbnb_booking.isnull().sum())

The dataset contains missing values in multiple columns.
The columns with missing values are children, country, agent, and company.

The column **name** has **16** missing values.

The column **host_name** has **21** missing values.

The column **last_review** has **10052** missing values.

The column **reviews_per_month** has **10052** missing values.

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(airbnb_booking.isnull(), cbar=False)

### What did you know about your dataset?

###**So Far we know about that:**

**Number of Rows:** The dataset contains **48895** entries or rows.

**Number of Columns:** There are **16** columns in the dataset, each representing different attributes or features related to hotel bookings.

**Data Types:** The dataset contains a mix of data types, including integers (int64), floats (float64), and objects (object). This suggests that the dataset includes both numerical and categorical variables.

**Duplicates Rows:**
The dataset contains **0** duplicate entries.

**Missing Values:** While the majority of columns contain complete data, several columns exhibit notable numbers of missing values. Specifically, the **last_reviews and reviews_per_month** column shows a substantial count of **10052** missing values each. Additionally, the **name** and **host_name** columns contain **16** and **21** missing values, respectively.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_booking.columns

In [None]:
# Dataset Describe
airbnb_booking.describe(include='all')

### Variables Description

**Here'a variable description for each column in dataset**
**id**: Unique id

**name**: Name of AirBnB listings

**host_id**: Unique id given to host

**host_name**: Name of the host

**neighbourhood_group**: Location of AirBnb

**latitude**: Latitude range AirBnB

**longitude**: longitude range of AirBnb

**room_type**: type of room listing(Private room or Entire home/apt)

**price**: price of listing(in $)

**minimum_nights**: minimun nights to be paid for

**number_of _reviews**: number of reviews given by customers

**last_review**: content of the last review

**reviews_per_month**: number of checks per month

**calculated_host_listing_count**: host count staying in AirBnB

**availability_365**: Availability of AirBnB listing around the year

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airbnb_booking.columns.tolist():
  print("No. of unique values in ",i,"is",airbnb_booking[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Let's Start with data cleaning.
# Remove duplicate rows from the hotel_booking DataFrame, modifying it in place
airbnb_booking.drop_duplicates(inplace=True)

In [None]:

# Fill missing values in 'name' and 'host_name' with 'Unknown'
airbnb_booking['name'].fillna('Unknown', inplace=True)
airbnb_booking['host_name'].fillna('Unknown', inplace=True)

In [None]:
#fill missing values in 'last_reviews' with 'no review' and 'reviews_per_month' with '0'
airbnb_booking['last_review'].fillna('No review', inplace=True)
airbnb_booking['reviews_per_month'].fillna(0,inplace=True)

In [None]:
airbnb_booking.isnull().sum()

### What all manipulations have you done and insights you found?

Filled missing values in 'name' and 'host_name' with 'Unknown' respecteively.

Filled missing values in 'last_review' with 'No review'

Filled missing values in 'reviews_per_month' with '0'



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Distribution Of Airbnb Bookings Price Range

In [None]:
# Create a figure with a custom size
plt.figure(figsize=(12, 5))

# Set the seaborn theme to darkgrid
sns.set_theme(style='darkgrid')

# Create a histogram of the 'price' column of the Airbnb_df dataframe
# using sns distplot function and specifying the color as red
sns.distplot(airbnb_booking['price'],color=('r'))

# Add labels to the x-axis and y-axis
plt.xlabel('Price', fontsize=14)
plt.ylabel('Density', fontsize=14)

# Add a title to the plot
plt.title('Distribution of Airbnb Prices',fontsize=15)

##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

*   The range of prices being charged on Airbnb appears to be from **20 to 330 dollars** , with the majority of listings falling in the price range of **50 to 150 dollars.**

*   The distribution of prices appears to have a peak in the **50 to 150 dollars range**, with a relatively lower density of listings in higher and lower price ranges.

*   There may be fewer listings available at prices above **250 dollars**, as
the density of listings drops significantly in this range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained can create a positive business impact by helping hosts optimize their pricing strategies. Knowing that most Airbnb listings fall between \$50 and \$150 allows hosts to stay competitive within this peak range, maximizing occupancy and revenue. However, there can also be negative growth if a host prices above \$250 without offering proportional value. This is supported by the sharp drop in listing density in this range, suggesting lower demand. Overpricing may lead to reduced bookings, decreased visibility in search results, and ultimately, revenue loss—highlighting the importance of aligning pricing with market trends and customer expectations.


#### Total Listing/Property count in Each Neighborhood Group

In [None]:
# Count the number of listings in each neighborhood group and store the result in a Pandas series
counts = airbnb_booking['neighbourhood_group'].value_counts()

# Reset the index of the series so that the neighborhood groups become columns in the resulting dataframe
Top_Neighborhood_group = counts.reset_index()

# Rename the columns of the dataframe to be more descriptive
Top_Neighborhood_group.columns = ['Neighborhood_Groups', 'Listing_Counts']

# display the resulting DataFrame
Top_Neighborhood_group

In [None]:
# Set the figure size
plt.figure(figsize=(12, 8))

# Create a countplot of the neighbourhood group data
sns.countplot(airbnb_booking['neighbourhood_group'])

# Set the title of the plot
plt.title('Neighbourhood_group Listing Counts in NYC', fontsize=15)

# Set the x-axis label
plt.xlabel('Neighbourhood_Group', fontsize=14)

# Set the y-axis label
plt.ylabel('total listings counts', fontsize=14)

##### 1. Why did you pick the specific chart?

A **countplot** is a powerful data visualization tool that displays the frequency of categorical data. It helps identify patterns, trends, and imbalances within categories quickly. Ideal for exploratory data analysis, it simplifies comparison across groups and supports clear, intuitive insights, making data interpretation and communication more effective and efficient.


##### 2. What is/are the insight(s) found from the chart?

*   Manhattan and Brooklyn have the highest number of listings on Airbnb, with 21661 and 20104 listings respectively.

*   Queens and the Bronx have significantly fewer listings compared to Manhattan and Brooklyn, with 5,666 and 1,091 listings, respectively

*   Staten Island has the fewest number of listings, with only 373.

*   The distribution of listings across the different neighborhood groups is skewed, with a concentration of listings in Manhattan and Brooklyn.

*   Despite being larger in size, the neighborhoods in Queens, the Bronx, and Staten Island have fewer listings on Airbnb compared to Manhattan, which has a smaller geographical area.

*   This could suggest that the demand for Airbnb rentals is higher in Manhattan compared to the other neighborhoods, leading to a higher concentration of listings in this area.

*   Alternatively, it could be that the supply of listings is higher in Manhattan due to a higher number of homeowners or property owners in this neighborhood who are willing to list their properties on Airbnb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by guiding investment and marketing strategies toward high-demand areas like Manhattan and Brooklyn, which have the highest number of Airbnb listings. This concentration suggests strong demand or a more active host base. However, this skewed distribution may also indicate oversaturation, leading to increased competition and reduced profitability. In contrast, underrepresented areas like Queens, the Bronx, and Staten Island may present growth opportunities. Ignoring these could lead to missed market expansion. Thus, while the insights mostly support positive growth, strategic decisions must avoid reinforcing imbalances that could hinder long-term sustainability.


#### Top Neighborhoods by Listing/property

In [None]:
# create a new DataFrame that displays the top 10 neighborhoods in the Airbnb NYC dataset based on the number of listings in each neighborhood
Top_Neighborhoods = airbnb_booking['neighbourhood'].value_counts()[:10].reset_index()

# rename the columns of the resulting DataFrame to 'Top_Neighborhoods' and 'Listing_Counts'
Top_Neighborhoods.columns = ['Top_Neighborhoods', 'Listing_Counts']

# display the resulting DataFrame
Top_Neighborhoods

In [None]:
# Get the top 10 neighborhoods by listing count
top_10_neigbourhoods = airbnb_booking['neighbourhood'].value_counts().nlargest(10)

# Create a list of colors to use for the bars
colors = ['c', 'g', 'olive', 'y', 'm', 'orange', '#C0C0C0', '#800000', '#008000', '#000080']

# Create a bar plot of the top 10 neighborhoods using the specified colors
top_10_neigbourhoods.plot(kind='bar', figsize=(15, 6), color = colors)

# Set the x-axis label
plt.xlabel('Neighbourhood', fontsize=14)

# Set the y-axis label
plt.ylabel('Total Listing Counts', fontsize=14)

# Set the title of the plot
plt.title('Listings by Top Neighborhoods in NYC', fontsize=15)

##### 1. Why did you pick the specific chart?

A bar plot offers a clear, visual comparison of categorical data. It highlights differences in quantities across categories, making trends and patterns easy to identify. Bar plots are simple to create, interpret, and effective for both small and large datasets, supporting decision-making and data presentation in various fields.


##### 2. What is/are the insight(s) found from the chart?

*   The top neighborhoods in New York City in terms of listing counts are Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side.

*   The top neighborhoods are primarily located in Brooklyn and Manhattan. This may be due to the fact that these boroughs have a higher overall population and a higher demand for housing.

*   The number of listings alone may not be indicative of the overall demand for housing in a particular neighborhood, as other factors such as the cost of living and the availability of housing may also play a role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can create a positive business impact by helping property managers and investors focus their efforts on high-demand neighborhoods like Williamsburg and Harlem, potentially maximizing occupancy and revenue. However, relying solely on listing counts could lead to negative growth. For instance, a high number of listings in a neighborhood might indicate market saturation or high turnover, suggesting instability. Without considering cost of living and housing availability, businesses risk investing in areas with limited profitability. Therefore, combining listing data with economic and demographic factors is crucial for making informed, growth-oriented decisions in the real estate market.


#### Top Hosts With More Listing/Property

In [None]:
# create a new DataFrame that displays the top 10 hosts in the Airbnb NYC dataset based on the number of listings each host has
top_10_hosts = airbnb_booking['host_name'].value_counts()[:10].reset_index()

# rename the columns of the resulting DataFrame to 'host_name' and 'Total_listings'
top_10_hosts.columns = ['host_name', 'Total_listings']

# display the resulting DataFrame
top_10_hosts

In [None]:
# Get the top 10 hosts by listing count
top_hosts = airbnb_booking['host_name'].value_counts()[:10]

# Create a bar plot of the top 10 hosts
top_hosts.plot(kind='bar', color='peru', figsize=(18, 7))

# Set the x-axis label
plt.xlabel('top10_hosts', fontsize=14)

# Set the y-axis label
plt.ylabel('total_NYC_listings', fontsize=14)

# Set the title of the plot
plt.title('top 10 hosts on the basis of no of listings in entire NYC!', fontsize=15)



```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

A bar plot offers a clear, visual comparison of categorical data. It highlights differences in quantities across categories, making trends and patterns easy to identify. Bar plots are simple to create, interpret, and effective for both small and large datasets, supporting decision-making and data presentation in various fields.

##### 2. What is/are the insight(s) found from the chart?

*   The top three hosts in terms of total listings are Michael, David, and John, who have 417, 403, and 327 listings, respectively.

*   There is a relatively large gap between the top two hosts and the rest of the hosts. For example, john has 327 listings, which is significantly fewer than Michael's 417 listings.

*   In this top10 list Maria has 204 listings, which is significantly fewer than
Michael's 417 listings. This could indicate that there is a lot of variation in the success of different hosts on Airbnb.

*   There are relatively few hosts with a large number of listings. This could indicate that the Airbnb market is relatively competitive, with a small number of hosts dominating a large portion of the market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by identifying top-performing hosts and understanding market concentration. This knowledge enables Airbnb to form strategic partnerships, design targeted incentives, and support underperforming hosts for balanced growth. However, the data also reveals a potential negative insight: market dominance by a few hosts may discourage new entrants and reduce competition, potentially impacting customer experience and pricing fairness. For instance, with Michael having more than double the listings of Maria, such imbalance could limit diversity in listings and innovation, leading to stagnation or negative perception among users and smaller hosts.


#### Number Of Active Hosts Per Location

In [None]:
# create a new DataFrame that displays the number of hosts in each neighborhood group in the Airbnb NYC dataset
hosts_per_location = airbnb_booking.groupby('neighbourhood_group')['id'].count().reset_index()

# rename the columns of the resulting DataFrame to 'Neighbourhood_Groups' and 'Host_counts'
hosts_per_location.columns = ['Neighbourhood_Groups', 'Host_counts']

# display the resulting DataFrame
hosts_per_location

In [None]:
# Group the data by neighbourhood_group and count the number of listings for each group
hosts_per_location = airbnb_booking.groupby('neighbourhood_group')['id'].count()

# Get the list of neighbourhood_group names
locations = hosts_per_location.index

# Get the list of host counts for each neighbourhood_group
host_counts = hosts_per_location.values

# Set the figure size
plt.figure(figsize=(12, 5))

# Create the line chart with some experiments using marker function
plt.plot(locations, host_counts, marker='o', ms=12, mew=4, mec='r')

# Add a title and labels to the x-axis and y-axis
plt.title('Number of Active Hosts per Location', fontsize='15')
plt.xlabel('Location', fontsize='14')
plt.ylabel('Number of Active Hosts', fontsize='14')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A line chart effectively displays trends and patterns over time, making it ideal for tracking changes in data. It allows easy comparison between multiple data sets, highlights fluctuations, and supports forecasting. Its simplicity and clarity make it a powerful tool for visualizing continuous data and informing time-based decision-making.


##### 2. What is/are the insight(s) found from the chart?

*   Manhattan has the largest number of hosts with 21661,Brooklyn has the second largest number of hosts with 20104.

* After that Queens with 5666 and the Bronx with 1091 . while Staten Island has the fewest with 373.

*   Brooklyn and Manhattan have the largest number of hosts, with more than double the number of hosts in Queens and more than 18 times the number of hosts in the Bronx.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact by identifying high-demand areas like Manhattan and Brooklyn, guiding targeted marketing, pricing strategies, and investment decisions. However, insights could also indicate potential negative growth. For example, Staten Island and the Bronx have significantly fewer hosts, suggesting low demand or unattractive market conditions. Investing in these areas without understanding the root causes—such as limited tourist interest or poor connectivity—could lead to low occupancy and revenue. Thus, while high-host areas offer growth opportunities, low-host regions require cautious evaluation to avoid unprofitable expansion and ensure strategic resource allocation.


####Total Counts Of Each Room Type

In [None]:
# create a new DataFrame that displays the number of listings of each room type in the Airbnb NYC dataset
top_room_type = airbnb_booking['room_type'].value_counts().reset_index()

# rename the columns of the resulting DataFrame to 'Room_Type' and 'Total_counts'
top_room_type.columns = ['Room_Type', 'Total_counts']

# display the resulting DataFrame
top_room_type

In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Get the room type counts
room_type_counts = airbnb_booking['room_type'].value_counts()

# Set the labels and sizes for the pie chart
labels = room_type_counts.index
sizes = room_type_counts.values

# Create the pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%')

# Add a legend to the chart
plt.legend(title='Room Type', bbox_to_anchor=(0.8, 0, 0.5, 1), fontsize='12')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the churn percentage account length wise.

##### 2. What is/are the insight(s) found from the chart?

*  The majority of listings on Airbnb are for entire homes or apartments, with 25409 listings, followed by private rooms with 22326 listings, and shared rooms with 1160 listings.

*  There is a significant difference in the number of listings for each room type. For example, there are almost 20 times as many listings for entire homes or apartments as there are for shared rooms.

*  The data suggests that travelers using Airbnb have a wide range of accommodation options to choose from, including private rooms and entire homes or apartments

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create a positive business impact by helping Airbnb optimize its platform to match user preferences. The dominance of entire homes or apartments (25409 listings) suggests a high demand for privacy and space, guiding Airbnb to prioritize such listings in marketing and user experience design. However, the low number of shared room listings (1160) could indicate underperformance or low demand. Investing in promoting shared rooms may not yield high returns and could lead to negative growth. Therefore, Airbnb should focus on enhancing its most popular offerings while reconsidering investment in low-demand segments like shared rooms.


####Stay Requirement counts by Minimum Nights using Bar chart

In [None]:
# Group the DataFrame by the minimum_nights column and count the number of rows in each group
min_nights_count = airbnb_booking.groupby('minimum_nights').size().reset_index(name = 'count')

# Sort the resulting DataFrame in descending order by the count column
min_nights_count = min_nights_count.sort_values('count', ascending=False)

# Select the top 10 rows
min_nights_count = min_nights_count.head(15)

# Reset the index
min_nights_count = min_nights_count.reset_index(drop=True)

# Display the resulting DataFrame
min_nights_count

In [None]:
# Extract the minimum_nights and count columns from the DataFrame
minimum_nights = min_nights_count['minimum_nights']
count = min_nights_count['count']

# Set the figure size
plt.figure(figsize=(12, 4))

# Create the bar plot
plt.bar(minimum_nights, count)

# Add axis labels and a title
plt.xlabel('Minimum Nights', fontsize='14')
plt.ylabel('Count', fontsize='14')
plt.title('Stay Requirement by Minimum Nights', fontsize='15')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

##### 2. What is/are the insight(s) found from the chart?

*   The majority of listings on Airbnb have a minimum stay requirement of 1 or 2 nights, with 12720 and 11696 listings, respectively.

*   The number of listings with a minimum stay requirement decreases as the length of stay increases, with 7999 listings requiring a minimum stay of 3 nights, and so on.

*   There are relatively few listings with a minimum stay requirement of 30 nights or more, with 3760 and 201 listings, respectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact by informing hosts and Airbnb about optimal minimum stay requirements. Shorter minimum stays (1–2 nights) attract more bookings, maximizing occupancy and revenue. However, listings with high minimum stay requirements (30+ nights) may face reduced visibility and booking rates, leading to negative growth. This could be due to limited traveler flexibility or preference for shorter stays. If many listings enforce longer stays without matching demand, they risk lower occupancy and profitability. Hence, adjusting stay policies based on demand patterns can help optimize listing performance and improve overall platform efficiency.


####most reviewed room type per month in neighbourhood groups

In [None]:
# create a figure with a default size of (10, 8)
f, ax = plt.subplots(figsize=(10, 8))

# create a stripplot that displays the number of reviews per month for each room type in the Airbnb NYC dataset
ax = sns.stripplot(x='room_type', y='reviews_per_month', hue='neighbourhood_group', dodge=True, data=airbnb_booking, palette='Set1')

# set the title of the plot
ax.set_title('Most Reviewed room_types in each Neighbourhood Groups', fontsize='14')

##### 1. Why did you pick the specific chart?

A stripplot displays individual data points along an axis, making it ideal for visualizing the distribution and clustering of small datasets. It highlights variation and potential outliers, helps compare categories, and avoids overplotting. It's simple yet effective for revealing trends, especially when combined with jitter or other categorical plot overlays.


##### 2. What is/are the insight(s) found from the chart?

*   We can see that Private room recieved the most no of reviews/month where Manhattan had the highest reviews received for Private rooms with more than 50 reviews/month, followed by Manhattan in the chase.

*   Manhattan & Queens got the most no of reviews for Entire home/apt room type.

*   There were less reviews recieved from shared rooms as compared to other room types and it was from Staten Island followed by Bronx.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can positively impact business by helping hosts focus on popular room types and high-demand locations. For example, prioritizing Private rooms in Manhattan or Entire home/apts in Manhattan and Queens can drive more bookings and revenue. However, low reviews for Shared rooms in Staten Island and Bronx may indicate low demand or customer dissatisfaction, potentially leading to negative growth if not addressed. This could be due to poor amenities, safety concerns, or accessibility issues. Businesses should consider improving quality or reallocating resources to high-performing segments to maximize returns and avoid losses in underperforming areas.


####Count Of Each Room Types In Entire NYC Using Multiple Bar Plot

In [None]:
# Now analysis Room types count in Neighbourhood groups in NYC

# Set the size of the plot
plt.rcParams['figure.figsize'] = (8, 5)

# Create a countplot using seaborn
ax = sns.countplot(y='room_type', hue='neighbourhood_group', data=airbnb_booking, palette='bright')

# Calculate the total number of room_type values
total = len(airbnb_booking['room_type'])

# Add percentage labels to each bar in the plot
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

# Add a title to the plot
plt.title('count of each room types in entire NYC', fontsize='15')

# Add a label to the x-axis
plt.xlabel('Room counts', fontsize='14')

# Rotate the x-tick labels
plt.xticks(rotation=90)

# Add a label to the y-axis
plt.ylabel('Rooms', fontsize='14')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* Manhattan has more listed properties with Entire home/apt around 27.0% of total listed properties followed by Brooklyn with around 19.6%.

*   Private rooms are more in Brooklyn as in 20.7% of the total listed properties followed by Manhattan with 16.3% of them. While 6.9% of private rooms are from Queens.

*   Very few of the total listed have shared rooms listed on Airbnb where there's negligible or almost very rare shared rooms in Staten Island and Bronx.

*   We can infer that Brooklyn,Queens,Bronx has more private room types while Manhattan which has the highest no of listings in entire NYC has more Entire home/apt room types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained can help create a positive business impact by guiding targeted marketing and pricing strategies. For example, focusing on entire home/apt listings in Manhattan can attract high-paying guests, while promoting private rooms in Brooklyn or Queens may appeal to budget-conscious travelers. However, an overreliance on shared or private rooms in areas with lower demand—like Bronx or Staten Island—may lead to negative growth due to limited customer interest and lower occupancy rates. Hence, investing in the right room types and locations is crucial to optimize occupancy and maximize returns, avoiding stagnation in less-demanded segments.

####Correlation Heatmap Visualization

In [None]:
# Calculate pairwise correlations between columns
corr = airbnb_booking.corr(numeric_only=True)

# Display the correlation between columns
corr


In [None]:
# Set the figure size
plt.figure(figsize=(12,6))

# Visualize correlations as a heatmap
sns.heatmap(corr, cmap='BrBG',annot=True)

# Display heatmap
plt.show()



```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

##### 2. What is/are the insight(s) found from the chart?

*   There is a moderate positive correlation (0.59) between the host_id and id columns, which suggests that hosts with more listings are more likely to have unique host IDs.

*   There is a weak positive correlation (0.057) between the price column and the calculated_host_listings_count column, which suggests that hosts with more listings tend to charge higher prices for their listings.

*   There is a moderate positive correlation (0.23) between the calculated_host_listings_count column and the availability_365 column, which suggests that hosts with more listings tend to have more days of availability in the next 365 days.

*   There is a strong positive correlation (0.59) between the number_of_reviews column and the reviews_per_month column, which suggests that listings with more total reviews tend to have more reviews per month.

#### Chart - 15 - Pair Plot

In [None]:
# create a pairplot using the seaborn library to visualize the relationships between different variables in the Airbnb NYC dataset
sns.pairplot(airbnb_booking)

# show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

##### 2. What is/are the insight(s) found from the chart?


* The first two columns (id and host_id) show a strong positive correlation with each other. This is likely due to the way IDs are assigned but not necessarily meaningful in a real-world sense unless you're specifically analyzing user or host behavior.

* These variables form a clustered pattern, likely representing geographic groupings (e.g., a city map). You can observe horizontal and vertical bands which suggest fixed coordinate boundaries (e.g., neighborhoods or boroughs).

* The price variable has a long tail, indicating a few listings are priced extremely high. Most listings seem to be priced on the lower end (left side of the plot), and the scatter plots show vertical clustering due to these few high-priced outliers.

* Most listings have a low minimum night requirement and low number of reviews, but some outliers exist with very high values, possibly skewing the data.

* There appears to be a positive correlation between reviews_per_month and number_of_reviews, which is expected, as more overall reviews likely mean more activity per month.

* Many listings have availability either near 0 or close to 365, suggesting two major types of listings: occasional (perhaps part-time hosts) and full-time rental properties.

* Some plots show gaps or bands (e.g., in minimum_nights, price), which may indicate data entry limitations, default values, or rules applied during data collection.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

*   Manhattan and Brooklyn have the highest demand for Airbnb rentals, as evidenced by the large number of listings in these neighborhoods. This could make them attractive areas for hosts to invest in property.

*   Manhattan is world-famous for its parks, museums, buildings, town, liberty, gardens, markets, island and also its substantial number of tourists throughout the year ,it makes sense that demand and price both high.

*   Brooklyn comes in second with significant number of listings and cheaper prices as compared to the Manhattan: With most listings located in Williamsburg and Bedford Stuyvesant two neighborhoods strategically close to Manhattan tourists get the chance to enjoy both boroughs equally while spending less.

*   Williamsburg, Bedford-Stuyvesant, Harlem, Bushwick, and the Upper West Side are the top neighborhoods in terms of listing counts, indicating strong demand for Airbnb rentals in these areas.

*   The average price of a listing in New York City is higher in the center of the city (Manhattan) compared to the outer boroughs. This could indicate that investing in property in Manhattan may be more lucrative for Airbnb rentals.
But Manhattan and Brooklyn have the largest number of hosts, indicating a high level of competition in these boroughs.

*   The data suggests that Airbnb rentals are primarily used for short-term stays, with relatively few listings requiring a minimum stay of 30 nights or more. Hosts may want to consider investing in property that can accommodate shorter stays in order to maximize their occupancy rate.

*   The majority of listings on Airbnb are for entire homes or apartments and also Private Rooms with relatively fewer listings for shared rooms. This suggests that travelers using Airbnb have a wide range of accommodation options to choose from, and hosts may want to consider investing in property that can accommodate multiple guests.

*   The data indicates that the availability of Airbnb rentals varies significantly across neighborhoods, with some neighborhoods having a high concentration of listings and others having relatively few.

*   The data indicates that there is a high level of competition among Airbnb hosts, with a small number of hosts dominating a large portion of the market. Hosts may want to consider investing in property in areas with relatively fewer listings in order to differentiate themselves from the competition.

*   The neighborhoods near the airport in Queens would have a higher average number of reviews, as they are likely to attract a lot of tourists or visitors who are passing through the area. The proximity to the airport could make these neighborhoods a convenient and appealing place to stay for travelers for short-term stay with spending less money because The price distribution is high in Manhattan and Brooklyn.

# **Conclusion**

This EDA project on Airbnb bookings successfully highlights the importance of data-driven decision-making in the hospitality industry. The insights gained not only help Airbnb hosts better understand and optimize their listings but also allow the platform to improve user experience and strategic planning. Future work can incorporate predictive modeling to forecast pricing or booking likelihood, offering deeper value through machine learning applications.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***