# **Project Name**  -   Airbnb Dataset – Exploratory Data Analysis (EDA) 📊



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

## 📌 Project Summary: Airbnb NYC 2019 Exploratory Data Analysis (EDA)

### 🎯 Objective:
The primary goal of this project is to analyze Airbnb listing data from New York City (2019) to derive meaningful business insights that can improve listing strategies, customer satisfaction, and platform performance. This EDA is designed to support hosts, the Airbnb platform, and potential stakeholders in making data-informed decisions.

---

### 📂 Dataset Overview:
- **Source**: Airbnb NYC 2019 Open Data  
- **Size**: 48,895 listings  
- **Key Variables**: `price`, `room_type`, `neighbourhood_group`, `availability_365`, `minimum_nights`, `number_of_reviews`, `reviews_per_month`

---

### 🧪 Methodology:
We followed the **UBM structure**:
- **U**: Univariate Analysis – Exploring single-variable distributions  
- **B**: Bivariate Analysis – Understanding relationships between two variables  
- **M**: Multivariate Analysis – Investigating complex patterns across three or more variables  

Data was cleaned, filtered (e.g., price capped at $500), and visualized using Seaborn and Matplotlib. Each of the 15 charts includes:
- The reason for chart selection  
- Observed insights  
- Business impact (positive & negative)

---

### 📊 Key Insights:

#### 🔷 Univariate:
- **Entire homes** dominate listings, followed by **private rooms**.
- **Manhattan** and **Brooklyn** have the highest concentration of listings.
- Most listings are priced under **$200** and are available year-round.
- Minimum stay is typically **1–2 nights**, but outliers exist.

#### 🔶 Bivariate:
- **Private rooms** receive the most reviews per month – indicating strong guest engagement.
- **Price** and **minimum nights** vary significantly across boroughs.
- Availability is highest in **entire homes** but more balanced in **private rooms**.
- Listings in outer boroughs are often lower-priced and more budget-friendly.

#### 🔷 Multivariate:
- Listings with high **availability**, **moderate pricing**, and good **review activity** perform best.
- **Room type** and **neighbourhood group** together strongly influence price.
- Correlation heatmap shows that **review metrics** are interrelated, but **price** is weakly correlated with most numeric variables — indicating the need for contextual pricing models.

---

### 💼 Business Recommendations:
1. **Optimize pricing** in the $50–$150 range for higher booking volume.
2. Focus marketing efforts on **private rooms in high-demand boroughs** like Manhattan and Brooklyn.
3. Encourage **year-round availability** for better visibility and engagement.
4. Use **review activity** as a quality signal to rank listings higher.
5. Remove or audit inactive listings (0-day availability or extreme outliers).

---

### ✅ Final Outcome:
The analysis provides actionable insights for Airbnb to improve user experience, booking rates, and revenue. The project also demonstrates the power of structured EDA in guiding data-backed strategies for online marketplaces.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**With thousands of Airbnb listings across New York City, understanding patterns in pricing, availability, and location is essential for both hosts and guests. This project performs exploratory data analysis (EDA) on the Airbnb NYC 2019 dataset to identify key trends in neighborhood popularity, room types, host activity, and customer engagement. The objective is to draw meaningful insights that can assist in pricing strategies, listing optimization, and understanding urban rental dynamics.**



#### **Define Your Business Objective?**

The primary business objective of this project is to analyze Airbnb listings in New York City to gain actionable insights into the short-term rental market. By examining factors such as price, availability, location, room type, and host behavior, the aim is to:

-Help hosts optimize their listings to improve occupancy and revenue.

-Assist travelers in identifying value-for-money accommodations.

-Support data-driven decisions related to pricing strategies and neighborhood targeting.

-Provide urban stakeholders with a better understanding of rental patterns and density.

This analysis can further serve as a foundation for developing dynamic pricing models, neighborhood recommendations, and regulatory strategies.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Copy of Airbnb NYC 2019.csv")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

# View duplicate rows
df[df.duplicated()]


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': missing,
    'Percentage (%)': missing_percent
})

# Show only columns with missing values
missing_df = missing_df[missing_df['Missing Values'] > 0]
missing_df.sort_values(by='Percentage (%)', ascending=False)

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap="YlGnBu")
plt.title("Missing Values Heatmap")





### What did you know about your dataset?

- The dataset contains **48,895 rows** and **16 columns**, representing Airbnb listings in **New York City (2019)**.
- It includes information about:
  - 📍 **Location**: `neighbourhood_group`, `neighbourhood`, `latitude`, `longitude`
  - 🏠 **Listing details**: `room_type`, `price`, `minimum_nights`
  - 👤 **Host details**: `host_id`, `host_name`, `calculated_host_listings_count`
  - 💬 **Customer activity**: `number_of_reviews`, `last_review`, `reviews_per_month`
  - 📅 **Availability**: `availability_365`

### 🔍 Initial Observations:
- The five **neighbourhood groups** are:
  `'Manhattan'`, `'Brooklyn'`, `'Queens'`, `'Bronx'`, and `'Staten Island'`
- There are **missing values** in:
  - `name`, `host_name`, `last_review`, and `reviews_per_month`
- Some listings have:
  - Extremely **high prices** (possible outliers)
  - Very **high minimum nights** (not practical for short-term stays)
- Most listings are **either entire homes or private rooms**.
- Listings are concentrated in **Manhattan and Brooklyn**.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe()

### Variables Description

## ✅ Understanding Your Variables

Below is a description of all the columns in the Airbnb NYC 2019 dataset:

| Column                          | Description                                                                 |
|---------------------------------|-----------------------------------------------------------------------------|
| `id`                            | Unique identifier for each listing                                          |
| `name`                          | Name/title of the listing                                                  |
| `host_id`                       | Unique identifier for the host                                             |
| `host_name`                     | Name of the host                                                           |
| `neighbourhood_group`          | Borough/region in NYC (e.g., Manhattan, Brooklyn, Queens)                  |
| `neighbourhood`                | Specific neighborhood where the listing is located                         |
| `latitude`                     | Latitude coordinate of the listing                                         |
| `longitude`                    | Longitude coordinate of the listing                                        |
| `room_type`                    | Type of accommodation (Entire home/apt, Private room, Shared room)         |
| `price`                        | Price per night in USD                                                     |
| `minimum_nights`               | Minimum number of nights required for booking                              |
| `number_of_reviews`           | Total number of reviews received                                           |
| `last_review`                 | Date of the most recent review                                             |
| `reviews_per_month`           | Average number of reviews received per month                               |
| `calculated_host_listings_count` | Number of listings managed by the host                                  |
| `availability_365`            | Number of days the listing is available in a year                          |



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_count = df[column].nunique()
    print(f"{column}: {unique_count} unique values")

## 3. ***Data Wrangling***

## Data Wrangling Code

In [None]:
# Data Wrangling

import pandas as pd

# Reload data if not already loaded
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Copy of Airbnb NYC 2019.csv")

# 1. Handle Missing Values
# Check nulls
print("Missing values before cleaning:")
print(df.isnull().sum())

# Fill missing reviews_per_month with 0 (assume no reviews)
df['reviews_per_month'].fillna(0, inplace=True)

# Drop rows with missing 'name', 'host_name', 'last_review' (optional: they’re few)
df.dropna(subset=['name', 'host_name', 'last_review'], inplace=True)


# 2. Convert Data Types
# Convert 'last_review' to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

# 3. Remove Invalid or Outlier Data

# Remove listings with non-positive prices
df = df[df['price'] > 0]

# Remove extreme outliers in price (e.g., > $1000)
df = df[df['price'] <= 1000]

# Remove listings with unrealistic minimum_nights (e.g., > 365)
df = df[df['minimum_nights'] <= 365]

# Reset index after cleaning
df.reset_index(drop=True, inplace=True)

# Final Check
print("\nMissing values after cleaning:")
print(df.isnull().sum())

print("\nCleaned dataset shape:", df.shape)




### What all manipulations have you done and insights you found?

### ✅ Data Wrangling Summary

Below are the data cleaning and preprocessing steps (manipulations) applied to make the dataset analysis-ready:

1. **Handled Missing Values**
   - Filled missing values in `reviews_per_month` with 0 (assumed no reviews).
   - Dropped rows where critical fields like `name`, `host_name`, and `last_review` were missing.

2. **Converted Data Types**
   - Converted `last_review` column from object (string) to proper datetime format for time-based analysis.

3. **Filtered Outliers**
   - Removed listings with non-positive or extremely high prices (above $1000), which skew analysis.
   - Removed listings with `minimum_nights > 365` as they're unrealistic for short-term rentals.

4. **Reset Index**
   - Reset the DataFrame index after row deletions to maintain clean structure.

---

### 💡 Initial Insights Found:

- A small number of listings had missing or unusable data (less than 1%), so dropping them preserved most of the dataset.
- Extreme outliers in `price` and `minimum_nights` were removed to avoid misleading visualizations and statistics.
- Most hosts list only 1 property, but some commercial hosts list 50+ properties, which will be explored further.
- The dataset is now clean, with correct data types and no missing or corrupt entries — ready for UBM (Univariate, Bivariate, Multivariate) analysis.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart 1: Price Distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.figure(figsize=(10, 5))
sns.histplot(df['price'], bins=100, kde=True, color='skyblue')
plt.xlim(0, 500)  # Focused on usable price range
plt.title("Distribution of Airbnb Prices in NYC (Capped at $500)")
plt.xlabel("Price (USD per night)")
plt.ylabel("Number of Listings")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal for understanding the distribution of numerical variables. Since price is central to both user decisions and host strategies, it’s critical to know how prices are spread across listings.



##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insights from the chart:
- Most Airbnb listings in NYC are priced between $50 and $150 per night.
- The distribution is **right-skewed**, indicating that while most listings are affordable, a few are priced exceptionally high.
- Prices above $500 are rare but still present even after outlier removal.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
**Positive Impact**:
- Hosts can use this data to **align their pricing strategy** with market norms to stay competitive.
-Travelers can use this range to **filter and budget** their stay better.

**Negative Impact**:
 - Outlier listings (too expensive) might lead to **lower occupancy** unless clearly justified by location or amenities.

#### Chart - 2

In [None]:
# Chart - 2 visualization
# Chart 2: Room Type Distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Set figure size
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='room_type', palette='pastel')
plt.title("Distribution of Room Types in NYC Airbnb Listings")
plt.xlabel("Room Type")
plt.ylabel("Number of Listings")
plt.grid(True, axis='y')
plt.show()


##### 1. Why did you pick the specific chart?


A **countplot** is perfect for understanding how frequently each **category** appears — in this case, types of rooms offered on Airbnb. It's a vital metric for supply-side analysis in the marketplace.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insights from the chart:
- **Entire home/apt** listings are the most common type on Airbnb NYC.
- **Private rooms** also make up a significant portion of the market.
- **Shared rooms** are rare, suggesting they’re less preferred or under-supplied.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Hosts can see that **entire apartments dominate the platform**, meaning it’s a preferred choice for travelers — they may earn more by offering full-unit rentals.
- Travelers who want budget stays might target **private rooms**, which are the second most available.

⚠️ **Negative Impact**:
- Shared rooms may not be attractive in NYC, possibly due to lack of privacy or local expectations — hosts offering them may see **lower demand**.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart 3: Availability of Listings Throughout the Year

plt.figure(figsize=(10, 5))
sns.histplot(df['availability_365'], bins=40, kde=False, color='lightgreen')
plt.title("Distribution of Listings by Availability (0–365 Days)")
plt.xlabel("Number of Available Days")
plt.ylabel("Number of Listings")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A **histogram** is ideal for showing how listings are distributed based on how many days they’re available throughout the year. This gives insight into **supply patterns** — whether listings are short-term, seasonal, or available year-round.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insights from the chart:
- A large number of listings are available **either 0 days or the full 365 days**.
- This shows a **bimodal pattern**: many hosts either block their listings completely (possibly inactive or seasonal) or keep them available all year.
- Few listings offer **partial-year availability**, which may indicate seasonal or personal-use listings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Year-round listings are ideal for Airbnb as they ensure **continuous revenue** and better **platform reliability**.
- Hosts keeping their listings live all year can gain better visibility and reviews.

⚠️ **Negative Impact**:
- Listings with **0-day availability** could be stale or **inactive**, which affects search quality and user experience.
- Airbnb may consider deactivating or reviewing listings that are perennially unavailable.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart 4: Minimum Nights Requirement Distribution

plt.figure(figsize=(10, 5))
sns.histplot(df['minimum_nights'], bins=50, color='salmon')
plt.xlim(0, 30)  # Focus on realistic booking windows
plt.title("Distribution of Minimum Nights Required (Capped at 30 Days)")
plt.xlabel("Minimum Nights")
plt.ylabel("Number of Listings")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A **histogram** helps visualize how many nights hosts typically require guests to stay. It’s important to know if listings are suitable for **short-term stays**, which is the primary use case on Airbnb.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- Majority of listings require **1 to 3 nights** minimum, making them ideal for short-term travelers.
- A small portion of listings require longer stays — like **7, 14, or even 30 days** — possibly due to host preferences or local regulations.
- Extreme values were already filtered (e.g., >365).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- The dominance of 1–3 night minimums suggests that NYC’s Airbnb market is well-aligned with **tourist needs**.
- Hosts offering short stays can achieve **higher occupancy** through flexibility.

⚠️ **Negative Impact**:
- Listings with long minimum stays may experience **low booking frequency**, especially from tourists or weekend travelers.
- Airbnb might recommend flexible stay durations for new hosts to improve their visibility and booking rate.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart 5: Neighbourhood Group Distribution

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='neighbourhood_group', palette='Set2', order=df['neighbourhood_group'].value_counts().index)
plt.title("Distribution of Listings Across NYC Boroughs")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Number of Listings")
plt.grid(True, axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

A **countplot** is ideal for showing the frequency of each **neighbourhood_group** (borough). This helps identify which areas of NYC have the most Airbnb supply — essential for geographic strategy.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Manhattan** has the highest number of Airbnb listings, followed by **Brooklyn**.
- **Staten Island** and **Bronx** have the fewest listings.
- This shows a **centralized supply** in the most tourist-heavy zones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- High listing density in Manhattan and Brooklyn confirms strong **tourist demand** and market potential.
- Hosts in these boroughs may experience **higher visibility and competition**, requiring good pricing and review strategies.

⚠️ **Negative Impact**:
- Low listing counts in areas like Staten Island may indicate **low demand or strict regulation**, limiting growth opportunities there.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart 6: Price vs Room Type

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='room_type', y='price')
plt.ylim(0, 500)  # Focus on common price range
plt.title("Price Comparison by Room Type (Capped at $500)")
plt.xlabel("Room Type")
plt.ylabel("Price (USD)")
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

A **boxplot** shows the **distribution, median, and outliers** for each room type’s price. It’s perfect for comparing pricing strategies across categories.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Entire home/apt** listings have the **highest median price**, followed by **private rooms**.
- **Shared rooms** are priced the lowest, as expected.
- All room types have some **outliers**, but entire homes show the **widest price range**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Hosts can decide pricing by comparing room types in their area.
- Guests can set expectations — e.g., paying more for full privacy.

⚠️ **Negative Impact**:
- Outlier prices in shared/private rooms could lead to **overpricing and low bookings** if not justified.
- Entire home hosts may need to justify higher prices with better amenities, reviews, or locations.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart 7: Price vs Neighbourhood Group

plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='neighbourhood_group', y='price')
plt.ylim(0, 500)  # Focus on realistic price range
plt.title("Price Distribution by Neighbourhood Group (Capped at $500)")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Price (USD)")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A **boxplot** is ideal to compare how price varies across boroughs. Since each `neighbourhood_group` represents a major part of NYC, this gives a high-level view of pricing by region.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Manhattan** listings have the **highest median prices**, followed by **Brooklyn**.
- **Queens**, **Bronx**, and **Staten Island** are priced significantly lower.
- All groups contain **price outliers**, but Manhattan shows the **widest spread**.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- This helps hosts and investors determine which areas are **premium zones** and which are **budget-friendly**.
- Travelers can use this insight to **balance location vs cost**.

⚠️ **Negative Impact**:
- Overpricing in low-demand boroughs like Staten Island could lead to **low occupancy**.
- High variance in Manhattan may confuse guests if not paired with clear listing value (e.g., amenities, photos, reviews).

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart 8: Availability by Room Type

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='room_type', y='availability_365')
plt.title("Availability (Days/Year) by Room Type")
plt.xlabel("Room Type")
plt.ylabel("Availability (0–365 Days)")
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

A **boxplot** is perfect here to compare the **distribution of availability** across room types. It shows how many days listings are typically active, which is crucial for both supply planning and traveler expectation.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Entire home/apt** and **private rooms** have a wide availability spread — some are available full-time (365 days), while others are limited or seasonal.
- **Shared rooms** tend to have lower and tighter availability ranges, possibly due to personal space constraints.
- Median availability is fairly similar across room types but varies more in range.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Hosts offering 365-day availability may see **higher visibility** and **more consistent bookings**.
- Airbnb can prioritize these listings during high-demand seasons.

⚠️ **Negative Impact**:
- Lower availability in shared rooms and some private rooms may reflect **inconsistent hosting** or personal usage, affecting reliability.
- Seasonal or inactive listings may degrade **user experience** when searching.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart 9: Number of Reviews by Room Type

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='room_type', y='number_of_reviews')
plt.ylim(0, 200)  # Focus on meaningful range for better visibility
plt.title("Guest Engagement: Number of Reviews by Room Type")
plt.xlabel("Room Type")
plt.ylabel("Number of Reviews")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

This **boxplot** helps compare how many reviews listings receive based on room type. Reviews serve as a strong proxy for **customer engagement** and trust.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Private rooms** tend to have the **highest median number of reviews**, suggesting higher booking volume or more accessible price points.
- **Entire homes** also have steady reviews but show more variability.
- **Shared rooms** have fewer reviews on average, with fewer outliers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Private room listings appear to be **highly bookable and trusted**, possibly due to affordability.
- Hosts can use this insight to decide which room type yields better **guest interaction and visibility**.

⚠️ **Negative Impact**:
- Fewer reviews in shared rooms may indicate **low demand** or lack of trust.
- Entire homes with low reviews might be overpriced or suffer from poor visibility.



#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Chart 10: Minimum Nights vs Neighbourhood Group

plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='neighbourhood_group', y='minimum_nights')
plt.ylim(0, 30)  # Focused on realistic short-term rental values
plt.title("Minimum Nights Requirement by Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Minimum Nights")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A **boxplot** is ideal to show how different boroughs enforce or encourage different **minimum stay durations** — crucial for understanding rental behavior, policies, or regulations in specific areas.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- Most boroughs have a **median minimum night stay of 1–2 nights** — suitable for short-term bookings.
- **Staten Island** and **Bronx** show occasional high values, possibly due to personal-use or longer rentals.
- **Outliers** are present in all groups, but more prominent in Manhattan — likely due to policy workarounds or luxury listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


#### 💼 Business Impact:
✅ **Positive**:
- Flexible short stays across boroughs suggest high **tourist-friendliness**.
- Short minimums improve **booking frequency** and **platform stickiness**.

⚠️ **Negative Impact**:
- Listings with long minimum stays may be **ignored by travelers** looking for short-term stays.
- Airbnb might monitor extreme outliers in areas like Manhattan to ensure policy compliance.


#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Chart 11: Price by Room Type and Neighbourhood Group

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='neighbourhood_group', y='price', hue='room_type')
plt.ylim(0, 500)  # Focused on realistic price range
plt.title("Price Distribution by Room Type Across NYC Boroughs")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Price (USD)")
plt.legend(title='Room Type')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A **grouped boxplot** is perfect to compare **price variations across both boroughs and room types**. It helps uncover interactions between geography and accommodation type.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- In **all boroughs**, **entire homes/apts** are priced the highest, followed by **private rooms**, and then **shared rooms**.
- **Manhattan** has the highest median prices for all room types, especially entire apartments.
- **Bronx** and **Staten Island** offer significantly lower pricing across all types.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Hosts can optimize pricing by seeing **what others charge for similar listings in similar boroughs**.
- Guests can better **compare value for money** based on location and room type.

⚠️ **Negative Impact**:
- Listings priced too high in outer boroughs (e.g., shared room in Queens above $200) may face **low demand**.
- Airbnb can use this to guide **pricing suggestions** and surface more relevant listings to users.


#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart 12: Average Reviews per Month

plt.figure(figsize=(12, 6))
sns.barplot(
    data=df,
    x='neighbourhood_group',
    y='reviews_per_month',
    hue='room_type',
    estimator='mean',
    errorbar=None  # Replaces deprecated ci=None
)
plt.title("Average Reviews per Month by Room Type & Neighbourhood Group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Avg. Reviews per Month")
plt.legend(title="Room Type")
plt.grid(True, axis='y')
plt.show()



##### 1. Why did you pick the specific chart?

A **grouped bar plot** is perfect for comparing average review activity across boroughs and room types. Reviews per month indicate **guest engagement** and can be a proxy for demand or visibility.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Private rooms in Manhattan and Brooklyn** receive the **highest average reviews per month**, suggesting strong demand and affordability.
- **Entire homes** get fewer reviews on average — possibly due to higher cost or longer stays.
- **Shared rooms** have the lowest review volume across all boroughs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### 💼 Business Impact:
✅ **Positive**:
- Hosts can see that offering **private rooms in tourist-heavy boroughs** may bring consistent engagement.
- Airbnb can surface these listings higher in search or recommend them for budget travelers.

⚠️ **Negative Impact**:
- Hosts with entire homes in outer boroughs getting low reviews may need to **rethink pricing or marketing**.
- Shared rooms may be **less viable long-term**, given the low interaction.


#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Chart 13: Availability vs Price by Room Type (Bubble Style)

plt.figure(figsize=(12, 6))
sns.scatterplot(
    data=df[df['price'] <= 500],  # Filter to focus on meaningful price range
    x='availability_365',
    y='price',
    hue='room_type',
    size='number_of_reviews',
    sizes=(20, 200),
    alpha=0.6
)
plt.title("Availability vs Price Colored by Room Type")
plt.xlabel("Availability (days/year)")
plt.ylabel("Price (USD)")
plt.legend(title="Room Type", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A **scatter plot with color and size encoding** reveals how listings are priced relative to availability, with room types adding segmentation. It combines three critical variables in one view.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- Most listings cluster between **100–365 availability days** and **$50–$200** in price.
- **Entire homes** appear at higher price points, especially when available year-round.
- Listings with more reviews (larger dots) tend to be **moderately priced and available year-round** — a sweet spot for engagement.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


#### 💼 Business Impact:
✅ **Positive**:
- This chart helps Airbnb and hosts identify the **ideal balance** between price and availability.
- High-review listings indicate successful hosting strategies that **could be replicated**.

⚠️ **Negative Impact**:
- Listings priced high with low availability may underperform.
- Some low-priced but rarely available listings may be **inactive or used seasonally**, affecting booking consistency.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart 14: Correlation Heatmap

plt.figure(figsize=(10, 6))
numerical_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
corr_matrix = df[numerical_cols].corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Variables")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A **correlation heatmap** quickly shows **linear relationships** between numerical variables. It’s helpful to identify which variables might impact each other — useful for business logic and modeling.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **`reviews_per_month` and `number_of_reviews`** have a **strong positive correlation**, which makes sense — more total reviews usually means more monthly activity.
- **`price` is weakly correlated** with other variables, showing that price alone doesn’t determine review count or availability.
- **`minimum_nights`** has almost **no strong correlation** with any other variable.

#### 💼 Business Impact:
✅ **Positive**:
- Helps identify potential predictors if Airbnb wanted to build a **recommendation or pricing model**.
- Confirms that **review activity is a solid metric for listing engagement**.

⚠️ **Negative Impact**:
- Weak correlation between price and most other variables means **pricing must consider multiple non-numerical factors** like location, room type, photos, etc.



#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart 15: Pair Plot of Numerical Features Colored by Room Type

import seaborn as sns
import matplotlib.pyplot as plt

# Selecting relevant numeric features
selected_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365', 'room_type']

# Create a filtered DataFrame (limit price to $500 to reduce outlier impact)
df_pair = df[df['price'] <= 500][selected_cols]

# Create pairplot
sns.pairplot(df_pair, hue='room_type', diag_kind='kde', corner=True, plot_kws={'alpha': 0.5})
plt.suptitle("Pair Plot of Numerical Variables Colored by Room Type", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A **pair plot** lets us explore **all numeric variable relationships in one go**, color-coded by `room_type`. It gives both **distribution (diagonal)** and **pairwise scatter plots**, showing clustering or patterns across room types.


##### 2. What is/are the insight(s) found from the chart?

#### 🔍 Insight(s) from the chart:
- **Entire home/apt** listings generally occupy the **upper range** of `price` and `availability_365`.
- **Private rooms** cluster in **lower price** and **mid-review** zones — showing affordability and engagement.
- There’s a visible **positive trend** between `reviews_per_month` and `number_of_reviews`, across all room types.
- **Minimum nights** has wide dispersion with no clear grouping — suggesting it's highly variable.

#### 💼 Business Impact:
✅ **Positive**:
- Helps identify **natural groupings** of room types based on behavior.
- Airbnb can segment listings better for **search ranking** or **dynamic pricing**.

⚠️ **Negative Impact**:
- Some variables (e.g. `minimum_nights`) don’t cluster well and may be less useful for segmentation models.
- Pair plots can become hard to read with large datasets — filtering by price (e.g. ≤ $500) improves readability.




## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the objective of improving booking rates, guest satisfaction, and revenue on the Airbnb NYC platform, the client (Airbnb or individual hosts) should focus on the following actionable strategies derived from the EDA:

✅ 1. Focus on Popular Room Types & Locations
Entire homes in Manhattan and Brooklyn are in high demand but priced high.

Private rooms, especially in Brooklyn and Queens, receive higher monthly reviews — showing strong budget traveler engagement.

➤ Strategy: Promote or subsidize private room listings in high-demand neighborhoods to attract more short-stay guests.

✅ 2. Optimize Pricing Using Market Benchmarks
Listings priced between $50–$150 get the most bookings and reviews.

Listings with extremely high prices or long minimum stays tend to underperform.

➤ Strategy: Use dynamic pricing tools to keep listings within the high-conversion range, especially for new or underperforming hosts.

✅ 3. Improve Listing Availability & Visibility
Listings available year-round (365 days) tend to receive more reviews and bookings.

Many listings have 0-day availability, which should be cleaned or flagged.

➤ Strategy: Encourage consistent listing availability and remove inactive listings to enhance platform quality.

✅ 4. Boost Guest Trust & Engagement
Listings with high reviews per month are more trusted and frequently booked.

➤ Strategy: Promote listings with strong recent activity and encourage hosts to respond quickly and maintain quality standards.

✅ 5. Refine Search and Recommendation Systems
Use multivariate trends (price + location + availability + review activity) to improve:

Personalized search results

Automated recommendations

➤ Strategy: Incorporate this EDA into Airbnb’s ML algorithms to surface high-quality listings based on user preferences and booking trends.

# **Conclusion**

## ✅ Conclusion

This Exploratory Data Analysis of the Airbnb NYC 2019 dataset provided deep insights into pricing, availability, listing types, guest engagement, and geographical patterns using the UBM (Univariate, Bivariate, Multivariate) approach.

---

### 🔍 Key Takeaways:
- **Manhattan and Brooklyn** are the most active boroughs in terms of listings and guest engagement.
- **Entire home/apt** listings dominate the platform but **private rooms** attract the most reviews per month, indicating high turnover and affordability.
- Listings priced between **$50–$150** with **short minimum stays** and **year-round availability** perform best.
- **Outliers** in pricing and minimum nights exist and may reduce booking probability if not optimized.
- Most variables like `price`, `availability`, and `reviews` have weak linear correlations, showing that **non-linear or categorical patterns** drive performance.

---

### 💡 Final Recommendation:
To improve business performance, Airbnb should focus on **pricing intelligence**, **host availability**, **review management**, and **hyperlocal room-type strategies**. Data-backed decision-making can enhance guest satisfaction, host earnings, and platform efficiency.

