# **Project Name**    -



##### **Project Type**    - EDA(airbnb)
## **Individual - Diksha CSE-B 4th year**


# **Project Summary -**


## 📄 **Project Summary: Exploratory Data Analysis on Airbnb NYC 2019 Dataset**

This project focused on conducting a comprehensive exploratory data analysis (EDA) of the Airbnb NYC 2019 dataset to extract actionable insights that could support business decisions for Airbnb, hosts, and guests. The dataset contains over 49,000 listings with various features including location, price, room type, availability, number of reviews, and more.

The analysis aimed to uncover patterns in listing performance, pricing behavior, geographical trends, and supply distribution across New York City. The ultimate goal was to identify strategies to enhance user satisfaction, improve host efficiency, balance market supply and demand, and drive positive business outcomes for the platform.

The dataset was first cleaned by handling missing values (notably in the `reviews_per_month` column), removing outliers in key features such as `price` and `minimum_nights`, and filtering out invalid or extreme geographic coordinates. The dataset was then explored through 15 visualizations, each designed to highlight a specific business aspect.

Key findings include that **room type**, **location**, and **availability** are the most significant factors influencing listing success. Listings with **shorter minimum night requirements** (1–3 days) and **high annual availability** are more likely to receive frequent bookings. **Private rooms** are consistently available and affordable, making them appealing to solo travelers and budget-conscious guests. Meanwhile, **entire home/apartment** listings are priced higher and often concentrated in premium areas like Manhattan.

A detailed breakdown of borough-level insights showed that **Manhattan and Brooklyn** dominate the listing volume, but this saturation may lead to regulatory challenges and internal competition. Conversely, **Staten Island, the Bronx, and parts of Queens** are underserved markets with growth potential. Targeted host acquisition strategies in these areas could help Airbnb balance its supply geographically.

From a pricing perspective, we observed that `price` does not strongly correlate with individual numerical features like `minimum_nights` or `availability_365`. Instead, it is more closely influenced by room type and location, suggesting that dynamic, context-aware pricing models would be more effective than flat-rate strategies. Additionally, properties with high availability but low review activity may need platform support to improve visibility or quality.

The use of geospatial scatter plots and heatmaps provided insight into how listings are distributed across NYC, with clusters of dense activity in central areas. Pair plots and correlation matrices helped validate assumptions and explore inter-feature relationships.

Based on these findings, a number of business recommendations were proposed, including optimizing pricing strategies using machine learning, encouraging hosts to reduce minimum stay requirements, promoting room-type diversity in concentrated markets, and onboarding hosts in underserved areas. These actions are expected to improve booking rates, guest satisfaction, host success, and overall platform performance.


In conclusion, this EDA project provided a data-driven foundation for strategic decision-making in the Airbnb ecosystem. By leveraging the insights uncovered through this analysis, Airbnb and its stakeholders can improve platform efficiency, enhance user experience, and sustainably grow their market presence in New York City.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Airbnb faces challenges in understanding what drives listing success in a competitive market like New York City. With thousands of listings varying by price, room type, location, and availability, it becomes difficult to identify patterns that lead to higher bookings and guest satisfaction.

This project aims to perform exploratory data analysis (EDA) on the Airbnb NYC 2019 dataset to uncover key trends, such as pricing behavior, room type distribution, and geographic demand. The goal is to provide actionable insights that help Airbnb and its hosts optimize listings, improve user experience, and make data-driven business decisions.

#### **Define Your Business Objective?**

The objective of this project is to perform exploratory data analysis (EDA) on the Airbnb NYC 2019 dataset to identify the main factors that impact listing performance—such as price, room type, availability, location, and reviews.

- The aim is to uncover insights that can help:

- Increase booking rates through better pricing and availability strategies.

- Guide hosts to optimize their listings for visibility and competitiveness.

- Support Airbnb’s business decisions by highlighting underserved markets and areas of growth.

- Improve the overall guest experience by ensuring a balanced supply of diverse and well-priced listings across NYC.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns





In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")

In [None]:
df.head()

### Dataset First View

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Dataset Rows & Columns count
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])


### Dataset Information

In [None]:
# Dataset Info
# Dataset Information
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows
duplicate_rows = df.duplicated()

# Total number of duplicate rows
print("Number of duplicate rows:", duplicate_rows.sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count# Check for missing values
missing_values = df.isnull().sum()

# Display columns with missing values only
missing_values[missing_values > 0]


In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

 Structure,missing data, duplicates, data types

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# List all columns
print(df.columns.tolist())


In [None]:
# Dataset Describe
# Describe numerical columns
df.describe()


### Variables Description


id: Unique listing ID assigned by Airbnb.
name :	Title of the listing created by the host.
host_id :	Unique ID for the host.
host_name : Name of the host

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check unique values for each column
for column in df.columns:
    unique_count = df[column].nunique()
    print(f"{column}: {unique_count} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 1. Import libraries
import pandas as pd
import numpy as np

# 2. Load the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
# 3. Drop duplicate rows
df.drop_duplicates(inplace=True)

# 4. Handle missing values
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df.dropna(subset=['name', 'host_name'], inplace=True)  # Drop rows with critical nulls

# 5. Convert data types
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
df['neighbourhood_group'] = df['neighbourhood_group'].astype('category')
df['room_type'] = df['room_type'].astype('category')

# 6. Create new derived columns
df['price_per_available_day'] = df['price'] / df['availability_365'].replace(0, np.nan)

# 7. Handle outliers (optional filters)
df = df[df['price'] <= 1000]
df = df[df['minimum_nights'] <= 365]

# 8. Rename columns (for better readability, optional)
df.rename(columns={
    'neighbourhood_group': 'area',
    'room_type': 'room_category'
}, inplace=True)

# 9. Reset index
df.reset_index(drop=True, inplace=True)

# 10. Final structure check
print("Cleaned and wrangled dataset:")
print(df.info())


### What all manipulations have you done and insights you found?

loaded dataset , removed duplicate entries, handled missing values etc.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Price Distribution by Room Type**

In [None]:
# Chart - 1 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")

# Clean column names
df.columns = df.columns.str.strip()

# Filter out extreme price values
df_filtered = df[df['price'] <= 500]

# Create boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_filtered, x='room_type', y='price', palette='Set2')
plt.title('Price Distribution by Room Type (≤ $500)')
plt.xlabel('Room Type')
plt.ylabel('Price (USD)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal to show how prices vary across categories like room_type.

It highlights median prices, price spread, and outliers in a visually intuitive way.

It helps stakeholders quickly understand which room types are more premium

##### 2. What is/are the insight(s) found from the chart?

Private rooms are more affordable and have a moderate price range.

Shared rooms are the cheapest, with very tight pricing variation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact
**Positive:**
Airbnb can use this to set pricing recommendations for new hosts.

Helps Airbnb target different customer segments (budget vs luxury).

**Negative:**
Could cause price mismatch in less-informed neighborhoods or for first-time hosts.



#### Chart - 2

**Availability (Days per Year) by Room Type**

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load your dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='room_type', y='availability_365', palette='Set3')
plt.title('Availability (Days per Year) by Room Type')
plt.xlabel('Room Type')
plt.ylabel('Availability (0–365 days)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Availability is crucial for both revenue potential and user experience.

A boxplot shows how each room type varies in availability, from low to high, including outliers.

##### 2. What is/are the insight(s) found from the chart?

Private rooms tend to be more consistently available across the year.

Entire home/apt listings show more variability — possibly used seasonally.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive**:
Helps Airbnb predict availability gaps in certain categories or regions.

Can suggest calendar optimization to hosts with low availability (boost revenue).

Helps filter out inactive listings or mark them as limited-availability for guests.

**Negative**:
Listings with extremely low availability waste space in search results and frustrate guests.

If entire homes dominate searches but are rarely available, this leads to drop in booking conversions.



#### Chart - 3

**Price Distribution by Neighbourhood Group**

In [None]:
# Chart - 3 visualization code
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")

# Clean column names
df.columns = df.columns.str.strip()

# Filter to remove extreme price outliers for better visualization
df_filtered = df[df['price'] <= 500]

# Plot boxplot of price by neighbourhood group
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_filtered, x='neighbourhood_group', y='price', palette='Set1')
plt.title('Chart 4: Price Distribution by Neighbourhood Group (≤ $500)')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Price (USD)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot shows the spread and skewness of prices within each neighbourhood group.

Helps us compare typical price ranges, medians, and outliers across city regions.



##### 2. What is/are the insight(s) found from the chart?

Manhattan clearly has the highest median and widest range of prices.

Brooklyn and Queens follow, with lower price medians but some overlap in spread.

Bronx and Staten Island consistently have lower prices and fewer outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impacts:**
Airbnb can inform new hosts of expected price bands based on location.

Pricing algorithms can be location-aware, preventing overpricing in low-demand zones.

Guests can be guided to areas that suit their budget preferences.

**Negative:**
Price overlap in cheaper areas may create confusion or reduce perceived value.

Inconsistent pricing in low-demand neighborhoods may lead to low conversion rates

#### Chart - 4

Number of Reviews by Room Type

In [None]:
# Chart - 4 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Optional: Filter to remove extreme outliers in reviews
df_filtered = df[df['number_of_reviews'] <= 200]

# Plot boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_filtered, x='room_type', y='number_of_reviews', palette='Accent')
plt.title('Chart 4: Number of Reviews by Room Type (≤ 200 Reviews)')
plt.xlabel('Room Type')
plt.ylabel('Number of Reviews')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal to visualize how number_of_reviews is distributed for each room_type. It highlights the median, spread, and presence of highly reviewed listings.

##### 2. What is/are the insight(s) found from the chart?

- Shared and private rooms typically get more reviews (possibly due to affordability).
- Entire homes have fewer but possibly more expensive bookings.
- Significant outliers exist — some listings get 100+ reviews yearly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**
- Improve listing ranking algorithms.
- Identify high-engagement property types.

**Negative:**
- Hosts with few or no reviews may suffer in visibility.
- Over-reliance on reviews could suppress newer listings in search rankings.

#### Chart - 5

Average Reviews per Month by Neighbourhood Group


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Handle missing review data
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

# Group and calculate mean reviews per month by neighbourhood group
avg_reviews = df.groupby('neighbourhood_group')['reviews_per_month'].mean().sort_values()

# Plot Chart 5
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_reviews.values, y=avg_reviews.index, palette='viridis')
plt.title('Chart 5: Average Reviews per Month by Neighbourhood Group')
plt.xlabel('Average Reviews per Month')
plt.ylabel('Neighbourhood Group')
plt.grid(axis='x')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

It reveals how frequently listings in each area are reviewed, which reflects guest activity and popularity.

##### 2. What is/are the insight(s) found from the chart?

- Central areas (like Manhattan, Brooklyn) have higher review activity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Helps Airbnb and hosts understand high-demand areas and replicate success in underperforming zones.

**Negative:**

Imbalance in listing engagement can lead to oversaturation in certain regions and missed opportunity elsewhere.


#### Chart - 6

Availability vs Price

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and clean the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter data to remove extreme outliers
df_filtered = df[(df['price'] <= 500) & (df['availability_365'] > 0)]

# Plot Chart 6
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_filtered, x='availability_365', y='price', hue='room_type', alpha=0.6)
plt.title('Chart 6: Availability vs Price (≤ $500)')
plt.xlabel('Availability in Days (per year)')
plt.ylabel('Price (USD)')
plt.grid(True)
plt.legend(title='Room Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot shows how two continuous variables (availability and price) relate. Helps spot patterns, clusters, or lack of correlation.


##### 2. What is/are the insight(s) found from the chart?

- Listings with lower availability tend to have higher prices — possibly premium or seasonal.
- Highly available listings tend to cluster around moderate prices.
- Shared/private rooms are more affordable and often more available.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Airbnb can use this insight to suggest dynamic pricing strategies for listings with low availability. Also helps identify ideal availability-price combos.

**Negative:**

Listings with low availability and high price may receive fewer bookings. Could lead to underutilization of property potential.

#### Chart - 7

Room Type Distribution Across Neighbourhood Groups

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and clean the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Create a grouped count of room_type per neighbourhood_group
room_dist = df.groupby(['neighbourhood_group', 'room_type']).size().reset_index(name='count')

# Plot Chart 7
plt.figure(figsize=(12, 6))
sns.barplot(data=room_dist, x='neighbourhood_group', y='count', hue='room_type', palette='Set2')
plt.title('Chart 7: Room Type Distribution Across Neighbourhood Groups')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Listings')
plt.legend(title='Room Type')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

This bar chart shows the supply mix by region, helping us understand what types of properties dominate each borough — critical for business strategy.


##### 2. What is/are the insight(s) found from the chart?

- Manhattan and Brooklyn have a high number of Entire home/apt and Private rooms.
- Bronx and Staten Island have fewer listings, with a higher ratio of private/shared rooms.


3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Airbnb can balance supply, run targeted promotions, and guide host onboarding by region and demand type.


**Negative:**

Some boroughs may lack diversity in room types, which could limit appeal to broader guest segments (e.g., no budget or luxury options).

#### Chart - 8

Distribution of Minimum Nights

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and clean the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter to remove extreme outliers in minimum nights
df_filtered = df[df['minimum_nights'] <= 30]  # focus on 1 month or shorter stays

# Plot Chart 8: Distribution of Minimum Nights
plt.figure(figsize=(10, 6))
sns.histplot(df_filtered['minimum_nights'], bins=30, kde=True, color='skyblue')
plt.title('Chart 8: Distribution of Minimum Nights (≤ 30)')
plt.xlabel('Minimum Nights Required')
plt.ylabel('Number of Listings')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram helps visualize the frequency distribution of minimum night requirements. It shows what's common and what’s unusual.


##### 2. What is/are the insight(s) found from the chart?

 Most listings have a minimum night stay of 1–3 days.
- There are fewer listings with stays longer than a week.
- Spikes at values like 7, 14, or 30 may indicate weekend/week/month settings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Positive:**

Airbnb can encourage hosts with long minimum stays to reduce them for better booking rates. Helps align listings with customer preferences for short stays.

**Negative:**

Listings with very high minimum nights could have low occupancy or appear less often in searches — leading to revenue loss.


#### Chart - 9

Minimum Nights vs Price (Scatter Plot)

In [None]:
# Chart - 9 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and prepare dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter for clarity (remove extreme outliers)
df_filtered = df[(df['price'] <= 500) & (df['minimum_nights'] <= 30)]

# Chart 9: Minimum Nights vs Price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_filtered, x='minimum_nights', y='price', hue='room_type', alpha=0.6)
plt.title('Chart 9: Minimum Nights vs Price (≤ $500 & ≤ 30 Nights)')
plt.xlabel('Minimum Nights Required')
plt.ylabel('Price (USD)')
plt.grid(True)
plt.legend(title='Room Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram helps visualize the frequency distribution of minimum night requirements. It shows what's common and what’s unusual.

##### 2. What is/are the insight(s) found from the chart?

- Most listings have a minimum night stay of 1–3 days.
- There are fewer listings with stays longer than a week.
- Spikes at values like 7, 14, or 30 may indicate weekend/week/month settings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Airbnb can encourage hosts with long minimum stays to reduce them for better booking rates. Helps align listings with customer preferences for short stays.

**Negative:**

Listings with very high minimum nights could have low occupancy or appear less often in searches — leading to revenue loss.

#### Chart - 10

Average Price by Room Type per Neighbourhood Group

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and prepare dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter to remove extreme outliers for clarity
df_filtered = df[df['price'] <= 500]

# Group by neighbourhood_group and room_type to calculate average price
grouped_price = df_filtered.groupby(['neighbourhood_group', 'room_type'])['price'].mean().reset_index()

# Plot Chart 10
plt.figure(figsize=(12, 6))
sns.barplot(data=grouped_price, x='neighbourhood_group', y='price', hue='room_type', palette='Paired')
plt.title('Chart 10: Average Price by Room Type per Neighbourhood Group (≤ $500)')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Average Price (USD)')
plt.legend(title='Room Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This grouped bar chart shows how prices vary by both room type and neighbourhood, giving deeper insight than either factor alone.

##### 2. What is/are the insight(s) found from the chart?

- Entire homes in Manhattan are the most expensive.
- Private rooms stay consistently more affordable across all boroughs.
- Brooklyn and Queens offer a mix of prices appealing to different audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Helps Airbnb optimize dynamic pricing, shows hosts how their price compares to similar listings nearby, and guides guests in budgeting per area.

**Negative:**

Some boroughs show little variation between room types, which may indicate poor segmentation or mispriced properties. This could lead to revenue inefficiency.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and prepare dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter availability values for clarity (remove 0 availability)
df_filtered = df[df['availability_365'] > 0]

# Plot Chart 11
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_filtered, x='room_type', y='availability_365', palette='pastel')
plt.title('Chart 11: Availability Distribution by Room Type')
plt.xlabel('Room Type')
plt.ylabel('Availability (Days per Year)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is ideal to compare how availability differs across room types, highlighting full-time vs part-time listings.

##### 2. What is/are the insight(s) found from the chart?

- Entire homes often have more limited availability — likely due to owner use.
- Private and shared rooms are more often available year-round.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Helps Airbnb identify which room types contribute most to overall inventory year-round, and when to push seasonal promotions.

**Negative:**

Low availability listings might dominate certain search results, frustrating users who can’t book long stays. Airbnb can recommend availability increases for underutilized listings.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import pandas as pd
import matplotlib.pyplot as plt

# Load and clean the dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Pivot the data: count of room types per neighbourhood group
pivot_df = df.pivot_table(index='neighbourhood_group',
                          columns='room_type',
                          aggfunc='size',
                          fill_value=0)

# Plot Chart 12: Stacked bar chart
pivot_df.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='tab20')
plt.title('Chart 12: Number of Listings by Room Type in Each Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Listings')
plt.legend(title='Room Type')
plt.tight_layout()
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

A stacked bar chart is perfect to show the relative mix and total volume of room types per neighbourhood group, all in one view.

##### 2. What is/are the insight(s) found from the chart?

- Manhattan has a higher ratio of entire homes than other boroughs.
- Brooklyn and Queens have a more diverse room mix (many private rooms).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Airbnb can adjust supply by encouraging specific room types in areas lacking them (e.g., more shared rooms in Manhattan, more entire homes in Bronx).

**Negative:**

If one room type dominates an area, it limits guest options and could affect conversion. Airbnb could promote listing diversity for a better user experience.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and clean dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Filter to remove extreme or invalid coordinates
df_filtered = df[(df['longitude'] < -73.7) & (df['longitude'] > -74.3) &
                 (df['latitude'] > 40.5) & (df['latitude'] < 40.95)]

# Plot Chart 13: Geographical scatterplot
plt.figure(figsize=(10, 8))
sns.scatterplot(data=df_filtered, x='longitude', y='latitude', hue='room_type', alpha=0.5, s=10, palette='tab10')
plt.title('Chart 13: Geographical Distribution of Listings by Room Type')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Room Type', loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This scatter map helps visualize spatial patterns, urban density, and room type clustering — especially useful for regional supply planning.

##### 2. What is/are the insight(s) found from the chart?

- Manhattan and Brooklyn have high-density clusters of listings.
- Entire homes are more spread across all boroughs.
- Shared rooms are mostly in Manhattan, possibly student- or hostel-style.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive:**

Airbnb can guide regional promotions, host recruitment, or city-specific strategies based on listing density.

**Negative:**

Overconcentration in specific areas (e.g. central Manhattan) may cause regulatory attention, local community friction, or price cannibalization. Airbnb may need to diversify inventory spatially.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and prepare dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Fill missing values for correlation calculations
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

# Select only numeric columns relevant to business insights
num_cols = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
corr = df[num_cols].corr()

# Plot Chart 14: Correlation Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Chart 14: Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart quantifies the strength and direction of relationships between key numerical variables — essential for feature selection and pattern recognition.

##### 2. What is/are the insight(s) found from the chart?

- number_of_reviews and reviews_per_month are strongly positively correlated.
- price shows very weak correlation with other features — indicating it's likely influenced by location and room type more than numeric features.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load and clean dataset
df = pd.read_csv("/content/drive/MyDrive/airbnb/Copy of Airbnb NYC 2019.csv")
df.columns = df.columns.str.strip()

# Fill missing reviews_per_month with 0 for completeness
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)

# Filter out price outliers for clearer visuals
df_filtered = df[(df['price'] <= 500) & (df['minimum_nights'] <= 30) & (df['availability_365'] > 0)]

# Select numerical columns for pair plot
num_features = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']

# Plot Chart 15
sns.pairplot(df_filtered[num_features], corner=True, diag_kind='kde', plot_kws={'alpha':0.4})
plt.suptitle('Chart 15: Pair Plot of Key Numerical Variables', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

It gives a comprehensive overview of how all numeric variables interact, making it easier to spot hidden relationships, distributions, and clusters.

##### 2. What is/are the insight(s) found from the chart?

- price has wide spread and outliers but doesn’t show clear correlation with most other features.
- reviews_per_month and number_of_reviews are tightly clustered and positively related.
- availability_365 shows moderate variance across features.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1.Boost Bookings:
Encourage lower minimum night stays (1–3 days) and improve availability (>200 days/year), especially for private rooms.

2.Optimize Pricing:
Use dynamic, location-based pricing. Flag underpriced or overpriced listings in premium areas.

3.Expand Supply in Underserved Areas:
Focus on onboarding hosts in low-listing areas like Staten Island and the Bronx to balance supply.

4.Improve Host Performance:
Support hosts with low review activity. Promote high-performing hosts to drive trust and conversions.

5.Enhance Guest Experience:
Ensure room type variety across boroughs and improve listing availability, pricing transparency, and quality.

6.Address Regulatory Risk:
Reduce overconcentration in Manhattan and audit listings with suspiciously high minimum stays or no availability.

7.Adopt Data-Driven Tools:
Use advanced analytics and machine learning for better pricing, demand prediction, and strategic planning.

# **Conclusion**

This exploratory data analysis of the Airbnb NYC 2019 dataset revealed key patterns in listing availability, pricing, room types, and location-based trends. The insights highlight that availability, location, and room type are the most influential factors in determining listing success. While Manhattan and Brooklyn dominate the market, underserved areas like the Bronx and Staten Island present growth opportunities.

Effective use of dynamic pricing, diverse inventory, and data-driven strategies can help optimize bookings, improve host performance, and deliver a better guest experience. Moving forward, Airbnb can leverage these insights to make smarter decisions, balance supply-demand, and remain compliant with local regulations.

