<a href="https://colab.research.google.com/github/gandharvbajaj/wallamart-/blob/main/EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **CAPSTONE PROJECT**: **Air bnb Booking Analysis**



##### **Project Type**    - EDA

---


##### **Contribution**    - Individual/Gandharv Bajaj


# **Project Summary -**

* The main goal of this analysis is to uncover the factors influencing Airbnb pricing in New York City, recognize variable patterns, and offer valuable insights for travelers, hosts, and the Airbnb business sector.

* Initially, the project focused on data exploration and cleaning to make the dataset ready for analysis. This involved examining the data to understand its characteristics, such as types, missing values, and value distributions. The cleaning process addressed issues like errors, missing entries, and duplicate records, and outliers were removed.

* By identifying and resolving data issues, we ensured the dataset was high-quality and free from biases or errors that could skew results. This foundational step is crucial for any analysis project to ensure accuracy and reliability in subsequent analyses.

* With the data cleaned and prepped, we moved on to exploration and summarization. This included describing the data, creating visualizations, and spotting patterns and trends. We also examined relationships between variables and the possible reasons behind certain patterns or trends.

* Data visualization played a key role in uncovering and understanding patterns in the Airbnb data. Various graphs and charts were created to illustrate the data, accompanied by observations and insights to aid in interpretation and highlight significant findings.

* Through these visualizations, we identified trends and relationships that are not easily seen in raw data. For instance, we discovered that factors like minimum nights, number of reviews, and host listing count significantly influence pricing. Additionally, availability varies widely across different neighborhoods.

* The insights gained from this process are beneficial for future analysis and decision-making concerning Airbnb. Our findings provide valuable information for both travelers and hosts in New York City, aiding in better decision-making and strategic planning.

# **GitHub Link -**

Provide your GitHub Link here.
https://github.com/gandharvbajaj

# **Problem Statement**


1. What are the most popular neighborhoods for Airbnb rentals in New York City? How do prices and availability vary by neighborhood?

2. How has the Airbnb market in New York City changed over time? Have there been any significant trends in terms of the number of listings, prices, or occupancy rates?

3. Are there any patterns or trends in terms of the types of properties that are being rented out on Airbnb in New York City? Are certain types of properties more popular or more expensive than others?

4. Are there any factors that seem to be correlated with the prices of Airbnb rentals in New York City?

5. The best area in New York City for a host to buy property at a good price rate and in an area with high traffic ?

6. How do the lengths of stay for Airbnb rentals in New York City vary by neighborhood? Do certain neighborhoods tend to attract longer or shorter stays?

7. How do the ratings of Airbnb rentals in New York City compare to their prices? Are higher-priced rentals more likely to have higher ratings?

8. Find the total numbers of Reviews and Maximum Reviews by Each Neighborhood Group.

9. Find Most reviewed room type in Neighborhood groups per month.

10. Find Best location listing/property location for travelers.

11. Find also best location listing/property location for Hosts.

12. Find Price variations in NYC Neighborhood groups.

13. There is a lot of problem statements and we have to finds information and insights through different different problem statements so now lets start...

#### **Define Your Business Objective?**

Answer Here. Our business objective is to serve maximum custumers and guests with quality services and
Maximize Exposure and Accessibility:

Increase visibility to attract more guests.
Ensure competitive pricing across neighborhoods.
Foster a positive host community.
* **Accommodation Variety and Pricing:**
1. Optimize minimum stay requirements and room availability.
2. Tailor marketing strategies based on seasonal guest patterns.
3. Invest strategically in high-return neighborhoods.
* **Listing Details and User Experience:**
1. Diversify room type offerings for guest preferences.
2. Visualize pricing dynamics for informed decisions.
3. Improve guest satisfaction through review analysis.

* **Stay Duration and Neighborhood Preferences:**
     1. Analyze and optimize stay duration policies.
     2. Identify cost-effective and luxury neighborhoods.





# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import missingno as msno
import os

### Dataset Loading

In [None]:
#drive mounting
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
airbnb_df  = pd.read_csv ('/content/drive/MyDrive/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
airbnb_df.head()

In [None]:
airbnb_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
airbnb_df.duplicated()

In [None]:
airbnb_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
airbnb_df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_values = airbnb_df.isnull()

# Plotting heatmap using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(missing_values, cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

In [None]:
# Visualize missing values using a matrix
msno.matrix(airbnb_df)

### What did you know about your dataset?

Answer: The following information we know about the data set
* The dataset contains 48895 enteries with 16 columns
* The dataset contains variables having data type of interger, float and object.
* There are no duplicate values present in the dataset
* The dataset provides information about Airbnb listings in New York City, including details about hosts, neighborhoods, room types, prices, minimum nights, reviews, and availability.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb_df.columns

In [None]:
airbnb_df.dtypes

In [None]:
# Create a new DataFrame with column names and data types
data_types_df = pd.DataFrame({'Column Name': airbnb_df.columns, 'Data Type': airbnb_df.dtypes})

# Print the DataFrame
print(data_types_df)

In [None]:
# Dataset Describe
airbnb_df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = airbnb_df.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
airbnb_df.duplicated().sum()

*There are no duplicate value in the dataset.So we will move forward but if duplicates are present we use*

**airbnb_df.dropduplicates(inplace= True)**

In [None]:
airbnb_df.isnull().sum()

In [None]:
airbnb_df.head(5)

In [None]:
#Imputing values in the empty enteries
airbnb_df['name'].fillna(value='Unknown', inplace=True)
airbnb_df['host_name'].fillna(value='None', inplace=True)
airbnb_df['last_review'].fillna(value='Not Known', inplace=True)
airbnb_df['reviews_per_month'].fillna(value=0, inplace=True)  # Using fillna directly
airbnb_df['reviews_per_month'] = airbnb_df['reviews_per_month'].astype('int64')

In [None]:
airbnb_df.isnull().sum()

### What all manipulations have you done and insights you found?

Answer Here. The manipulations I have done here are following.

1. I have Checked whether any duplicates are present or not

2. I have checked the nunber of missing values

3. I have impute them with NA, Unknown and None to make data more symetrical for analyis and visualization.

deleted the duplicate values and after wards I have

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
airbnb_df['neighbourhood_group'].value_counts()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='neighbourhood_group', data=airbnb_df, order=airbnb_df['neighbourhood_group'].value_counts().index)

# Add title and labels
plt.title('Count of Listings by Neighborhood Group')
plt.xlabel('Neighborhood Group')
plt.ylabel('Count')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. I chose a bar chart because it effectively shows the number of listings across various neighborhood groups. Bar charts are ideal for comparing categorical values against numerical values, providing clear insights into which neighborhoods have the highest or lowest listings."

##### 2. What is/are the insight(s) found from the chart?

Answer Here The insights I get from this chart is that Manhattan and Brooklyn have higher number of listing and Staten Island has lower number of listing. This might be because Manhattan is business capital and more people are interested in rental income than Staten Island.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: More number of listings in Manhattan means there is more demand of rental property here and more number of visitors are visiting this place. Also there are more number of homeowners residing in Manhattan  

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#airbnb price distribution
plt.figure(figsize=(12, 5))

# Set the seaborn theme to darkgrid
sns.set_theme(style='darkgrid')

# Create a histogram of the 'price' column of the Airbnb_df dataframe
# using sns distplot function and specifying the color as red
sns.histplot(airbnb_df['price'],kde = True,color=('y'))

# Add labels to the x-axis and y-axis
plt.xlabel('Price', fontsize=14)
plt.ylabel('Density', fontsize=14)

# Add a title to the plot
plt.title('Distribution of Airbnb Prices with Kernel Density Estimate',fontsize=15)
plt.xlim(0, 500)  # Adjust these values as needed to zoom in



# Average Price by Room Type
plt.figure(figsize=(8, 6))
sns.barplot(x='room_type', y='price', data=airbnb_df)
plt.title('Average Price by Room Type')
plt.xlabel('Room Type')
plt.ylabel('Average Price')
plt.show()


##### 1. Why did you pick the specific chart?

Answer: The visualization I used above is histogram and bar graph and the I use them for reason because Bar chart effectively compares numerical quantities.

I have used histogram to show distribution of price. I used this chart because it very well and excellently explains the distribution in with kernal density.

##### 2. What is/are the insight(s) found from the chart?

Answer: The insight I get from the histogram is that the price The range of prices being charged on Airbnb appears to be from 20 to 330 , with the majority of listings falling in : the price range of 50 to 150.
The distribution of prices appears to have a peak in the 50 to 150 range, with a relatively lower density of listings in higher and lower price ranges.
There may be fewer listings available at prices above 250 dollars , as the density of listings drops significantly in this range.

The other visualization I used is Bar chart because it compared the average price of rooms according to room type which shows that the Entire/ Home Appartment has higher Average price ie 220 dollars and shared room has lower price ie 65 dollars  which also make sense

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The insight gain from the above visualization is usefull and because we can find that why prices above 250 ? Are there any specific Neighbourhood Group which have higher prices or what might be the reason for that
In second visualization we  get understanding that the the price of entire appartment is high as compared to others

#### Chart - 3

In [None]:
# Create a new column for price range
price_bins = [0, 100, 200, 300, 400, 500, 600, float('inf')]
price_labels = ['0-100', '100-200', '200-300', '300-400', '400-500', '500-600', '600+']
airbnb_df['price_range'] = pd.cut(airbnb_df['price'], bins=price_bins, labels=price_labels, right=False)

# Plot the scatter plot with the new price range column as hue
plt.figure(figsize=(10, 6))
sns.scatterplot(x='longitude', y='latitude', hue='price_range', data=airbnb_df, palette='viridis')
plt.title('Map of Listings')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Price Range')
plt.show()

##### 1. Why did you pick the specific chart?



```
`# This is formatted as code`
```

Answer  The visualization used here is heatmap.Heatmap provide visual and geographical context of data. The heat map is used to show the price variation.

##### 2. What is/are the insight(s) found from the chart?

Answer: The heatmap used is showing price variation among the area where are the expensive and where are the affordable appartments  

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answers: The gained insight tells us that the price 200-250 most of the population lies in this range and above 600 are only few.That are described by parrot green/light green

---



#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Calculate the distribution percentage
neighbourhood_group_counts = airbnb_df['neighbourhood_group'].value_counts()
neighbourhood_group_percentages = (neighbourhood_group_counts / neighbourhood_group_counts.sum()) * 100

# Plot a pie chart
plt.figure(figsize=(10, 7))
plt.pie(neighbourhood_group_percentages, labels=neighbourhood_group_percentages.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.tab20.colors)
plt.title('Neighbourhood Group Distribution Percentage')
plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular
plt.show()

##### 1. Why did you pick the specific chart?

*Answer* :The visualization I used here is a Pie chart
Pie charts are excellent for showing the proportion of whole categories. In this case, it’s used to show the proportion of reviews by neighborhood group in New York City. Each slice of the pie represents a different neighborhood, and the size of the slice corresponds to its proportion of the total reviews.

##### 2. What is/are the insight(s) found from the chart?

Answer:The insights we gain are **Manhattan** is the highest count neighbourhood group and **Staten Island** have minimum number of counts. As the graph says Manhattan has **44.3 %**  of the share and Staten Island has only **0.8%** share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The insight we get is usefull as the visualization tells us that the Manhattan has more market share and Staten Island has less Market share that;s probably because Manahattan is a business capital .So more number of traders and visitors visit Mahantaan everyday where as the Staten Island is more of remote area and  or may be less visitor prone area

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
plt.hist(airbnb_df['number_of_reviews'], bins=30, color='blue', edgecolor='black')
plt.title('Number of Reviews Distribution')
plt.xlabel('Number of Reviews')
plt.ylabel('Frequency')
plt.grid(True)

# Set x-axis limits based on the minimum and maximum number of reviews
plt.xlim(0,250)

plt.show()

##### 1. Why did you pick the specific chart?

Answer: The visualization I used here is Histogram.

**Histogram**: Histograms are used to display the distribution of a numerical variable in this case number of reviews

##### 2. What is/are the insight(s) found from the chart?

Answer:The Visualization here help us get a clear picture of distribution of number of reviews.So we can use this information to solve the problems faced by users.By finding how many people are leaving a review or how many are not. And with data we can ask user to leave us with reviews.So that business can improve

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The gain information can be used
**Optimizing Pricing and Marketing**: Listings with a higher number of reviews can justify higher pricing due to perceived popularity and positive guest feedback. Conversely, listings with lower review counts might benefit from targeted marketing campaigns or promotional strategies to increase visibility and bookings.

**Improving Guest Satisfaction**: Insights from review distributions can highlight areas where improvements are needed in terms of amenities, cleanliness, or customer service. Addressing issues raised in reviews can lead to higher guest satisfaction, more positive reviews, and increased repeat bookings.

**Strategic Planning**: Understanding seasonal trends in review distributions helps hosts anticipate demand fluctuations and adjust pricing and availability accordingly. For instance, during peak seasons with higher review counts, hosts might implement dynamic pricing strategies to maximize revenue.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# sns.countplot(data= airbnb_df, x = 'neighbourhood_group', hue = 'room_type')

# Create a count plot using Seaborn
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.countplot(data=airbnb_df, x='neighbourhood_group', hue='room_type')

# Add title to the plot
plt.title('Distribution of Airbnb Listings by Room Type in Each Neighbourhood Group', fontsize=16)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

*Answer* : The Chart I pick here is Barchart.It is used to display different neigbourhood group and room type among them. The distribution of room type among different neighbourhood is seen by bars with different colors in the above visualization.

##### 2. What is/are the insight(s) found from the chart?

Answer: The Insight I find from the above visualization is Staten Island and Broonx does not have the concept of Shared Room probably because these are near to rural areas or there is not an issue of space. So here only two types of Room either Private Room  or either Entire Home/appartmnet  listing  is meaning full But in future near the crowded areas we can Find the opportunity and invest.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The Room type prefference in the different neighbourhood let us know the people needs and demand and we can optimize the platform accordingly.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='longitude', y='latitude', hue='neighbourhood_group', data=airbnb_df, palette='viridis')
plt.title('Map of Listings')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The Visualization I choose above is scatter plot. A scatter plot is effective for visualizing geospatial data, such as longitude and latitude coordinates. Each point on the plot represents a specific location on a map, which in this case corresponds to Airbnb listings.

##### 2. What is/are the insight(s) found from the chart?


Answer:The plot can reveal the spatial distribution of Airbnb listings across the geographical area covered by the dataset. You can observe whether listings are concentrated in specific regions or spread out evenly across the map.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

**Strategic Pricing:**:Buisnesses which are already in the business of Food and hospitality, real estate can use this data to strategically price their listings based on the data in differrent Neighbourhood.

**Investment Decisions:** This data can guide businesses on where to invest in new properties. Areas with higher prices might indicate high demand and could be lucrative investment opportunities.

**Potential Negative Impact:**

**Over-reliance on Data:** While the heatmap provides valuable insights, over-reliance on this data could lead to missed opportunities. For instance, areas with lower prices might be up-and-coming neighborhoods with potential for growth.

**Neglecting Other Factors:** Price is just one factor to consider. Other factors like property condition, local amenities, and future development plans also significantly impact the business.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Calculate the number of listings in top 10 neighbourhood
neighbourhood_counts = airbnb_df['neighbourhood'].value_counts().reset_index().head(10)
neighbourhood_counts.columns = ['Neighbourhood', 'Listings Count']

# Create an interactive bar chart using Plotly
fig = px.bar(neighbourhood_counts, x='Neighbourhood', y='Listings Count',
             title='Listings Count by Neighbourhood',
             labels={'Listings Count': 'Number of Listings'},
             color='Listings Count',
             color_continuous_scale=px.colors.sequential.Viridis)

fig.update_layout(xaxis_title='Neighbourhood', yaxis_title='Number of Listings', title_font_size=20)
fig.show()

##### 1. Why did you pick the specific chart?

Answer:Thre visualization I pick here is Bar chart. Bar chart is excellent in comparing the catorical variables with numerical values and give better insight of Data. In this case different bars represent different neighbourhood and the height of bar shows the number of listings . For example in this case Williamsburg has maximum numberof listings ie 3920. The above visualization shows top 10 neighbourhoods according to their listings.

##### 2. What is/are the insight(s) found from the chart?

Answer: The Insight here is helpful in finding the number of listings in different neighbourhoods and after the neighbourhood this can be sub classification to dive deep into the data and find the available scope of improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The insight we find here can help solve many problems.
**Strategic planning**: These insights are used to plan investment and calculate which are the popular neighbourhood and how many listings are in each of them ??

**Finding Competitions**: The insight can be usefull in finding competition among hosts, Also if they are planning to buy or construct a property for business purpose which neighbourhood would be best for them and howmany people they have to compete with.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Create an interactive map of listings using Plotly Express
fig = px.scatter_mapbox(airbnb_df,
                        lat='latitude',
                        lon='longitude',
                        color_continuous_scale=px.colors.cyclical.IceFire,
                        size_max=15,
                        zoom=10)
fig.update_layout(mapbox_style='open-street-map')
fig.update_layout(title='Map of Airbnb Listings')
fig.show()

##### 1. Why did you pick the specific chart?

Answer:
**Geographical Distribution**: You can observe the clustering or dispersion of Airbnb listings across the map. Concentrations of markers indicate areas with higher density of listings, which can signify popular or densely populated regions.

**Spatial Relationships**: Insights into proximity to amenities, attractions, or city centers can be inferred by the distribution of markers. Listings closer to central areas might attract higher demand or command premium pricing.

**Market Insights**: Patterns in listing distribution can reveal potential gaps or opportunities in specific neighborhoods or regions, guiding strategic decisions for property acquisition or marketing efforts.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Group the DataFrame by the minimum_nights column and count the number of rows in each group
min_nights_count = airbnb_df.groupby('minimum_nights').size().reset_index(name = 'count')

# Sort the resulting DataFrame in descending order by the count column
min_nights_count = min_nights_count.sort_values('count', ascending=False)

# Select the top 10 rows
min_nights_count = min_nights_count.head(15)

# Reset the index
min_nights_count = min_nights_count.reset_index(drop=True)

# Extract the minimum_nights and count columns from the DataFrame
minimum_nights = min_nights_count['minimum_nights']
count = min_nights_count['count']

# Set the figure size
plt.figure(figsize=(12, 4))

# Create the bar plot
plt.bar(minimum_nights, count)

# Add axis labels and a title
plt.xlabel('Minimum Nights', fontsize='14')
plt.ylabel('Count', fontsize='14')
plt.title('Stay Requirement by Minimum Nights', fontsize='15')

# Show the plot
plt.show()



##### 1. Why did you pick the specific chart?

Answer: I pick the following chart because Bar graph is excellently used to compare two categories. For Example in this case the individual bar is used to show minimum number of nights and the height of the bar is used to show the count of different minimum number of nights





##### 2. What is/are the insight(s) found from the chart?

Answer: The insight we got from this visualization is that the majority of the people are booking the listing for one night ie nearly (12700).For 2 nights the listing count is around (11600). for 3 night the count is nearly (8000).
As the number of nights are increasing the listing count is dicreasing. This goes upto 5 minimum number of nihts and after that there is sudden drop in the listing and we can say barely no listing.
After that few listings in 10 nights, 15 nights the count is nominal but gradually increasing at 30 nights. the listing count is increasing which also  makes sense beacsue there are people who might have book the place for one month.

##### 3. Will the gained insights help creating a positive business impact?


Answer: The gained insight help in providing good facilities to the  costumers

**Strategic Pricing**: Recognizing high demand for specific stay durations (e.g., 5 nights) allows businesses to implement premium pricing strategies for these sought-after options. Adjusting prices accordingly can capitalize on customer preferences and maximize profitability.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Calculate the percentage distribution of room types
room_type_counts = airbnb_df['room_type'].value_counts()
room_type_percentages = room_type_counts / room_type_counts.sum() * 100

# Plot a pie chart of room type percentages
plt.figure(figsize=(10, 6))
plt.pie(room_type_percentages, labels=room_type_percentages.index, autopct='%1.1f%%',
        startangle=140, colors=sns.color_palette('pastel'))
plt.title('Percentage Distribution of Room Types')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?

Answer: The chari I picked here is a Pie chart because when we have to visualize proportion of whole or percentage of whole.Pie chart are excellent to show this visualization

##### 2. What is/are the insight(s) found from the chart?

Answer: The insights we find from this visualization Shared Room has lowest proportion of percentage ie (2.4%) and the Entire/Room has higher percentage (52%)
* **Enire Home/Dept** : 52 %
* **Privare Room** : 45.7%
* **Shared Room** : 2.4%


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The insight here can be used to solve Business probelm

* **Shared Room** : There is a lot of scope in this category.The shared room is only 2.4%. We can analyze the area in which concept of shared room is working and the use the caliculation to find new areas to invest in that concept in which this concept can be resourcefull. Also where the prices are going high we can use and invest in Shared Room concept.   

#### Chart - 12

In [None]:
# Chart - 12 visualization code
airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'], errors='coerce')

# Drop rows where 'last_review' is NaT (Not a Time)
airbnb_df = airbnb_df.dropna(subset=['last_review'])

# Set the 'last_review' column as the index
airbnb_df.set_index('last_review', inplace=True)

# Verify the index is a DatetimeIndex
print(type(airbnb_df.index))  # Should output <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

# Resample to monthly frequency, calculating the mean price for each month
monthly_avg_price = airbnb_df['price'].resample('M').mean()

# Reset the index to get 'last_review' back as a column
monthly_avg_price = monthly_avg_price.reset_index()

# Plot a line chart
plt.figure(figsize=(12, 6))
plt.plot(monthly_avg_price['last_review'], monthly_avg_price['price'], marker='o', linestyle='-')
plt.title('Average Price per Month')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.grid(True)
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

*Answer* :The visualization I picked here is **Line Chart**

**Line graphs**  are great for showing trends over a variable, which can be time, In this case Average price is seen with years from **2011** to **2019**.

##### 2. What is/are the insight(s) found from the chart?

Answer: The insight we gain from the above visualization is that in 2013 last 4 months the Average price skyrocketed but immidiately in 2014 first few months,but interestingly there is  great depression in first four months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer: The insight here is helpfull in analyzing the trends . The trends shows that from 2017 to 2019 the Average price are stable. But before that we see a lot of randomness in the prices. In year 2013 and 2015 prices are skyrockted and also deprssion in 2014. This suggest that there might be some hikes of taxes or might be some event in those quarter of year.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='longitude', y='latitude', hue='room_type', data=airbnb_df, palette='Blues')
plt.title('Map of Listings')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

##### 1. Why did you pick the specific chart?

Answer:The Chart I picked here is a heatmap. The heatmap provide geographical context of data.It displays the distribution of Airbnb listings across longitude and latitude, with each point colored according to the type of room (using the 'viridis' color palette).

##### 2. What is/are the insight(s) found from the chart?

Answer: The Distribution of listings according to the room type across  latitude and longitude . The colors show placement of different room type nad give us  rough idea of how the listing is distributed geospatically.The gradient of the color describes the clear visual represenatation.The color yellow is used to describe shared room, green is used for Entire/Home appartment and purple is used to describe Private Room. The insight we gain from this is as we move towards outskirts of urban area there is no concept of shared room oe may be very less ie in queens and Staten Island.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer*: The insight we got from the above visualization can be used to anaylze the geographical area in which we can use find scope of improvement.
* **Manhattan**: The Manhattan has a lot of Private rooms but also the prices in Manhattan is most visted state by number of visitors where a lot of business travel happen and shared room concepts for students and interns can be improved and invested in the Area.   

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Calculate the correlation matrix for num_df1
numeric_columns = airbnb_df.select_dtypes(include='number')
correlation_matrix_num = numeric_columns.corr()

# Create a heatmap using seaborn
plt.figure(figsize=(8, 8))
sns.heatmap(correlation_matrix_num, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap for Numerical Columns')
plt.show()


##### 1. Why did you pick the specific chart?

Answer:
* Heatmaps visually summarize complex datasets using color coding.
* Darker colors signify stronger correlations between variables.
* They allow for efficient simultaneous analysis of relationships across all variables.
* Heatmaps help identify patterns and relationships that may be less obvious in raw data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here:
The above Heatmap provides the following relationship between the variables
 * **Number of reviews and Reviews per month**: The corelation between number_of_reviews and reviews_per_month is high and positive ie (0.52).This suggest that the listing having more reviews_per_month have more number_of_reviews

 * **Number of reviews and id** : The Corelation between number_of_reviews and id is slightly negative ie  (0.33)
 * **Price and Number of Reviews**The corelation between price and number_of_reviws is near to zero or slightly negative which insite that there is no relation between them

 * **Price and availability_365**:There is not much relation between price and availability_365 ie (0.08)

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
numeric_columns = airbnb_df.select_dtypes(include='number')

# Create a pairplot
sns.pairplot(numeric_columns)
plt.show()

##### 1. Why did you pick the specific chart?

Answer:
Visualizing Relationships with Pairplot

**Pairplot Overview:**
Seaborn's pairplot function helps visualize the relationships between different variables in a dataset. It creates a matrix of plots showing both individual (univariate) distributions and relationships between pairs of variables (bivariate distributions).

**Handling Variable Types:**
pairplot can manage both continuous and categorical variables, making it a versatile tool for diverse datasets.

**Univariate and Bivariate Distributions:**
The diagonal plots show the distribution of individual variables, while the off-diagonal plots show how pairs of variables relate to each other.

**High-Level Interface:**
Seaborn offers an easy-to-use, high-level interface for creating informative and attractive statistical graphics.

##### 2. What is/are the insight(s) found from the chart?

Answer:

**Price and Minimum Nights:** There seems to be no strong linear relationship between price and minimum nights. Prices vary widely regardless of the minimum nights required.

**Number of Reviews and Reviews per Month:** There's a positive correlation between the number of reviews and reviews per month. Listings with more reviews per month tend to have a higher total number of reviews.

**Price and Availability:** Price and availability don't show a clear linear relationship. Some expensive listings are available all year, while some affordable ones aren't.

**Number of Reviews and Availability:** Listings with higher availability tend to have more reviews, which might indicate that popular listings are available more often.

**Price and Number of Reviews:** Price and the number of reviews don't have a clear linear relationship. Some highly-priced listings have many reviews, while some don't.

In summary, the pair plot helps us understand how different numerical features in the dataset relate to each other, which can be useful for making data-driven decisions for Airbnb hosting.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer:

**1. Optimize Listing Distribution:**
Data-Driven Insights: Utilize analytics to pinpoint high-demand and emerging neighborhoods.
Execution: Encourage new listings in these areas by offering incentives, ensuring balanced distribution across key locations.
**2. Dynamic Pricing Strategies:**
Data-Driven Insights: Develop a pricing strategy that incorporates seasonal trends, local events, and historical data.
Execution: Implement machine learning to continually update prices and provide hosts with optimal pricing recommendations.
**3. Host Recognition and Engagement:**
Data-Driven Insights: Identify and reward top hosts based on guest reviews and booking frequency.
Execution: Create a rewards program, and organize training sessions and community events to engage and support hosts.
**4. Enhance Guest Experience:**
Data-Driven Insights: Analyze guest feedback to pinpoint common issues and areas for improvement.
Execution: Provide hosts with actionable feedback and resources to improve amenities and communication, establishing a continuous improvement loop.
**5. Strategic Investment in Neighborhoods:**
Data-Driven Insights: Focus investments on neighborhoods with high listings and demand.
Execution: Collaborate with local businesses and invest in neighborhood enhancement projects to boost attractiveness and safety.
**6. Geographical Insights for Marketing:**
Data-Driven Insights: Utilize geographical data to understand guest trends and preferences by region.
Execution: Tailor marketing campaigns to highlight unique neighborhood features and target specific guest experiences.
**7. Optimize Minimum Nights Stayed Policies:**
Data-Driven Insights: Analyze booking data to understand the impact of minimum stay policies.
Execution: Adjust minimum stay requirements to balance guest convenience and host profitability, providing data-driven recommendations to hosts.
**8. Diversify Offerings in Cost-Friendly Areas:**
Data-Driven Insights: Identify budget-friendly neighborhoods with potential for increased demand.
Execution: Encourage hosts to offer a variety of accommodations and introduce promotions to attract budget-conscious travelers.
**9. Regularly Review and Adapt Strategies:**
Data-Driven Insights: Continuously monitor market trends, competitor actions, and guest feedback.
Execution: Establish a regular review process to evaluate and adjust strategies based on data analytics, ensuring agility in a dynamic market.
Additional Recommendations:
Sustainability Initiatives: Promote eco-friendly practices among hosts to attract environmentally conscious travelers.
Technology Integration: Invest in advanced technologies like AI-powered customer service, virtual tours, and seamless payment systems.
Localized Experiences: Partner with local entities to offer unique experiences, such as guided tours and cultural activities, to enhance guest stays.
By implementing these refined strategies, Airbnb can optimize its operations, improve experiences for both guests and hosts, and maximize profitability across various neighborhoods. Leveraging data analytics will ensure the business remains responsive to market changes and evolving customer preferences.

# **Conclusion**
The Airbnb EDA Analysis project has yielded significant insights into the New York City Airbnb market. By examining listings distribution, pricing strategies, host performance, and guest behavior, the analysis offers valuable guidance for improving business strategies.

Key recommendations include optimizing listing distribution across neighborhoods, refining pricing strategies to balance competitiveness and revenue, and recognizing top hosts to foster a positive ecosystem. Enhancements based on guest feedback aim to elevate the guest experience.

Seasonal guest trends, neighborhood investment strategies, and geographical pricing insights provide a foundation for targeted marketing and resource allocation. The detailed exploration of neighborhood dynamics through visual tools like sunburst charts enables more precise marketing efforts.

Overall, the findings emphasize the importance of staying adaptive to market changes, understanding guest preferences, and employing flexible strategies to enhance business outcomes. These insights empower hosts to make informed decisions and contribute to the dynamic and efficient Airbnb marketplace in New York City.








Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***