<a href="https://colab.research.google.com/github/anuraj0012/project/blob/module/ANURAJ_EDA_Airbnb_Submission_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

### This project centers on analyzing a substantial dataset from Airbnb, consisting of approximately 49,000 observations. The primary objective is to extract meaningful insights that will aid management and stakeholders in making informed decisions to foster business growth and enhance user satisfaction. By identifying patterns in the data, we aim to offer actionable recommendations for expanding and improving the business.

### **Data Analysis Objectives:**

1. **Identify Guest Preferences:** We will analyze data on room types, pricing, and neighborhood preferences to understand what guests favor. This includes examining:
   - **Room Type Preferences:** What types of rooms (e.g., entire home, private room) are most popular.
   - **Price Sensitivity:** Preferred price ranges among different guest segments.
   - **Neighborhood Preferences:** Which neighborhoods are most attractive to guests.

2. **Improve Host Listings:** Insights gained will help hosts adjust their offerings to better meet guest needs. This includes:
   - **Pricing Strategies:** Suggestions on pricing adjustments based on guest preferences and market trends.
   - **Room Type Adjustments:** Recommendations on optimizing room types and amenities.

3. **Create a Filter System for Guests:** Develop a user-friendly filtering system that allows guests to easily find listings that fit their budget and preferences. This system will enhance the booking experience by matching guests with the most relevant listings.

**Data and Tools:**

- **Data Components:** The dataset includes key variables such as listing counts, neighborhood distribution, pricing, reviews, and room type preferences.
- **Data Wrangling:** We will use **pandas** for data cleaning and organization. This involves:
  - Handling missing or inconsistent data.
  - Structuring the dataset to facilitate analysis.

- **Numerical Analysis:** **Numpy** will be employed for numerical computations and creating arrays necessary for ranking and analysis. Key tasks include:
  - Calculating summary statistics.
  - Performing rankings and comparisons.

- **Data Visualization:** To present findings effectively, we will use **Matplotlib** and **Seaborn**. These tools will help in:
  - Creating visual representations of data trends and patterns.
  - Generating charts and graphs that illustrate key insights.

**Expected Insights and Recommendations:**

1. **Guest Preferences:** By analyzing guest preferences, we will identify:
   - The most sought-after room types and price ranges.
   - The most popular neighborhoods, guiding future investments and adjustments.

2. **Host Improvement:** Recommendations will be provided to hosts on:
   - Enhancing their listings based on successful patterns observed.
   - Adapting pricing and room offerings to attract more guests.

3. **Enhanced User Experience:** The filter system will:
   - Allow guests to quickly find listings that meet their specific needs.
   - Improve satisfaction by aligning available options with guest preferences.

**Learning Outcomes:**

- **Business Model Understanding:** Gain insight into how Airbnb operates and how data-driven decisions can impact the business.
- **Technical Skills Development:** Enhance skills in data wrangling, numerical analysis, and data visualization using tools like pandas, numpy, Matplotlib, and Seaborn.
- **Problem-Solving:** Develop the ability to identify and address business challenges through data analysis. Improve critical thinking skills by applying complex concepts to real-world data.

This project will not only deepen my knowledge of the Airbnb business model but also improve my ability to analyze and interpret data in practical scenarios. By translating data insights into actionable strategies, I will contribute to both guest satisfaction and host success, ultimately driving business growth.



# **GitHub Link -**

[ANURAJ_EDA_Airbnb_Submission_.ipynb](ANURAJ_EDA_Airbnb_Submission_.ipynb)

# **Problem Statement**


The task of this project is to analyze the provided dataset to uncover valuable insights. These insights will be used by stakeholders to inform and guide business improvements, ultimately enhancing decision-making and driving strategic growth.

#### **Define Your Business Objective?**

The goal of this project is to uncover opportunities for business improvement by analyzing customer preferences and identifying key patterns. By deriving actionable insights from the data, we aim to enhance customer satisfaction and optimize overall business performance.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
import seaborn as sns
import openpyxl
pd.set_option('display.max_columns', 200)

### Dataset Loading

In [None]:
# Load Dataset
airbnb_df = pd.DataFrame(pd.read_csv("/content/Airbnb NYC 2019.csv"))

### Dataset First View

In [None]:
# Dataset First Look
airbnb_df.head(5)

As we can see here is the first look of our dataset on which we will be working, let's dive deep into it.

In [None]:
# Let's check for all the columns we have in our dataset.
airbnb_df.columns

Here is the list of all our columns in the dataset.


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb_df.shape

Looking at the shape of our dataset we can see that the number of rows is significantly high as compared to the number of columns. We have:- rows = 48895 columns = 16

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info()

Here we can see the division of Categorical and Numerical values in our dataset, We can see:-
*   3 columns have float64 data values
*   7 columns have int64 data type values
*   6 columns have object data type values

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicated_values = airbnb_df[airbnb_df.duplicated()]
duplicated_values

With this we can see that there are no exact duplicate values, however we need to check for other columns as well, like id and name, as these values are not likely to be same.

In [None]:
# Let us first check for 'id' column.
duplicated_id = airbnb_df[airbnb_df.duplicated(subset=['id'])]
duplicated_id

As we can see there is no duplicated 'id's in our dataset.

In [None]:
# Now let us check for 'name' column.
duplicated_name = airbnb_df[airbnb_df.duplicated(subset=['name'])]
duplicated_name


Now as we can see we have 998 rows that are duplicates, this might be due to some changes to get an idea for the reason of these duplicated values, let us observe any one of the rows.

In [None]:
# We can check for any specific row having the name value included in the duplicated data, using the query command.
# Let's check for where name is 'Superior @ Box House', which is in our duplicated data.
airbnb_df.query('name == "Superior @ Box House"')

Here we can observe that almost all the values are same except for the id, latitude, longitude, number_of_reviews, reviews_per_month, availability_365. However these columns can be different, and they are not so important.

In [None]:
# To tackle this issue we will create a new data frame in which we will include
# only those columns which are really important for us.
airbnb_df = airbnb_df[['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']].copy()

Now that we have created this dataframe we will be able to handle duplicated values more efficiently.

In [None]:
# Let's check for the duplicated values of any other row.
airbnb_df.query('name == "Loft w/ Terrace @ Box House Hotel"')

As we can see here each and every value is almost the same, except for the prices, there might be a chances that the prices were altered as per the market condition due however the hotel is the same.

In [None]:
# In this case we will seek out the duplicated values for those columns which are concerning and may effect the data if duplicated.
# Checking for the values where - 'name', 'host_name','neighbourhood_group', 'neighbourhood', 'room_type' are duplicated.

airbnb_df.loc[airbnb_df.duplicated(subset= ['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'])]

In [None]:
# Let's check for any specific value.
airbnb_df.query('name == "✿✿✿ COUNTRY COTTAGE IN THE CITY✿✿✿"')

In this now we can see that all the values are same except for last_review and reviews_per_month, which are not so important regarding duplicacy.

Now to drop the duplicated values we will sort out dataframe on the basis of latest entries, for which we need to have a timestamp in our dataset, in our dataset we can use the 'last_review' column as the timestamp for our dataset.



In [None]:
# Converting the 'last_review' column in a datetime format.
airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'])

In [None]:
# Now let us sort the data by the last_review column
airbnb_df = airbnb_df.sort_values(by='last_review', ascending=False).reset_index(drop=True)

As we can see there are few NA values in our dataframe in the date column let's fill these values with the latest dates we are having in our dataframe.

In [None]:
# Replacing NA Values
airbnb_df['last_review'].replace(np.nan,airbnb_df['last_review'].max(), inplace=True)

# Let's sort the values again
airbnb_df = airbnb_df.sort_values(by='last_review', ascending=False).reset_index(drop=True)

In [None]:
# Dropping duplicated values
airbnb_df = airbnb_df.drop_duplicates(subset=['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'], keep='first').reset_index(drop=True)

# Now the duplicated columns have been dropped, let's check the current shape of our data frame.
airbnb_df.shape # (48655, 14)

# Let's check if there are any duplicated values now
airbnb_df[airbnb_df.duplicated(subset=['name', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'])] # No values

airbnb_df.query('name == "✿✿✿ COUNTRY COTTAGE IN THE CITY✿✿✿"') # The latest value shows up

We have successfully dropped the duplicated values and now we only have latest data values.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = airbnb_df.isnull()
null_value_count = null_values.sum()

In [None]:
# Visualizing the missing values
null_value_count

As we can see we were having missing values in name and host_name columns, however as we are having their respective ids we will still be able to manage the data, however let us check for outliers.

In [None]:
# To check the outliers we need to check the columns which are having numerical data.
airbnb_df.describe()

If we look into this data we can see that the price and the minimum_nights columns are the most concerning ones, in these 2 columns we need to find the outliers and drop them.

In [None]:
# As we see above the column_name 'calculated_host_listings_count' is quiet long
# Let's change it to 'listings'
airbnb_df.rename(columns={
    "calculated_host_listings_count": 'listings'
}, inplace=True)

# Lets check the price columns once
airbnb_df['price'].describe()

# As we can see the min price is 0 which is not likely to happen.
# According to the current website the price range starts from $25.
# In this case we will check for the prices which are below $25 and we will
# replace their values with 25 so that we can handle the outliers in price column.

airbnb_df['price'].replace(range(0, 25), 25, inplace=True) # replacing values
airbnb_df['price'].describe()

As the few price values were less than $25, which is not likely to happen, we have converted those values into 25, so that we may handle the outliers more efficiently, as there was other crucial info present in those rows.

In [None]:
# Now let's check the minimum_nights column.
airbnb_df['minimum_nights'].describe()

# As the minimum value in this column is 1, we can work on only the upper limit.
# In this case if any value is more than 365 days, we will mark it as 365.

airbnb_df['minimum_nights'].replace(range(366, 1251), 365, inplace=True) # replacing values
airbnb_df['minimum_nights'].describe()

Now we have modified the outliers, we can move ahead with our dataset ready to be wrangled for deriving insights.

### What did you know about your dataset?

The Airbnb NYC 2019 dataset comprises 48,895 rows and 16 columns. It includes both numerical and categorical data types, organized as follows:

 3 columns have float64 data values (Numerical) 7 columns have int64 data type values (Numerical) 6 columns have object data type values (Categorical)

The primary key of our dataset is the "id" column, having a unique IDs for the hotel names.

Here is the list of columns:- ['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

Data Quality and Handling:

Duplicate Values: The dataset had no exact duplicate rows, but there were 998
duplicate entries in the name column. These duplicates primarily differed in price, likely due to price changes over time. To address this, we sorted the dataframe by the last_review column, which we used as a timestamp to retain the most recent entry.

Missing Values:

last_review: 10052 missing values

reviews_per_month: 10052 missing values

As the date column was missing values we replaced the NA values with the latest date present in our dataset, so that the time stamp could be efficient.

We are also missing few "Names" as well as "Host Names":-

name = 16

host_name = 21

We have now successfully formatted our dataframe and it is now ready for data wrangling.

## ***2. Understanding Your Variables***

In [None]:
airbnb_df.head()

In [None]:
# Dataset Columns
df_columns = airbnb_df.columns
df_columns

In [None]:
# Dataset Describe
df_describe = airbnb_df.describe()
df_describe # These are all the numerical variables in our dataset.

### Variables Description

We have various outputs that can be analyzed to derive insights and make informed conclusions based on the data. For instance, although we observe a broad range of prices, the average price preferred by customers is approximately 150. This information helps us understand the budget and preferences of our guests.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def unique_value(df):
  for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for '{column}': {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

Let's create a new column with the price range distribution. We will use it while working with the price column.

In [None]:
# Write your code to make your dataset analysis ready.
start = 0
end = 10000
breakpoints = np.linspace(start, end, num=101)
breakpoints = breakpoints.astype(int)
def price_range(amt):
    bp = breakpoints
    for i in range(len(breakpoints)-1):
        if bp[i] <= amt <= bp[i+1]:
            return f"{bp[i]} - {bp[i+1]}"


airbnb_df['price_range'] = airbnb_df['price'].apply(lambda amt: price_range(amt))
airbnb_df['price_range']

Creating a price range column offers several benefits:

- **Data Summarization**: It provides a clear summary of the price distribution within our dataset.
- **Visualization**: A price range column is more effective for visualization compared to individual price points.
- **Segmentation and Analysis**: It segments prices into distinct ranges, making it easier to compare different segments.
- **Decision Making**: This information can aid in decision-making by helping tailor recommendations to customer budgets and requirements.
- **Communicating Insights**: It clearly conveys customer budgets and preferences, revealing where most customers fall in terms of purchasing power.

In [None]:
# Let's take a look at our dataset once.
airbnb_df.head()

# As we can see our dataset is sorted according to the last review, let us sort our dataset according to the price.
airbnb_df.sort_values(by='price', ascending=False, inplace=True)

# Let's take a look at our price sorted dataset once.
airbnb_df.head()

In [None]:
# Let us check our columns and filter those columns that will work as our features.
airbnb_df.columns

In [None]:
# Let filter the dataframe with only the required columns.
feature_df = airbnb_df[['id', 'name', 'host_id', 'neighbourhood_group',
       'neighbourhood', 'room_type', 'price', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month', 'listings',
       'availability_365', 'price_range']]

feature_df = feature_df.reset_index(drop=True)
feature_df.head()

In [None]:
feature_df.head()

In [None]:
# Let's check the top performing Host as per the total listing count
host_groups = feature_df.groupby('name')
hosts = []
no_of_listings = []
host_prices = []
for host, data in host_groups:
  hosts.append(host)
  no_of_listings.append(data['listings'].sum())
  host_prices.append(data['price'].mean())

host_df = pd.DataFrame({
    'Host Name': hosts,
    'Total Listings': no_of_listings,
    'Price': host_prices
})
host_df = host_df.sort_values(by='Total Listings', ascending=False).reset_index(drop=True)
host_df = host_df.drop_duplicates(subset='Total Listings').reset_index(drop=True)
top_10_hosts = host_df.head(10)

host_df['Revenue'] = (host_df['Total Listings'])*(host_df['Price'])
top_host_revenue = host_df.sort_values(by='Revenue', ascending=False).reset_index(drop=True)
top_host_revenue.head(10)

In [None]:
# First let's check how many neighbourhood_groups are there.
feature_df['neighbourhood_group'].unique() # There are 5 different groups.

# ['Brooklyn', 'Queens', 'Manhattan', 'Staten Island', 'Bronx']
#  let's group our dataset according to the neighbourhood groups.

n_groups = feature_df.groupby('neighbourhood_group')

# Let's check which group is most prefered as per the listings, and no. of reviews.
groups = [] # To save the groups
listings = []
reviews = []
max_price = []
min_price = []
for group, data in n_groups:
  groups.append(group)
  listings.append(data['listings'].sum())
  reviews.append(data['number_of_reviews'].sum())
  max_price.append(data['price'].max())
  min_price.append(data['price'].min())

group_feat_df = pd.DataFrame({
    'Group': groups,
    'Listing Count': listings,
    'No._of_reviews': reviews,
    'Min Price': min_price,
    'Max Price': max_price
})

group_feat_df
# Here we can see that Manhattan group is the most prefered group as per the listings.
# However the most reviews are given to the Brooklyn group.

# Now let's divide our dataset as per the room types, and create their groups.
# First let's check how room_types are there.

feature_df['room_type'].unique() # There are 3 room types.

# ['Entire home/apt', 'Private room', 'Shared room']
#  let's group our dataset according to the room types.

room_groups = feature_df.groupby('room_type')
rooms = []
room_listings = []
room_reviews = []
max_room_price = []
min_room_price = []
for room_type, room_data in room_groups:
    rooms.append(room_type)
    room_listings.append(room_data['listings'].sum())
    max_room_price.append(room_data['price'].max())
    min_room_price.append(room_data['price'].min())

room_feat_df = pd.DataFrame({
    'Group': rooms,
    'Listing Count': room_listings,
    'Min Price': min_room_price,
    'Max Price': max_room_price
})

room_feat_df
# Here we can see the most prefered room type is the Entire home/apt.
# The most reviews are also given to the Entire home/apt.
# Shared rooms are the least prefered.

In [None]:
# Now let us check how many different neighbourhoods are there in total.
feature_df['neighbourhood'].count() # There are 48655 neighbourhoods

# Now let us check what are the top 10 most prefered neighbourhoods.
# Also, let's check their average pricing and average price range.
area_groups = feature_df.groupby('neighbourhood')
areas = []
listing_count = []
avg_price = []
for area, n_data in area_groups:
  areas.append(area)
  listing_count.append(n_data['listings'].sum())
  avg_price.append(round(n_data['price'].mean(), 2))

area_feat_df = pd.DataFrame({
    'Area': areas,
    'Listing Count': listing_count,
    'Average Price': avg_price,
})

area_feat_df = area_feat_df.sort_values(by='Listing Count', ascending=False).reset_index(drop=True)
area_feat_df.head(10)

In [None]:
# Let us now work with relationships.
# Relationship: Price vs. Room Type
# Checking the distribution of prices for different room types.
# "room_groups" is our grouped dataframe we will be using this.

avg_room_price = []
for room, data in room_groups:
  avg_room_price.append(data['price'].mean())

room_vs_price = pd.DataFrame({
    'Room Type': rooms,
    'Avg_Price': avg_room_price
})

room_vs_price

# As we can see here the Entire home/apt is having the highest pricing.

In [None]:
# Relationship: Reviews per Month vs. Room Type
# Comparing the distribution of reviews per month for different room types.

total_reviews = []
for room, data in room_groups:
  total_reviews.append(data['reviews_per_month'].sum())

room_vs_reviews = pd.DataFrame({
    'Room Type': rooms,
    'Reviews': total_reviews
})

room_vs_reviews
# As we can see here the Entire home/apt is having the highest no. of reviews per month.

### What all manipulations have you done and insights you found?

This dataset had well-distributed information, which was great for our study. Since it's from Airbnb, the locations are very important, and the data was well-organized by location and area.

We performed several manipulations to provide useful insights for stakeholders and help hosts understand customer preferences, as promised in our plan.

Here’s a summary of the steps we took:

1. **Created a Price Range Column**: This column helps categorize prices and gives us a clearer idea of customer budgets. We aimed to make it as precise as possible.

2. **Sorted the Dataset by Price**: We sorted the data by price to highlight the most expensive listings at the top, since the original data was not organized in a meaningful way.

3. **Filtered the Data**: We streamlined the dataset by focusing only on the columns necessary for analysis, making it simpler and less chaotic.

4. **Divided Data into Categories**: We organized the data into different groups to gain specific insights:
   - **Neighborhood Groups**: To identify the most popular neighborhoods.
   - **Room Type Groups**: To find out which room types are preferred.
   - **Neighborhood Area Groups**: To determine the most favored areas within neighborhoods.

We also explored some key relationships to help with decision-making:

1. **Price vs. Room Type**: We examined how prices vary by room type to identify which types are most expensive.

2. **Reviews per Month vs. Room Type**: We looked at the number of reviews for different room types to gauge customer satisfaction and engagement.

3. **Price vs. Neighborhood**: We compared prices across neighborhoods to find out where prices are higher or lower and to understand price variations.

These manipulations were done to gather useful information that can support business growth and improve the experience for both hosts and customers. We will provide detailed insights and visualizations to make the findings clearer and easier to understand.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Let's visualize the most prefered neighbourhood group with a bar chart

group_feat_df # As we have already created a dataframe for the groups

plt.figure(figsize=(10,5))
# plt.plot(groups,listings, color='black', marker = "o", markerfacecolor = 'blue', markeredgecolor='blue',linestyle='-')
colors = ['red', 'blue', 'green', 'orange', 'purple'] # To make each bar with a different color
sns.barplot(x="Group", y='Listing Count', data=group_feat_df, hue='Group', palette=colors, legend=False)
plt.xlabel('Groups')
plt.ylabel('Listing Counts')
plt.title('Most prefered neighbourhood')
for x, y in zip(range(len(groups)), listings):
    plt.text(x, y, f'{y}', ha='center', va='bottom') # To annotate each bar with the exact value

##### 1. Why did you pick the specific chart?

To determine which group is the most preferred, we need to evaluate the values for each group and compare them. A bar chart is the ideal choice for this, as it effectively displays and contrasts the preferences among all the groups.

##### 2. What is/are the insight(s) found from the chart?

It’s clear that Manhattan is the most preferred group by a significant margin, outperforming all others. Brooklyn follows as the second choice among our customers after Manhattan.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Certainly! Understanding our customers' preferences allows us to focus on the areas in high demand and tailor our offerings to meet their needs effectively. In this case, since Manhattan is the most preferred group, we should prioritize targeting our customers in that area to ensure their satisfaction.

Regarding the negative growth in other neighborhoods, the low demand in these areas indicates a need for further investigation. By analyzing why these neighborhoods are less favored, we can identify the root causes of the low demand. Addressing these issues will help us understand the disparities and devise strategies to improve engagement in those areas.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Let us now visualize the average price division with room types.

data = {
    'Room Type': room_vs_price['Room Type'],
    'Avg_Price': room_vs_price['Avg_Price'],
    'Listings': ((room_feat_df['Listing Count']/room_feat_df['Listing Count'].sum())*100)
}

# Creating a DataFrame
df = pd.DataFrame(data)

df_melted = pd.melt(df, id_vars='Room Type', var_name='Attribute', value_name='Value')
# Creating a strip plot
sns.barplot(x='Room Type', y='Value', data=df_melted, hue="Attribute", palette=['blue', 'grey'], legend=True)

# Applying annotations on the values
for p in plt.gca().patches:
    plt.gca().annotate('{:.1f}'.format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                       ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                       textcoords='offset points')

plt.show()

##### 1. Why did you pick the specific chart?

As we are only having 3 types it was better to use the bar chart for better visualizing the divisions of the average price and the percentage of the listings divided among these values altogether.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that the relationship between the average pricing the listing counts is direct, the highest pricing is in the Entire Home/Apt room type, and slo the listings division, which can be quite using in making decisions like pricing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, as we can see what percentage is ready to pay which amount for their preferences we will be able to make better pricing strategies which will definitely assist us in the growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Let's visualize the most prefered room type using a pie chart

room_feat_df # Our grouped df created earlier

# Plotting the data into a pie chart
plt.pie(room_feat_df['Listing Count'], labels=room_feat_df['Group'], autopct="%1i%%", explode=(0.1, 0, 0.1), shadow=True)
plt.show()

##### 1. Why did you pick the specific chart?

Given that we only had three types of rooms, a pie chart was an ideal choice. It provides a clear and intuitive visualization of how room listings are distributed, making it easy to compare the proportions of each type. Additionally, the pie chart displays percentage differences, offering a detailed overview of the data and emphasizing the variations among the room types.

##### 2. What is/are the insight(s) found from the chart?

It is evident that the "Entire Home/Apt" room type is the most preferred, indicating that people value their privacy and prefer to stay in a space where they have complete control and minimal interference. On the other hand, shared rooms are the least favored, typically chosen by students or individuals seeking more affordable accommodation. This information can be leveraged to develop more targeted strategies for addressing the needs and preferences of different customer segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we are able to see what is prefered by the majority of our customer we will be able to spend money on advertisements more efficiently which will help us to minimize our cost wastage, not only for the majority as what sort of customers prefer the other room types, we will be able to target them with their needs which will increase our consumer market even in the low demand sectors, which can work as a decoy for us.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Now let's check which price range has the most number of listings, it would give us an idea for the most prefered price range of our customers.
# Let's plot the top 10 most prefered price range on chart.
prefered_range = airbnb_df['price_range'].value_counts().head(10)
plt.figure(figsize=(10,5))

prefered_range_sorted = prefered_range.sort_index()
plt.scatter(prefered_range.index, prefered_range, color='blue', marker='o')
plt.plot(prefered_range_sorted.index, prefered_range_sorted, color='black', linestyle='-', linewidth=2, label='Line Connecting Dots')

for x, y in zip(prefered_range_sorted.index, prefered_range_sorted): # To annotate each plotted value
    plt.text(x, y, f'{y}', ha='center', va='bottom')
plt.title("Top 10 Prefered Price Range")
plt.xlabel("Price Range")
plt.ylabel("Listings")
plt.grid(color='black')

##### 1. Why did you pick the specific chart?

The plot chart can give us the graphical representation of the preferences of our customers regarding prices and how do they go through each of the price ranges. As we can see the chart is giving us a fall in the preferences as the price range increases.

##### 2. What is/are the insight(s) found from the chart?

Looking at this chart we can cleary see the budget preferences of our customers and we can see that the most prefered price range is the 0-100, however there is no 0 values as we have already dropped them. This can give us an idea on the spending power of our customers which will help us in setting better and more specific prices, also at the time of listings we can even make recommendations to our customers with their prefered price ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes definitely, as customers like personalized interfaces and options, if we will focus on providing them the options that are there for them only, it would be really appreciated by them as they will not have to go through a lot while searching for things they are looking for specifically.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Reviews vs Review per month
sns.scatterplot(x='number_of_reviews', y='reviews_per_month', data=feature_df)

##### 1. Why did you pick the specific chart?

In this case using a scatter plot can easily tell us the frequency, as we can see the review traffic is not that scattere and is rather collective.

##### 2. What is/are the insight(s) found from the chart?

As we can clearly see that there is an outlier as well, however the relationship between the reviews and reviews per month is quiet direct which means that the customer are getting satisfied with the outcomes, where the number of review is higher and the host is also getting good reach towards consumer market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes as we can seek out the difference from the given reviews and make specific recommendations to our hosts about what changes can be done by them in order to increase their reach to the customers. This can be one of our premium services that can widely help our hosts.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# As now we are having the price ranges with the most listings let us check the neighbourhoods with the most listings.

areas = []
area_listing = []
area_group = feature_df.groupby('neighbourhood')
for area, data in area_group:
    areas.append(area)
    area_listing.append(data['listings'].sum())

area_data_df = pd.DataFrame({
    "Area": areas,
    "Values": area_listing
})

area_data_df = area_data_df.sort_values(by='Values', ascending=False).reset_index(drop=True)
area_data_df
plt.figure(figsize=(17,5))
# plt.plot(groups,listings, color='black', marker = "o", markerfacecolor = 'blue', markeredgecolor='blue',linestyle='-')
colors = ['red', 'blue', 'green', 'orange', 'purple']
sns.barplot(x="Area", y='Values', data=area_data_df.head(10), hue="Area", palette=colors, legend=False)
plt.title('Top 10 prefered neighbourhood areas')
plt.xlabel('Area')
plt.ylabel('Listing Counts')
for x, y in zip(area_data_df['Area'].head(10), area_data_df['Values'].head(10)):
    plt.text(x, y, f'{y}', ha='center', va='bottom')

##### 1. Why did you pick the specific chart?

The bar chart clearly reflects the differece between the areas and the gap between them as per the listing counts, we can see the top ten most listed neighbourhoods.

##### 2. What is/are the insight(s) found from the chart?

We can see that the most listed neighbourhood area is the 'Financial District' and it is also having a major gap between the other ones in the list, we can clearly see that the Financial District is the most prefered neighbourhood of our customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, as this insight suggests the areas where there is a great scope for business and we can create more opportunities and also increase our Advertisements in a more targetted manner.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Let us know the price range having the highest listings so as to get more specific idea on price preference.


sorted_range # Sorted Price Range grouped
range_group # List of all the price range groups
range_listing_count = []

for range, data in sorted_range:
  range_listing_count.append(data['listings'].sum())


range_list_df = pd.DataFrame({
    'Range': range_group,
    'Listing Count': range_listing_count
})

range_list_df = range_list_df.sort_values(by='Listing Count', ascending=False).reset_index(drop=True)

# We will be taking on the top 5 prefered ranges.
plt.pie(range_list_df['Listing Count'].head(), labels=range_list_df['Range'].head(), autopct="%1i%%", explode=(0.1, 0.1, 0.1, 0, 0), shadow=True)
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart is an effective chat to clearly see the distrubutions and preferences, it becomes easier to notice the division of listngs by the price range.

##### 2. What is/are the insight(s) found from the chart?

As we can see form the chart:- The top 3 price ranges that are having the most listings are:-

200-300: 33%
100-200: 28%
0-100: 22%



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see the number of listings are closely divided among these sectors and this can be a great was to keep a track of all the hostels that are within this price range and create recommendations according to their prefered neighbourhoods.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Let's visualize the relationship between listing price and the number of reviews received.
# As we are having huge number of rows for this we will be taking only top 10 price ranges.
# For the listing prices we will take price_range

sorted_range = feature_df.sort_values(by='price_range', ascending=False)
sorted_range = sorted_range.groupby('price_range')
range_group = []
review_count = []
for range, data in sorted_range:
  range_group.append(range)
  review_count.append(data['number_of_reviews'].sum())

range_df = pd.DataFrame({
    'Range': range_group,
    'Review Count': review_count
})
plt.figure(figsize=(15, 5))
sns.scatterplot(x='Range', y='Review Count', data=range_df.head(10), color='blue', legend=False, marker='o')
plt.plot(range_df['Range'].head(10), range_df['Review Count'].head(10), color='grey', linestyle='--', linewidth=2, label='Line Connecting Dots')
plt.title("Price vs Reviews of top 10 most prefered price range")
for x, y in zip(range_df['Range'].head(10), range_df['Review Count'].head(10)):
  plt.text(x, y, f'{y}', ha='center', va='bottom')

In [None]:
# Let us also check for the least 10 prefered price ranges
plt.figure(figsize=(15, 5))
sns.scatterplot(x='Range', y='Review Count', data=range_df.tail(10), color='blue', legend=False, marker='o')
plt.plot(range_df['Range'].tail(10), range_df['Review Count'].tail(10), color='grey', linestyle='--', linewidth=2, label='Line Connecting Dots')
plt.title("Price vs Reviews of bottom least 10 prefered price range")
for x, y in zip(range_df['Range'].tail(10), range_df['Review Count'].tail(10)):
  plt.text(x, y, f'{y}', ha='center', va='bottom')

##### 1. Why did you pick the specific chart?

As the difference between the no. of reviews in the price range is very huge, using a plot chart is quiet handy so that the pointers can be seen clearly with respect to their values and also their differences.

##### 2. What is/are the insight(s) found from the chart?

As we can see the price range and the no. of reviews are having inverse relationship, it cleary states that the budget of our majortiy customer base lies between the range 0 - 200.

We can also see that there are few preferences in the higher budget section as well where there is a competetion in the prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This suggests that we will be able to make our pricing policies more targeted and more specific resulting in increasing customer base by providing them with the prices that are under their budget.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Chart - 9 visualization code
feature_df.describe()

# using corr() function to check the correlation between numeric columns.
corr_df = feature_df[['price', 'minimum_nights', 'number_of_reviews',
             'reviews_per_month', 'listings', 'availability_365']].corr()

# Visualizing correlation using seaborn heatmap, using annotations.
sns.heatmap(corr_df, annot=True)

##### 1. Why did you pick the specific chart?

In order to check the correlation between numeric columns the heatmap is a really great option as it indicates the depth of each correlation with a different color, lighter to darker depending on the depth, not just that, it gives a proper one look visual about the data we want to check.

##### 2. What is/are the insight(s) found from the chart?

This chart helped us to identify which pairs of variables have strong positive or negative correlations.

It also helped us understand which variables are most closely related to each other.

It even highlighted potential multicollinearity issues (high correlations between independent variables) if we plan to use regression analysis.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes as using these factors we will get good assistance while creating the pedictive models, as this informations is really useful and important as it reflects the current standing of our business and also provides the possible oppourtunities in different areas.

#### Chart - 10 - Pair Plot

In [None]:
# Chart - 10 visualization code
# Let's create a pairplot to know the relationship between our few variables.

pair_df = feature_df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'listings', 'availability_365', 'price_range']]
sns.pairplot(pair_df)

##### 1. Why did you pick the specific chart?

Here we can easily see the distribution of our numerical variables, it will also give us an overview of our dataset and the relationships between our variables.

##### 2. What is/are the insight(s) found from the chart?

Yes, we can see that the utmost majority of our complete data set is within the price range of 0-2500.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can give us an overview of our dataset so that we can have an idea about the ranges of the numerical values in which we are required to work in, making it easier for us to get a direction of the area in which we need to work.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?


To address our business objectives, we have thoroughly analyzed the dataset and drawn several key conclusions. Here are the actionable insights:

1. **Pricing Strategy:** The most preferred price range is $0-300, which should guide our pricing strategy to align with customer expectations.

2. **Room Type Preferences:** The "Entire Home/Apt" is the most favored room type, indicating a strong customer preference for privacy and safety. We should implement relevant policies to enhance security and privacy, which will help build trust and foster goodwill.

3. **Preferred Area:** The Financial District in the Manhattan neighborhood is the most popular area. We should tailor our recommendations and pricing strategies to this area’s preferences.

4. **Customer Feedback:** The Financial District also receives the highest number of positive reviews. Analyzing these reviews can highlight aspects that satisfy our customers, which we can then apply to other neighborhoods to potentially increase business in those areas.

5. **Additional Data:** Including a column for customer occupation in the dataset could provide valuable insights. It would help segment customers by profession, allowing us to better understand the accommodation preferences of different occupational groups.

6. **Budget Insights:** The dataset offers a clear view of the budget preferences of our target market, showing us the preferred pricing ratio.

7. **Business Focus:** We can see where most of our business is concentrated and which customer group is most profitable. This information will help us improve services in key areas and explore opportunities to expand into less saturated sectors.

By leveraging these insights, we can refine our strategies to better meet customer needs, enhance our offerings, and grow our business.

# **Conclusion**

 Our analysis of Airbnb data shows that about 90% of our business is focused on a specific area, whether it's related to location or pricing. This highlights a strong preference among our customers. To optimize our performance, we should use these insights to:

 Enhance Popular Areas: Focus on improving listings and services in the preferred locations or pricing categories.

 Expand Strategically: Apply successful strategies from these areas to other locations to boost their appeal.
 By aligning our offerings with customer preferences, we can better meet their needs and improve our overall business.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***