# **Project Name**    - Airbnb booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Airbnb has revolutionized the way people travel and experience new places by providing a unique and personalized alternative to traditional hotel stays. By connecting hosts who have extra space in their homes or apartments with travelers looking for a more authentic and local experience, Airbnb has created a thriving community of people from all over the world who are passionate about exploring new cultures and sharing their own.

As Airbnb has grown in popularity, the company has also recognized the value of data analysis in understanding its customers and hosts, making informed business decisions, and identifying new opportunities for growth. With millions of listings on its platform, Airbnb generates a vast amount of data that can be used to gain insights into the behavior and performance of hosts and guests.

This project aims to explore and analyze a dataset of approximately 49,000 Airbnb listings in order to gain a better understanding of the factors that drive success on the platform. The dataset includes both categorical and numeric variables, such as the location of the listing, the type of property, the price, the number of reviews, and the amenities offered.

One of the key questions we can explore using this data is what sets successful hosts apart from those who may be struggling. By examining variables such as the number of bookings and reviews, the location of the listings, and the amenities offered, we can identify patterns and trends that may be correlated with higher levels of success.

In addition to studying the characteristics of individual hosts, we can also use the data to understand the factors that influence demand for Airbnb listings in different areas. Factors such as the availability of local attractions, the cost of living, and the overall quality of life in an area may all play a role in determining the popularity of Airbnb listings.

Another area of interest for Airbnb is predicting future demand and pricing for its listings. By using machine learning techniques such as regression or classification, we can build models that can predict the likelihood of a listing being booked or the expected price based on factors such as the location, type of property, and amenities offered. These predictions can be valuable for both hosts and guests, as they can help hosts set competitive prices and guests make informed decisions about where to stay.

Overall, the data generated by Airbnb's millions of listings provides a rich source of information that can be used to understand the behavior and preferences of hosts and guests, make informed business decisions, and identify new opportunities for growth and innovation. By exploring and analyzing this data, we can gain a deeper understanding of the factors that drive success on Airbnb and use this knowledge to drive the company's continued success.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


As Airbnb continues to expand and grow, it is important for the company to to use data analysis to gain insights into the behavior and performance of hosts and customers on the Airbnb platform, and to use this knowledge to inform business decisions and identify opportunities for growth and innovation. Specifically, the goal is to understand the factors that drive success on Airbnb, including the characteristics of successful hosts, the demand for listings in different locations, and the factors that influence pricing and demand. The goal is to explore data to predict factors such as demand, pricing, and the busiest hosts, and to identify areas where Airbnb can improve its services or expand its offerings. By understanding these factors, Airbnb can make informed decisions about how to optimize its platform and services to better meet the needs of its users and drive continued growth and success.

#### **Define Your Business Objective?**

In the context of this project, the business objective is to use data analysis to identify the characteristics of successful hosts, understanding the demand for listings in different locations, identifying new markets or demographics to target with marketing efforts and predicting factors such as pricing and demand.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
pd.options.display.float_format = "{:.2f}".format


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount("/content/drive")
path = "/content/drive/My Drive/Airbnb NYC 2019.csv"
airbnb_df = pd.read_csv(path)



### Dataset First View

In [None]:
# Dataset First Look
airbnb_df

In [None]:
airbnb_df.head() #first 5 rows of the dataset
airbnb_df.tail() #last 5 rows of the dataset

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Shape: {airbnb_df.shape}') #to check no. of rows and columns

### Dataset Information

In [None]:
# Dataset Info
airbnb_df.info() #info method used to check non-Null count and datatype of columns

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
#airbnb_df[airbnb_df.duplicated()].count()
len(airbnb_df[airbnb_df.duplicated()]) #to check count for duplicate values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
airbnb_df.isnull().sum() #to get the null value count

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(airbnb_df.isnull(),cbar= True);

In [None]:
#this will return column with atleast one null value
airbnb_df.loc[:,airbnb_df.isna().sum()!=0][:5]

In [None]:
# Missing Value Count Function
def showMissing():
    missing = airbnb_df.columns[airbnb_df.isnull().any()].tolist()
    return missing

missingVal = pd.DataFrame()
missingVal['Missing Data Count'] = airbnb_df[showMissing()].isnull().sum().sort_values(ascending = False)
missingVal['Missing Data Percentage'] = missingVal['Missing Data Count']/len(airbnb_df)*100
print(missingVal)


### What did you know about your dataset?

These observation can be concluded from the above analysis:

- There are 48895 observation and 16 features with a mixture of interger, float and object data type i.e., contains both numerical and categorial feature.
- Last_review feature is a date but has object data type, need to change it to correct data type.
- Dataset contains all unique value i.e., there is no duplicate which means data is free from bias as duplicates can cause problems in downstream analyses, such as biasing results or making it difficult to accurately summarize the data.
- Some of the features like name, host_name, last_review and reviews_per_month has null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f'Features : {airbnb_df.columns.to_list()}')


In [None]:
# Dataset Describe
airbnb_df.describe()

### Variables Description

**id** : Unique Id generated while storing data

**name** : Name of the listing

**host_id** : The host id is assigned to each host by Airbnb and is used to identify and distinguish hosts from one another.

**host_name** : The host name is typically the name of the person who owns the property or is authorized to list it on Airbnb.

**neighbourhood_group** : Location of the listing / categorical variable that indicates the general geographic area in which a listing is located.

**neighbourhood** : area of the listing

**latitude** : Latitude range of the listing

**longitude** : Longitude range of the listing

**room_type** : Type of the listing

**price** : Price of the listing

**minimum_nights** : Minimum nights to be paid for


**number_of_reviews** : Number of reviews for the listing

**last_review** : Content of the last review

**reviews_per_month** : Average number of reviews that a listing receives per month

**calculated_host_listings_count** : Total number of listings that a host has on the Airbnb platform

**availability_365** : The number of days in a year that a listing is available for booking on the Airbnb platform based on the listing's calendar, and reflects the number of days in the future that the listing is marked as available for booking.-

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
airbnb_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = airbnb_df.copy()

In [None]:
#changing data type for last review
from datetime import datetime

df['last_review'] = pd.to_datetime(df['last_review'])

#calculating total count of observation where number of review is equal to 0
len(df[df['number_of_reviews']== 0])

In [None]:
#replacing null values with possible data

#categorial value like name and host name is replace with not known
df['name'].fillna('not known',inplace = True)
df['host_name'].fillna('not known',inplace = True)

#replacing numerical value for reviews per month with 0 as number of review is 0 for those values
df['reviews_per_month'].fillna(0,inplace = True)

In [None]:
#dropping last review column as it can only provide information about the timeliness of the data being analyzed
df.drop(['id', 'last_review'], axis=1,inplace = True)
df.isnull().sum()



In [None]:
# Calculate the median of the feature price
median = df['price'].median()

# Replace 0 with the median
df['price'].replace(0, median, inplace=True)

In [None]:
#calculating total count of observation where number of review is equal to 0
len(df[df['price']== 0])

In [None]:
#calculating number of potential booking each can take
df['number_of_bookings'] = (df['availability_365'] / df['minimum_nights']).astype(int)
df['potential_revenue_per_year'] = df['number_of_bookings']* df['price']


In [None]:
#finding unique neighbourhood names and room type for each group
count = 0
for i in df.neighbourhood_group.unique():
  count= count+1
  print(f'{count}. {i} has room type {df.room_type.unique()}')

In [None]:
df.groupby('neighbourhood_group')[['price','number_of_reviews', 'reviews_per_month',
 'calculated_host_listings_count','number_of_bookings',
 'potential_revenue_per_year']].agg(['mean', 'median','min','max','sum']).T


In [None]:
#finding total review when price is maximum
df_max_reviews = df[df['price'] == df['price'].max()].groupby('neighbourhood_group')[[
    'reviews_per_month','number_of_bookings']].sum().reset_index().sort_values(
        'reviews_per_month', ascending = False)
max_price = df['price'].max()
print(f'Maximum Price for each Group is {max_price}')
df_max_reviews

In [None]:
#finding total review when price is minimum
df_min_reviews = df[df['price'] == df['price'].min()].groupby('neighbourhood_group')[[
    'reviews_per_month','number_of_bookings']].sum().reset_index().sort_values(
        'reviews_per_month', ascending = False)
Min_price = df['price'].min()
print(f'Minimum Price for each Group is {Min_price}')
df_min_reviews

In [None]:
#finding for price above min and below max
df_between_max_min = df[(df['price'] > df['price'].min()) & (
    df['price'] < df['price'].max())].groupby('neighbourhood_group')[[
    'reviews_per_month','number_of_bookings']].sum().reset_index().sort_values(
        'reviews_per_month', ascending = False)
print(f'Price range for each Group is {Min_price} - {max_price}')
df_between_max_min

In [None]:

#considering only those reviews which are more than average and price range between min and max
df_above_avg_reviews = df[(df['reviews_per_month']> df['reviews_per_month'].mean()) &
    (df['price'] > df['price'].min()) & (df['price'] < df['price'].max())].groupby(
    'neighbourhood_group')[['reviews_per_month','number_of_bookings']].sum().reset_index().sort_values(
        'reviews_per_month', ascending = False)
avg = df['reviews_per_month'].mean()
print(f'Average Review: {avg}')
df_above_avg_reviews

In [None]:
#based on avg reviews per month creating poor and good engagement
df['review_quality'] = df['reviews_per_month'].apply(
    lambda x:'Poor Engagement'  if x < df['reviews_per_month'].mean() else 'Good Engagement')

In [None]:
#checking number of booking based on review quality
pd.DataFrame(df.groupby('review_quality')['number_of_bookings'].value_counts().reset_index()).head()

In [None]:
#checking number of booking based on review quality
pd.DataFrame(df.groupby('review_quality')['number_of_bookings'].value_counts().reset_index()).tail()

In [None]:
#average price for airbnb which has good engagement but actual booking is 0
df_good = df[(df['number_of_bookings']==0) & (df['review_quality']=='Good Engagement')]
df_good['price'].mean()

In [None]:
#average price for airbnb which has poor engagement but more than 200 booking
df_bad = df[(df['number_of_bookings']>200) & (df['review_quality']=='Poor Engagement')]
df_bad['price'].mean()

### What all manipulations have you done and insights you found?

In order to proceed and find the insight from the data it was important to make it consistent, therefore i first changed data type for 'last_review' feature as it represent date but had object data type. so change it into datetime data type. Secondly, it was important to deal with null values before feature engineering as the scope of this project was to find relation amoung different features and find the trend and pattern and null values can skew the result, therefore i changed the name and host_name null value with 'not known'. Since total count of number of review equal to zero was equal to the count of null values in review per month, and as there was no review for the month therefore review per month should also be zero therefore, i changed null value with zero. Thirdly, since minimum value for price was zero, and since there is no specific data available why it was zero for example whether any discount was given or because of some other reason, therefore i changed zero value with median as median is less sensitive to the presence of outliers than the mean, and can provide a more accurate representation of the central tendency of the data in these cases. Next added new column - number_of_booking and potenial_revenue_per_year as it will be helpful in forecasting potential revenue it can generate throughout year if ideally booked based on availability. Finally dropped the last_review date column even though it can be used to know how old our data can be and in sentiment analysis, but in the context of this analysis it may not be used and also dropped id column because the feature will not have any variability - every row will have a unique value, so the feature will not be able to provide any information about the relationship between the target variable and the feature. Ignored availability_365 zero values to understand how these values are affecting overall business as being an airbnb host you should be available for atleast more than 1 day in entire year, its better to figure out what the reason may be,so a better anlysis on these number was needed. Then dumped the cleaned data into csv format for visual anlaysis in tableau.

After making consistent data, calculated to mean, median, minimum, maximum and sum for each feature, later grouped neighbourhood_group and room type to check what does maximum and minimum price show customer engagement. However better analysis could be done from visualization. These are few finding from analysis:

Minimum price was 10 and maximum price was 10000 for neighbourhood group.
People show very less interaction with maximum price and main reason was because it does not show any number of bookings.
Manhattan shows maximum engagement with respect to reviews per month.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

Univariate

#### Chart - 1 (Distribution of price)

In [None]:
# Chart - 1 visualization code
sns.histplot(df, x="price");
plt.show()

In [None]:
sns.histplot(df[df['price']<=1000], x= 'price');
plt.show()

In [None]:
#price lower than 1000
price_less_than_1000 = df[df['price'] <= 1000]
col = 'price'
sns.histplot(price_less_than_1000[col], color = '#055E85');
feature = price_less_than_1000[col]
plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #Rose-Red Color indicate mean of data
plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #Cyan indicate median of data

plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper right')
# Add a title to the plot with custom font size and color
plt.title('Distribution of Price', fontsize=20, color='red')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

To understand distribution of price as the best way to understand distribution of continous data is either through histogram or kde plot, so i used distplot which is a combination of both.

##### 2. What is/are the insight(s) found from the chart?

From the first plot, i found that price was distrbuted from range 0 to 10,000. But it was hard to visualize the entire range from graph, even though i found that most of the distribution was for price below 1000 and very less for price above 1000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding distribution of price is important for any business because it can affect the demand for the product or service. If the price is too high, it may discourage potential customers from making a purchase, leading to lower sales and revenue. On the other hand, if the price is too low, the business may not be able to cover its costs or make a profit, in this case maintainence cost. Therefore, it is important for businesses to carefully consider the distribution of prices in order to find a balance that maximizes revenue while still being attractive to customers.

Price for only three airbnb is 10000 which is very high compared to price where most fall in range under 1000. This may cause negative effect on buisness revenue. However it need extra attentioin to get clear insight if it really affect or not by comparing it with other feature to know exactly.

#### Chart - 2 (Distribution of price using box plot)

In [None]:
# Chart - 2 visualization code
# Calculate Q1, Q3, and IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers using the IQR method
outliers = df[~((df['price'] < (Q1 - 1.5 * IQR)) | (df['price'] > (Q3 + 1.5 * IQR)))]

# Remove the outliers from the dataset
df_clean = df[((df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR)))]



In [None]:
col = 'price'
sns.histplot(df_clean[col], color = '#055E85');
feature = df_clean[col]
plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #Rose-Red Color indicate mean of data
plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #Cyan indicate median of data

plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper right')
# Add a title to the plot with custom font size and color
plt.title('Distribution of Price', fontsize=20, color='red')

# Show the plot
plt.show()

In [None]:
#box plot for outlier visualization
sns.boxplot(y='price', data=df_clean).set_title('Price Distribution');

##### 1. Why did you pick the specific chart?

To understand distribution of price as the best way to understand distribution of continous data is either through histogram or kde plot, so i used distplot which is a combination of both

##### 2. What is/are the insight(s) found from the chart?

I found that price was right skewed, so tried removing oultiers using IQR method. However in context of this project will not use the cleaned data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding distribution of price is important for any business because it can affect the demand for the product or service. If the price is too high, it may discourage potential customers from making a purchase, leading to lower sales and revenue. On the other hand, if the price is too low, the business may not be able to cover its costs or make a profit, in this case maintainence cost. Therefore, it is important for businesses to carefully consider the distribution of prices in order to find a balance that maximizes revenue while still being attractive to customers.

Price for only three airbnb is 10000 which is very high compared to price where most fall in range under 1000. This may cause negative effect on buisness revenue. However it need extra attentioin to get clear insight if it really affect or not by comparing it with other feature to know exactly.

#### Chart - 3 (Popular neighbourhood_group)

In [None]:
# Chart - 3 visualization code
df_new = pd.DataFrame(df['neighbourhood_group'].value_counts().reset_index())
fig = px.pie(df_new, values='count', names='neighbourhood_group', title='neighbourhood_group');
fig.update_traces(textposition='outside', textinfo='percent+label');
fig.show();



##### 1. Why did you pick the specific chart?

Pie chart is easy to understand as each slice of the pie represents a different category of data. It helps in understanding different proportion of data. A pie chart for the neighbourhood_group field show the distribution of listings across different neighborhood groups.

##### 2. What is/are the insight(s) found from the chart?

Amoung all different area Manhattan and Brooklyn has maximum airbnb and Staten Island has least airbnb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the above insight it will be useful in understanding the popularity of different neighborhood groups, or for comparing the supply of listings in different neighborhoods.

Compartively less airbnb in Staten Island will generate less revenue, hence need to change marketing strategy like giving promotional offer would work.

#### Chart - 4 (Popular Host)

In [None]:
# Chart - 4 visualization code
#grouping all host id to know top 10 host
df_popular_host = df.groupby(['host_id','host_name'])['calculated_host_listings_count'
                                ].max().reset_index().sort_values(
                            'calculated_host_listings_count',ascending = False)[:10]

plt.bar(df_popular_host['host_name'], df_popular_host['calculated_host_listings_count'])
plt.xlabel('Host Name')
plt.ylabel('Calculated Host Listings Count')
plt.title('Bar Plot of Calculated Host Listings Count by Host Name')
plt.xticks(rotation=90, color='red')  # Rotate x-axis labels for better readability
plt.show()


##### 1. Why did you pick the specific chart?

Since popular host name represent to be discrete data, barplot will be the best option for plotting these type of data.

##### 2. What is/are the insight(s) found from the chart?

Sonder (NYC) is the most popular host with the highest number of airbnb listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Sounder (NYC) has maximux listing that means, there is a chance he/she will be generating more revenue.

There are host with only 1 listing, they can create a negative brand name if they won't generate profit. Strategies can be designed on how they can generate more revenue.

Bivariate

#### Chart - 5 (Price Point of neighbourhood_group)

In [None]:
# Chart - 5 visualization code
#grouping and taking mean of price
df_avg_price = df.groupby('neighbourhood_group')['price'].mean().reset_index().sort_values(
    'price',ascending = False)

#line plot
plt.figure(figsize=(8,6));

ax = sns.lineplot(data = df_avg_price,x='neighbourhood_group', y = 'price',
  marker= 'o', color = 'green',linewidth=2);

# Set the font size of the tick labels to 12
ax.tick_params(axis='both', which='major', labelsize=12);
# Set the x-label with a font size of 25
ax.set_xlabel("neighbourhood_group", fontsize=14)

# Set the y-label with a font size of 25
ax.set_ylabel("Price", fontsize=14)
plt.title('Price Point of neighbourhood_group', fontsize=20, color='red');



##### 1. Why did you pick the specific chart?

I wanted to find the trend of price in neighbourhood_groups as price represents continous data(data that can take any value).

##### 2. What is/are the insight(s) found from the chart?

Price which airbnb offer in Manhattan is more than others. Based on the earlier analysis we found that Staten Island has less number of airbnb, price point for that is relatively high when compared to others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the price point of airbnb refers to the specific price at which it is offered for service. Manhattan offers a competitive avg price of 196 which is a good value to generate profitable revenue and enhance brand perception.

Price point in Staten Island is relatively high compared to number of airbnb it has, it can decrease demand and cause negative brand perception.

#### Chart - 6 (Customer Engagement)

In [None]:
# Chart - 6 visualization code
df_engage = df.groupby(['name'])['number_of_reviews'].sum().reset_index().sort_values(
    'number_of_reviews',ascending = False)[:10]
#barplot
plt.figure(figsize=(8,5));
ax = sns.barplot(data = df_engage, y='name', x = 'number_of_reviews');

# Set the font size of the tick labels to 12
ax.tick_params(axis='both', which='major', labelsize=10);
# Set the x-label with a font size of 25
ax.set_xlabel("Total Number of Reviews", fontsize=12, color='red')

# Set the y-label with a font size of 25
ax.set_ylabel("Airbnb Name", fontsize=12, color='red')
plt.title('Top 10 Customer Engaging Airbnb', fontsize=20, color='purple');

##### 1. Why did you pick the specific chart?

I used this plot to display total number of reviews with most engaging airbnb, since barplot helps to compare data across different categories.

##### 2. What is/are the insight(s) found from the chart?

the maximum listings are from manhattan and brooklyn neighbourhood as per reviews.

##### 3. Will the gained insights help creating a positive business impact?


Yes, customer engagement is the heart and soul of any business, maximum engagement means maximum profit.

Negative impact could be identified if sentiment analysis for review will be done, it is hard to predict with numbers, deos not have specific data for the analysis.

#### Chart - 7 (Customer Engagement)

In [None]:
# Chart - 7 visualization code
#reviews_per_month
df_engage2 = df.groupby(['name'])['reviews_per_month'].sum().reset_index().sort_values(
    'reviews_per_month', ascending = False)[:10]
#barplot
plt.figure(figsize=(8,5));
ax = sns.barplot(data = df_engage2, y='name', x = 'reviews_per_month');

# Set the font size of the tick labels to 12
ax.tick_params(axis='both', which='major', labelsize=10);
# Set the x-label with a font size of 25
ax.set_xlabel("Reviews Per Month", fontsize=12)

# Set the y-label with a font size of 25
ax.set_ylabel("Airbnb Name", fontsize=12)
plt.title('Top 10 Customer Engaging Airbnb', fontsize=20, color='purple');

##### 1. Why did you pick the specific chart?

I used this plot to display reviews per month with most engaging airbnb, since barplot helps to compare data across different categories.

##### 2. What is/are the insight(s) found from the chart?

the maximum listings are from manhattan and brooklyn neighbourhood as per reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, customer engagement is the heart and soul of any business, maximum engagement means maximum profit.

Negative impact could be identified if sentiment analysis for review will be done, it is hard to predict with numbers, deos not have specific data for the analysis.

#### Chart - 8 (Customer Engagement)

In [None]:
# Chart - 8 visualization code
#grouping all host id to know top 10 host
df_popular_host = df.groupby(['host_id','host_name'])['reviews_per_month'].sum().reset_index().sort_values(
                            'reviews_per_month',ascending = False)[:10]
#barplot
ax = sns.barplot(df_popular_host,x='host_name',y='reviews_per_month');

# Set the x-label with a font size of 25
ax.set_xlabel("Host Name", fontsize=10)
plt.xticks(fontsize = 8, rotation = 90);

# Set the y-label with a font size of 25
ax.set_ylabel("Reviews Per Month", fontsize=10)
plt.title('Top 10 Customer Engaging Host', fontsize=20, color='purple');

##### 1. Why did you pick the specific chart?

I used this plot to display reviews per month with most engaging hosts, since barplot helps to compare data across different categories.

##### 2. What is/are the insight(s) found from the chart?

the maximum reviews per month is for the host name sonder showing the popularity by huge margin against other hosts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, customer engagement is the heart and soul of any business, maximum engagement means maximum profit.

Negative impact could be identified if sentiment analysis for review will be done, it is hard to predict with numbers, deos not have specific data for the analysis.

#### Chart - 9 (customer engagement with room type)

In [None]:
# Chart - 9 visualization code
#checking engagement based on room type
df_room = df.groupby('room_type')['reviews_per_month'].sum().reset_index().sort_values(
    'reviews_per_month', ascending = False)
ax = sns.barplot(df_room,x='room_type',y='reviews_per_month')
# Set the x-label with a font size of 25
ax.set_xlabel("Room Type", fontsize=12)

# Set the y-label with a font size of 25
ax.set_ylabel("Reviews Per Month", fontsize=12)
plt.title('Customer Engagement based on Room Type', fontsize=14, color='purple');

##### 1. Why did you pick the specific chart?

I used this plot to display relative proportions of reviews, since barplot helps to compare data across different categories.

##### 2. What is/are the insight(s) found from the chart?

Entire home/apt and private room have more reviews as compared to any other room type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, customer engagement is the heart and soul of any business, maximum engagement means maximum profit.

Negative impact could be identified if sentiment analysis for review will be done, it is hard to predict with numbers, deos not have specific data for the analysis.



#### Chart - 10 (price distribution with roomtype)

In [None]:
# Chart - 10 visualization code
#note : for this analysis data without outliers is used which was cleaned earlier in the analysis
plt.figure(figsize=(12,6))
sns.violinplot(data=df_clean,x="room_type", y="price").set_title(
    'Price Distribution for Room Type',fontsize=20, color='purple')
plt.show()

##### 1. Why did you pick the specific chart?

I used this plot as violin plot is used mainly when learning about distrbution of quantitaive data across one categorial variable, here it helps best in understanding distribution of price amoung certain demand and can be used to compare the distributions of prices between different groups, in this case price distribution for different room type.

##### 2. What is/are the insight(s) found from the chart?

Entire home or appartment room type has higher price range than private and shared.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Different price range for different economy people which is more accepted toward high end profit.

There is no such negative impact.

#### Chart - 11 (Possible Revenue Generation based on Availability)

In [None]:
# Chart - 11 visualization code
#grouping host id
df_revenue = df.groupby(['host_id','host_name'])[
    'potential_revenue_per_year'].sum().reset_index().sort_values(
        'potential_revenue_per_year', ascending = False)[:10]

plt.figure(figsize=(10,8));
sns.barplot(df_revenue, x='host_name', y='potential_revenue_per_year');
plt.title('Possible Revenue Per Year',fontsize=20,color='red');

##### 1. Why did you pick the specific chart?

To compare potential revenue per year against hosts individually

##### 2. What is/are the insight(s) found from the chart?

Sonder(NYC) can possibly produce highest revenue if booked for entire available day with minimum night booking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This analysis was done to forecast ideal case to understand total revenue each host can generate based on possible number of booking.

#### Chart - 12 (Possible Revenue Generation based on Availability)

In [None]:
# Chart - 12 visualization code
#grouping host name and id based on availabilty to know potential revenue
df_non_functional = df.groupby(['host_id', 'host_name','number_of_bookings'])[
    'price'].max().reset_index().sort_values('price', ascending = False)
df_non_functional = df_non_functional[df_non_functional['number_of_bookings']<=5]
#barplot
ax = sns.barplot(df_non_functional.head(10),x='host_name',y='price');
ax.tick_params(axis='x', which='major', labelsize=10, rotation=90)
# Set the x-label with a font size of 25
ax.set_xlabel("Host Name", fontsize=25)

# Set the y-label with a font size of 25
ax.set_ylabel("Price", fontsize=25)
plt.title('Top 10 Host with High Price and less than 5 Booking per Year', fontsize=20, color='purple');


##### 1. Why did you pick the specific chart?

To compare price against host name

##### 2. What is/are the insight(s) found from the chart?

Jelena, Kathrine, Erin, Olson, Amy have the highest price but less than 5 bookings per year

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This analysis was done to forecast ideal case to understand total revenue each host can generate based on possible number of booking.

There are host like Jelena Catherine who offers less than 5 booking per year but relatively very high price than average price offered by airbnb which can cause negative brand name and even revenue generated by them is zero, these listing show no profit for business.

#### Chart - 13 (Price Point for each Room Type in each Neighbourhood group)

In [None]:
# Chart - 13 visualization code
#grouping
df_avgPrice_roomType = df.groupby(['neighbourhood_group','room_type'])[
    'price'].mean().reset_index().sort_values('price',ascending=False)

plt.figure(figsize=(10,8));
ax = sns.barplot(data= df_avgPrice_roomType,x='neighbourhood_group',y='price',hue='room_type');

# Set the font size of the tick labels to 12
ax.tick_params(axis='both', which='major', labelsize=12);
# Set the x-label with a font size of 25
ax.set_xlabel("Neighbourhood Group", fontsize=14)

# Set the y-label with a font size of 25
ax.set_ylabel("Average Price", fontsize=14)
plt.title('Price Point for each Room Type in Neighbourhood Group', fontsize=20, color='purple');

##### 1. Why did you pick the specific chart?

Better way to visualize three different variable in one chart.

##### 2. What is/are the insight(s) found from the chart?

Manhattan has highest price point in room type and Bronx has lowest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Maximum profit and has most afforable price range for all class of people.

Less average price resembles that there are more number of listing and it may nit be generating required amount of profit, need to change marketing strategy.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
heatmap_df= df[['host_id','price','minimum_nights','number_of_reviews','calculated_host_listings_count','availability_365','number_of_bookings','potential_revenue_per_year' ]]
sns.heatmap(heatmap_df.corr(), annot=True,linewidth=.5,cmap="PiYG");


##### 1. Why did you pick the specific chart?

I choose this chart because correlation heatmap is the easiest way to identify which variables are correlated and the strength of the correlation. It can help identify multicollinearity, which can be useful for identifying which variables to include in a model or analysis.

##### 2. What is/are the insight(s) found from the chart?

Feature availability_365 and number of booking show high correlation which means they show multicolinearity as a result these variable can not be used together to train model, so either they should be combined togther using any formula or relation or they should be dropped for further analysis.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
pairplot_df= df[['neighbourhood_group','price','availability_365']]
sns.pairplot(pairplot_df, hue="neighbourhood_group");

##### 1. Why did you pick the specific chart?

A pairplot can be a valuable tool for data analysis when trying to understand and analyze the relationships between different variables and identify patterns and trends in the data.

##### 2. What is/are the insight(s) found from the chart?

It shows the relation between price and availibility_365 keeping neighbourhood_group as hue parameter.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis, it is clear that

1. There are host which take 0 booking for entire year and not even available for booking but has high booking price, these listing should be either removed or should undergo physical verification to learn exact reason.
2. In order to increase profit for host with less lisitng, special promotional offer should be given and personalised marketing for them should be done from airbnb to promote business in their specific region.
3. Price point for few area are relatively high compared to others based on number of listing in the other area, it could be changed.
4. Reward for host who provide maximum customer engagement to keep them motivated, this will help in host retention and generation of profit.
5. Since private room/apt get maximum reviews per month, therefore, these kind of listing could be increased based on customer demand.


# **Conclusion**

From the above analysis, we could conclude that:

- Average price distribution for airbnb falls under the range of 1000 i.e., 150-200.
- Amoung all different area Manhattan and Brooklyn has maximum airbnb and Staten Island has least airbnb.
- Sounder (NYC) is the most popular host with the highest number of airbnb listings.
- Price which airbnb offer in Manhattan is more than others.
- Airbnb listing Room near JFK Queen Bed has maximum number of reviews but when compared with average review per month airbnb listing - Enjoy great views of the City in our Deluxe Room! and host name Sonder (NYC) and Entire home/apt shows maximum customer engagement.
- Sonder(NYC) can possibly produce highest revenue if booked for entire available day with minimum night booking.
- Manhattan has highest price point in room type and Bronx has lowest.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***