<a href="https://colab.research.google.com/github/akashjimain/AirBnb-Booking-Analysis-EDA-Project/blob/main/AirBnb_Booking_Analysis_EDA_Submission_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Akash Tiwari


# **Project Summary -**

**Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.**


# **GitHub Link -**

https://github.com/akashjimain/AirBnb-Booking-Analysis-EDA-Project

# **Problem Statement**


# **Explore and analyze the data to discover key understandings (not limited to these) such as :**


*   What can we learn about different hosts and areas?
*   What can we learn from predictions? (ex: locations, prices, reviews, etc)

*   Which hosts are the busiest and why?
*   Is there any noticeable difference of traffic among different areas and what could be the reason for it?






#### **Define Your Business Objective?**

The goal of data exploration is to learn about characteristics and potential problems of a data set without the need to formulate assumptions about the data beforehand. In statistics, data exploration is often referred to as “exploratory data analysis” and contrasts traditional hypothesis testing.

It can help with the detection of obvious errors, a better comprehension of data patterns, the detection of outliers or unexpected events, and the discovery of interesting correlations between variables.

Data Science technologies are at the core of identifying drivers of trust to engage more users and find out novel ways on how to alleviate trust. Data science technology is the key differentiator for the rapid growth of AirBnB and how it is able to make better recommendations by matching the right people together.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly
import missingno as msno
from wordcloud import WordCloud, ImageColorGenerator

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Give path to access data

path= '/content/drive/MyDrive/Airbnb NYC 2019.csv'

In [None]:
# Read files using pandas module.
airbnb = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
airbnb.head()

In [None]:
# Read The Last five Rows
airbnb.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
airbnb.shape

### Dataset Information

In [None]:
# Dataset Info
airbnb.info

In [None]:
# Dataset Columns
airbnb.columns

In [None]:
# Dataset Describe
airbnb.describe(include = 'all')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(airbnb[airbnb.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(airbnb.isnull().sum())

In [None]:
# Visualizing the missing values
# The matrix below shows the nan values in each of the feature of the data
# The horizontal white line in each columns represents the nan value
# The column of reviews_per_month contains the most nan values

msno.matrix(airbnb)



### Data Cleaning

In [None]:
# Here we will delete unnecessary column like id ,name , host_name, last_review,
airbnb.drop(['id','host_name','last_review'],axis=1,inplace=True)
airbnb.head()

### What did you know about your dataset?

The column of reviews_per_month contains the most nan values

### Handling missing and NaN values in review_per_month

In [None]:
# Here we will replace all missing values in 'reviews-per_month'with zero and name with 'absant'.
airbnb['reviews_per_month'].fillna(0,inplace=True)
airbnb['name'].fillna('Absant',inplace=True)
airbnb

In [None]:
# let us check is there any missing value remain in our data.
airbnb.isnull().any()
# 'False' for every category means no missing values.

In [None]:
# let us check shape of our dataframe now.
airbnb.shape

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
airbnb.columns

In [None]:
# Dataset Describe
airbnb.describe(include = 'all')

### Variables Description



*   **id :**Unique ID
*   **name :**Name of listing

*   **host_id :**Unique host_id

*   **host_name :**Name of the host

*   **neighbourhood_group :**location

*   **neighbourhood :**area
*   **latitude :**Latitude range
* **longitude :**Longitude range

*   **room_type :**Type of listing


*   **price :**Price of listing

*   **minimum_nights :**Minimum nights to be paid for
*   **number_of_reviews :**Number of reviews

*   **last_review :**Content of the last review
*   **reviews_per_month :**Number of checks per month

*   **calculated_host_listings_count :**Total count
*   **availability_365 :**Availability around the year









### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in airbnb.columns.tolist():
  print("No. of unique values in ",i,"is",airbnb[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Let us find unique areas in data using 'neighbourhood'column.
airbnb.neighbourhood.unique()


In [None]:
# Let us find total count of areas (neighbourhood)  in dataset
len(airbnb.neighbourhood.unique())

In [None]:
# Let us first find total number of hosts in dataset
Number_of_hosts=airbnb.host_id.unique()
Number_of_hosts

In [None]:
len(Number_of_hosts)

In [None]:
top_host_id = airbnb['host_id'].value_counts().head(10)
top_host_id

In [None]:
top_host_df=pd.DataFrame(top_host_id)
top_host_df.reset_index(inplace=True)
top_host_df.rename(columns={'index':'Host_ID', 'host_id':'P_Count'}, inplace=True)
top_host_df

In [None]:
top_host_id.sum()

In [None]:
# Let us find topmost location which has maximum nuber of listings in given data.
airbnb.neighbourhood_group.unique()

In [None]:
locations = airbnb['neighbourhood_group'].value_counts()
locations

In [None]:
len(locations)

In [None]:
# let us find relation between location i.e.neighbourhood_group and price.
price_vs_location = airbnb.groupby(['neighbourhood_group'])['price'].mean()
price_vs_location

In [None]:
# Let us find top 10 most reviewed listings based on number of reviews per month.
most_reviewed_listings=airbnb.nlargest(10,'number_of_reviews')
most_reviewed_listings

In [None]:
reviewed_per_listings= airbnb.filter(['neighbourhood_group','number_of_reviews'])
reviewed_per_listings

In [None]:
top_reviewed_listings=reviewed_per_listings.nlargest(10,'number_of_reviews')
top_reviewed_listings

In [None]:
# Let us find the busiest host using host_id and minimum nights column in our dataset.
Busy_host=airbnb.groupby(['host_id']).minimum_nights.mean()
Busy_host=Busy_host.sort_values(ascending=True)
Busy_host

In [None]:
# Let us find top 10 busy hosts.
Top_busy_hosts=Busy_host.tail(10)
Top_busy_hosts

In [None]:
#let's find out more about our neiberhoods presented 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island', and 'Bronx'

#Brooklyn
sub_1=airbnb.loc[airbnb['neighbourhood_group'] == 'Brooklyn']
price_sub1=sub_1[['price']]
#Manhattan
sub_2=airbnb.loc[airbnb['neighbourhood_group'] == 'Manhattan']
price_sub2=sub_2[['price']]
#Queens
sub_3=airbnb.loc[airbnb['neighbourhood_group'] == 'Queens']
price_sub3=sub_3[['price']]
#Staten Island
sub_4=airbnb.loc[airbnb['neighbourhood_group'] == 'Staten Island']
price_sub4=sub_4[['price']]
#Bronx
sub_5=airbnb.loc[airbnb['neighbourhood_group'] == 'Bronx']
price_sub5=sub_5[['price']]
#putting all the prices' dfs in the list
price_list_by_n=[price_sub1, price_sub2, price_sub3, price_sub4, price_sub5]

In [None]:
#creating an empty list that we will append later with price distributions for each neighbourhood_group
p_l_b_n_2=[]
#creating list with known values in neighbourhood_group column
nei_list=['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx']
#creating a for loop to get statistics for price ranges and append it to our empty list
for x in price_list_by_n:
    i=x.describe(percentiles=[.25, .50, .75])
    i=i.iloc[3:]
    i.reset_index(inplace=True)
    i.rename(columns={'index':'Stats'}, inplace=True)
    p_l_b_n_2.append(i)
#changing names of the price column to the area name for easier reading of the table
p_l_b_n_2[0].rename(columns={'price':nei_list[0]}, inplace=True)
p_l_b_n_2[1].rename(columns={'price':nei_list[1]}, inplace=True)
p_l_b_n_2[2].rename(columns={'price':nei_list[2]}, inplace=True)
p_l_b_n_2[3].rename(columns={'price':nei_list[3]}, inplace=True)
p_l_b_n_2[4].rename(columns={'price':nei_list[4]}, inplace=True)
#finilizing our dataframe for final view
stat_df=p_l_b_n_2
stat_df=[df.set_index('Stats') for df in stat_df]
stat_df=stat_df[0].join(stat_df[1:])
stat_df

In [None]:
# room_type in neighbourhood group
room_type_var = airbnb.groupby(['neighbourhood_group','room_type'])['room_type'].count().unstack()
room_type_var

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# let us visualize  top_host_id using bar chart.
viz_1=sns.barplot(x="Host_ID", y="P_Count", data=top_host_df,
                 palette='Blues_d')
viz_1.set_title('Hosts with the most listings in NYC')
viz_1.set_ylabel('Count of listings')
viz_1.set_xlabel('Host IDs')
viz_1.set_xticklabels(viz_1.get_xticklabels(), rotation=45)

##### 1. Why did you pick the specific chart?

Bar chart is the chart that you use one unit to describe a fixed value, then draw rectangular bars of corresponding length proportion based on values, and finally sequence them with an order. This chart presents the value of each category intuitively and visually for making a comparison of different categories.

##### 2. What is/are the insight(s) found from the chart?

From above bar chart it is observed that host with host_id = 219517861 has 327 listings also hosts with host_id 12243051,16098958 have same number of listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer here

#### Chart - 2

In [None]:
# Chart - 2 visualization code (Most expensive area)
#we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization

#creating a sub-dataframe with no extreme values / less than 500
sub_6=airbnb[airbnb.price < 500]
#using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=sub_6, x='neighbourhood_group', y='price')
viz_2.set_title('Density and distribution of prices for each neighberhood_group')

##### 1. Why did you pick the specific chart?

Violin plots are used when you want to observe the distribution of numeric data, and are especially useful when you want to make a comparison of distributions between multiple groups. The peaks, valleys, and tails of each group's density curve can be compared to see where groups are similar or different.

##### 2. What is/are the insight(s) found from the chart?

Great, with a statistical table and a violin plot we can definitely observe a couple of things about distribution of prices for Airbnb in NYC boroughs. First, we can state that Manhattan has the highest range of prices for the listings with $150 price as average observation, followed by Brooklyn with \$90 per night. Queens and Staten Island appear to have very similar distributions, Bronx is the cheapest of them all. This distribution and density of prices were completely expected; for example, as it is no secret that Manhattan is one of the most expensive places in the world to live in, where Bronx on other hand appears to have lower standards of living.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Visualise number of listings  in different locations with help of pie chart.
plt.style.use('fivethirtyeight')
plt.figure(figsize=(13,7))
g=plt.pie(airbnb.neighbourhood_group.value_counts(),labels=airbnb.neighbourhood_group.value_counts().index,autopct='%1.1f%%',startangle=180)
plt.title=('neighbourhood group')
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

from above pie chart it is observed that maximum number of listings in NewYark are found in 'Manhatten(44.3%) of total listings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Let us visualise price_vs_location .
ax= price_vs_location.plot.bar(figsize=(10,5),fontsize=14)
ax.set_title('price per different location',fontsize=20)
ax.set_xlabel('neighbourhood',fontsize=15)
ax.set_ylabel('price',fontsize=15)

##### 1. Why did you pick the specific chart?

Bar chart is the chart that you use one unit to describe a fixed value, then draw rectangular bars of corresponding length proportion based on values, and finally sequence them with an order. This chart presents the value of each category intuitively and visually for making a comparison of different categories.

##### 2. What is/are the insight(s) found from the chart?

From above plot it is observed that Manhattan is most expensive location in given dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Let us visualise top 10 busy hosts to find busiest host using bar plot.
plt.rcParams['figure.figsize']=(10,5)
ax = Top_busy_hosts.plot(kind='bar')
ax.set_title('Top_busy_host')
ax.set_ylabel('minimum_nights')
ax.set_xlabel('host_id')
plt.show

##### 1. Why did you pick the specific chart?

Bar chart is the chart that you use one unit to describe a fixed value, then draw rectangular bars of corresponding length proportion based on values, and finally sequence them with an order. This chart presents the value of each category intuitively and visually for making a comparison of different categories.

##### 2. What is/are the insight(s) found from the chart?

From above bar plot it is observed that host with host_id 17550546 is busiest host in given dataset as number of minimum nights spend at listings belongs to host id 17550546 are more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#let's now combine this with our boroughs and room type for a rich visualization we can make

#grabbing top 10 neighbourhoods for sub-dataframe
sub_7=airbnb.loc[airbnb['neighbourhood'].isin(['Williamsburg','Bedford-Stuyvesant','Harlem','Bushwick',
                 'Upper West Side','Hell\'s Kitchen','East Village','Upper East Side','Crown Heights','Midtown'])]
#using catplot to represent multiple interesting attributes together and a count
viz_3=sns.catplot(x='neighbourhood', hue='neighbourhood_group', col='room_type', data=sub_7, kind='count')
viz_3.set_xticklabels(rotation=90)


##### 1. Why did you pick the specific chart?

This chart provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.

##### 2. What is/are the insight(s) found from the chart?

Amazing, but let' breakdown on what we can see from this plot. First, we can see that our plot consists of 3 subplots - that is the power of using catplot; with such output, we can easily proceed with comparing distributions among interesting attributes. Y and X axes stay exactly the same for each subplot, Y-axis represents a count of observations and X-axis observations we want to count. However, there are 2 more important elements: column and hue; those 2 differentiate subplots. After we specify the column and determined hue we are able to observe and compare our Y and X axes among specified column as well as color-coded. So, what do we learn from this? The observation that is definitely contrasted the most is that 'Shared room' type Airbnb listing is barely available among 10 most listing-populated neighborhoods. Then, we can see that for these 10 neighborhoods only 2 boroughs are represented: Manhattan and Brooklyn; that was somewhat expected as Manhattan and Brooklyn are one of the most traveled destinations, therefore would have the most listing availability. We can also observe that Bedford-Stuyvesant and Williamsburg are the most popular for Manhattan borough, and Harlem for Brooklyn.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Let's look at listing distribution across latitude and longitude with different Neighbourhood group.
lat = sns.scatterplot(x=airbnb.longitude, y=airbnb.latitude, hue=airbnb.neighbourhood_group, palette="husl")

lat.set_title('Latitudinal and longitudinal distribution of listings across neighbours', weight='bold', fontsize = 16)
lat.set_xlabel('Longitude')
lat.set_ylabel('Latitude')
lat.legend(loc='upper left', title='Neighbourhood')

In [None]:
#Room type distribution across latitude and longitudes
rm = sns.scatterplot(x=airbnb.longitude, y=airbnb.latitude, hue=airbnb.room_type, palette="husl")
rm.set_title('Room type distribution across latitude and longitudes', weight='bold', fontsize = 16)
rm.legend(loc='upper left', title='Room type')

##### 1. Why did you pick the specific chart?

When the two variables in a scatter plot are geographical coordinates – latitude and longitude – we can overlay the points on a map to get a scatter map (aka dot map). This can be convenient when the geographic context is useful for drawing particular insights and can be combined with other third-variable encodings like point size and color.

##### 2. What is/are the insight(s) found from the chart?



*   It appears there are a few scattered listings across Queens and Staten Island. In contrast, Brooklyn and Manhattan have a crowded listing situation in their respective regions.
*   In terms of the distribution of room types, we can see there is a good mix of different types available across the region. When compared with shared rooms, there is dominancy in private rooms and entire homes categories.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Different types of room distribution
plt.style.use('fivethirtyeight')
p = plt.pie(airbnb.room_type.value_counts(), labels= airbnb.room_type.value_counts().index, autopct='%1.1f%%')
plt.title = ("Listings of each room type")
plt.show()


In [None]:
r = sns.catplot(data = airbnb, x='neighbourhood_group', hue='neighbourhood_group', col='room_type', palette='flare',kind="count")
r.set_xticklabels(rotation=90)
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")

##### 1. Why did you pick the specific chart?



*   A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.
*   This chart provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.



##### 2. What is/are the insight(s) found from the chart?

There is a very clear percentage division of the three different room types across the region, with 'Entire home/apt' accounting for 52.3% of listings, and shared rooms representing just 2.2%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Analyzing the number of reviews and room availability with respect to price
sns.relplot(x=airbnb.number_of_reviews, y=airbnb.price, hue=airbnb.neighbourhood_group, col=airbnb.neighbourhood_group, palette="flare")
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")

In [None]:
# Let's have a more closer look, taking price range less than 400 dollars.
sns.relplot(x=airbnb.number_of_reviews, y=airbnb.price, hue=airbnb.neighbourhood_group, col=airbnb.neighbourhood_group, palette="flare")
sns.set(rc={'figure.figsize':(10,6)})
plt.ylim(0,400)
sns.set(style="white")

In [None]:
sns.relplot(x=airbnb.availability_365, y=airbnb.price, hue=airbnb.room_type, palette="husl")
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")

##### 1. Why did you pick the specific chart?

This chart provides access to several different axes-level functions that show the relationship between two variables with semantic mappings of subsets and allows us to visualise how variables within a dataset relate to each other

##### 2. What is/are the insight(s) found from the chart?



*   From the first graph, we can see a negative relationship between price and the number of reviews. There are more reviews for properties with lower prices since they are booked more frequently.
*   According to the second graph, the price of properties does not vary much in relation to their availability.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Finding Relation between neighbourhood group and availability of rooms
# Let us visualise relationship  between neighbourhood group and availability 365 using boxplot
plt.figure (figsize= (10,10))
ax= sns.boxplot(data=airbnb,x='neighbourhood_group',y='availability_365',palette='plasma')
ax.set_title('Relation between Neighbourhood group & Availability of rooms').set_fontsize('15')


##### 1. Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups.

##### 2. What is/are the insight(s) found from the chart?



*   We can see that costumers prefers Brooklyn over Manhattan due to reasonable
price.

*   Costumers only go for Queens when both Brooklyn and Manhattan are not available or are expensive.

*   Costumers prefers to go for Stalen Island and Bronx only in later days of year.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Frequency of listings with respect to the price
sns.histplot(data=airbnb, x="price").set_title('Frequency of listings with respect to the price')
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")


In [None]:
price_range = airbnb[airbnb['price'] <= 1000]
sns.histplot(data=price_range, x="price", kde=True, bins = 80).set_title('Frequency of listings with respect to the price<=1000')
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="dark")

##### 1. Why did you pick the specific chart?

A histogram is a traditional visualization tool that counts the number of data that fall into discrete bins to illustrate the distribution of one or more variables. This function can add a smooth curve derived using a kernel density estimate to the statistic computed within each bin to estimate frequency, density, or probability mass.

##### 2. What is/are the insight(s) found from the chart?

By using the histogram, we can now see how prices are distributed. We have a large number of values concentrated below 200 dollars.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Frequency of listings with respect to the minimum number of nights
sns.histplot(data=airbnb, x="minimum_nights", kde=True, bins = 80).set_title('Frequency of listings with respect to the minimum number of nights')
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")

In [None]:
min_night = airbnb[airbnb['minimum_nights'] <= 50]
sns.histplot(data=min_night, x="minimum_nights", kde=True, bins = 80).set_title('Frequency of listings with respect to the minimum number of nights<=50')
sns.set(rc={'figure.figsize':(10,6)})
sns.set(style="white")

##### 1. Why did you pick the specific chart?

A histogram is a traditional visualization tool that counts the number of data that fall into discrete bins to illustrate the distribution of one or more variables. This function can add a smooth curve derived using a kernel density estimate to the statistic computed within each bin to estimate frequency, density, or probability mass.

##### 2. What is/are the insight(s) found from the chart?

In this case, most of the listings have listed out their minimum night record below 10. One unusual thing to note here is the peak in the listing frequency for the minimum night of 30. It is possible that some owners have listed their properties on a monthly rental basis, which may explain this.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13 (Textual Data Mining to find out the host's mindset)

In [None]:
# Chart - 13 visualization code
text = ' '.join(str(n).lower() for n in airbnb.name)

wordcloud = WordCloud(max_words=200, background_color = 'white').generate(text)

plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

##### 1. Why did you pick the specific chart?

We will be using the Wordcloud library for textual data mining on the name column. Word clouds use frequency counts of the words as input and return a beautiful graphic display of the most frequently occurring words with their size proportional to their relative frequency. We can see a large number of naming patterns used by our hosts for their listings. Using the word cloud as a tool for analysis, we can uncover some interesting trends that may help us understand our hosts' behavior and mindset.

##### 2. What is/are the insight(s) found from the chart?



1.   Based on the above resultant word cloud, it is evident that hosts are using simple and location-oriented keywords to differentiate their listings. In this case, location is the key indicator, since the words "Manhattan" and "Williamsburg" are prominently displayed.
2.   Furthermore, we can see some adjectives such as "beautiful", "quiet", "cozy", and "gorgeous" bedrooms/apartments, which indicates that visitor comfort is a top priority.


3.   Several mentions of "private rooms" indicate the popularity of this room type in the city.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Let us see correletion between the predictions such as location, price, reviews
corr=airbnb.corr(method='kendall')
plt.figure(figsize=(13,10))
sns.heatmap(corr,annot=True).set_title('correlation between location, price,reviews\n')
plt.show()

##### 1. Why did you pick the specific chart?

Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships. In addition, correlation plots can be used to identify outliers and to detect linear and nonlinear relationships.

##### 2. What is/are the insight(s) found from the chart?

From above corelation plot it is observed that there is no strong corelation between any factors but calculated_host_listing_count and Availability_365 are weakly corelated.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(airbnb)

##### 1. Why did you pick the specific chart?

To plot multiple pairwise bivariate distributions in a dataset.The Seaborn Pairplot allows us to plot pairwise relationships between variables within a dataset. This creates a nice visualisation and helps us understand the data by summarising a large amount of data in a single figure.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



   DataSets have limiting attributes to classify various categories of properties.

  Customer experimental and Category wise ratings for Hosts seemed to be missing which could have played an important role in identifying Star Hosts.


   A lot of guest information were missing like Purpose of Visit, Number of Guests, which could have given a sense of understanding about the relation of customer footfall and neighbourhood.


   Key attributes of properties like Number of Beds, Closets, Bathrooms, Gym, Sauna, Property Age, Distances from nearest Hospitals, Shopping, Complexes, Airport, Station were missing.

   Some local tours can be clubbed during longer visits encouraging customers to stay longer and prefer their stays.

   Pricing: The analysis reveals a wide range of prices for Airbnb listings in NYC. Factors such as location, property type, and the number of bedrooms significantly impact the listing prices. Hosts should consider these factors when setting competitive prices for their listings.

   Popular Neighborhoods: Certain neighborhoods in NYC have higher demand for Airbnb accommodations. Hosts can use this information to focus their efforts on popular areas to attract more bookings.

   Room Types: The dataset shows that private rooms and entire homes/apartments are the most common room types listed on Airbnb in NYC. Hosts should consider offering these room types to align with customer preferences.

   Availability: The analysis indicates variations in availability throughout the year, with some periods experiencing higher demand and lower availability. Hosts can use this information to adjust their pricing and availability to maximize bookings during peak periods.

   Reviews: Customer reviews play a crucial role in the success of Airbnb listings. Analysis of review data can help hosts identify areas for improvement and enhance the guest experience, leading to better reviews and increased bookings.

   Amenities: Certain amenities, such as Wi-Fi, air conditioning, and kitchen facilities, are highly valued by guests. Hosts should ensure their listings offer desirable amenities to attract bookings.









# **Conclusion**



*   Our top ten hosts have a substantial number of listing,with the top host having over 300.

*   The listing are dispersed among the five boroughts of New York City,with Manhattan having the most percentage and Brooklyn and Staten Island having the lowest.


*   These were the three distinct room types percentage distribution: Home or apartment as a  whole:52.3% ,private room: 45.5%,shared room:2.2%.

*   According to the analysis of client needs,they strongly choose an entire home or apartment.Offering these shared rooms carries the greatest risk of loosing customers.

*   Shared rooms are mostly available over other room types and Entire Home/ Apt which has the highest proportion of room share are mostly on the expensive ends.

* Bronx and Staten Island are mostly preferred for shorter visits an onwards and others are for slightly longer stays.  

*   Manhattan and Brooklyn are the posh areas in New York as there is maximum footfall and properties based on prices and listing are on the higher sides. Manhattan and Brooklyn has the highest number of hosts .

*   Manhattan has the highest number of Private rooms and Entire house/Apt. Inculmination followed by Brooklyn.


*   Staten Island seems more available for booking throughout the year compared to other neighboiurhoods.













### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***