# **Project Name**    - Exploratory Data Analysis on AirBnb Bookings Dataset



##### **Project Type**    - EDA
##### **Contribution**    - Amal ks


# **Project Summary -**


The Airbnb EDA (Exploratory Data Analysis) project involved analyzing a dataset containing information about Airbnb listings, including various attributes such as location, property type, price, availability, and guest reviews. The goal of the project was to gain insights into the data and provide recommendations to help Airbnb achieve its business objectives.

Here's a summary of the key steps and findings from the Airbnb EDA project:

* Data Collection: The project started with collecting the Airbnb dataset, which contained information about listings, hosts, and guests. The dataset included features such as listing price, property type, neighborhood, availability, and guest reviews.
* Data Cleaning and Preprocessing: This data set contains null values and 0 values in the columns and also some outliers in the continuous variables.
Removed 3 columns names id,name and last review because these columns doesn't have not much influence on the analysis process of the dataset.Reviews per month column contains null values we just replaced it with 0 .
* Exploratory Data Analysis (EDA): The main focus of the project was on EDA, which involved analyzing the dataset to gain insights into various aspects of Airbnb listings and guest behavior. This included visualizations such as histograms, bar plots, box plots, and scatter plots to explore relationships, distributions, and trends in the data.

Following plots are used to evaluate the data
* Private rooms are preffered over other rooms- This is analyzed using bar plots.Bar plot is created using sum of the reviews per month in y-axis and room type in x-axis.
* Which neighbourhood is generating more revenue wrt room type-We used a stacked bar chart here.In the x-axis we use different neighbourhood groups and in the y-axis used total money for the room types.
* Which neighbourhood is mostly preffered wrt reviews per month-Here we used a pie chart to show the percentage owned by the neighbourhoods.
* Distribution of availability of room vs neighbourhood group-It is visualized using boxplot.We can see the distribution of avalibilty of rooms in different neighbourhoods using this plot,also it shows the minimum,maximum and mean values also.
* Checking correlation between variables-Analyzed correlation between some features using heatmap.
* Top 10 busiest hosts wrt reviews per month-Create a bar chart showing that the top 10 busiest hosts with respect to reviews per month,and find the most busiest host.
* Price vs minimum nights-Plot a scatter plot of price vs minimum night to understand the spread of the data and plot box plots for price and minimum nights to understand the distribution of the data.

Key Insights: Through EDA, several key insights were uncovered, including:

Distribution of listing prices across different neighborhoods and property types.

Correlations between listing attributes such as price, minimum nights, and availability.
Impact of guest reviews and ratings on listing performance and occupancy rates.

Recommendations: Based on the insights gained from the EDA, recommendations were provided to Airbnb to help achieve its business objectives. These recommendations included optimizing pricing strategies, improving listing performance, enhancing customer experience, targeting marketing efforts, and expanding inventory and market reach.

###conclusion

Manhttan being the urban core of New York city, many people across US come to live a descent lifestyle. It has many globally recognised tourist attractions, thus people from around the globe flock into this city. It is also the most expensive places among other neighbourhoods.

Brooklyn is both residential and industrial hotspot. Many people around the country and globe comes here in search of employment. It is the second most expensive neighbourhood.

Sonder(NYC), Blueground, Michael and David are top 4 most spending customers. Their popular destinations is mainly Manhattan, Brooklyn and Queens.

To start a business associated to Airbnb, one should acquire an Entire home/apt in Williamsburg neighbourhood. As, we saw Williamsburg accounts for nearly 8% of the total listings in entire NY city, it implies most days of the year rooms will be booked. And as average price of Entire home/apt is much higher, it might provide high return on investment.

# **GitHub Link -**

[link text](https://)

# **Problem Statement**


Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.
Explore and analyse the data to discover key understandings.


#### **Define Your Business Objective?**



*    Aims to maximize revenue by attracting more guests to book accommodations and experiences through its platform.
*   Providing a seamless and enjoyable experience for both hosts and guests is crucial. This includes ensuring accurate listings, efficient booking processes, and reliable customer support.


*   Encouraging hosts to list their properties and engage with guests can help business maintain a diverse range of accommodations and experiences on its platform.







# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
# File path to Airbnb NYC 2019 csv file
file_path = '/content/Airbnb NYC 2019.csv'

# Reading the csv file using pandas read_csv
airbnb_df = pd.read_csv(file_path)


### Dataset First View

In [None]:
# Dataset First Look

# Dataset first 5 rows
print('First 5 rows of Airbnb booking NYC dataset : ')
airbnb_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Displaying the Rows & Columns count using pandas shape method
print(f'Air bnb NYC 2019 dataset has {airbnb_df.shape[0]} rows and {airbnb_df.shape[1]} columns')

### Dataset Information

In [None]:
# Dataset Info

# Displaying the column names corresponding data type of the column and non null count using pandas info method
airbnb_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# Checking for the count of duplicate values using pandas duplicated method and finding total count of duplicates using sum method
print(f'Total number of duplicate rows in the dataset is : {airbnb_df.duplicated().sum()}')


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Finding total number of  Misiing values/Null values in dataset using isnull  and sum functions in series
print(f'Total number of Missing/Null values in each column of the dataset is : \n{airbnb_df.isnull().sum()}')

In [None]:
# Visualizing the missing values

# Calculates the number of missing values for each feature in the dataset and stores the result in the missing_values variable

missing_values = airbnb_df.isnull().sum()

# Plot missing values

# Creating a plot of figure size (10,6)
plt.figure(figsize=(10, 6))

# Using barplot from seaborn library  and assigning X-axis as the column names and Y-axis as the no.of null values
sns.barplot(x=missing_values.index, y=missing_values.values, palette='viridis')

# Adjusting the X-axis label
plt.xticks(rotation=90)

# Defining X-label as Features
plt.xlabel('Features')

# Defining Y-label as Number of Missing values
plt.ylabel('Number of Missing Values')

# Defining title of the plot
plt.title('Missing Values in Airbnb Dataset')

plt.show()



### What did you know about your dataset?

The Airbnb Dataset has 16 columns and 48895 rows. In that 16 columns 7 columns(id,host_id,price, minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365) are containing the data in type of int64, 3 columns(latitude,longitude,reviews_per_month) are containing the data in type of float64 and 6 columns(name, host_name,neighbourhood_group,neighbourhood,room_type,last_review) are in the type of objects. The id column represents the unique id of each listings , host_id column represents the unique id of each hosts. Number of duplicate rows in this dataset is 0. Four columns containing null/missing values. Column 'name' containing 16 null values ,'host_name containing 21 null values, last_review and reviews_per_month containing 10052 null values respectively.
                           
             
  

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Column names of the dataset is
print(f'Column names of the dataset is : \n{airbnb_df.columns}')

In [None]:
# Dataset Describe

airbnb_df.describe()

### Variables Description

The dataset has 16 features. On these features Price feature has a minimum value as 0 this is may be wrong because there is no chance of host giving their property for free. Also the availability_365 column has a minimum value of 0 , this is not possible. These values are may be appeared in the dataset due to incorrect data entry or incorrect way of pulling the data. The price column has maximum value of 10000 and also the mean of the price column is 149.571777, the value showing that this is an outlier in the price column .

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Iterate over each column in the DataFrame
for column in airbnb_df.columns:

  # Print the column name and the no.of unique values for the column
  print(f"No.of Unique values for the column '{column}' is : {len(airbnb_df[column].unique())}")

  # Add an empty line for better readability
  print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Create a function called data_wrangler()
def data_wrangler(airbnb_df):

  # Removing the rows of the column host_name and name
  airbnb_df.dropna(axis=0,subset=['host_name'],inplace=True)

  # Replace the null/missing values in the columns reviews_per_month with 0
  airbnb_df.fillna({'reviews_per_month':0},inplace=True)

  # Dropping id and name column from the dataset
  airbnb_df.drop(['name','id','last_review'],axis=1,inplace=True)

  return airbnb_df

airbnb_df=data_wrangler(airbnb_df)

### What all manipulations have you done and insights you found?


Manipulations done on the dataset
*   Removed rows of the column host_name and name which has null value.
*   Replaced the null/missing values in the columns reviews_per_month with 0.

*   Removed id ,last_review and name column from the dataset

Insights found


*   id,name and last_review columns are objective and id column containig  unique values these columns havent any effect on the analysis
*   Price column contains 16 rows as zero this is a small amount compared to the actual no.of rows, this also doesn't affect our data analysis process












## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### 1. Are private rooms are preffered over other rooms using reviews per month

In [None]:
# Chart - 1 visualization code

# Group by room type and counting the occurences
count_wrt_roomtype=airbnb_df.groupby('room_type')['reviews_per_month'].sum()

# Create barplot
barplot=sns.barplot(count_wrt_roomtype,palette={'Entire home/apt':'green','Private room':'red','Shared room':'blue'})

# Adding labels above the bars
for index, value in enumerate(count_wrt_roomtype):
    barplot.text(index, value, str(value), ha='center', va='bottom')

# Show plot
plt.show()



##### 1. Why did you pick the specific chart?

A barplot is suitable for visualizing count of the different categories, such as the count of each room type in this case.Bar plots are simple , making them easy to understand for most viewers.Bar plots can be easily customized to add additional information, such as labels above the bars as requested. They also support different color palettes and styling options to enhance visual appeal and convey additional meaning.

##### 2. What is/are the insight(s) found from the chart?

Most common room type in the among the listings is "Entire home/apt.Least prefferd room type in the listings is "Shared room".The length of the bars in the above plot allows us to compare the choice of room type easily.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Peoples are mostly preffered Entire home/apt and private rooms over shared rooms.Entire home/apt" is the most common room type, marketing campaigns can be designed to highlight the features and benefits of such accommodations to attract more customers.Businesses can allocate resources and prioritize listings based on the demand for each room type. This can help ensure that popular room types are adequately stocked and available to meet customer demand.By understanding customer preferences for different room types, businesses can tailor their offerings to better meet customer needs and preferences. This can lead to higher customer satisfaction and loyalty, ultimately contributing to a positive business impact.

### 2. Which neighbourood is generating more revenue with respect to the room type

##### 1. Why did you pick the specific chart?

A stacked bar chart is most suitable for visualizing the values of one variable based on two categories at once.Here price is the continuous variable and neighbourhood group and room type are the two categorical variables

In [None]:
# Chart - 2 visualization code

#  Create dataframe to show total money spent on each neighbourhood_group wrt room_types
df1 = pd.DataFrame(airbnb_df.groupby(['neighbourhood_group', 'room_type'])['price'].sum())

# Create a table from the dataframe
table = pd.pivot_table(df1, index = ['neighbourhood_group'], columns = ['room_type'], values = 'price')

# Plot a stacked bar chart
table.plot(kind='bar',ylabel='Total money spent',stacked=True,figsize=[12,6])

# Giving title to the chart
plt.title('Money spent on each neighbourhood wrt room type')

# Show the plot
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Entire home/apt is the most preferred choice and thus accounts for the most money spent across every neighbourhood group, followed by Private room and Shared room respectively.Mostly the money spent in Manhattan, followed by Brooklyn, Queens, Bronx and Staten Island respectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

People choose manhattan and brooklyn neighbourhoods mostly.By understanding the prefference and popularity of room type and neighbourhood businesses can promote those neighbourhoods more and can tailor more offerings there.This can lead to more customer satisfaction and lead to a positive business impact.

### 3. Which neighourhood is mostly preffered wrt to reviews per month

In [None]:
# Chart - 3 visualization code

# Using group by function and value_counts function finding mostly prefferd neighbourhood
df1 = pd.DataFrame(airbnb_df.groupby('neighbourhood_group')['reviews_per_month'].sum())

# Plotting a pie chart
plt.pie((df1.values).flatten(),labels=df1.index,autopct='%1.1f%%',explode=(.1,0,0,0,0),radius=1.5)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart can effectively display the relative sizes of different categories or segments in a dataset. Each wedge in the pie represents a proportion of the whole, making it easy to compare the sizes visually. The circular shape and the division into slices make it easy for viewers to grasp the relative proportions.Pie charts work best when there are a limited number of categories or segments to display

##### 2. What is/are the insight(s) found from the chart?

Manhattan neighbourhood is more preffered neighbourhood.The previous visualization also shows that most revenue generating neighbourhood is also manhattan

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Manhattan is the urban core of New York city. It is also economic powerhouse of the city. Many globally famous tourist attraction are present like Empire State Building, Statue of Liberty and Central Park. Thus, many people from around the globe flock here to experience the American richness and culture. From our analysis, we can see that many people spend most time in this neighbourhood group for various reasons. They most likely spend most of their money in Manhattan rather than other neighbourhood group.

Brooklyn is both residential and industrial and also handles a vast amount of oceangoing traffic. It too have few tourist attractions like Brooklyn museum and Coney Island. Many people comes to this neighbourhood group mainly because of industries.

Queens is less visited neighbourhood group, followed by Bronx and Staten Island respectively.

### 4. Distribution of availability of room vs neighbourhood group


In [None]:
# Chart - 4 visualization code

# Create a figure of size (10,8)
plt.figure(figsize=[10,8])

# Create a boxplot
sns.boxplot(airbnb_df,x='neighbourhood_group',y='availability_365',hue='neighbourhood_group',palette="dark")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Through the visualization using box plot we can visualize the minimum,maximum and mean values at a time in one visualization also it shows the outliers.

##### 2. What is/are the insight(s) found from the chart?

Some of the neighbourhood showing the minimum value of availability of rooms is zero, this means that those listings are not available. The chance of availability of rooms in staten island a is more than any other neighbourhoods because the mean value is higher than any other neighbourhoods here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the visualization we can identify that some of the rooms availability is zero it shows that they are not available, removal of this kind of listings from the business can deliver more accurate data to customres for their selection of listings.The staten island has the mean availability as higher than any other neighbourhoods buisnesses can show this neighbourhood as having most chance to get the stay in their portal/dashboard.

### 5. Checking correlation between variables

In [None]:
# Chart - 5 visualization code

# Caluclating the correlation matrix
corr = airbnb_df[['price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'availability_365']].corr()

# Using seaborn heatmap to create a chart
sns.heatmap(corr, cmap='RdBu', fmt='.2f', square=True, linecolor='white', annot=True);

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Heat map is used to visualize the correlation between variables.It displays the correlation as colour intensity

##### 2. What is/are the insight(s) found from the chart?

As we can see, the variables do not seem to be significantly correlated to one another.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No significant correlation between variables

### 6. Top 10 busiest hosts wrt reviews per month

In [None]:
# Chart - 6 visualization code

# Finding sum of reviews per month for host id by using groupby and sum functions
data=airbnb_df.groupby(['host_id'])['reviews_per_month'].sum().sort_values(ascending=False).head(10)

# Creating a figure of size (10,8)
plt.figure(figsize=[10,8])

# Plotting the bar plot
data.plot(kind='bar',ylabel='reviews per month for a host',color = ['firebrick', 'green', 'blue', 'black', 'red','purple', 'seagreen', 'skyblue', 'black', 'tomato'])

##### 1. Why did you pick the specific chart?

A barplot is suitable for visualizing count of the different categories, such as the count of each room type in this case.Bar plots are simple , making them easy to understand for most viewers.Bar plots can be easily customized to add additional information, such as labels above the bars as requested. They also support different color palettes and styling options to enhance visual appeal and convey additional meaning.

##### 2. What is/are the insight(s) found from the chart?

The above chart showing that the top 10 busiest hosts with respect to reviews per month.It shows that host id : 219517861 is the most busiest host.These hosts get busy may be because of their affordable pricing,Good service and The location it situated is most famous for tourism.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Other hosts can change their approach in the service or other customer satifaction methods.Business can promote these busiest hosts to their customers with high confidence for booking, because of these stats.
Business can classify hosts according to their quality based on this stat.This will help customers to reach their hosts without spending more time on the site.

### 7. Price vs minimum nights

In [None]:
# Chart - 7 visualization code

# Create a box plot of price column
airbnb_df['price'].plot(kind='box', vert=False, figsize=(15,3))

# Show the plot
plt.show()

# Create a box plot of minimum nights
airbnb_df['minimum_nights'].plot(kind='box', vert=False, figsize=(15,3))

# Show the plot
plt.show()

# Create a figure of size (20,6)
plt.figure(figsize=[20, 6])

# Plotting a scatter plot
sns.scatterplot(airbnb_df, x='price',y='minimum_nights',hue='room_type',  color = 'red')

# Labelling x-axis
plt.xlabel('prices')

# Labelling y-axis
plt.ylabel('nights spent')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot is often used to visualize the relationship between two continuous variables. It's particularly useful for identifying patterns, trends, and outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

*    Most nights spent are for rooms having prices below 2000.
*  Very few nights are spent for room with price over 2000.

*   It is also quite logical that people living in highly expensive accomodation might not stay for good number of days.
*   Shared rooms have comparitively low price than the other two types of rooms.

*   Also minimum nights are low for shared rooms.
*   Mostly entire house/apt has the high price than shared rooms and private rooms.

*   Only one host has the condition of minimum night as
around 1000 days in the shared rooms.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Customers may be willing to pay more for certain room types even for shorter minimum nights spent.Business can encourage more hosts to come with House/apt and private room this will increase the availability of the most prefferd room types and there by the revenue.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



*   Utilize insights from the analysis of pricing variations across different neighborhoods, property types, and listing characteristics to optimize pricing strategies. This could involve adjusting prices based on demand trends, competitor pricing, and seasonal fluctuations to maximize revenue and occupancy rates.
*   Provide hosts with recommendations and best practices based on the analysis of high-performing listings.

*   Identify opportunities to enhance the customer experience by analyzing guest feedback, reviews, and ratings.
*    Use insights from the analysis of booking trends and guest demographics to inform targeted marketing and promotion efforts.









# **Conclusion**

Manhttan being the urban core of New York city, many people across US come to live a descent lifestyle. It has many globally recognised tourist attractions, thus people from around the globe flock into this city. It is also the most expensive places among other neighbourhoods.

Brooklyn is both residential and industrial hotspot. Many people around the country and globe comes here in search of employment. It is the second most expensive neighbourhood.

Sonder(NYC), Blueground, Michael and David are top 4 most spending customers. Their popular destinations is mainly Manhattan, Brooklyn and Queens.

To start a business associated to Airbnb, one should acquire an Entire home/apt in Williamsburg neighbourhood. As, we saw Williamsburg accounts for nearly 8% of the total listings in entire NY city, it implies most days of the year rooms will be booked. And as average price of Entire home/apt is much higher, it might provide high return on investment.







### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***