<a href="https://colab.research.google.com/github/ayushmuhana/testrepo/blob/main/EDA_Submission_AirBnb_Booking_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Booking Analysis



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual

# **Project Summary -**

This is an Exploratory Data Analysis project where I am trying to understand a dataset and make visualizations to get inferences on business outcomes through these visualizations. This dataset in about Airbnb bookings in 2019 in New York. It has 48895 rows and 16 colomns. This colomns are id, name containing details of the listing; host_id, host_name containing details of the host; neighbourhood_group, neighbourhood, latitude, longitude containing location of the listings; and addional data about the listing like room_type, price, minimum_nights, number_of_reviews, reviews_per_month, last_review, availability_365; and one additional column calculated_host_listings showing number of listings with that host. My job is to understand the data, make a few visualization charts, make inferences from those charts and make hypothesis on how it affects the business model. For this I have loaded the dataset csv by mounting my google drive, cleaned and wrangled the data and made visualizations and charts using pandas, numpy, matplotlib, seaborn.

# **GitHub Link -**

https://github.com/ayushmuhana/testrepo/blob/f88c02786472d2990bd9ad72da46484d8a528a0c/EDA_Submission_Template.ipynb

# **Problem Statement**


My task is to understand the Airbnb dataset, make visualization charts, make inferences from those charts and make hypothesis on how it affects the business model.

#### **Define Your Business Objective?**

To understand how different variables affect bookings of airbnb listings and to maximize profitability by suggesting ideas to enhance customer experience.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive

### Dataset Loading

In [None]:
# Load Dataset
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"Rows: {rows}, Colums: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(f"number of duplicate rows in dataset: {df.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# finding reason for high missing values in last_review and reviews_per_month column
selected_columns = ["id", "host_name", "neighbourhood", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365"]

filtered_df = df[df["last_review"].isna()][selected_columns]
filtered_df

From the above result, we can say that the reason for missing values in last_review and reviews_per_month is because there have been 0 num_of_reviews for these listings.

In [None]:
# Visualizing the missing values
df.isna().sum().plot(kind = 'barh', ylabel = "Columns", xlabel = "Values Missing")
plt.show()

### What did you know about your dataset?

*   This dataset has 48895 rows and 16 columns.
*   There are no duplicate values in the dataset which means all the rows are distinct.
*   There are very few missing values in name and host_name columns, which don't have any large significance on analysis.
*   Almost 1/4th of the data in last_review and reviews_per_month is missing, where there have been no reviews, which could affect some analysis relating to reviews.
*   last_review is a date column which is in a string format.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

id: This column contains unique identifiers for each entry in the dataset. It is of integer data type.

name: This column contains the name or title of the accommodation listing. It is of string values.

host_id: This column contains unique identifiers for the hosts of the accommodation listings. It is of integer data type.

host_name: This column contains the names of the hosts. It is of string data type.

neighbourhood_group: This column contains the name of the new york borough where the accommodation is located. It is of string data type.

neighbourhood: This column contains the name of the specific neighborhood where the accommodation is located. It is of string data type.

latitude: This column contains the latitude coordinates of the accommodation's location. It is of float data type.

longitude: This column contains the longitude coordinates of the accommodation's location. It is of float data type.

room_type: This column contains the type of room or accommodation offered (e.g., 'Private room,' 'Entire home/apt'). It is of string data type.

price: This column contains the price of the accommodation. It is of integer data type.

minimum_nights: This column contains the minimum number of nights required for booking the accommodation. It is of integer data type.

number_of_reviews: This column contains the number of reviews received for the accommodation. It is of integer data type.

last_review: This column contains the date of the last review for the accommodation. It is of string data type.

reviews_per_month: This column contains the average number of reviews per month for the accommodation. It is of float data type.

calculated_host_listings_count: This column contains the count of listings managed by the host. It is of integer data type.

availability_365: This column contains the number of available days for booking the accommodation within a year (365 days). It is of integer data type.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = {}
for col in df.columns:
  unique_values[col] = df[col].nunique()
pd.Series(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#replace missing values
df["name"].fillna("Unnamed Listing", inplace = True)
df["host_name"].fillna("Unknown host", inplace = True)

#data transformation
df["last_review"] = pd.to_datetime(df["last_review"])

df.head()

### What all manipulations have you done and insights you found?

*   Since there are no duplicate values, no action was required
*   Changed all missing names and host names to "unnamed listing" and "unknown host name" for data consistency
*   Changed the colomn "last_review" to a datetime column as the values it contains are dates. This would make it easier to work with this data where date comparision is required.
*   Did NOT replace "last_review" and "reviews_per_month" column's missing values with anything as that would cause inaccuracy in data findings. These values are missing as there are no reviews for those listing and thus last review and avg review per month don't exist.







## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# price distribution per neighbourhood group chart

plt.figure(figsize = (6,6))
sns.set(style = "whitegrid")

sns.boxplot(x = "neighbourhood_group", y = "price", data = df)

plt.title("price distribution per neighbourhood group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Price")
plt.show()

##### 1. Why did you pick the specific chart?

I wanted to show the price distribution per neighbourhood group. Box plot would best respresent how the prices are varied in different areas, also showing the outliers and giving us a better understanding.

##### 2. What is/are the insight(s) found from the chart?

The insights that we can get here is that the price of rooms in the airbnb dataset is the highest in Manhatten while the lowest in Bronx.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this can. Now that we know that Bronx has cheaper rooms when compared to Manhatten, we can offer these suggestions to people looking to visit Manhatten as they are pretty close. This could increase booking of people looking for rooms in Manhatten but are budget constrained.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Count of room type per neighbourhood_group
count_per_room = df.groupby(["room_type", "neighbourhood_group"]).size().reset_index(name="Count")
plt.figure(figsize = (6,6))

sns.barplot(x = "neighbourhood_group", y = "Count", hue = "room_type", palette = "colorblind", data = count_per_room)

plt.title("Number of rooms per room type in each neighbourhood group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Count per Neighbourhood")
plt.show()

##### 1. Why did you pick the specific chart?

I chose this chart in particual as bar graph is easier to show multivariate categorical data. I had to divide each neighbourhood group and show the number of rooms per room type in each neighbourhood group. This gives a clear idea of how they are distributed.

##### 2. What is/are the insight(s) found from the chart?

When we see this particular chart, we get an idea that number of entire apt is the highest for manhatten while all the other areas have more number of private room when compared to entire apt. Also, there are not a lot of shared rooms listed on airbnb for this partiular area.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, we have an understanding that while the number of entire home/apt and private room is high in manhatten and brooklyn, there are very few listings from other areas. We can come up with incentives for more hosts to enlist their apartments from these areas. Also, we can take steps to enlist more shared rooms as well.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# average monthly reviews per room type per area

avg_review = df.groupby(["neighbourhood_group", "room_type"])["reviews_per_month"].mean().reset_index()

sns.lineplot(x = "neighbourhood_group", y = "reviews_per_month", hue = "room_type", data = avg_review)

plt.title("average monthly reviews per room type per neighbourhood group")
plt.xlabel("Neighbourhood Group")
plt.ylabel("Average Reviews per neighbourhood")
plt.show()

##### 1. Why did you pick the specific chart?

In this chart, I have tried to show the average reviews per month each neighbourhood group has and how do they differ with each other. The reason I went with a line chart is because it shows the relation between 2 data points, in this case neighbourhood groups, and how they are related. It gives a clear picture of how reviews differ in each neighbourhood group.

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can understand that bronx has high entire home/apt review while mahattan has highest for shared rooms and queens for private rooms. This can help us understand the interest of customers in those areas and how they respond to it.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can help a business understand how their customers behave in a particular area and they can take measures to promote business in that way. For example, customers in Manhattan prefer shared rooms so they can increase the number of shared rooms and incentivize customers to go for them.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
df["Year"] = df["last_review"].dt.to_period("Y")
last_review_by_year = df["Year"].value_counts().sort_index()

plt.figure(figsize = (6,6))
ax = last_review_by_year.plot(kind = "bar", color = "skyblue")
plt.title("Number of Listings with Last Reviews by Year")
plt.xlabel("Year")
plt.ylabel("Number of listings")

for p in ax.patches:
  ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2, p.get_height()), ha = "center", va = "bottom")

plt.show()

##### 1. Why did you pick the specific chart?

Here, I wanted to show the number of listings that have gotten their last review by what year they got it. Since this is a categorical data where i wanted to show value counts per category, bar chart was perfect.

##### 2. What is/are the insight(s) found from the chart?

Through this chart, we can get an understanding of how many listings are up to date with reviews. This dataset contains 48895 listings but out of which 25209 listings have their last reviews in the current year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart can give us a very crucial input. Firstly we understand the number of listings that are not up to date with their reviews which is almost half of the total. This chart helps the business owner to understand the number of lisitings that might have something to be worked upon.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# comparision between price and review_per_month for each neighbourhood_group
sns.scatterplot(x = "price", y = "number_of_reviews", hue = "neighbourhood_group", data = df)
plt.title("Comparision between price and number of review")
plt.xlabel("Price")
plt.ylabel("Number of Reviews")

plt.show()

##### 1. Why did you pick the specific chart?

Here I am trying to show how price affects the number of reviews. I used a scatter plot as it shows the best visual depiction between the two variables. It is easier to identify any outliers that are present as well.

##### 2. What is/are the insight(s) found from the chart?

With this chart we can get an understanding that while there is not much correlation between price and number of reviews in the cheaper listings, it does increase as the price increases. The higher the price, the lesser the number of reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While there is no direct business impact through this chart, it can be used to get an inference on how prices affect number of reviews. The expensive listings have lesser reviews and this could affect the business, so the hosts can get an idea on where to work.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#histogram of price distribution
plt.figure(figsize=(10,6))

bins = list(range(0,2001,100)) + [2000]
sns.histplot(df["price"], bins = bins, edgecolor = "black")

plt.title("Price Distribution Histogram Chart")
plt.xlabel("Price")
plt.ylabel("Number of Listings")

plt.show()

##### 1. Why did you pick the specific chart?

I wanted to show how the listings are distributed on the basis of price. Histograms allowed me to categorize price range in different bins which made it easier to show the data.

##### 2. What is/are the insight(s) found from the chart?

The majority of listings in the Airbnb dataset are ranged within 2000 dollars while there are very few high class listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It can, depending on the affordability of consumers. Manhattan has a lot of high paying customers and they usually go for rooms that are better in class, which is usually higher in price. There can be more focus on getting them to list in Airbnb.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# pie chart of neighbourhood groups
neighbour_count = df["neighbourhood_group"].value_counts()
colors = sns.color_palette("Set2")

plt.figure(figsize = (6,6))
plt.pie(neighbour_count, labels = neighbour_count.index, autopct = "%1.1f%%", colors = colors)
plt.title("Pie Chart of Listings Distributions in each neighbourhood")

plt.show()

##### 1. Why did you pick the specific chart?

Pie chart is best used to show how a total figure is divided into categories. In this case, I tried to show how the listings are divided in each neighbourhood group.

##### 2. What is/are the insight(s) found from the chart?

We can see that majority of our listings are in Manhatten and Brooklyn, covering almost 85% of the dataset while staten island and bronx have very few listings going upto only 3%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

While there are high number of listings in Manhatten and Brooklyn, promoting areas like Bronx can help generate more revenue as it is very close to Manhatten and much cheaper.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Minimum Nights Histogram
plt.figure(figsize = (10,6))

bins = list(range(0,31))
sns.histplot(df["minimum_nights"], bins = bins, color = "green", edgecolor = "black")

plt.title("Distribution of Min Nights per booking")
plt.xlabel("Min number of nights")
plt.ylabel("Number of Listings")
plt.xticks(bins)

plt.show()

##### 1. Why did you pick the specific chart?

I wanted to show how the listings are distributed by min number of nights to be spent per booking. Histogram helps best give an idea of the distribution of data.

##### 2. What is/are the insight(s) found from the chart?

Here we get an understanding of how many listings are present that require min nights to be spent. We see that most of the bookings require less than 5 nights to be spent but there is spike in the chart when it comes to booking above 30 days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

You get an understanding of how these bookings are effected by minimum nights spent. We can use this data to understand how business is affected by minimum nights. We can also use this chart to get a perspective on why there is spike in above 30 days min bookings.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize = (10,6))

sns.scatterplot(x = df["availability_365"], y = df["price"], alpha = 0.5, color = "violet")

plt.grid(False)
plt.title("Relationship between room availibility and price")
plt.xlabel("Room availability in 365 days")
plt.ylabel("Price")

plt.show()

##### 1. Why did you pick the specific chart?

I wanted to understand if the number of days of room availbility has an affect of price or not. Scatter plot is best used as it helps me understand how individual data points look on a larger map

##### 2. What is/are the insight(s) found from the chart?

We get an understanding that price and room availibility are no where correlated to each other. Data points are distributed equally throughout the chart.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Currently availability level donot affect the price levels. But if you see in a logical sense, the availability of bookings show increase demand and hence have an increased price. Other factors have to be considered before making the call but this can be used as an business impact.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
top_10_hosts = df.groupby("host_name")["number_of_reviews"].sum().nlargest(10)

plt.figure(figsize = (6,6))

plt.barh(top_10_hosts.index, top_10_hosts, color = "skyblue")

plt.title("Top 10 hosts (by reviews)")
plt.xlabel("Number of Reviews")
plt.ylabel("Host Names")
plt.gca().invert_yaxis()
plt.grid(False)

plt.show()

##### 1. Why did you pick the specific chart?

We are trying to understand the best hosts who have the most number of reviews. This horizontal bar chart gives us a clear picture of who is the best and how better they are from the rest.

##### 2. What is/are the insight(s) found from the chart?

Among the top hosts with the most reviews Michael and David are leading with over 10k and 8k reviews respectively, while almost everyone else have a little over 4k reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This can help airbnb to understand the hosts who are really putting in the effort to make their listings the best. This can be used to award these hosts which might motivate others to do the same, having an overall impact in the business.

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_matrix = df.corr()

plt.figure(figsize = (12,10))
sns.heatmap(corr_matrix, annot =True, cmap = "mako", linewidths = .5)

plt.title("Correlation Heatmap of the Airbnb 2019 Dataset")

plt.show()

##### 1. Why did you pick the specific chart?

Correlation heatmap is best used when we want to understand the relation between all the numeric variables in a dataset. Here, we understand how all these numeric values are affected by one another.

##### 2. What is/are the insight(s) found from the chart?

When you read the chart, you can see that most of the data is the range between 0.5 and -0.5 which basically means that correlation between different columns of this dataset is very little. This means that any changes to any one of the factor doesn't have a vast affect on any other variable in the same dataset.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
pair_plot_columns = df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']]

sns.pairplot(pair_plot_columns, height = 2)

plt.show()

##### 1. Why did you pick the specific chart?

Pair plots are best used to understand correlation between numerical variables of a dataset in a visual format. This pairplot shows how variables like price, minimum nights, number of reviews, average reviews, availablity are affected by each other

##### 2. What is/are the insight(s) found from the chart?

We can see that most of these graphs have little to correlation as the datapoints are scattered throughout the charts. We do see a little correlation in price vs number of reviews graph where the number of reviews decrease with increase in price and the same effect in the minimum nights spent vs number of reviews graph. This can be used as an input and help understand how number of reviews are affecting different parameters of bookings.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

After understanding the dataset and how different variables are linked, my suggestions to the client would be:

*   Change pricing strategies by including other factors like, number of reviews, host ratings, location.
*   Conduct regular reviews as more than half of the listings have last reviews that are older than one year.
*   Focus on getting more listings by incentivizing hosts in areas neighbourhood groups like Bronx to improve demand.
*   Reward top performing hosts motivating other hosts to improve their reviews.

These ideas can help improve customer booking experience, improve host initiatives and overall profitability.

# **Conclusion**

The dataset contains 48895 rows and 16 columns.

There were no duplicate rows.

There were null values in name and host name columns which were filled with "unnamed listing" and "unknown host".

Null values in last_review and reviews_per_month were left as is because those values don't exist, considering there have been no reviews for those listings.

All the numeric data in the dataset had very little correlation with each other indicating room for improvement in pricing and customer experience review strategies.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***