# **EDA Project: Exploratory Data Analysis on Airbnb Dataset**



##### **Project Type** - Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a data analysis process where the primary goal is to understand the main characteristics of a dataset before diving into more advanced modeling or hypothesis testing.


## ***Business Objective:***

*Objective: The primary objective of this EDA project is to gain insights into Airbnb booking patterns, user behaviors, and property characteristics. Also, to gain insights into the main features and patterns within the Airbnb NYC 2019.*

*Through a systematic examination of the data, we aim to uncover relationships, identify trends, and understand the distribution of variables, providing a foundation for subsequent analysis.*

*By thoroughly exploring the booking dataset, we aim to identify trends, patterns, and key factors that influence booking decisions. This analysis provides a foundation for understanding the dynamics of Airbnb reservations and aids in making informed decisions for both hosts and users.*

# **Project Summary -**

1. *Insight into Booking Patterns: Understanding these patterns is vital for hosts to optimize pricing strategies, allocate resources efficiently, and enhance the overall guest experience during high-demand periods.*

2. *User Behavior Understanding: Analyzing demographic information and user preferences through EDA helps in grasping the diverse needs and expectations of Airbnb guests. This insight is invaluable for hosts looking to tailor their property listings, amenities, and services to match the preferences of their target audience, ultimately attracting more bookings.*

3. *Pricing Strategy Optimization: Examining the distribution of property prices and understanding the factors influencing pricing through EDA is essential for hosts to develop effective pricing strategies. Hosts can adjust their prices based on seasonal variations, local events, and the competitive landscape, ultimately maximizing revenue and occupancy rates.*

4. Guest Review Analysis: Enables hosts to delve into user reviews, identifying factors contributing to positive or negative feedback. Understanding the aspects that impact guest satisfaction allows hosts to make informed decisions to enhance the quality of their listings, improve service, and foster positive reviews, thereby attracting more bookings.

5. *Strategic Decision-Making: By uncovering patterns and trends, EDA empowers stakeholders with the knowledge needed for strategic decision-making. Whether it's optimizing property listings, adjusting pricing, or enhancing the guest experience, EDA provides the foundation for making informed choices that positively impact the success of Airbnb hosts and the satisfaction of guests.*

# **Problem Statement**


1. *Price Distribution: How is the price distributed across different room types?*

2. *Geographical Insights: What are the most common locations for the listed hotels?*

3. *Variability in Prices: Which locations exhibit the highest variability in hotel prices?*

4. *Review Analysis: How does the number of reviews vary across different hotels?*

5. *Comparison of Hotels Based on Location: Are there notable differences in the number of reviews across locations?*

6. *Comparison of Hotels Based on Location: Which locations have the highest and lowest average hotel prices?*

7. *Popular Hotel Names: What are the most commonly occurring hotel names in the dataset? Do certain hotel names attract more reviews?*

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sb
import matplotlib.pyplot as mp
%matplotlib inline
from matplotlib import rcParams

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv("/content/drive/MyDrive/Python-Colab/Airbnb NYC 2019.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head(2)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
df.columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
host_name0 = df[df['host_name'].isna()]['host_name']

In [None]:
host_name = df[~df['host_name'].isna()]['host_name']
host_name

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sb.heatmap(df.isnull(), cbar = False)

In [None]:
df.shape

In [None]:
df.dropna(inplace=True)
df.shape

In [None]:
# removing the data where price is 0
ndf = df[df['price']!=0]
ndf.shape

### What did you know about your dataset?

The dataset given is a dataset from Hotel booking industry, and we have to analysis the review of customers and the insights behind it.

Review prediction is analytical studies on the possibility of a customer abandoning a product or service. The goal is to understand and take steps to change it before the costumer gives up the product or service.

The above dataset has 48895 rows and 16 columns. There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
ndf.columns

In [None]:
# Dataset Describe
ndf.describe(include='all')

### Variables Description
1. name
2. host_name
3. neighbourhood_group
4. neighbourhood
5. lattitude
6. longitude
7. room_type
8. price
9. minimum_nights

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in ndf.columns.tolist():
  print("No. of unique values in ",i,"is",ndf[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# copying the original dataset
cdf=ndf.copy()
cdf.shape

In [None]:
#price distributed across different room types
#grouping by roomtype & price
price_rt = pd.DataFrame(cdf.groupby('room_type')['price'].mean())
price_rt.reset_index(inplace=True)
price_rt

In [None]:
#Geographical Insights: total cost for the listed locations
#grouping by neighbourhood & price
nh_price_sum = pd.DataFrame(cdf.groupby('neighbourhood_group')['price'].sum()).sort_values(by='price', ascending=False)
nh_price_sum.reset_index(inplace=True)
nh_price_sum

In [None]:
#Which locations have the highest and lowest average hotel prices?
#grouping by neighbourhood & avg. price
nh_price_mean = pd.DataFrame(cdf.groupby('neighbourhood_group')['price'].mean()).sort_values(by='price', ascending=False)
nh_price_mean.reset_index(inplace=True)
nh_price_mean

In [None]:
#Review Analysis: How does the number of reviews vary across different locations?
#grouping by neighbourhood & avg. no. of reviews
nh_reviews = pd.DataFrame(cdf.groupby('neighbourhood_group')['number_of_reviews'].mean()).sort_values(by='number_of_reviews')
nh_reviews.reset_index(inplace=True)
nh_reviews

In [None]:
#Which locations have the highest hotel prices?
#grouping by expensive name of hotel & price
Most_Expensive =  pd.DataFrame(cdf.groupby(['name','neighbourhood_group']).agg({'price':'sum'})).sort_values(by='price', ascending=False).head(5)
Most_Expensive.reset_index(inplace=True)
Most_Expensive

In [None]:
#Top 2 highest hotel prices according to locations
#Most expensive hotels neighbourhood group wise
Expensive_Loc_All = cdf.sort_values(by='price', ascending=False).groupby('neighbourhood_group').head(2).reset_index(drop=True)
Spec_Expensive_Loc = Expensive_Loc_All[['name','neighbourhood_group','price']]
Spec_Expensive_Loc

In [None]:
#Which locations have the lowest hotel prices?
#Least expensive hotels neighbourhood group wise
Affortable_Loc_All = cdf.sort_values(by='price').groupby('neighbourhood_group').head(1).reset_index(drop=True)
Least_Expensive_Loc = Affortable_Loc_All[['name','neighbourhood_group','price']]
Least_Expensive_Loc

In [None]:
#mp.figure(figsize=(10,5))
# sb.heatmap(nh_reviews['number_of_reviews'])

In [None]:
#Popular Hotel Names: What are the most commonly occurring hotel names in the dataset? Do certain hotel names attract more reviews?
#groupby review wise
Most_Reviewed = pd.DataFrame(cdf.groupby('name')['number_of_reviews'].sum()).sort_values(by='number_of_reviews', ascending=False).head(10)
Most_Reviewed.reset_index(inplace=True)
Most_Reviewed.rename(columns={'name':'hotel_name'},inplace=True)
Most_Reviewed

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1
#price distributed across different room types
mp.pie(price_rt['price'], labels = price_rt['room_type'], autopct='%1.2f%%')

#### Chart - 2

In [None]:
# Chart - 2
#Geographical Insights: total cost for the listed locations
mp.bar(nh_price_sum['neighbourhood_group'], nh_price_sum['price'], color=np.random.rand(len(nh_price_sum['neighbourhood_group']),4))
mp.xlabel('neighbourhood_group')
mp.ylabel('total price of hotel')

#### Chart - 3

In [None]:
# Chart - 3
#Which locations have the highest and lowest average hotel prices?
mp.barh(nh_price_mean['neighbourhood_group'], nh_price_mean['price'])
mp.xlabel('neighbourhood_group')
mp.ylabel('avg. price')
mp.title('Avg price according to neighbourhood_group')

for i,v in enumerate(nh_price_mean['price']):
  mp.text(v,i, str(round(v,2)), color='purple')
mp.show()

#### Chart - 4

In [None]:
# Chart - 4
#Review Analysis: How does the number of reviews vary across different locations?
mp.bar(nh_reviews['neighbourhood_group'], nh_reviews['number_of_reviews'], color='green')
mp.xlabel('neighbourhood_group')
mp.ylabel('number_of_reviews')
mp.title('number_of_reviews according to neighbourhood_group')

for i,v in enumerate(nh_reviews['number_of_reviews']):
  mp.text(i,v, str(round(v,2)), ha='center', va='bottom' ,color='purple')
mp.show()

#### Chart - 5

In [None]:
# Chart -
#Which locations have the highest hotel prices? (Top 2 highest hotels )
sb.barplot(y=Spec_Expensive_Loc['neighbourhood_group'], x=Spec_Expensive_Loc['price'])

#### Chart - 6

In [None]:
# Chart - 6
#Which hotels have the highest prices?
sb.barplot(y=Most_Expensive['name'], x= Most_Expensive['price'])
mp.show

#### Chart - 7

In [None]:
# Chart - 7.1
#Which hotels have the lowest prices?
sb.barplot(y=Least_Expensive_Loc['name'], x=Least_Expensive_Loc['price'])

#### Chart - 7

In [None]:
# Chart - 7.2
#Popular Hotel Names: What are the most commonly occurring hotel names in the dataset? Do certain hotel names attract more reviews?
sb.barplot(y = Most_Reviewed['hotel_name'], x = Most_Reviewed['number_of_reviews'])
mp.show

## **5. Solution to Business Objective**

1. *Price Distribution: We have determined that 'Entire Home/apt' has more bookings than 'Private room' and 'Shared room'
['Entire Home/apt': 57.12% , 'Private room': 24.44% , 'Shared room': 18.44%]*

2. *Geographical Insights: Manhattan & Brooklyn has much higher cost than Queens, Bronx and Staten Island. Those 2 areas are extremly expensive in terms of hotel prices.*

3. *Variability in Prices($): Manhattan has the highest avg price (180.07), followed by Brooklyn (121.53), Queens(95.78), Staten Island(89.96) and Bronx(79.65).*

4. *Review Analysis: Staten Island has received the highest average number of reviews, while Manhattan has receieved the lowest, it shows the inverse correlation between nunber of reivews & avg price.*

5. *Comparison of Hotels Based on Location & Avg Price:*
  *   *Most Expensive:*
      1. *Luxury 1 bedroom apt. -stunning Manhattan views,	Brooklyn*
      2. *Furnished room in Astoria apartment, Queens*

  *  *Least Expensive:*
      1. Spacious 2-bedroom Apt in Heart of Greenpoint, Brooklyn
     2. Girls only, cozy room one block from Times Square, Manhattan

6. *Comparison of Hotels Based on Location & reviews:*
  *   Manhattan	(27.32)
  *   Brooklyn	(29.57)
  *   Bronx	(32.35)
  *   Queens	(34.31)
  *   Staten Island	(36.75)



7. *Popular Hotel Names:*
  *   Private Bedroom in Manhattan (666)
  *   Room near JFK Queen Bed	(629)
  *   Beautiful Bedroom in Manhattan	(617)
  *   Great Bedroom in Manhattan	(607)
  *   Room Near JFK Twin Beds	(576)


# **Conclusion**

*This EDA project provides a comprehensive understanding of the Airbnb NYC 2019 booking data. offering valuable insights for hosts, users, and stakeholders. The findings serve as a basis for strategic decision-making within the dynamic Airbnb platform. The insights gained serve as a foundation for subsequent phases of analysis, guiding the development of hypotheses and informing potential modeling strategies.*

*Property Characteristics: Investigated the diversity of Airbnb properties in terms of type, size, and amenities. Explored the impact of property attributes on booking frequency and user ratings.*

*Inspection: Conducted a thorough examination of the dataset to identify and address missing values, outliers, and data quality issues. Checked for consistency in data types and formatting.*

*Pricing Analysis: Examined the distribution of property prices. Identified factors influencing pricing, such as location, property type, and seasonal variations.*

*User Preferences: Highlighted user demographics and preferences shaping booking choices.*

*Property Impact: Explored the influence of property characteristics on booking frequency and user satisfaction.*