<a href="https://colab.research.google.com/github/arsh457/Exploratory-data-analysis/blob/main/Airbnb_Bookings_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

Let's begin to explore the data !!!!!
Firstly let's see what is Airbnb means?

**Airbnb**, as in “Air Bed and Breakfast,” is a service that lets property 
owners rent out their spaces to travelers looking for a place to stay. Travelers can rent a space for multiple people to share, a shared space with private rooms, or the entire property for themselves.

Airbnb was started in 2008 by Brian Chesky and Joe Gebbia, two industrial designers that recently moved to San Francisco.

Airbnb is based on a peer-to-peer business model. This makes it simple, easy to use, and tends to be more profitable for both parties. The model also gives you the opportunity to customize and personalize your guests’ experience the way you want.

Now looking at the data !!!!

As mentioned in the question that the data consists of 49000 observations and 16 columns , so out of these 16 columns :- 10 are numeric and 6 columns are categorical.

so lets understand what is meant by this both terms:

**Numeric:-**  In Python, numeric data type represent the data which has numeric value. Numeric value can be integer, floating number or even complex numbers. These values are defined as int , float and complex class in Python. Integers – This value is represented by int class.

**Categorical :-**  Categorical variables can take on only a limited, and usually fixed number of possible values. ... Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type.


###import libraries !!!!
 

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns

In [18]:
## load the data
file_path = '/content/Airbnb NYC 2019 .csv'


In [20]:
airbnb_df = pd.read_csv(file_path)

In [28]:
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,10/19/2018,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,5/21/2019,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,7/5/2019,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,11/19/2018,0.1,1,0


In [24]:
##print shape of dataset with rows and columns
print(airbnb_df.shape)

(48895, 16)


so we got to know the exact number of rows and columns ,but now we need to clean the data as there may be NAN values

first we will see the columns with detail explanation and then jump to NAN values


In [26]:
airbnb_df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

**Detail explanation about Columns :**

**id** refers to the listing id

**names** refers a name of the listing

**host_id** is an id of the host for the required listing

**host_name** is the person who hosted his/her required space for stay

**neighbourhood_group** refers to the location 

**neighbour** refers to the area 

**latitude & longitude** are the coordinates

**room_type** displays the various type of space provided

**price (in dollars)** is the amount to be paid for their respective stay depending on the nights they stayed

**minimum_nights** number of nights stayed

**number of reviews** is the count of reviews

**last_review** as it is clear by the name itself that it is the last review for the particular listing

**reviews_per_month** gives us the popularity of the particular listing

**calculated_host_listings_count** refers to the number of listings per host

**availability_365** number of days available for booking the particular listing

By this we get the clear picture of the data

In [29]:
##now let's see the datatype of the columns and also the nan values
airbnb_df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

In [30]:
##As mentioned above that we have some NAN values in the dataset,we need to findout what columns have missing values to begin with further analysis.
airbnb_df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

As we can see above there are huge nan values in the last_review and reviews_per_month columns compared to name and host_name columns.

So we will drop the columns with larger number of nan values and replace the other column nan values with '0' and '1' as per our analysis.

In [31]:
##drop the required columns which are not necessary for further analysis
airbnb_df.drop(['id','host_name','last_review'], axis=1, inplace=True)

In [32]:
##now replace the nan vules with '0' in reviews_per_month column
airbnb_df.fillna({'reviews_per_month':0}, inplace=True)

In [35]:
##count of number of null values for each attribute again
airbnb_df.isnull().sum()

name                              16
host_id                            0
neighbourhood_group                0
neighbourhood                      0
latitude                           0
longitude                          0
room_type                          0
price                              0
minimum_nights                     0
number_of_reviews                  0
reviews_per_month                  0
calculated_host_listings_count     0
availability_365                   0
dtype: int64

finally data is cleaned and is ready for further analysis

Also look at the columns to check whether drop was successful

In [39]:
airbnb_df.head()

Unnamed: 0,name,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,4632,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,0.0,1,365
3,Cozy Entire Floor of Brownstone,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


In [27]:
airbnb_df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0
